Preparing Data for Keras

DZone 's Guide to

Preparing Data for Keras

We have to prepare the data for Keras to meaningfully work with it.

· AI Zone ·
Free Resource

This article is an excerpt from Packt’s upcoming book, Machine Learning for Finance by Jannes Klaas.

We have to prepare the data for Keras to meaningfully work with it: There are three types of data:

Nominal: Data comes in discrete categories which cannot be ordered. In our case, the type of transfer is a nominal variable. There are four discrete types but it does not make sense to put them in any order. "TRANSFER" cannot be "more" than "CASH_OUT". They are just separate categories.

Ordinal: Data is also discrete categories, but they can be ordered. For example, if coffee comes in sizes large, medium, or small, those are distinct categories, but they can be compared. Large is more than small.

Numerical: Data can be ordered but we can also perform mathematical operations on it. An example in our data is the amount. We can compare amounts, but we can also subtract them or add them up.

Nominal and ordinal data are both categorical data, as they describe discrete categories. Numerical data works fine with neural networks out of the box. Categorical data needs special treatment. There are two ways of preparing categorical data:

One Hot Encoding

The most often used method to encode categorical data is called "one hot". In one hot encoding, we create a new variable, a so-called dummy variable for each category. We then set the dummy variable to 1 if the transaction is a member of a certain category and to zero otherwise:










One Hot













Pandas offer a function to create dummy variables out of the box. Before however, it makes sense to add 'Type_'   in front of all actual transaction types. The dummy variables will be named after the name of the category, by adding 'Type_'   to the beginning we know that these dummy variables indicate the type:

 df['type'] = 'Type_' + df['type'].astype(str) 

This line does three things.  df['type'].astype(str)  converts all entries in the 'type'   column to strings. Then the prefix 'Type_'   is added by combining the strings. This new column of combined strings then replaces the original 'type'   column.

We can now get the dummy variables.

dummies = pd.get_dummies(df['type']) 

Note that the get_dummies()   function creates a new data frame. We now have to attach this data frame to the main data frame:

df = pd.concat([df,dummies],axis=1)

concat()  concatenates two data frames. We concatenate along axis  1  to add the data frame as new columns. Now that the dummy variables are in our main data frame, we can remove the original column:

 del df['type'] 

And, voila, we turned our categorical variable into something a neural network can work with.

Entity Embeddings

This section makes use of embeddings and the Keras functional API. This walkthrough shows the general workflow to you, and it is fine if you do not understand everything that is going on just now. This is an advances section after all. You can just focus on the general ideas for now and worry about the implementation details later.

In this section, we will create embedding vectors for categorical data. Embedding vectors are vectors representing categorical values. They are used as inputs for the neural network. We train embeddings together with the neural network so that we obtain more useful embeddings over time. Embeddings are an extremely useful tool. Not only do they reduce the number of dimensions needed for encoding over one-hot encoding and thus decrease memory usage. They also reduce sparsity in the input activations, which helps reduce overfitting and they can encode semantic meanings as vectors. The same advantages that made embeddings useful for text make them useful for categorical data.

Tokenizing Categories

Just as with text, we have to tokenize inputs before feeding them into the embeddings layer. To do so, we have to create a mapping dictionary that maps categories to a token:

map_dict = {}  

for token, value in enumerate(df['type'].unique()): 

    map_dict[value] = token

This code loops over all unique type categories while counting up. The first category gets the token 0, the second 1, and so on. Our map_dict   looks like this:

{'CASH_IN': 4, 'CASH_OUT': 2, 'DEBIT': 3, 'PAYMENT': 0,'TRANSFER': 1} 

We can now apply this mapping to our data frame:

 df["type"].replace(map_dict, inplace=True) 

All types will now be replaced by their token.

We have to deal with the non-categorical values in our dataf rame separately. We can create a list of columns that are not the type and not the target like this:

 other_cols = [c for c in df.columns if ((c != 'type') and (c!= 'isFraud'))] 

Creating Input Models

Our model will have two inputs: One of the types with an embedding layer, and one for all other, non-categorical variables. To combine them later easily, we keep track of their inputs and outputs with two arrays:

inputs = []

outputs = [] 

The model that acts as in input for the type receives a one-dimensional input and parses it through an embedding layer. The outputs of the embedding layer are then reshaped into flat arrays.

num_types = len(df['type'].unique())

type_embedding_dim = 3

type_in = Input(shape=(1,))

type_embedding = Embedding(num_types,type_embedding_dim,input_len)(type_in)

type_out = Reshape(target_shape=(type_embedding_dim,))(type_embedding) 

type_model = Model(type_in,type_out)



The type embeddings have three layers here. This is an arbitrary choice and experimentation with different numbers of dimensions might improve results.

For all other inputs, we also create an input. It has as many dimensions as there are non-categorical variables and consists of a single dense layer with no activation function. The dense layer is optional, the inputs could also be directly passed into the head model. More layers could also be added.

 num_rest = len(other_cols)

rest_in = Input(shape = (num_rest,))

rest_out = Dense(16)(rest_in) 

rest_model = Model(rest_in,rest_out) 



Now that we have created the two input models, we can concatenate. On top of the two concatenated inputs, we will build our head model:

concatenated = Concatenate()(outputs) 

Now we can build and compile the overall model:

 x = Dense(16)(concatenated)

x = Activation('sigmoid')(x)

x = Dense(1)(concatenated)model_out = Activation('sigmoid')(x)

merged_model = Model(inputs, model_out)




Training the Model

To train a model with multiple inputs, we need to provide a list of X values for each input. So we first split up our data frame:

types = df['type']

rest = df[other_cols]

target = df['isFraud'] 

 And then train the model by providing a list of the two inputs and the target:

history =      


                            epochs = 1, batch_size = 128)  


Epoch 1/1

6362620/6362620 [==============================] - 78s 12us/step


loss: 0.0208 - acc: 0.9987 

In this article, we have taken a structured data problem from raw data to strong predictive models. We have learned about end-to-end modeling.

Packt’s upcoming book, Machine Learning for Finance is a guide for practitioners in the fintech field. You'll learn how to analyze financial data using modern machine learning techniques. Jannes Klaas takes you through data-munging, general adversarial networks, deep learning algorithms, and reinforcement learning policies to work with sophisticated approaches to modern finance.

Have a look at the Preview of the book and get a glimpse of what it has to offer.

big data, c++, data science, deep learning, finance, keras, machine learning, natural language processing, tensorflow

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}