Crime Analysis Using H2O Autoencoders (Part 1)
Crime Analysis Using H2O Autoencoders (Part 1)
Learn how to build an analytical pipeline and apply deep learning to predict the status of crimes happening in Los Angeles.
Join the DZone community and get the full member experience.Join For Free
Nowadays, deep learning (DL) and machine learning (ML) are used to analyze and accurately predict data. Machine learning models are used to accurately predict crimes. Crime prediction not only helps in crime prevention but also enhances public safety. Autoencoder, a simple, three-layer neural network, is used for dimensionality reduction and for extracting key features from the model.
Data engineers spend much time building analytic models with proper validation metrics in order to increase the performance of the model. Data analysts spend time building data pipelines as a part of big data analytics. Machine learning models are developed in these pipelines with their own functionalities/features. On passing the models through the analytical pipeline, these models are easily deployed in real-time processing.
This blog is Part 1 of a two-part series of crime analysis using H2O autoencoders. In this blog, let's discuss building the analytical pipeline and applying deep learning to predict the status of crimes happening in Los Angeles.
Install the following in R:
- Command to install:
install.packages("h2o", type="source", repos="https://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/2/R"
A crime dataset of Los Angeles, from 2016-2017, with 224K records and 27 attributes is used as the source file. This dataset is an open data resource for governments, non-profit organizations, and NGOs.
- Predict the arrest status of crimes happening in Los Angeles.
- Achieve an analytical pipeline.
- Analyze the performance of autoencoders.
- Build deep learning and machine learning models.
- Apply required mechanisms to increase the performance of the models.
- Access data
- Prepare data
- Clean data
- Preprocess data
- Perform exploratory data analysis (EDA)
- Build a machine learning model
- Initialize H2O cluster
- Impute data
- Train model
- Validate model
- Execute model
- Pre-trained supervised model
The crime dataset is obtained from here and imported into the database. Socrata APIs provide rich query functionality through a query language called Socrata Query Language, or SoQL. The data structure is as follows:
In this section, let's discuss data preparation for building a model.
Data cleansing is performed to find NA values in the dataset. These NA values should be either removed or imputed with some imputation techniques to get the desired data.
To get the count of NA values and view the results, use the below commands:
The total number of NA values for each column:
From the above diagram, it is evident that the attributes such as
weapon_used_cd are repeated and are to be removed. These attributes are removed from the dataset.
Data preprocessing such as data type conversion, date conversion, month, year, and week derivation from the date field, new attributes derivation, and so on is performed on the dataset. The date attribute is converted from factor to POSIXct object. The lubridate package is used to get various fields such as the month, year, and week using this object. The chron package is used along with the time attribute to derive crime time interval (morning, afternoon, midnight, and so on).
Performing Exploratory Data Analysis
The EDA is performed on the crime dataset to make better and useful EDA.
Top 20 crimes in Los Angeles:
Month with highest crimes:
Area with highest crime percentage:
Top 10 descent groups being affected:
Top 10 frequently used weapons for crime:
Safest living places in Los Angeles:
Building a Machine Learning Model
In this section, let's discuss building the best machine learning model for our dataset using machine learning algorithms.
Initializing H2O Cluster
In H2O, data imputation is performed using
h2o.impute() to fill the NA values using default methods such as mean, median, and mode. The method is chosen based on the data type of each column. For example, factor or categorical columns are imputed using the mode method.
The dataset is split into Train, Test, and Validation frames based on certain ratios specified using
h2o.splitframe. Each frame is assigned to a separate variable using
To train the model, perform the following:
- Take the data pertaining to the year 2016 as the training set.
- Take the data pertaining to the year 2017 as the test set.
- Apply deep learning to the model.
- Perform unsupervised classification methods to predict the arrest status of the crimes.
- Make the autoencoder model learn the patterns of the input data regardless of the given class labels.
- Make the model learn the status behavior based on the features.
Functions Used to Apply Deep Learning to Our Data
@param x: Features for our model
@param training_frame: Dataset to the model that needs to be applied.
@param model_id: String represents our model to save and load.
@param seed: For reproducibility.
@param hidden: Number of hidden layers.
@param epochs: Number of iterations our dataset must go through.
@param activation: A string representing the activation to be used.
export_weights_and_biases: Used for cross-validation purposes.
@param autoencoder: Logic representing whether autoencoders should be applied or not
The above diagram shows the summary of our autoencoders model and its performance for our training set. A classification problem is encountered, as a Gaussian distribution is applied to our model instead of a binomial classification.
As the above results are not satisfactory, the dimensionality of our model is reduced to get better results. The features of one of the hidden layers are extracted and the results are plotted to classify the arrest status using deep features functions in H2O package.
From the above results, the arrest status of the crimes that happened cannot be exactly obtained.
So, dimensionality reduction with our autoencoder model alone is not sufficient to identify the arrest status in this dataset. The dimensionality representation of one of our hidden layers is used as features for model training. Supervised classification is applied to the extracted features and the results are tested.
To validate the performance of our model, the cross-validation parameters used while building the model are used to plot the ROC curves and get the AUC value on our validation frames. A detailed overview of our model is obtained using
To predict the arrest status of the crimes, perform the following:
- Apply the deep features to the dataset.
- Use our model to predict the arrest status.
- Plot the ROC curve with AUC values based on sensitivity and specificity.
- Group the results based on the predicted and actual values with the total number of classes and its frequencies.
- Decide the performance of our model on the arrest status of the crimes.
From the above diagram, the predicted number of Not Arrested cases is 28 and the predicted number of Arrested cases is 150. As the numbers seem to be less, this model will cause a slight problem in maintaining the historical records when used in real-time.
Pretrained Supervised Model
The autoencoder model is used as a pre-training input for a supervised model and its weights are used for model fitting. The same training and validation sets are used for the supervised model. A parameter called
pretrained_autoencoder is added in our model along with the autoencoder model name.
This pre-trained model is used to predict the results of our new data and to find the probability of classes for our new data.
The results are grouped based on the actual and predicted values, and the performance of our model is decided based on the arrest status of the crimes.
From the above results, it is evident that there are only minor changes in the results from our previous results with the dimensionality representation. Let's plot the ROC curves and AUC values to compare both the results.
In this blog, we discussed creating the analytical pipeline for the Los Angeles crime dataset, applying the autoencoders to the dataset, performing both unsupervised and supervised classifications, extracting the dimensionality representation of our model, and applying the supervised model.
In our next blog, we'll discuss deploying the model by converting it into POJO/MOJO objects with the help of H2O functions.
Published at DZone with permission of Rathnadevi Manivannan . See the original article here.
Opinions expressed by DZone contributors are their own.