Over a million developers have joined DZone.

Crime Analysis Using H2O Autoencoders (Part 1)

DZone's Guide to

Crime Analysis Using H2O Autoencoders (Part 1)

Learn how to build an analytical pipeline and apply deep learning to predict the status of crimes happening in Los Angeles.

· AI Zone ·
Free Resource

Did you know that 50- 80% of your enterprise business processes can be automated with AssistEdge?  Identify processes, deploy bots and scale effortlessly with AssistEdge.

Nowadays, deep learning (DL) and machine learning (ML) are used to analyze and accurately predict data. Machine learning models are used to accurately predict crimes. Crime prediction not only helps in crime prevention but also enhances public safety. Autoencoder, a simple, three-layer neural network, is used for dimensionality reduction and for extracting key features from the model.

Data engineers spend much time building analytic models with proper validation metrics in order to increase the performance of the model. Data analysts spend time building data pipelines as a part of big data analytics. Machine learning models are developed in these pipelines with their own functionalities/features. On passing the models through the analytical pipeline, these models are easily deployed in real-time processing.

This blog is Part 1 of a two-part series of crime analysis using H2O autoencoders. In this blog, let's discuss building the analytical pipeline and applying deep learning to predict the status of crimes happening in Los Angeles.


Install the following in R:

  • H2O
    • Command to install:
    install.packages("h2o", type="source", repos="https://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/2/R"
  • dplyr
  • ggplot2
  • lubridate
  • chron
  • pROC

Dataset Description

A crime dataset of Los Angeles, from 2016-2017, with 224K records and 27 attributes is used as the source file. This dataset is an open data resource for governments, non-profit organizations, and NGOs.

Sample dataset:select

Use Case

  • Predict the arrest status of crimes happening in Los Angeles.
  • Achieve an analytical pipeline.
  • Analyze the performance of autoencoders.
  • Build deep learning and machine learning models.
  • Apply required mechanisms to increase the performance of the models.


  • Access data
  • Prepare data
    • Clean data
    • Preprocess data
  • Perform exploratory data analysis (EDA)
  • Build a machine learning model
    • Initialize H2O cluster
    • Impute data
    • Train model
  • Validate model
  • Execute model
    • Pre-trained supervised model

Accessing Data

The crime dataset is obtained from here and imported into the database. Socrata APIs provide rich query functionality through a query language called Socrata Query Language, or SoQL. The data structure is as follows:select

Preparing Data

In this section, let's discuss data preparation for building a model.

Cleansing Data

Data cleansing is performed to find NA values in the dataset. These NA values should be either removed or imputed with some imputation techniques to get the desired data.

To get the count of NA values and view the results, use the below commands:select

The total number of NA values for each column:


From the above diagram, it is evident that the attributes such as crm_cd_2, crm_cd_3, crm_cd_4, cross_streetpremis_cd, and weapon_used_cd are repeated and are to be removed. These attributes are removed from the dataset.

Preprocessing Data

Data preprocessing such as data type conversion, date conversion, month, year, and week derivation from the date field, new attributes derivation, and so on is performed on the dataset. The date attribute is converted from factor to POSIXct object. The lubridate package is used to get various fields such as the month, year, and week using this object. The chron package is used along with the time attribute to derive crime time interval (morning, afternoon, midnight, and so on).select

Performing Exploratory Data Analysis

The EDA is performed on the crime dataset to make better and useful EDA.

Top 20 crimes in Los Angeles:select

Crime timings:


Month with highest crimes:


Area with highest crime percentage:


Top 10 descent groups being affected:


Top 10 frequently used weapons for crime:


Safest living places in Los Angeles:


Building a Machine Learning Model

In this section, let's discuss building the best machine learning model for our dataset using machine learning algorithms.

Initializing H2O Cluster

Before imputing the data, initiate an H2O cluster running with port 12345 using init(). This cluster is accessed using http://localhost:12345/flow/index.html#.select

Imputing Data

In H2O, data imputation is performed using h2o.impute() to fill the NA values using default methods such as mean, median, and mode. The method is chosen based on the data type of each column. For example, factor or categorical columns are imputed using the mode method.select

The dependent variable is grouped based on the status codes of the crimes occurred. The crimes arrest status codes are grouped into Not Arrested and Arrested.select

Training Model

The dataset is split into Train, Test, and Validation frames based on certain ratios specified using h2o.splitframe. Each frame is assigned to a separate variable using h2o.assign().

To train the model, perform the following:

  • Take the data pertaining to the year 2016 as the training set.
  • Take the data pertaining to the year 2017 as the test set.
  • Apply deep learning to the model.
  • Perform unsupervised classification methods to predict the arrest status of the crimes.
  • Make the autoencoder model learn the patterns of the input data regardless of the given class labels.
  • Make the model learn the status behavior based on the features.

Functions Used to Apply Deep Learning to Our Data

  • @param x: Features for our model

  • @param training_frame: Dataset to the model that needs to be applied.

  • @param model_id: String represents our model to save and load.

  • @param seed: For reproducibility.

  • @param hidden: Number of hidden layers.

  • @param epochs: Number of iterations our dataset must go through.

  • @param activation: A string representing the activation to be used.

  • @params stopping_roundsstopping_metricexport_weights_and_biases: Used for cross-validation purposes.

  • @param autoencoder: Logic representing whether autoencoders should be applied or not



The above diagram shows the summary of our autoencoders model and its performance for our training set. A classification problem is encountered, as a Gaussian distribution is applied to our model instead of a binomial classification.

As the above results are not satisfactory, the dimensionality of our model is reduced to get better results. The features of one of the hidden layers are extracted and the results are plotted to classify the arrest status using deep features functions in H2O package.select

From the above results, the arrest status of the crimes that happened cannot be exactly obtained.


So, dimensionality reduction with our autoencoder model alone is not sufficient to identify the arrest status in this dataset. The dimensionality representation of one of our hidden layers is used as features for model training. Supervised classification is applied to the extracted features and the results are tested.


Validating Model

To validate the performance of our model, the cross-validation parameters used while building the model are used to plot the ROC curves and get the AUC value on our validation frames. A detailed overview of our model is obtained using summary() function.selectselect

Executing Model

To predict the arrest status of the crimes, perform the following:

    • Apply the deep features to the dataset.
    • Use our model to predict the arrest status.


    • Plot the ROC curve with AUC values based on sensitivity and specificity.


    • Group the results based on the predicted and actual values with the total number of classes and its frequencies.
    • Decide the performance of our model on the arrest status of the crimes.


From the above diagram, the predicted number of Not Arrested cases is 28 and the predicted number of Arrested cases is 150. As the numbers seem to be less, this model will cause a slight problem in maintaining the historical records when used in real-time.

Pretrained Supervised Model

The autoencoder model is used as a pre-training input for a supervised model and its weights are used for model fitting. The same training and validation sets are used for the supervised model. A parameter called pretrained_autoencoder is added in our model along with the autoencoder model name.


This pre-trained model is used to predict the results of our new data and to find the probability of classes for our new data.


The results are grouped based on the actual and predicted values, and the performance of our model is decided based on the arrest status of the crimes.


From the above results, it is evident that there are only minor changes in the results from our previous results with the dimensionality representation. Let's plot the ROC curves and AUC values to compare both the results.




In this blog, we discussed creating the analytical pipeline for the Los Angeles crime dataset, applying the autoencoders to the dataset, performing both unsupervised and supervised classifications, extracting the dimensionality representation of our model, and applying the supervised model.

In our next blog, we'll discuss deploying the model by converting it into POJO/MOJO objects with the help of H2O functions.


  • Sample dataset
  • Sample dataset attribute description
  • City of Los Angeles dataset API
  • Queries using SODA
  • Building deep neural nets with H2O
  • H2O deep learning
  • Sample dataset in GitHub
  • Consuming AI in byte sized applications is the best way to transform digitally. #BuiltOnAI, EdgeVerve’s business application, provides you with everything you need to plug & play AI into your enterprise.  Learn more.

    ai ,deep learning ,predictive analytics ,h2o ,autoencoders ,exploratory data analysis

    Published at DZone with permission of

    Opinions expressed by DZone contributors are their own.

    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}