DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Explainable AI: Making the Black Box Transparent
  • AI, ML, and Data Science: Shaping the Future of Automation
  • Dark Data: Recovering the Lost Opportunities
  • Data Processing With Python: Choosing Between MPI and Spark

Trending

  • Apple and Anthropic Partner on AI-Powered Vibe-Coding Tool – Public Release TBD
  • Secrets Sprawl and AI: Why Your Non-Human Identities Need Attention Before You Deploy That LLM
  • Distributed Consensus: Paxos vs. Raft and Modern Implementations
  • Securing the Future: Best Practices for Privacy and Data Governance in LLMOps
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Anomaly Detection Using H2O Deep Learning

Anomaly Detection Using H2O Deep Learning

In this article, we jump straight into creating an anomaly detection model using Deep Learning and anomaly package from H2O.

By 
Sibanjan Das user avatar
Sibanjan Das
DZone Core CORE ·
Dec. 06, 16 · Tutorial
Likes (7)
Comment
Save
Tweet
Share
25.4K Views

Join the DZone community and get the full member experience.

Join For Free

In a previous article, we had an overview of the applications of Deep Learning and touched upon some basic points to consider while creating a Deep Learning model. We also had an overview of what it is and methods to get started with deep learning. In this article, we jump straight into creating an anomaly detection model using Deep Learning and anomaly package from H2O. 

Anomaly Detection

Readers who don't know what it is can view it as anything that occurs unexpected and is a rare event. It is a deviation from the standard pattern and does not confirm to the usual behavior of the data. Let's say we work in a steel manufacturing industry, and we see the quality of the steel suddenly drops down below the permissible limits. This is an anomaly; if not detected and resolved soon will cost the organization millions. So to detect an anomaly at an early stage of its occurrence is very crucial.

There are various methods for anomaly detection which can be either rule-based or machine-learning-based detection systems. Rule-based systems are usually designed by defining rules that describe an anomaly and assigning thresholds and limits. For example, anything above the threshold or below is classified as an anomaly. Rule-based systems are designed from the experience of the industry experts and can be thought of as systems that are designed to detect “known anomalies”. They are "known anomalies" because we recognize what is usual and what is not usual.

However, if there is something that is not known and has never been discovered so far, then creating rules for detecting anomalies is difficult. In such cases, machine learning-based detection systems come in handy. They are a bit complicated but can deal with many uncertain situations. Mostly, they can deduce the patterns that are unusual and alert the users. The one technique we demonstrate here is using H2O's autoencoder deep learning with anomaly package.

For some more techniques for anomaly detection such as one class SVM, you can refer my upcoming book on Data Science using Oracle Data Miner and Oracle R Enterprise published by Apress (Aforementioned is a short excerpt from my book).

The Dataset Used for This Demo

The data was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption and ten aspects of automobile design and performance for 32 automobiles (1973–74 models). The Data Set was modified a bit to introduce some data anomalies.The data set and code for this demonstration were prepared for the meetup Data Science for IoT using H2O, London by me, Ajit Jaokar and Jo-Fai Chow from H2O.ai who had guided me on H2O stuff. However, I missed attending the D-Day. 

We can install and get started using H2O in R using the below script.

 Note: We are using H2O with R for the demo. However, the beauty of H2O is that you can kick it up for your machine learning tasks using any of the popular languages - Python and Java.

So, let's get started with detecting the anomalies and separate the unknowns from the knowns. 

Installation

If you have an older version of H2O installed, this script will remove it and install the latest one.

# ------------------------------------------------------------------------------
# Step 1: Install R Packages
#
# Install "h2o" for machine learning
# Reference: http://www.h2o.ai/download/h2o/r
# ------------------------------------------------------------------------------

# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Let's download packages that H2O depends on.
if (! ("methods" %in% rownames(installed.packages()))) { install.packages("methods") }
if (! ("statmod" %in% rownames(installed.packages()))) { install.packages("statmod") }
if (! ("stats" %in% rownames(installed.packages()))) { install.packages("stats") }
if (! ("graphics" %in% rownames(installed.packages()))) { install.packages("graphics") }
if (! ("RCurl" %in% rownames(installed.packages()))) { install.packages("RCurl") }
if (! ("jsonlite" %in% rownames(installed.packages()))) { install.packages("jsonlite") }
if (! ("tools" %in% rownames(installed.packages()))) { install.packages("tools") }
if (! ("utils" %in% rownames(installed.packages()))) { install.packages("utils") }

# Download, install and initialize the H2O package for R.
install.packages("h2o", type="source", 
                 repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-turing/10/R")))

# Test, if the H2O library is installed properly
suppressPackageStartupMessages(library(h2o))
h2o.init(nthreads = -1)

# Finally, let's run a demo on K-means clustering to see H2O at work.
demo(h2o.kmeans)

If you find difficulty in installing H2O using the above script, then download the tar file from http://h2o-release.s3.amazonaws.com/h2o/rel-turing/10/R/src/contrib/h2o_3.10.0.10.tar.gz  Install it by loading the downloaded package archive from tools -> install package as shown in below figure.

Image title

If you already have H2O installed, then we can skip running the above script and load the H2O libraries as shown in the below figure. The code h20.init() start the h2o server. It provides the method to specify the number of threads, IP and Port to be used by H20. By default, H2O is limited to the CRAN default of 2 CPUs. If you have more than 2 CPU, you can use all of those by passing the parameter nthreads=-1 to the h2o.init function.

library(h2o)
h2o.init()

After starting H2O, you can use and view every operation done over H2O in the Web UI at http://localhost:54321

Reading the Data

Next, we read the data into the mtcar variables and change the datatype of some variables to factors as shown below. We convert them into factors as there have few distinct values and so treating them as categories will help for better analysis.

Alternatively, we can use h2o.importFile and  h2o.uploadFile to read files. You can use h2o.uploadFile to upload a file from your local disk. It is a push from the client to the server and is only intended for smaller data sizes. h2o.importFile is another function which pulls information from the server from a location specified by the client. It is more suitable for big-data operations. Using h2o.importFile, you can also import files stored in HDFS as well as load an entire folder of files.

You can also use as.h2o() to import a local R data frame to the H2O. It serially uploads the file to H2O and so might be a bit slower than other file import methods.

mtcar <- read.csv("F:/h20/predicitve_mantainance/auto_design.csv")
mtcar$gear <- as.factor(mtcar$gear)  
mtcar$carb <- as.factor(mtcar$carb) 
mtcar$cyl <- as.factor(mtcar$cyl)  
mtcar$vs  <- as.factor(mtcar$vs)  
mtcar$am  <- as.factor(mtcar$am)
mtcar$ID  <- 1:nrow(mtcar)  
mtcar.hex  <- as.h2o(mtcar)

Training an Autoencoder

An autoencoder neural network is a class of Deep Learning that can be used for unsupervised learning.  Anomaly detection depends essentially on unsupervised techniques as we tend to find unknown from the knowns. So, we can say the data set for anomaly detection has just one class i.e. "known" and is not suitable for supervised learning. The autoencoder algorithm applies backpropagation, setting the target values to be equal to the inputs.

We can use the h2o.deeplearning library setting autoencoder= TRUE to train our autoencoder model.

To go through the complete list of parameters, one can refer r_h20_deep_learning reference page. Also, there is a great tutorial on Deep Learning using H2O by Erin LeDell, Data Scientist, H2O.ai

# Use a bigger DNN
mtcar.dl = h2o.deeplearning(x = 1:10, training_frame = mtcar.hex, autoencoder = TRUE,
                            hidden = c(50, 50, 50), epochs = 100,seed=1)

Image title

The h2o.anomaly function can be used to detect anomalies in a dataset. This function uses an H2O autoencoder model. This function reconstructs the original data set using the model and calculates the Mean Squared Error(MSE) for each data point.

errors <- h2o.anomaly(mtcar.dl, mtcar.hex, per_feature = TRUE)
print(errors)

Image title

Next, we plot the reconstruction.MSE and see where the model grapples to reconstruct the original records. This means that the model was not able to learn patterns for those records correctly which can be an anomaly.

Image title

It is normal for a good generalized model to show some variations in reconstruction. So, every little variation in reconstruction cannot be considered as an anomaly. We define a cutoff point to hold only records above reconstruction error as 0.099 as an anomaly.

# Outliers (define 0.099 as the cut-off point)
row_outliers <- which(errors > 0.099) # based on plot above
mtcar[row_outliers,]

Image title

We can notice that there are two records which were identified by the model as an anomaly. And, precisely, these were the records which we shaped to introduced defect in the mtcars data set.

H2O (web server) Deep learning Anomaly detection Data science Big data

Opinions expressed by DZone contributors are their own.

Related

  • Explainable AI: Making the Black Box Transparent
  • AI, ML, and Data Science: Shaping the Future of Automation
  • Dark Data: Recovering the Lost Opportunities
  • Data Processing With Python: Choosing Between MPI and Spark

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!