# Explainable AI (XAI) Design for Detecting Out-of-Distribution Samples and Adversarial Attacks

### In this article, we will discuss an interpretable prototype of an unsupervised deep convolutional neural network and autoencoders for anomaly detection.

Join the DZone community and get the full member experience.

Join For Free## Introduction

Hello, friends.

In this article, we will discuss an interpretable and explainable prototype of an unsupervised deep convolutional neural network. We will also discuss lstm autoencoders-based real-time anomaly detectors for high-dimensional heterogeneous/homogeneous time series multi-sensor data.

Further, I will take you through the new features of the package MSDA. More details can be found on the GitHub page here

## What's New in MSDA v1.10.0?

MSDA is an open-source, `low-code`

multi-sensor data analysis library in Python that aims to reduce the hypothesis to provide insight to cycle time in time-series multi-sensor data analysis and experiments. It enables users to perform end-to-end proof-of-concept experiments quickly and efficiently. The module identifies events in the multidimensional time series by capturing the variation and trend to establish a relationship aimed towards identifying the correlated features, helping in feature selection from raw sensor signals.

It also provides a provision to precisely detect the anomalies in real-time streaming data on an unsupervised deep convolutional neural network. LSTM autoencoders-based detectors are designed to run on GPU/CPU. Finally, a game-theoretic approach is used to explain the output of the built anomaly detector model.

The package includes:

- Time series analysis.
- The variation of each sensor column wrt time (increasing, decreasing, equal).
- How each column's values vary with the wrt of the other column, and the maximum variation ratio between each column's wrt of the other column.
- Relationship establishment with trend array to identify the most appropriate sensor
- Users can select window length and then check the average value and standard deviation across each window for each sensor column
- It provides a count of growth/decay values for each sensor column values above or below a threshold value
- Feature Engineering
- Features involving trend of values across various aggregation windows: change and rate of change in average, standard deviation across a window
- Ratio of changes, growth rate with std. deviation.
- Change over time
- Rate of change over time
- Growth or decay
- Rate of growth or decay
- Count of values above or below a threshold value

**** Unsupervised deep time-series anomaly detector.** **

** **Game-theoretic approach to explain the time-series data model.** **

MSDA is simple, easy to use, and low-code. The key features are shown in the figure below:

## Who Should Use MSDA?

MSDA is an open-source library that anybody can use. In our view, the ideal target audience of MSDA is:

- Researchers for quick POC testing
- Experienced data scientists who want to increase productivity
- Citizen data scientists who prefer a low code solution
- Students of data science
- Data science professionals and consultants involved in building Proof of Concept projects

## What is an Anomaly?

What is an anomaly, and why should it be of any concern? In layman's terms, “anomalies” or “outliers” are the data points in a data space, which are abnormal, or out of trend. Anomaly detection focuses on identifying examples in the data that somehow deviate from what is expected or typical. Now, the question is, **“How do you define whether something is abnormal or an outlier?” **The quick, rational answer is all those points that don’t follow the trend of the neighboring points in the sample space.

For any business domain, detecting suspicious patterns from a huge set of data is very critical. Say, for example, in the banking domain, fraudulent transactions pose a serious threat and loss/liabilities to the bank. In this article, we will try to learn about detecting anomalies from data without training the model beforehand, because you can’t train a model on data, which we don’t know about! That’s where the whole idea of **unsupervised learning** helps. We will see two network architectures for building a real-time anomaly detector, i.e., a) Deep CNN b) LSTM AutoEncoder.

These network suits for detecting a wide range of anomalies, i.e., point anomalies, contextual anomalies, and discords in time series data. Since the approach is unsupervised, it requires no labels for anomalies. We use the unlabeled data to capture and learn the data distribution that is used to forecast the normal behavior of a time-series. The first architecture is inspired by the IEEE paper DeepAnT; it consists of two components: **time series predictor **and **anomaly detector**. The time series predictor uses a deep convolutional neural network (CNN) to predict the next time stamp on the defined horizon. This component takes a window of time series (used as a reference context) and attempts to predict the next timestamp. The predicted value is then passed to the anomaly detector component, which is responsible for labeling the corresponding timestamp as **Non-Anomaly** or **Anomaly**.

The second architecture is inspired by this Nature paper: Deep LSTM-based Stacked Autoencoder for Multivariate Time Series.

Let first understand simply what an **autoencoder neural network** is. The autoencoder architecture is used to learn efficient data representation in an unsupervised manner. There are **three components** to an autoencoder:

- an encoding (input) portion that compresses the data, and in the process learns a representation (encoding) for the set of data,
- a component that handles the compressed data (size reduction),
- and a decoder (output) portion that reconstructs the learned representation as close as possible to the original input from the compressed data while minimizing the overall loss function.

So, simply when the data is fed into an autoencoder, it is encoded and then compressed down to a smaller size, and then that smaller representation is decoded back to the original input.

Next, let us understand why LSTM is appropriate here. LSTM stands for **l****ong short-term memory** and is a neural network architecture capable of learning order dependencies in sequence prediction problems. An LSTM network is a type of recurrent neural network (RNN).

The RNN mainly suffers from vanishing gradients. Gradients contain information, and over time, if the gradients vanish, then important localized information is lost. This is where LSTM is a handful as it helps remember the cell states preserving the information. The basic idea is that the LSTM network has multiple “gates” inside of it with trained parameters. Some of these gates control the modules' “output” and other gates control their “forgetting.”

LSTM networks are a good fit for classifying, processing, and making predictions based on time series data since there can be lags of unknown duration between important events in a time series.

An **LSTM Autoencoder** is an implementation of an **autoencoder** for sequence data using an Encoder-Decoder **LSTM** network architecture. Now that we have seen the basic concepts of each network, let us go through the design of both networks as shown below.

The **DeepCNN** consists of two convolutional layers. Typically, CNN consists of a sequence of layers which includes convolutional layers, pooling layers, and fully connected layers. Each convolutional layer normally has two stages. In the first stage, the layer performs the mathematical operation called convolution which results in linear activations. In the second stage, a non-linear activation function is applied to each linear activation.

Like other neural networks, the CNN also uses training data to adapt its parameters (weights and biases) to perform the learning task. The parameters of the network are optimized using the ADAM optimizer. The kernel size, i.e, the number of filters, can be tuned further to perform better depending on the dataset. Further, the dropout, learning rate, etc. can be fine-tuned to validate the performance of the network.

The loss function used was the MSELoss (squared L2 norm) that measures the mean squared error between each element in the inputs ‘x’ and target ‘y.' The `LSTMAENN`

** **consists of stacked multiple LSTM layers with `input_size`

—The number of expected features in the input x, `hidden_size`

— The number of features in the hidden state h, `num_layers`

—The number of recurrent layers (Default:1), etc.

For more details, refer here. To avoid the scope of interpreting the detected noise in the data as anomalies, we can tune the additional hyper-parameters like ‘lookback’ (time series window size), units in hidden layers, and many more.

## Unsupervised Deep Anomaly Detector Models

`DeepCNN(`

` (conv1d_1_layer): Conv1d(10, 16, kernel_size=(3,), stride=(1,))`

` (relu_1_layer): ReLU()`

` (maxpooling_1_layer): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)`

` (conv1d_2_layer): Conv1d(16, 16, kernel_size=(3,), stride=(1,))`

` (relu_2_layer): ReLU()`

` (maxpooling_2_layer): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)`

` (flatten_layer): Flatten()`

` (dense_1_layer): Linear(in_features=80, out_features=40, bias=True)`

` (relu_3_layer): ReLU()`

` (dropout_layer): Dropout(p=0.25, inplace=False)`

` (dense_2_layer): Linear(in_features=40, out_features=26, bias=True)`

`)`

- **
**LSTM Autoencoder****

`xxxxxxxxxx`

`LSTMAENN(`

` (lstm_1_layer): LSTM(26, 128)`

` (dropout_1_layer): Dropout(p=0.2, inplace=False)`

` (lstm_2_layer): LSTM(128, 64)`

` (dropout_2_layer): Dropout(p=0.2, inplace=False)`

` (lstm_3_layer): LSTM(64, 64)`

` (dropout_3_layer): Dropout(p=0.2, inplace=False)`

` (lstm_4_layer): LSTM(64, 128)`

` (dropout_4_layer): Dropout(p=0.2, inplace=False)`

` (linear_layer): Linear(in_features=128, out_features=26, bias=True)`

`)`

Now, that we have designed the network architectures. Next, we will go through the further steps with a hands-on demonstration as given below.

**Getting Started**

### 1) Install the Package

The easiest way to install MSDA is by using pip.

`xxxxxxxxxx`

`pip install msda`

`OR`

`$ git clone https://github.com/ajayarunachalam/msda`

`$ cd msda`

`$ python setup.py install`

### Notebook

`xxxxxxxxxx`

`!pip install msda`

### 2) Import Time-Series Data

Here, we will use the climate data from here. This dataset is compiled from several public sources. The dataset consists of daily temperatures and precipitation from 13 Canadian centers. Precipitation is either rain or snow (likely snow in the winter months). In 1940, there is daily data for seven out of the 13 centers, but by 1960 there is daily data from all 13 centers, with the occasional missing value. We have around 80 years of records (daily frequency of data), and we want to identify the anomalies from that climate data. As seen below, this data has 27 features and around 30K records.

`xxxxxxxxxx`

`df = pd.read_csv('Canadian_climate_history.csv')`

`df.shape`

`=============`

`(29221, 27)`

### 3) Data Validation, Pre-Processing, etc.

We start by checking for missing values and imputing those missing values.

The functions `missing()`

and `impute()`

** **from the Preprocessing & ExploratoryDataAnalysis class can be used to find missing values and fill in the missing information. We are replacing the missing values with the mean values (hence, modes=1). There are several utility functions within these classes that can be used for profiling your dataset, manual filtering of outliers, etc. Also, other options provided include DateTime conversions, getting descriptive stats of the data, normality distribution test, etc. For more details peek here.

`xxxxxxxxxx`

`'''`

`Impute missing values with impute function (modes=0,1, 2, else use backfill)`

`0: impute with zero, 1: impute with mean, 2: impute with median, else impute with backfill method`

`'''`

`ExploratoryDataAnalysis.impute(df=df, modes=1)`

### 4) Post-Processing Data To Input Into the Anomaly Detector

Next, we are inputting data with no missing values, removal of unwanted fields, assert the timestamp field, etc. Here, the user can input the column to drop with their index value, and assert the timestamp field with their index value too. This returns two data frames; one will have all the numerical fields without the timestamp index, while the other will have all the numerical fields with timestamp indexing. We need to use one with the timestamp as an index of data for further steps.

`xxxxxxxxxx`

`Anamoly.read_data(data=df_no_na, column_index_to_drop=0, timestamp_column_index=0)`

### 5) **Data Processing With User-Input Time Window Size**

The time window size (lookback size) is given as an input to the function `data_pre_processing`

from the anamoly class.

`xxxxxxxxxx`

`X,Y,timesteps,X_data = Anamoly.data_pre_processing(df=anamoly_df, LOOKBACK_SIZE=10)`

With this function, we are also normalizing the data within the range of `[0,1]`

and then modifying the dataset by including "time-steps" as another additional dimension. The idea is to convert the two-dimensional data set of the dimension from `[Batch Size, Features]`

to a three-dimensional data set `[Batch Size, Lookback Size, Features].`

For more details, inspect here.

### 6) Selecting Custom User Selection Input Configurations To Train the Anomaly Model

Using the `set_config()`

function, the user can select from the deep network architectures, set time window size, and tune the kernel size. The available models include—Deep Convolutional Neural Network and LSTM AUTOENCODERS that can be given with possible values `deepcnn`

and `lstmaenn`

, respectively. We choose the time-series window `size=10`

and use the kernel size of 3 for the convolutional network.

`xxxxxxxxxx`

`MODEL_SELECTED, LOOKBACK_SIZE, KERNEL_SIZE = Anamoly.set_config(MODEL_SELECTED='deepcnn', LOOKBACK_SIZE=10, KERNEL_SIZE=3)`

`==================`

`MODEL_SELECTED = deepcnn`

`LOOKBACK_SIZE = 10`

`KERNEL_SIZE = 3`

### 7) Training the Selected Anomaly Detector Model

One can train the model with either the GPU/CPU based on availability. The compute function will use GPU, if available, otherwise, it will use the CPU resources. The google collab uses NVIDIA TESLA K80 which is the most most popular GPU, while NVIDIA TESLA V100 is the First Tensor Core GPU. The number of epochs for training can be custom set. The device being used will be outputted on the console.

`xxxxxxxxxx`

`Anamoly.compute(X, Y, LOOKBACK_SIZE=10, num_of_numerical_features=26, MODEL_SELECTED=MODEL_SELECTED, KERNEL_SIZE=KERNEL_SIZE, epocs=30)`

`==================`

`Training Loss: 0.2189370188678473 - Epoch: 1`

`Training Loss: 0.18122351250783636 - Epoch: 2`

`Training Loss: 0.09276176958476466 - Epoch: 3`

`Training Loss: 0.04396845106961693 - Epoch: 4`

`Training Loss: 0.03315385463795454 - Epoch: 5`

`Training Loss: 0.027696743746250377 - Epoch: 6`

`Training Loss: 0.024318942805264566 - Epoch: 7`

`Training Loss: 0.021794179179027335 - Epoch: 8`

`Training Loss: 0.019968783528812286 - Epoch: 9`

`Training Loss: 0.0185430530715746 - Epoch: 10`

`Training Loss: 0.01731374272046384 - Epoch: 11`

`Training Loss: 0.016200231966590112 - Epoch: 12`

`Training Loss: 0.015432962290901867 - Epoch: 13`

`Training Loss: 0.014561152689542462 - Epoch: 14`

`Training Loss: 0.013974714691690522 - Epoch: 15`

`Training Loss: 0.013378228182289321 - Epoch: 16`

`Training Loss: 0.012861106097943028 - Epoch: 17`

`Training Loss: 0.012339938251426095 - Epoch: 18`

`Training Loss: 0.011948177564954476 - Epoch: 19`

`Training Loss: 0.011574006228333366 - Epoch: 20`

`Training Loss: 0.011185694509874397 - Epoch: 21`

`Training Loss: 0.010946418002639517 - Epoch: 22`

`Training Loss: 0.010724217305010896 - Epoch: 23`

`Training Loss: 0.010427865211985524 - Epoch: 24`

`Training Loss: 0.010206768034701313 - Epoch: 25`

`Training Loss: 0.009942568653453904 - Epoch: 26`

`Training Loss: 0.009779498535478721 - Epoch: 27`

`Training Loss: 0.00969111187656911 - Epoch: 28`

`Training Loss: 0.009527427295318766 - Epoch: 29`

`Training Loss: 0.009236675929400544 - Epoch: 30`

### 8) Finding Anomalies

Once the training is completed, the next step is to find the anomalies. Now, this brings us back to our fundamental question, i.e., how exactly can we estimate and trace what is an anomaly?. One can use Anomaly Score, Anomaly Likelihood,** **and some recently developed metrics like the Mahalanobis distance-based confidence score. The Mahalanobis confidence score assumes that the intermediate features of pre-trained neural classifiers follow class conditional Gaussian distributions whose covariances are tied for all distributions, and the confidence score for a new input is defined as the Mahalanobis distance from the closest class conditional distribution.

An Anomaly Score is the fraction of active columns that were not predicted correctly. In contrast, Anomaly Likelihood is the likelihood that a given anomaly score represents a true anomaly. In any dataset, there will be a natural level of uncertainty that creates a certain “normal” number of errors in prediction. Anomaly likelihood accounts for this natural level of error. Since we don’t have the ground truth anomaly label, so in our case, we cannot use this metric. The `find_anamoly()`

is used to detect anomalies by generating the hypothesis and calculating losses, which are the anomaly confidence scores for individual time stamps given in the data set.

`xxxxxxxxxx`

`loss_df = Anamoly.find_anamoly(loss=loss, T=timesteps)`

`xxxxxxxxxx`

`hypothesis = model(torch.from_numpy(X.astype(np.float32)).to(device)).detach().cpu().numpy()`

`loss = np.linalg.norm(hypothesis — Y, axis=1)`

`return loss.reshape(len(loss),1)`

### 9) Plotting Samples With Confidence Score: DeepCNN Example

Next, we need to visualize the anomalies; the samples are assigned anomaly confidence scores for each timestamp record. The `plot_anamoly_results`

function can be used to plot the anomaly score with respect to frequencies (bins) and the confidence scores for every timestamp record.

`xxxxxxxxxx`

`Anamoly.plot_anamoly_results(loss_df=loss_df)`

From the above graphs, one can presume that the timestamps/instances which have anomaly confidence scores greater than or equal to 1.2 are likely examples that deviate from what is expected or typical, and thus can be treated as potential anomalies.

### 10) Interpretable Results of Predictions From the Anomaly Detector—DeepCNN

Finally, a prototype of Explainable AI for the built time-series predictor is designed. Before we go through this step, let us understand what is needed for interpretable models/explainable models.

#### Why Explainable AI (XAI) Is the Buzz and Need of the Hour?

Data is everywhere and machine learning can mine it for information. Representation learning would become more valuable & highly significant if the results also generated by machine learning models could be easily understood, interpreted, and trusted by humans. That is where Explainable AI comes in, thereby making things no longer a black box.

The `explainable_results()`

uses the game-theoretic approach to explain the output of the model. To understand, interpret, and trust the results on the deep models at the individual/sample level, we use the Kernel Explainer. One of the fundamental properties of Shapley values is that they always sum up to the difference between the game outcome when all players are present, and the game outcome when no players are present. For machine learning models, this means that SHAP values of all the input features will always sum up to the difference between baseline (expected) model output, and the current model output for the prediction being explained.

The `explainable_results`

function takes the input value for a specific row/instance/sample prediction that was made to be interpreted. It also takes the number of input features (X), and the time-series window size difference (Y). We can get the explainable results at the individual instance level, and also at the batch of data size (say for example first 200 rows, last 50 samples, etc.)

`xxxxxxxxxx`

`Anamoly.explainable_results(X=anamoly_data, Y=Y, specific_prediction_sample_to_explain=10,input_label_index_value=16, num_labels=26)`

The above graph is the result of the 10th example/sample/record/instance. It can be seen that the features that contributed significantly to the corresponding anomaly confidence score result were the temperature readings from the weather stations of Vancouver, Toronto, Saskatoon, Winnipeg, and Calgary.

## Important Resources

- Example Unsupervised Feature Selection Demo Notebook
- Example Unsupervised Anomaly Detector & Explainable AI Demo Notebook

The complete code is made available here. Refer to this notebook.

Published at DZone with permission of Ajay Arunachalam. See the original article here.

Opinions expressed by DZone contributors are their own.

Comments