Malware Detection With Convolutional Neural Networks in Python

Learn the basics of artificial network architectures and how to use Convolutional Neural Networks to help malware analysts and information security professionals detect and classify malicious code.

Sugandha Lahoti

Updated Sep. 19, 18 · Tutorial

Likes (7)

Comment

Save

31.8K Views

In this post, we will learn about artificial network architectures and how to use one of them (Convolutional Neural Networks) to help malware analysts and information security professionals to detect and classify malicious code.

Malware is a nightmare for every modern organization. Attackers and cybercriminals are always coming up with new malicious software to attack their targets. Security vendors are doing their best to defend against malware attacks but, unfortunately, with millions of malware discovered monthly, they cannot achieve that. Thus, novel approaches such as deep learning are needed.

Before diving into the technical details and the steps for the practical implementation of the DL method, it is essential to learn and discover the other different architectures of artificial neural networks. The major artificial neural networks are discussed now.

This excerpt is taken from the book Mastering Machine Learning for Penetration Testing by Packt Publishing. This book teaches you extensive skills to become a master at penetration testing using machine learning with Python.

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are a deep learning approach to tackle the image classification problem, or what we call computer vision problems, because classic computer programs face many challenges and difficulties to identify objects for many reasons, including lighting, viewpoint, deformation, and segmentation.

This technique is inspired by how the eye works, especially the visual cortex function algorithm in animals. CNN are arranged in three-dimensional structures with width, height, and depth as characteristics. In the case of images, the height is the image height, the width is the image width, and the depth is RGB channels.

To build a CNN, we need three main types of layers:

Convolutional layer: A convolutional operation refers to extracting features from the input image and multiplying the values in the filter with the original pixel values

Pooling layer: The pooling operation reduces the dimensionality of each feature map

Fully-connected layer: The fully-connected layer is a classic multi-layer perceptrons with a softmax activation function in the output layer

To implement a CNN with Python, you can use the following Python script:

import numpy

from keras.datasets
import mnist

from keras.models
import Sequential

from keras.layers
import Dense

from keras.layers
import Dropout

from keras.layers
import Flatten

from keras.layers.convolutional
import Conv2D

from keras.layers.convolutional
import MaxPooling2D

from keras.utils
import np_utils

from keras
import backend

backend.set_image_dim_ordering('th')

model = Sequential()

model.add(Conv2D(32, (5, 5), input_shape = (1, 28, 28), ))

model.add(MaxPooling2D(pool_size = (2, 2)))

model.add(Dropout(0.2))

model.add(Flatten())

model.add(Dense(128, ))

model.add(Dense(num_classes, ))

model.compile(, , metrics = ['accuracy'])

Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) are artificial neural networks where we can make use of sequential information, such as sentences. In other words, RNNs perform the same task for every element of a sequence, with the output depending on the previous computations. RNNs are widely used in language modeling and text generation (machine translation, speech recognition, and many other applications). RNNs do not remember things for a long time.

Long Short Term Memory networks

Long Short Term Memory (LSTM) solves the short memory issue in recurrent neural networks by building a memory block. This block sometimes is called a memory cell.

Hopfield networks

Hopfield networks were developed by John Hopfield in 1982. The main goal of Hopfield networks is auto-association and optimization. We have two categories of Hopfield network: discrete and continuous.

Boltzmann Machine Networks

Boltzmann machine networks use recurrent structures and they use only locally available information. They were developed by Geoffrey Hinton and Terry Sejnowski in 1985. Also, the goal of a Boltzmann machine is optimizing the solutions.

Malware Detection With CNNs

For this new model, we are going to discover how to build a malware classifier with CNNs. But I bet you are wondering how we can do that while CNNs are taking images as inputs. The answer is really simple, the trick here is converting malware into an image. Is this possible? Yes, it is. Malware visualization is one of many research topics during the past few years. One of the proposed solutions has come from a research study called Malware Images: Visualization and Automatic Classification by Lakshmanan Nataraj from the Vision Research Lab, University of California, Santa Barbara.

The following diagram details how to convert malware into an image:

Image title

The following is an image of the Alueron.gen!J malware:

Image title

This technique also gives us the ability to visualize malware sections in a detailed way:

Image title

By solving the issue of how to feed malware machine learning classifiers that use CNNs by images, information security professionals can use the power of CNNs to train models. One of the malware datasets most often used to feed CNNs is the Malimg dataset. This malware dataset contains 9,339 malware samples from 25 different malware families. You can download it from Kaggle.

These are the malware families:

Allaple.L

Allaple.A

Yuner.A

Lolyda.AA 1

Lolyda.AA 2

Lolyda.AA 3

C2Lop.P

C2Lop.gen!G

Instant access

Swizzor.gen!I

Swizzor.gen!E

VB.AT

Fakerean

Alueron.gen!J

Malex.gen!J

Lolyda.AT

Adialer.C

Wintrim.BX

Dialplatform.B

Dontovo.A

Obfuscator.AD

Agent.FYI

Autorun.K

Rbot!gen

Skintrim.N

After converting malware into grayscale images, you can get the following malware representation so you can use them later to feed the machine learning model:

Image title
The conversion of each malware to a grayscale image can be done using the following Python script:

import os

import scipy

import array

filename = '<Malware_File_Name_Here>';

f = open(filename,'rb');

ln = os.path.getsize(filename);

width = 256;

rem = ln%width;

a = array.array("B");

a.fromfile(f,ln-rem);

f.close();

g = numpy.reshape(a,(len(a)/width,width));

g = numpy.uint8(g);

scipy.misc.imsave('<Malware_File_Name_Here>.png',g);

For feature selection, you can extract or use any image characteristics, such as the texture pattern, frequencies in image, intensity, or color features, using different techniques such as Euclidean distance, or mean and standard deviation, to generate later feature vectors. In our case, we can use algorithms such as a color layout descriptor, homogeneous texture descriptor, or global image descriptors (GIST). Let's suppose that we selected the GIST; pyleargist is a great Python library to compute it. To install it, use PIP as usual:

# pip install pyleargist=.0.1

As a use case, to compute a GIST, you can use the following Python script:

import Image

Import leargist

image = Image.open('<Image_Name_Here>.png');

New_im = image.resize((64,64));

des = leargist.color_gist(New_im);

Feature_Vector = des[0:320];

Here, 320 refers to the first 320 values while we are using grayscale images. Don't forget to save them as NumPy arrays to use them later to train the model.

After getting the feature vectors, we can train many different models, including SVM, k-means, and artificial neural networks. One of the useful algorithms is that of the CNN.

Once the feature selection and engineering is done, we can build a CNN. For our model, for example, we will build a convolutional network with two convolutional layers, with 32 * 32 inputs. To build the model using Python libraries, we can implement it with the previously installed TensorFlow and utils libraries.

So, the overall CNN architecture will be as in the following diagram:

Image title This CNN architecture is not the only proposal to build the model, but at the moment we are going to use it for the implementation.

To build the model and CNN in general, I highly recommend Keras. The required imports are the following:

import keras

from keras.models
import Sequential, Input, Model

from keras.layers
import Dense, Dropout, Flatten

from keras.layers
import Conv2D, MaxPooling2D

from keras.layers.normalization
import BatchNormalization

from keras.layers.advanced_activations
import LeakyReLU

As we discussed before, the grayscale image has pixel values that range from 0 to 255, and we need to feed the net with 32 * 32 * 1 dimension images as a result:

train_X = train_X.reshape(-1, 32,32, 1)
test_X = test_X.reshape(-1, 32,32, 1)

We will train our network with these parameters:

batch_size = 64

epochs = 20

num_classes = 25

To build the architecture, with regards to its format, use the following:

Malware_Model = Sequential()

Malware_Model.add(Conv2D(32, kernel_size=(3,3),,input_shape=(32,32,1),))

Malware_Model.add(LeakyReLU(.1))

Malware_model.add(MaxPooling2D(pool_size=(2, 2),))

Malware_Model.add(Conv2D(64, (3, 3), ,))

Malware_Model.add(LeakyReLU(.1))

Malware_Model.add(Dense(1024, ))

Malware_Model.add(LeakyReLU(.1))

Malware_Model.add(Dropout(0.4))

Malware_Model.add(Dense(num_classes, ))

To compile the model, use the following:

Malware_Model.compile(.losses.categorical_crossentropy, .optimizers.Adam(),metrics=['accuracy'])

Fit and train the model:

Malware_Model.fit(train_X, train_label, ,,,validation_data=(valid_X, valid_label))

As you noticed, we are respecting the flow of training a neural network that was discussed in previous chapters. To evaluate the model, use the following code:

Malware_Model.evaluate(test_X, test_Y_one_hot, )

print('The accuracy of the Test is:', test_eval[1])

Promises and Challenges in Applying Deep Learning to Malware Detection

Many different deep network architectures were proposed by machine learning practitioners and malware analysts to detect both known and unknown malware; some of the proposed architectures include restricted Boltzmann machines and hybrid methods.

New approaches to detect malware and malicious software show many promising results. However, there are many challenges that malware analysts face when it comes to detecting malware using deep learning networks, especially when analyzing PE files because to analyze a PE file, we take each byte as an input unit, so we deal with classifying sequences with millions of steps, in addition to the need of keeping complicated spatial correlation across functions due to function calls and jump commands.

You just read an excerpt from the book Mastering Machine Learning for Penetration Testing written by Chiheb Chebbi and published by Packt Publishing. Image title

We just discovered how to build malware detectors using different machine learning algorithms, especially using the power of deep learning techniques.

Interested in reading more? Here’s how you can learn how to detect botnets by building and developing robust intelligent systems.

neural network Malware Network Machine learning Python (language)

Opinions expressed by DZone contributors are their own.

Related

Trending

Malware Detection With Convolutional Neural Networks in Python

Learn the basics of artificial network architectures and how to use Convolutional Neural Networks to help malware analysts and information security professionals detect and classify malicious code.

Convolutional Neural Networks (CNNs)

Recurrent Neural Networks (RNNs)

Long Short Term Memory networks

Hopfield networks

Boltzmann Machine Networks

Malware Detection With CNNs

Promises and Challenges in Applying Deep Learning to Malware Detection

Related

Partner Resources