Over a million developers have joined DZone.

Dynamic Intrusion Detection for Authorization Systems Like Udaru

DZone 's Guide to

Dynamic Intrusion Detection for Authorization Systems Like Udaru

A discussion of a low-level security topic, dynamic intrusion detection, and the algorithms that make it work. In this post, we use the open source Udaru as an example.

· Security Zone ·
Free Resource


Security is an ever-increasing risk. As we add more components to our infrastructure the number of attack vectors increase. When building an authorization system such as Udaru, it is just as important to think of the security of Udaru itself as the security of the systems that Udaru protects.

Creating an intrusion-detection system for Udaru poses some challenges; Udaru itself doesn’t have any restrictions on what type of application or domain it can be used for. Because of this, it is very difficult to create feature/application specific rules. Secondly, there is often no, or very little, data about when something is known to be an intrusion attempt and especially when we want to detect intrusion patterns that haven’t been considered.

Therefore, our intrusion-detection system dynamically creates rules based on a training dataset of previously seen valid requests. This kind of data modeling is known as zero-negative classification. Typical modern approaches such as machine learning are not very suited for these kinds of problems. Instead, we use frequentist and Bayesian statistics for the models as those approaches are more suited for this type of problem.


Udaru is a Policy Based Access Control (PBAC) authorization system. Let’s take a user in an organization. This user asks to perform an action on a certain resource. Udaru policies can control if this action is allowed or not. For example, we can validate if a user can write to a specific file on a server.

Udaru information flow

Udaru information flow: Udaru decouples the user login from the access control itself. This means when Udaru gets an authorization request, it assumes that the user login has already been validated.

It is important to understand that Udaru itself is not responsible for the user login. Intrusion-detection for attack vectors there are meant to get user credentials, such as brute force attackstiming attacks or SQL injections, and is therefore not within the scope of intrusion-detection for Udaru.

Intrusion Strategies

The intrusion-detection system looks at IP addresses and timestamps to see if a user moves too fast. This may indicate that an intruder has gained another person’s user credentials.

Secondly, the intrusion-detection system looks at the resource string to see if this is similar to previously seen resource strings. If not, this might indicate that an intruder is trying to gain access to resources through obscure means. Such as by using special characters (resource:server:/pubic/server/../../../etc/passwd) or unusual syntax (resource:server:/public/server:/etc/passwd). In these cases, a poorly written policy may allow access because resource:server:/public/server matches the poorly written policy. Even if the polices are fine, it is still important to know that someone is attempting to break the authorization system.

Udaru poses some challenges in the second case because it doesn’t assume anything about the application. By default, it does not have any restrictions on how resource strings are supposed to look. For example, in the example above, the resources are directories on servers. However, they can also be physical doors in a corporate building or personal information used by humans.

Transaction Audit Log

To facilitate logging the authorization data, nearForm have recently developed a new audit log service called Trail. Trail lets you index the authorization data and makes it easily accessible to the intrusion-detection system. To use Trail with Udaru, there is also an integration model called Udaru-Trail.

Geolocation Anomalies

For detecting users who “move” too fast, the IP-address is translated to longitude and latitude using the MaxMind GeoLite2 Database. Using the longitude and latitude of the IP-address of the last login and the new login, the distance can be calculated using the Haversine formula:

Image title

Based on the fact that the Earth’s radius is approximately:

Image title

With the traveled distance calculated, it is straightforward to compute the velocity assuming infinite acceleration:

Image title

These equations require the following parameters:

  • (λ1, ϕ1, t1) longitude, latitude, and timestamp for the last login.
  • (λ2, ϕ2, t2) longitude, latitude, and timestamp for the last login.

For the Udaru intrusion-detection, a velocity above the speed of sound (343 m/s) is considered an intrusion.

The Haversine functions are not a part of the standard Python math library; but they can easily be implemented using the following trigonometric identities:

Image title

Resource Anomalies

To detect anomalies in the resource strings, 3 different models are used:

  1. Check if the string length matches what was previously seen.
  2. Check if the characters used match what was previously seen.
  3. Check if the grammatical structure matches what was previously seen.

These models are based on the paper “Anomaly Detection of Web-based Attacks” but with some modifications especially in the case of the grammatical structure model.

When validating a resource string, the fastest approach would be to validate the resource string using the length model first; this is very cheap and the others have a computational complexity that scales with the string length. Next, one should use the character model. However, for the purpose of demonstration, the results of all three models are presented here.

1. String Lenght Model

A big problem in creating a statistical model based on string length is that it is very hard to assume anything about the length. The same instance of Udaru can be used for handling resources of different types, such as door numbers and file paths. Therefore the empirical distribution of the resource string length can easily look something like this:

Hypothetical string length histogram

Hypothetical string length histogram: bimodal distributions or worse can be expected. In general, it is very hard to assume anything about the distribution of the resource string length.

Because it is very hard to assume anything about the distribution, a good idea is to use the Chebyshev inequality theorem as this doesn’t assume anything about the distribution:

Image title

The Chebyshev inequality says that the probability that something (x) deviates more from the mean than a threshold (t), is less than σ2/t2. This is reformulated to “the probability that something (x) deviates more from the mean than the current deviation lμ from the mean. Where l is the current resource string length.

Image title

To validate a string length the lower-bound probability P(∣xμ∣>∣lμ∣) is then thresholded by some fixed value. In this case, 10% is used. Meaning if the probability is greater than 10%, the string length is considered valid.

Image title

2. Character Distribution Model

The character distribution model computes the empirical distribution from the training dataset. For validation, it then compares this with the character distribution of new resource strings using a χ2-test. If the p-value is greater than 10%, it is considered a valid string.

To increase the degrees of freedom, some characters are grouped into the same symbol. The character groupings are 0-9, a-f, g-z, A-F, and G-Z. The letters are split on F and G to allow the model to easily treat hexadecimal strings differently from strings that use all English letters.

3. Grammatical Model

The Grammatical Model is by far the most complex model. The idea is to dynamically build a finite automaton that represents all valid resource strings. The difficult part is creating a finite automaton that doesn’t just fit the training dataset but generalizes to the actual pattern. However, the finite automaton must not end up being so general that it fits everything.

Grammatical Model

Grammatical Model: This describes resource strings such as resource:server:/public/server and rejects resource strings such as resource:server:/public/server:/etc/passwd.

This balance between fitting the training dataset and generalizing is accomplished using Bayesian statistics.

Image title

When building the model, the finite automaton is augmented with emission and transition probabilities, thereby making it a Markov Chain. The process of dynamically building this Markov Chain is called Bayesian Model Merging (see Inducing Probabilistic Grammars by Bayesian Model Merging).

As Markov Chains inherently make it theoretically trivial to compute the probability of a sequence, computing P(DatasetModel) is likewise trivial. As for the Bayesian prior P(Model), there is no universal correct choice. In the udaru-anomaly-detection tool the following prior is used:

Image title

Where es and ts are the numbers of emissions and transitions for the state, s.

Now that the unnormalized P(ModelDataset) can be calculated, the model merging strategy can be described.

For each sequence in the dataset, do the following greedily, sequence by sequence:

  1. Extend the Markov Model by creating a linked list between the start and the end. Mark each of these states as unmergeable. Meaning, another state cannot be merged into an unmergeable state. However, an unmergeable state can be merged into a mergeable state.
  2. Attempt to merge each unmergeable state into all other mergeable states, do this greedily from start to finish.
  3. If any of these merges caused P(ModelDataset) to increase, consider this the best model the new model. If P(ModelDataset) wasn’t increased by merging, mark the unmergeable state mergeable.

To make Bayesian Model Merging work in practice you have to go quite a way beyond what the original paper describes. Some key observations:

  • Use BeamSearch to approximate P(DatasetModel). The final grammatical model often has a polynomial complexity. However, when exploring possible models it is almost unavoidable to encounter Markov Models that contain cycles that cause an exponential computational complexity when calculating P(DatasetModel).
  • If a string already fits the model, simply increase the probabilities along the most likely path.
  • Compute all probabilities in log-space. This is much more numerically stable and it makes the Bayesian prior very cheap to compute.
  • Consider checking for model pruning opportunities. If two states are self-linking and link to each other, attempt to merge those two states.
  • Before merging the new sequence into the model, merge equivalent subsequent emissions together. This dramatically shortens the new sequence and therefore the number of merges to check.

Together these optimizations bring the training time down from a couple of days to less than 5 minutes on 100 resource strings.

Running the Tool

The udaru-anomaly-detection tool is an experiment which is why there is no UI and the CLI is limited in terms of configuration.

To install the udaru-anomaly-detection tool, run:

git clone https://github.com/nearform/udaru-anomaly-detection
cd udaru-anomaly-detection
python3 setup.py develop

You can then set up Udaru with Trail integration using the following code:

const TrailPlugin = require('@nearform/trail-hapi-plugin')
const UdaruPlugin = require('@nearform/udaru-hapi-plugin')
const { UdaruTrailHapiPlugin } = require('@nearform/udaru-trail')
const Hapi = require('hapi')

const server = Hapi.Server({port: 8080})

await this.server.register({plugin: UdaruPlugin})
await this.server.register({plugin: TrailPlugin})
await this.server.register({plugin: UdaruTrailHapiPlugin})

await this.server.start()

Alternatively, if you just want to try it out, you can insert some data into the Trail server. In this case, you still need the Trail server but not the Udaru server. To start a stand-alone Trail server, use npx run trail-hapi-server. With the Trail server running, you can insert some fake data with:

udaru-anomaly-detection --insert

It is then possible to train and test the model with:

udaru-anomaly-detection --train --from 2017-01-01 --to 2017-12-31 --modeldir=./cli_models
udaru-anomaly-detection --test --from 2018-01-01 --to 2018-12-31 --modeldir=./cli_models

Future Work

An area for intrusion-detection that we didn’t look into is anomalies regarding a single user's behavior. For example - a user suddenly makes a lot of otherwise valid authorization requests at a time when the user normally doesn’t make many requests. Such systems can also be used for automatically throttling requests or raising soft alarms. We didn’t look into this as it requires real-life data and we want to protect our users!

P.S. To see a demo, follow the link to the original source below. 

security ,intrusion detection ,authorization ,bayesian

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}