# Clickstream Analysis and Data Mining Techniques 101: An Introduction

# Clickstream Analysis and Data Mining Techniques 101: An Introduction

### Clickstream data can be elusive—but it's not impossible to analyze. Learn how to extract data-driven personas, predict next actions, and extract frequent sequential patterns using clickstream data.

Join the DZone community and get the full member experience.

Join For FreeHortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Ah, the elusive clickstream data. Many platforms, like Facebook and Google Ads, rely on the data generated based on what a user clicks — and what they don't. To start analyzing clickstream data, we need first to be able to capture step-by-step a user's activity on a web page or application. And that is of great value in the hands of any internet marketer. Getting a 360-degree view of a customer by knowing what they are and are not clicking can bring you a huge improvement in both your products and your customers' experience.

## Data Collection

Either you have your data in your data warehouse, or you need to enrich it with more data sources you need to have a way to collect and store data consistently into a database. Simple as that.

## Data Preparation

Raw data is like a rough diamond; it requires some refinement before being truly valuable.

In the data world, refinement includes data processing, cleaning, and transformation of the initial data into something convenient for the analysis you are going to carry out.

In this case, we would like to have our data grouped into sessions. It would be good, too, if we could arrange the events of each session in time order before moving to the actual analysis.

In the above description, it is important to define what do we mean by the term "session." A session involves the consideration of one of the following:

- The time between two consequent application start events in the case of an
*application*. - The time from entry until logout or timeout (i.e. 20 minutes of no activity) in the case of a
*web page*.

In contrast to other data sequences, clickstream data can have varying lengths for different sessions and different users.

In order to transform the initially collected event log into clickstream data, we need to:

- Identify events/actions performed by the same user and group them together.
- Split them further into subgroups of events based what was performed during the same session according to the session's definition given above.

At this point, the dataset we are going to use for the rest of the analysis should look like this:

```
Session1 A8
Session2 A14 A4 A8 A11 A12
Session3 A14 A4 A8 A11 A12
Session4 A14 A4 A9 A8 A9 A8 A11 A12
Session5 A14 A4 A9 A8 A11 A24 A9 A9 A8 A1 A14 A4 A8 A11 A12
```

In this representation, each line corresponds to a session. The first field is the session's name, while the next fields are the actions performed by the user during this session.

## Model Construction

As in most cases, the methods we can deploy for solving this problem are many. In this post, we are going to evaluate two of them, as they are widely used and easy to understand.

### Markov Chains

Markov Chains work with sequential data, which is the type of data we're dealing with in this post.

The Markov process is a stochastic process that satisfies the Markov property of memorylessness. A Markov chain is a Markov process in either discrete or continuous time with a countable state space.

It can be graphically represented as a transition diagram along with the corresponding probabilities:

In clickstream analysis, we usually utilize these Markov Chains. The process:...takes the state:...from a finite set:...at each time:The order of a Markov Chain is derived from the number of recent states on which the current state, we assume, depends. Based on this, zero-order chains imply that the probability of being in a state in the next step is independent of all previous states.

The higher-order Markov Chain introduced by the Raftery (1985) will lead to more realistic models. At the same time, the parameters needed for the representation increase exponentially, so it is important to find a right balance between these two.

#### Fitting a Markov Chain

As mentioned before, at this point, our dataset looks like:

```
Session1 A8
Session2 A14 A4 A8 A11 A12
Session3 A14 A4 A8 A11 A12
Session4 A14 A4 A9 A8 A9 A8 A11 A12
Session5 A14 A4 A9 A8 A11 A24 A9 A9 A8 A1 A14 A4 A8 A11 A12
```

We chose to use the third-order Markov Chain on the above-produced data, as:

**The number of parameters needed for the chain's representation remains manageable**. As the order increases, the parameters necessary for the representation increase exponentially and thus managing them requires significant computational power.**As a rule of thumb, we would like at least half of the clickstreams to consist of as many clicks as the order of the Markov Chain that should be fitted**. There is no point in selecting a third-order chain if the majority of the clickstream consists of two states and so there is no state three steps behind to take into consideration.

Fitting the Markov Chain model gives us transition probabilities matrices and the lambda parameters of the chain for each one of the three lags, along with the start and end probabilities.

Start and end probabilities correspond to the probability that a clickstream will start or end with this specific event.

The transition probability matrix can be represented as a heat map with the y-axis representing the current state and x-axis representing the next one. The more red-ish the color, the more probable the indicated transition will occur. For example, the transition from Action 23 to Action 1 is very likely while the transition from Action 21 to Action 10 is not.

On the other hand, Akaike's information criterion (AIC) and Bayes' information criterion (BIC) (computed based on the log likelihood) can be used to compare two fitted Markov Chain models.

#### Predicting Clicks

In clickstream analysis, it is often very useful to predict the next click or final click (state) of a user given the pattern they have followed until now. In this way, data-driven personas can be constructed that will incorporate users' behavior.

Along with the click prediction, the probability of transition is also calculated.

For example, when in state A14, the most probable next transitions are:

This can be extended to find the most common use case scenario of an app or a web page.

Starting from the state with the highest start probability and following the most probable transitions, we end up with the data-driven persona.

#### Clustering Clickstream Data

In most cases, due to the complexity of websites or applications, same clickstreams are difficult to occur as the different paths a user can follow are many. A large number of monitored clickstreams makes the analysis more difficult unless we group together similar clickstreams and user profiles.

This way, a company can:

- Find customer segments
- Identify communities of visitors with similar interests

In our case, we performed a k-means clustering with two centers. At this point, it is important to provide a meaningful interpretation for each of the clusters.

For our clustering, we noticed that cluster by cluster, the average length of clickstreams increase. This means that k-means clustered the clickstreams based on the number of actions the user that produced them performed during a session.

Graphically, the clusters can be represented as:

The y-axis represents a unique identifier for each session while the x-axis corresponds to the total number of states changed during each session.

Based on this, the light blue can be interpreted as users who perform a few actions and probably don't spend much time on our page or application. Especially for an application, it may represent users who achieved their goal easily and had no problem using the interface.

The dark blue cluster represents users that perform many actions and spend much more time navigating.

This interpretation can completely change on different data. In fact, there is no rule of thumb about how to perform the cluster interpretation and in most cases requires deep understanding on the data and field expertise.

### Mining Associations With cSPADE

Instead of modeling clickstream data as transition probabilities, we can represent them as sequential patterns. We can then mine sequential patterns for finding those patterns that have a particular minimum support, or in other words, occur a small number of times in user's clickstream data.

The SPADE algorithm mines frequent patterns utilizing temporal joins along with efficient lattice search techniques. In the first step, the algorithm computes the frequencies of sequences with only one item, in the second step with two items and so on.

With this approach, a company can explore, understand, or predict the visitors' navigation patterns through a website or application.

Using the cSPADE algorithm, we extract all pattern sequences with a minimum support that we've defined.

For example, the following 22 pattern sequences are supported at least 40% of the clickstreams:

For a given sequence pattern S, we can predict the next click by searching for the pattern sequence with the highest support that starts with S.

For example, after having just performed (A14), the most probable next action is (A4) according to pattern sequence 8 with support 73.5%.

Of course, lowering the support would give us pattern sequences that are less frequent in our clickstreams. This may be useful in the case we want to extract pattern sequences that yield errors or failures in our software.

## Conclusion

At this point, we have explored Markov Chains and the SPADE algorithm for mining clickstream sequence data. By deploying the proposed models, we can:

- Extract data-driven personas for the most frequent digital journeys of our customers through an app or a web page.
- Predict the next actions based on those performed so far.
- Extract frequent sequential patterns.

However, all this analysis is worthless unless it is driving actions. Based on the above results, a recurring process of reviewing must be initialized regarding web or application design and content and marketing strategy.

The code of the post can be found on GitHub.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub. Join the discussion.

Published at DZone with permission of Eleni Markou , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}