Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Call Detail Record (CDR) Analysis K-Means Clustering Using Tableau

DZone's Guide to

Call Detail Record (CDR) Analysis K-Means Clustering Using Tableau

Learn about the clustering of customer activities for 24 hours with Tableau 10's K-means clustering feature, which automatically groups similar data points.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In this blog, we will discuss the clustering of customer activities for 24 hours by using the K-means clustering feature in Tableau 10. Tableau 10's clustering feature automatically groups similar data points together. This type of clustering helps you create statistically based segments that provide insights about similarities in different groups and the performance of the groups when compared to each other.

You can use clustering on any type of visualization, ranging from scatter plots to text tables and even maps. In our previous blog post, Call Detail Record Analysis – K-Means Clustering With R, we discussed CDR analysis using the unsupervised K-means clustering algorithm.

Data Description

A daily activity file from Dandelion API is used as a data source, where the file contains CDR records generated by the Telecom Italia cellular network over the city of Milano. The daily CDR activity file contains information for 10,000 grids about SMS in and out, call in and out, and Internet activity.This dataset has five million records and the size of the dataset is 314 MB. selectThe below table, created in Tableau, shows the total activity of SMS, call, and Internet activity by hours and the total number of records per hour. The Grand Total section shows the cumulative total activity for SMS, call, and Internet. select

Data Preprocessing

To preprocess data, perform the following steps:

  • Derive new fields such as activity_start_time, activity_date, and activity hour from the time interval field.
  • Find the total activity, which is the sum of SMS in and out activity, call in and out activity, and Internet traffic activity.
  • Find the total SMS activity, which is the sum of SMS in and out activity.
  • Find the total call activity, which is the sum of call in and out activity.

The calculation of the above new fields can be done in Tableau easily with Create Calculated Field features.

Calculated Fields in Tableau

The below screenshot shows the formula used in Tableau for calculating activity_start_time. select If the epoch time is in milliseconds, then divide the value by 1,000 to convert into seconds. select The formulas for calculating other fields are as follows:

activity_date:

DATE ([activity_start_time])

activity_hour:

DATEPART ('hour', [activity_start_time])

total_activity:

SUM(IFNULL([call_in_activity],0) + IFNULL([call_out_activity],0) + IFNULL([sms_in_activity],0) + IFNULL ([sms_out_activity],0) + IFNULL([internet_traffic_activity],0))

total_sms_activity:

SUM (IFNULL ([sms_in_activity], 0) + IFNULL ([sms_out_activity], 0))

total_call_activity:

SUM (IFNULL ([call_in_activity], 0) + IFNULL ([call_out_activity], 0))

Note: IFNULL () is used to replace null values with zero while doing SUM (). The below screenshot shows the derived fields in Tableau:select

CDR Exploratory Data Analysis (EDA)

Tableau is bundled with rich sets of visualization to analyze the data. Exploratory data analysis is the process of analyzing the data visually. It involves outlier detection, anomaly detection, missing values detection, aggregating the values, and producing the meaningful insights. The following visualizations are created as part of EDA on five million data.

Total Activity by Activity Hours

This visualization is used to find out:

  • Total activities by hour.
  • Hours producing more traffic respective to total activity.
  • Hours producing less traffic respective to total activity.

selectFrom the above visualization, it is evident that most of the activities happened in the hour of 23 and much less activity happened in the hours of 5 and 6.

Top and Bottom 10 Square Grids by Total Activity

This visualization is used to find out:

  • Top 10 square grids producing more traffic with total activity in those grids.
  • Bottom 10 square grids producing more traffic with total activity in those grids.

selectFrom the above visualization, it is evident that most of the activities happened in the square grid ID 5059 and less activities happened in the square grid ID 497.

Call Detail Record Clustering in Tableau

Let's take a look at some clustering.

Clustering in Tableau

If the number of clusters is not specified by a user, Tableau picks the number of clusters corresponding to the first local maximum of the Calinski-Harabasz index. Tableau uses the following:

  • K-means algorithm for clustering.
  • Lloyd’s algorithm with squared Euclidean distances to compute the K-means clustering for each K.
  • Calinski-Harabasz criterion to assess cluster quality.

By default, K-means will be run for up to 25 clusters if the first local maximum of the index is not reached for a smaller value of K. A maximum value of 50 clusters can be set. To determine optimal number of clusters (K), consider between-group sum of squares (SSB), within group sum of squares (SSW), and the total sum of squares from the cluster result.

Total Activity by Activity Hours Cluster

Sometimes, data groupings make immediate sense. total_activity measure is used as a variable for clustering and K value is provided as 8. select The Describe clusters dialog box provides information about the models that Tableau computed for clustering. These statistics can be used to assess clustering quality.

Describe Clusters: Models Tab

Analysis of Variance (ANOVA) is a collection of statistical models and associated procedures useful for analyzing variation within and between observations that have been partitioned into groups or clusters. select In this use case, ANOVA is computed for the total_activity variable and the resulting analysis of variance table is used to determine total_activity variable effectiveness to distinguish clusters.

Describe Clusters: Summary Tab

The Summary tab identifies the inputs used to generate the clusters and provides some statistics characterizing the clusters. select The clusters are ranked using the Total Activity by Activity Hours cluster from more to less traffic as follows:

  • Cluster 1 produced more traffic activities, which include only activity hour 23.
  • Cluster 7 is second and includes activity hours 11, 15, 16, 17, and 18.
  • Cluster 2 is third and includes activity hours 0, 10, 12, 13, 14, 19, and 20.
  • Cluster 8 is fourth and includes activity hour 21.
  • Cluster 3 is fifth and includes activity hours 1, 9, and 22.
  • Cluster 4 is sixth and includes activity hour 2.
  • Cluster 5 is seventh and includes activity hours 3 and 8.
  • Cluster 6 is less traffic and includes activity hours 4, 5, 6, and 7.

Total SMS Activity by Activity Hours Cluster

This cluster is based on total_sms_activity measure as the variable for clustering and K value provided as 8. select

Internet Activity by Activity Hours Cluster

This cluster is based on sum (internet_traffic_activity) measure as the variable for clustering and K value provided as 8. select

SMS in and Out Activity by Activity Hours Cluster

This cluster is based on sum (sms_in_activity) and sum (sms_out_activity) measures as the variable for clustering and K value provided as 6. select

Call in and Out Activity by Activity Hours Cluster

This cluster is based on sum (call_in_activity) and sum (call_out_activity)  measures as the variable for clustering and K value provided as 6. select

CDR analysis (by Treselle Systems).

Conclusion

By using this clustering mechanism, you can find the clusters that are making more traffic to the telecom network in the measure of total activity. Similarly, you can obtain more information like square grid and country code information to understand the square grid likely creating more revenue and more traffic to the telecom network and to target high customers based on their geo-location.

References

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
tableau ,big data ,tutorial ,data analytics ,k-means clustering

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}