Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Call Detail Record Analysis: K-Means Clustering With R

DZone's Guide to

Call Detail Record Analysis: K-Means Clustering With R

By using this clustering mechanism, you can find the clusters making more traffic to the telecom network in the measure of total activity.

Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Call Detail Record (CDR) is the information captured by the telecom companies during Call, SMS, and Internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. Most telecom companies use CDR information for fraud detection by clustering the user profiles, reducing customer churn by usage activity, and targeting the profitable customers by using RFM analysis.In this blog, we will discuss clustering of the customer activities for 24 hours by using unsupervised K-means clustering algorithm. It is used to understand a segment of customers with respect to their usage by hours.For example, customer segment with high activity may generate more revenue. Customer segment with high activity in the night hours might be fraud ones.

Data Description

A daily activity file from Dandelion API is used as a data source, where the file contains CDR records generated by the Telecom Italia cellular network over the city of Milano. The daily CDR activity file contains information for 10, 000 grids about SMS in and out, Call in and out, and Internet activity. The structure of the dataset is as follows: As it has five million records, a subset of the file containing activity information for 500 square IDs is used as a use case.

Image title

Data Source Features Description

The actual dataset contains eight numerical features about SMS in and out activity, call in and out activity, Internet traffic activity, square grid ID where the activity has happened, country code, and timestamp information about when the activity has been started.

Image title

Data Preprocessing

Data preprocessing involves data cleansing, data type conversion, and wrangling.To preprocess data, perform the following steps:

  • Convert the square ID and the county code into factor columns as part of type conversion.
  • Derive new fields such as “activity start date” and “activity hour” from “time interval” field.
  • Find the total activity, which is the sum of SMS in and out activity, call in and out activity, and Internet traffic activity.

Image title

  • Create new derived fields as mentioned above.

Image title

CDR Exploratory Data Analysis (EDA)

Exploratory Data Analysis is the process of analyzing the data visually. It involves outlier detection, anomaly detection, missing values detection, aggregating the values, and producing the meaningful insights. The plot for “Total activity by activity hours” is as follows:

Image title

From the above plot, it is evident that most of the activities happened in the hour of 23 and very less activity happened in the hour of 06. The plot for “Top 25 square grids by total activity” is as follows:

Image title

From the above plot, it is evident that most of the activities happened in the square grid ID 147. The plot for “Top 10 country by total activity” is as follows:

Image title

From the above plot, it is evident that the country code 39 has the highest activity.

Call Detail Record Clustering

K-means clustering is the popular unsupervised clustering algorithm used to find the pattern in the data. Here, K-means is applied among “total activity and activity hours” to find the usage pattern with respect to the activity hours.

Elbow method is used to find the optimal number of clusters to the K-means algorithm.

Image title

By looking at the above plot, it is evident that Sum of Squared Error (SSE) decreases with minimal change after cluster number 10 and there is no unexpected increase in the error distance. So, the best cluster to perform K-means for this dataset is 10.The summary of CDR K-means model and its center calculated for each cluster is as follows:

Image title

The heat map plot with the cluster, activity hour, and total activity time is as follows:

select

From the above plot, it is evident that the clusters 1, 7, and 9 have activity for all 24 hours and are the more revenue-generating clusters. The clusters 1, 5, 7, 9, and 10 have activity during night hours. The cluster 5 has activity from 11.5 to 17 hours.

Conclusion

By using this clustering mechanism, you can find the clusters making more traffic to the telecom network in the measure of total activity. Similarly, you can obtain more information like square grid and country code information to understand the square grid likely creating more revenue and more traffic to the telecom network and to target high customers based on their geo-location. In the upcoming blog, we will discuss how RFM will be used to analyze call detail records. For more, here's the GitHub location.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
big data ,k means clustering ,r ,call detail record

Published at DZone with permission of Rathnadevi Manivannan. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}