Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Cluster Analysis and Big Data: The Basics

DZone's Guide to

Cluster Analysis and Big Data: The Basics

Cluster analysis is an unsupervised way to gain data insight into the world of Big Data. It will show you relationships in data that you may not realize are there.

· AI Zone ·
Free Resource

Enable your enterprise to add AI to your existing infrastructure with EdgeVerve’s Business Applications built on AI platform Infosys Nia™. Register for our webinar to learn more.

My last article on machine learning discussed different types of artificial intelligence that can be applied to Big Data. This article will discuss cluster analysis, which is a form of unsupervised pattern recognition.

Let's start with a basic definition. Pattern recognition algorithms are used to detect regularities in data, and they come in two basic flavors: supervised and unsupervised. In supervised pattern recognition, training against a dataset occurs to help the algorithms detect patterns. Unsupervised means no training against data is provided; patterns are detected by other means, such as statistical analysis.

What are the benefits of using supervised versus unsupervised pattern recognition? To answer this question, bear in mind that some prior knowledge must go into designing supervised pattern recognition software. This is because data used to train the software must be pre-selected.

In unsupervised pattern recognition, this is unnecessary. A group of data is simply run through an algorithm to observe what's "interesting." We can ask questions about data without pre-thinking potential relationships, and do it "on the fly."

With supervised pattern recognition, if a few weeks down the road it becomes apparent that other data should have been accounted for, the algorithm will need to be re-trained, and this will involve some additional software development. With unsupervised pattern recognition, the algorithm is simply run against the new data.

Cluster Analysis is a form of unsupervised pattern recognition, and is defined by Wikipedia as follows:

"Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)."

This is easily explained visually. Please see the following diagram (By Chire — Own work, CC BY —SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17087089).

Image title

Cluster Analysis involves numeric representation of data so oftentimes a conversion must take place*

Think of each point as a relationship between two pieces of data. For example, a point may represent yearly spending by a department (the y-axis representing spending in hundreds of thousands of dollars; the x-axis being decimal representations of departments), or sales by geographic location (the y-axis representing sales in hundreds of thousands of dollars; the x-axis being decimal representations of geographic coordinates).* The preceding diagram illustrates data clustering behavior. This in and of itself may not necessarily lead to data insight immediately.

The next step is for an analyst to look at the data comprising each cluster. For instance, an examination of the green cluster might reveal a concentration of expenses made by departments connected to sales. Or perhaps the blue cluster is comprised of geographic locations in the Northeast. The analyst is asking: 1) what is interesting about the clusters, and 2) what data attributes could be causing clustering in the manner seen? By running a cluster analysis on data that one wouldn't think would necessarily be related, a determination can be made if relationships do in fact exist.

Several types of clustering algorithms are available, such as connectivity-, centroid-, distribution-, and density-based algorithms. I will leave it to the reader to research on your own the various algorithms and their workings. Hopefully, this blog has given you an idea of the practical applications of using clustering.

Summary

In summary, cluster analysis is an unsupervised way to gain data insight into the world of Big Data. It will show you relationships in data that you may not realize are there. jKool is a Big Data analysis solution that takes advantage of clustering. Stay tuned to follow-up articles for more information that will allow you to see various machine learning examples with Big Data at the jKool website www.jkoolcloud.com.

Adopting a digital strategy is just the beginning. For enterprise-wide digital transformation to take effect, you need an infrastructure that’s #BuiltOnAI. Register for our webinar to learn more.

Topics:
ai ,cluster analysis ,big data ,artificial intelligence

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}