Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Clustering for Everyday Life

DZone's Guide to

Clustering for Everyday Life

k-means is an algorithm that finds k groups (where k is defined) on a given dataset. See how it can be used to help plan a trip.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Let’s consider this scenario: I love walking, so when I visit a city, I want to walk as much as possible — but I want to optimize my time to see as many as possible attractions. Now I want to plan my next trip to Gotham City to visit some Batman places. I found 1,000 places where Batman has appeared and I have, at most, four days. I need to bucket those 1,000 places into for four buckets so that points are close to a center in where I can leave my car to plan each day of my trip.

How can I do this?

This kind of problem can be classified as a clustering problem. But what is clustering? Clustering, or cluster analysis, is the task of grouping a set of data into a selection of homogeneous or similar items. To solve this kind of problem, it is necessary to:

  • Define the “resemblance” measure between elements (the concept of similarity).
  • Find out if the subset of elements that are “similar” according to the measure chosen.

The algorithm determines which elements form a cluster and what degrees of similarity unite them within a cluster. As I said in my previous post, clustering is a problem that can be solved with algorithms that belong to unsupervised methods because the algorithm doesn’t know any kind of information about structure and characteristics of the clusters.

In particular, for this problem, I'll use the k-means algorithm. k-means is an algorithm that finds k groups (where k is defined) on a given dataset. Each group is described by a centroid that represents the “center” of each cluster. The concept of “center” refers to the concept of distance that we have chosen for the specific problem.

For our problem, the concept of distance is simple. It is the real distance between two points defined by a latitude and a longitude. For this reason, the Euclidean distance can't be used but it is necessary to introduce the spherical law of cosines to compute the correct distance between geographical points.

But how do k-means algorithm work? They follow an iterative procedure:

Flow graph for k-mean point

The popularity of this algorithm comes from its:

  • Convergence speed.
  • Ease of implementation.

On the other hand, the algorithm doesn’t guarantee that you'll achieve the global optimum. The quality of the final solution strongly depends on the initial set of clusters. Since the algorithm is extremely fast, it’s possible to apply it several times and choose the best solution.

This algorithm starts with the definition of k cluster, where k is defined by the user. But how does the user know if k is the correct number? And how does the user know if the clusters are “good” clusters? One possible metric to measure the quality of the clusters is SSE (sum of square error), where error is the distance from the cluster centroid to the current point. Because this error is squared, this places more emphasis on the points far from the centroid.

Implementation

For this implementation, we need Python (I use 3.7, but 2.x is OK) and some packages:

  • matplotlib.
  • TensorFlow.
  • numpy.
  • pandas.

To install those packages is simple:

  • For Python 2.x:
    • pip install <packages name>
  • For Python 3.x:
    • pip3 install <packages name>

In any case, you can follow the installation instructions on the documentation of each package.

So far, so good. OK, let's go deep into the code:

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import pandas as pd
import csv

num_clusters = 4
num_steps = 100
num_vectors = 1000

lat_values = []
lon_values = []
vector_values = []


def display_partition(x_values, y_values, assignment_values):
    labels = []
colors = ["red", "blue", "green", "yellow"]
for i in range(len(assignment_values)):
    labels.append(colors[(assignment_values[i])])
color = labels
df = pd.DataFrame(dict(x = x_values, y = y_values, color = labels))
fig, ax = plt.subplots()
ax.scatter(df['x'], df['y'], c = df['color'])
plt.show()

for i in range(num_vectors):
    radlat = np.random.normal(45.7, 0.7)
radlon = np.random.normal(9.1, 0.8)
vector_values.append((radlat, radlon))
lat_values.append(radlat)
lon_values.append(radlon)

plt.plot(lat_values, lon_values, 'o', label = 'Input Data')
plt.legend()
plt.show()

print('lat:{}', lat_values)
print('lon:{}', lon_values)
print('vector:{}', vector_values)

vectors = tf.constant(vector_values)

n_samples = tf.shape(vector_values)[0]
random_indices = tf.random_shuffle(tf.range(0, n_samples))
begin = [0, ]
size = [num_clusters, ]
size[0] = num_clusters

centroid_indices = tf.slice(random_indices, begin, size)
centroids = tf.Variable(tf.gather(vector_values, centroid_indices))

expanded_point = tf.expand_dims(vectors, 0)
expanded_centroid = tf.expand_dims(centroids, 1)

x1 = tf.subtract(expanded_point, expanded_centroid)
x = x1[: , : , 0]
y = tf.multiply(x1[: , : , 1], tf.cos(expanded_centroid[: , : , 0]))
p = tf.square(x) + tf.square(y)
dist = tf.multiply(p, 110.25)

assignments = tf.to_int32(tf.argmin(dist, 0))
partitions = tf.dynamic_partition(vectors, assignments, num_clusters)

update_centroids = tf.concat([tf.expand_dims(tf.reduce_mean(partition, 0), 0) for partition in partitions], 0)

# tf session:
    init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)

for step in range(num_steps):
    _, centroid_values, assignment_values = sess.run([update_centroids, centroids, assignments])

display_partition(lat_values, lon_values, assignment_values)

First of all, I defined the parameters of my problem:

  • The number of points: 1,000.
  • The number of clusters: 4.
  • The number of computational steps: 100.

In this particular example, I used as training set, a set of 1,000 GPS positions generated randomly (lines 27 to 36) about the position 45.7 9.1. If you have a file with the correct coordinates, you can load them and use the correct ones. Lines 34 to 36 display the training set:

Start

In line 42, the vector values are converted into constants usable by TensorFlow.

After randomly building the training set, we need the centroid (lines 44 to 50) and then convert it into a variable that will be manipulated by TensorFlow. This is the key of the k-means algorithm: we need a set of centroids to start the iterations.

The cost function for k-means is the distance between the point and the centroid. This algorithm tries to minimize this cost function. As I wrote in my previous post, the distance between two GPS points can't be calculated with the Euclidean distance, and it is necessary to introduce a more precise method to compute distance. One of these methods is the spherical cosine law. For this example, I used an approximation of the spherical cosine law. This approximation works very well for distances like city distance and is more computationally efficient than the implementation of the entire algorithm. To know more about this approximation and the error, read this interesting post (focusing on lines 53 to 63).

And finally, the centroids are updated (line 65).

Lines 68 to 75 initialize all the variables, instantiate the evaluation graph, run the algorithm, and visualize the results:

End

Conclusion

This post is focused on an algorithm for clustering problems: k-means. This algorithm takes some assumptions on data:

  • The variance of the distribution of each attribute is spherical.
  • All variables have the same variance.
  • The prior probability of each cluster is the same.

If one if those assumptions are violated, then the algorithm fails.

A possible con of this algorithm is the necessity to define, a priori, the number of clusters. If you don't have any idea on how your clusters are, you can choose another clustering algorithm like DBSCAN or OPTICS (those algorithms work on a density model instead of a centroid model). Or you can introduce a postprocessing step in k-means that aggregates (or splits) two or more centroids and then relaunches the entire algorithm on the same training set with the new set of centroids.

From the computational point of view, the k-means algorithm is linear on the number of data objects. Other clustering algorithms have a quadratic complexity. This can be an important point to keep in mind.

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Topics:
data analysis ,big data ,tutorial ,clustering ,k means clustering

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}