DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • The LLM Advantage: Smarter Time Series Predictions With Less Effort
  • How to Enable Azure Databricks Lakehouse Monitoring Through Scripts
  • Leveraging Snowflake’s AI/ML Capabilities for Anomaly Detection
  • Overview of Classical Time Series Analysis: Techniques, Applications, and Models

Trending

  • The Scrum Guide Expansion Pack
  • Streamlining DevOps: How Containers and Kubernetes Deliver
  • A Complete Guide to Modern AI Developer Tools
  • Vibe Coding: Conversational Software Development — Part 1 Introduction
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Optimizing K-Means Clustering for Time Series Data

Optimizing K-Means Clustering for Time Series Data

Given the amount of data we collect, faster clustering times are crucial. Here's how we sped up our k-means clustering process!

By 
Jason Snell user avatar
Jason Snell
·
Oct. 30, 17 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
25.8K Views

Join the DZone community and get the full member experience.

Join For Free

Here at New Relic, we collect 1.37 billion data points per minute. A vast amount of the data we collect, analyze, and display for our customers is stored as time series. In an effort to build relationships between applications and other entities, such as servers and containers, for new, intelligent products like New Relic Radar, we're constantly exploring faster and more efficient methods of grouping time series data. Given the amount of data we collect, faster clustering times are crucial.

Speeding Up K-Means Clustering

A popular method of grouping data is k-means clustering. The basic principle of k-means involves determining the distances between each data point and grouping them into meaningful clusters. This process is usually demonstrated using two-dimensional data on a plane. It's certainly possible to cluster in more than two dimensions — but our ability to visualize the data quickly becomes more complicated. For example, the following graphic shows the convergence of k-means clustering in two arbitrary dimensions over several iterations:

GIF by Chire—Own work, GFDL

Unfortunately, this method does not map well to time series since they are generally one-dimensional over time. However, we can still compute a distance factor between two time series using a number of different functions. In these examples, we use the mean squared error (MSE) to explore different k-means implementations. In testing these implementations, we noticed many suffered from serious performance issues, but we were still able to demonstrate how it's possible to speed up k-means clustering, in some cases, by an order of magnitude.

For this exercise, we'll use Python's NumPy package. If you're following along on your own, you can copy and paste the code right into a Jupyter Notebook. Let's start by importing the packages we'll use along the way:

import time
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In the following tests, we start by generating 10,000 random time series, 500 samples long. Then we add noise to sine waves of random lengths. While this kind of data isn't really ideal for k-means clustering, it should be enough to strain unoptimized implementations.

n = 10000 
ts_len = 500

phases = np.array(np.random.randint(0, 50, [n, 2]))
pure = np.sin([np.linspace(-np.pi * x[0], -np.pi * x[1], ts_len) for x in phases])
noise = np.array([np.random.normal(0, 1, ts_len) for x in range(n)])

signals = pure * noise

# Normalize everything between 0 and 1
signals += np.abs(np.min(signals))
signals /= np.max(signals)

plt.plot(signals[0])

The First Implementation

Let's start with the most basic and straightforward implementation we encountered. euclid_dist implements a simple MSE estimator for the distance function, and k_means implements the basic k-means algorithm. We chose num_clust random time series from our initial dataset as centroids (which represent the middle of each cluster). For num_iter iterations, we continually moved the centroids around while minimizing the distance between them and the other time series.

def euclid_dist(t1, t2):
         return np.sqrt(((t1-t2)**2).sum())
def k_means(data, num_clust, num_iter):
    centroids = signals[np.random.randint(0, signals.shape[0], num_clust)]

    for n in range(num_iter): 
        assignments={}
        for ind, i in enumerate(data):
            min_dist = float('inf') 
            closest_clust = None
            for c_ind, j in enumerate(centroids):
                dist = euclid_dist(i, j) 
                if dist < min_dist:
                   min_dist = dist
                   closest_clust = c_ind

            if closest_clust in assignments: 
                assignments[closest_clust].append(ind)
            else:
                assignments[closest_clust]=[] 
                assignments[closest_clust].append(ind)

        for key in assignments:
            clust_sum = 0
            for k in assignments[key]: 
                clust_sum = clust_sum + data[k]
            centroids[key] = [m / len(assignments[key]) for m in clust_sum] 

    return centroids
t1 = time.time()
centroids = k_means(signals, 100, 100)
t2 = time.time()
print("Took {} seconds".format(t2 - t1))
Took 1138.8745470046997 seconds

It took almost 20 minutes to cluster the data. That isn't terrible, but it's certainly not great. To speed this up for the next implementation, we decided to eliminate as many for loops as possible.

The Vectorized Implementation

One of the awesome benefits of using NumPy is vectorized operations. (If you're unfamiliar with vector operations, check out this excellent review.)

The k-means algorithm calls for pairwise comparisons between each centroid and data point. This means, in our previous iteration, we compared each of our 100 centroids to 10,000 time series for a total of one million comparisons per iteration. Keep in mind that each comparison involved two sets of 500 samples. Since we iterated 100 times, that gave us a total of 100 million comparisons-quite a bit of work for a single CPU. While Python is a reasonably efficient language, it's hard to beat operations written in C. For this reason, most of NumPy's core operations are written in C and vectorized to minimize overhead incurred by loops.

Let's explore how we vectorized our code in order to eliminate as many for loops as possible.

First, we divided the code into functional blocks. This gave us a better idea of what each section was responsible for. Next, we altered the calc_centroids step to only iterate over centroids (instead of each time series). This way, we were passing all the time series and one centroid to euclid_dist. We also allocated the dist matrix up front instead of treating it like a dictionary and expanding it over time. NumPy's argmin compared each vector pair in one shot.

In move_centroids, we collapsed another for loop using vector operations, and we iterated only over the unique set of centroids. If we lost a centroid, we added the appropriate number by randomly choosing from our time series dataset (this rarely happens in the wild).

Lastly, we added an early stopping check to k_means to halt our iterations if there were no updates to the centroids.

Here's a look at the code:

def euclid_dist(t1, t2):
    return np.sqrt(((t1-t2)**2).sum(axis = 1))

def calc_centroids(data, centroids):
    dist = np.zeros([data.shape[0], centroids.shape[0]])

    for idx, centroid in enumerate(centroids): 
        dist[:, idx] = euclid_dist(centroid, data)

    return np.array(dist)

def closest_centroids(data, centroids): 
    dist = calc_centroids(data, centroids) 
    return np.argmin(dist, axis = 1)

def move_centroids(data, closest, centroids):
    k = centroids.shape[0]
    new_centroids = np.array([data[closest == c].mean(axis = 0) for c in np.unique(closest)])

    if k - new_centroids.shape[0] > 0:
       print("adding {} centroid(s)".format(k - new_centroids.shape[0]))
       additional_centroids = data[np.random.randint(0, data.shape[0], k - new_centroids.shape[0])] 
       new_centroids = np.append(new_centroids, additional_centroids, axis = 0)

    return new_centroids

def k_means(data, num_clust, num_iter):
    centroids = signals[np.random.randint(0, signals.shape[0], num_clust)]
    last_centroids = centroids

    for n in range(num_iter):
        closest = closest_centroids(data, centroids)
        centroids = move_centroids(data, closest, centroids)
        if not np.any(last_centroids != centroids):
           print("early finish!")
           break
        last_centroids = centroids 

    return centroids
t1 = time.time()
centroids = k_means(signals, 100, 100)
t2 = time.time()
print("Took {} seconds".format(t2 - t1))
adding 1 centroid(s)
early finish!
took 206.72993397712708 seconds

Just a bit longer than 3.5 minutes. Not bad! But we wanted to finish even faster.

The K-Means++ Implementation

For the next iteration, we applied the k-means++ algorithm. This algorithm seeks to choose better initial centroids. Let's see if this optimization helped...

def init_centroids(data, num_clust):
    centroids = np.zeros([num_clust, data.shape[1]]) 
    centroids[0,:] = data[np.random.randint(0, data.shape[0], 1)]

    for i in range(1, num_clust):
        D2 = np.min([np.linalg.norm(data - c, axis = 1)**2 for c in centroids[0:i, :]], axis = 0) 
        probs = D2/D2.sum()
        cumprobs = probs.cumsum()
        ind = np.where(cumprobs >= np.random.random())[0][0]
        centroids[i, :] = np.expand_dims(data[ind], axis = 0)

    return centroids

def k_means(data, num_clust, num_iter): 
    centroids = init_centroids(data, num_clust)
    last_centroids = centroids

    for n in range(num_iter):
        closest = closest_centroids(data, centroids)
        centroids = move_centroids(data, closest, centroids)
        if not np.any(last_centroids != centroids):
            print("Early finish!")
            break
        last_centroids = centroids

    return centroids
t1 = time.time()
centroids = k_means(signals, 100, 100)
t2 = time.time()
print("Took {} seconds".format(t2 - t1))
early finish!
took 180.91435194015503 seconds

Adding the k-means++ algorithm yielded slightly better performance compared to our previous iteration. However, this optimization really started to pay off once we parallelized it.

The Parallel Implementation

Until this point, all of our implementations were single-threaded, so we decided to explore parallelizing parts of the k-means++ algorithm. Since we were using a Jupyter Notebook, we opted to manage parallelism with ipyparallel for parallel computing. With ipyparallel, we didn't have to worry about forking the entire server, but there were a few idiosyncrasies we needed to address. For instance, we had to inform our worker nodes to load NumPy.

import ipyparallel as ipp
c = ipp.Client()
v = c[:]
v.use_cloudpickle()

with v.sync_imports():
    import numpy as np

(For details on launching workers, see the pyparallel getting started guide.)

In this iteration, we focused on two areas for parallelization. First, calc_centroids had a loop that iterated over each centroid and compared it with our time series data. We used map_sync to send each of these iterations to our workers.

Next, we parallelized a similar loop in the k-means++ centroid search. Note the call to v.push: since our lambda referenced data, we needed to make sure it was available on the worker nodes. We accomplished this by calling ipyparallel's push method to copy the variable into the workers' global scope.

Take a look:

def calc_centroids(data, centroids):
         return np.array(v.map_sync(lambda x: np.sqrt(((x - data)**2).sum(axis = 1)), centroids))
def closest_centroids(points, centroids): 
    dist = calc_centroids(points, centroids)
    return np.argmin(dist, axis=0)

def init_centroids(data, num_clust):
    v.push(dict(data=data))
    centroids = np.zeros([num_clust, data.shape[1]]) 
    centroids[0,:] = data[np.random.randint(0, data.shape[0], 1)] 
    for i in range(1, num_clust):
        D2 = np.min(v.map_sync(lambda c: np.linalg.norm(data - c, axis = 1)**2, centroids[0:i,:]), axis = 0) 
        probs = D2/D2.sum()
        cumprobs = probs.cumsum()
        ind = np.where(cumprobs >= np.random.random())[0][0]
        centroids[i, :] = np.expand_dims(data[ind], axis = 0) 

    return centroids
t1 = time.time()
centroids = k_means(signals, 100, 100)
t2 = time.time()
print("Took {} seconds".format(t2 - t1))
adding 2 centroid(s)
early finish!
took 143.49819207191467 seconds

At just a bit more than two minutes, this iteration achieved our fastest time yet!

Next Steps: Go Faster!

In these tests, we relied on central processing unit (CPU) cores. CPUs offer convenient parallelization, but we suspect that with a little more effort, we could implement clustering orders of a magnitude faster using graphical processing units (GPUs). We might be able to do this using TensorFlow, open-source software for numerical computations and machine learning. In fact, TensorFlow already includes a k-means implementation, but we'll almost certainly have to tweak it to support time-series clustering. At any rate, we'll never stop looking for more efficient and faster clustering algorithms to help manage our users' data.

This article was originally posted here.

Time series clustering Data (computing)

Published at DZone with permission of Jason Snell, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • The LLM Advantage: Smarter Time Series Predictions With Less Effort
  • How to Enable Azure Databricks Lakehouse Monitoring Through Scripts
  • Leveraging Snowflake’s AI/ML Capabilities for Anomaly Detection
  • Overview of Classical Time Series Analysis: Techniques, Applications, and Models

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: