Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

K-Means Clustering With SAS

DZone's Guide to

K-Means Clustering With SAS

K-means clustering partitions observations into clusters in which each observation belongs to the cluster with the nearest mean.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

PROC FASTCLUS performs disjoint cluster analysis on the basis of distances computed from one or more quantitative variables.

The most-used cluster analysis procedure is PROC FASTCLUS, or k-means clustering. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

K-means clustering also known as unsupervised learning. Unsupervised learning is a type of Machine Learning algorithm used to draw inferences from datasets consisting of input data without labeled responses.

The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. There is no dependent variable used in unsupervised learning for analysis.

Clustering means the grouping of a particular set of objects based on their characteristics, aggregating them according to their similarities.

PROC FASTCLUS is used in a variety of analytic, business intelligence, reporting, and data management situations.

PROC FASTCLUS <MAXCLUSTERS= n> <RADIUS= t> <options>; 
VAR variables; 
ID variables; 
FREQ variable; 
WEIGHT variable; 
BY variables;

The PROC FASTCLUS statement calls the FASTCLUS procedure.

  • out specifies the output dataset.

  • radius = t specifies the minimum distance from the previous seed to classify an observation as a new seed; by default, t = 0.

  • maxclusters = n specifies the maximum number of clusters permitted; by default, n is 100.

Let's understand k-means clustering with the help of an example. We will perform the k-means on insurance data contains 100 observation and 5 variables (Premium_Paid, Age, Days_to_Renew, Claims_made, Income).Image title

Income and Age variables are used to perform k-means clustering.

proc fastclus data = libref.cluster out = out maxc= 3;
var Income Age;
title 'FASTCLUS ANALYSIS';
RUN;

When you run this code, the output is generated and it shown on the screen. The clusters are grouped on the basis of maximum distance from seed to observations.

Image title

Image title

The R-squared value for the model is 0.89444 (>0.70). Hence, this a good fit model. The distance between the seed and observation of the first cluster distance is 18750, and the last cluster is the maximum value.

Image title

Now, you know about k-means clustering with SAS.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
sas ,clustering ,big data ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}