Over a million developers have joined DZone.

K-Means Clustering With SAS

DZone's Guide to

K-Means Clustering With SAS

K-means clustering partitions observations into clusters in which each observation belongs to the cluster with the nearest mean.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

PROC FASTCLUS performs disjoint cluster analysis on the basis of distances computed from one or more quantitative variables.

The most-used cluster analysis procedure is PROC FASTCLUS, or k-means clustering. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

K-means clustering also known as unsupervised learning. Unsupervised learning is a type of Machine Learning algorithm used to draw inferences from datasets consisting of input data without labeled responses.

The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. There is no dependent variable used in unsupervised learning for analysis.

Clustering means the grouping of a particular set of objects based on their characteristics, aggregating them according to their similarities.

PROC FASTCLUS is used in a variety of analytic, business intelligence, reporting, and data management situations.

VAR variables; 
ID variables; 
FREQ variable; 
WEIGHT variable; 
BY variables;

The PROC FASTCLUS statement calls the FASTCLUS procedure.

  • out specifies the output dataset.

  • radius = t specifies the minimum distance from the previous seed to classify an observation as a new seed; by default, t = 0.

  • maxclusters = n specifies the maximum number of clusters permitted; by default, n is 100.

Let's understand k-means clustering with the help of an example. We will perform the k-means on insurance data contains 100 observation and 5 variables (Premium_Paid, Age, Days_to_Renew, Claims_made, Income).Image title

Income and Age variables are used to perform k-means clustering.

proc fastclus data = libref.cluster out = out maxc= 3;
var Income Age;

When you run this code, the output is generated and it shown on the screen. The clusters are grouped on the basis of maximum distance from seed to observations.

Image title

Image title

The R-squared value for the model is 0.89444 (>0.70). Hence, this a good fit model. The distance between the seed and observation of the first cluster distance is 18750, and the last cluster is the maximum value.

Image title

Now, you know about k-means clustering with SAS.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

sas ,clustering ,big data ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}