PROC FASTCLUS performs disjoint cluster analysis on the basis of distances computed from one or more quantitative variables.
The most-used cluster analysis procedure is
PROC FASTCLUS, or k-means clustering. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
K-means clustering also known as unsupervised learning. Unsupervised learning is a type of Machine Learning algorithm used to draw inferences from datasets consisting of input data without labeled responses.
The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. There is no dependent variable used in unsupervised learning for analysis.
Clustering means the grouping of a particular set of objects based on their characteristics, aggregating them according to their similarities.
PROC FASTCLUS is used in a variety of analytic, business intelligence, reporting, and data management situations.
PROC FASTCLUS <MAXCLUSTERS= n> <RADIUS= t> <options>; VAR variables; ID variables; FREQ variable; WEIGHT variable; BY variables;
PROC FASTCLUS statement calls the
out specifies the output dataset.
radius = t specifies the minimum distance from the previous seed to classify an observation as a new seed; by default, t = 0.
maxclusters = n specifies the maximum number of clusters permitted; by default, n is 100.
Let's understand k-means clustering with the help of an example. We will perform the k-means on insurance data contains 100 observation and 5 variables (Premium_Paid, Age, Days_to_Renew, Claims_made, Income).
Income and Age variables are used to perform k-means clustering.
proc fastclus data = libref.cluster out = out maxc= 3; var Income Age; title 'FASTCLUS ANALYSIS'; RUN;
When you run this code, the output is generated and it shown on the screen. The clusters are grouped on the basis of maximum distance from seed to observations.
The R-squared value for the model is 0.89444 (>0.70). Hence, this a good fit model. The distance between the seed and observation of the first cluster distance is 18750, and the last cluster is the maximum value.
Now, you know about k-means clustering with SAS.