How to Get a Frequency Table of a Categorical Variable as a Data Frame

Learn how to get a frequency table of a categorical variable as a Data Frame in R using table(), cut(), count(), and plyr — with clean, ready-to-use output.

Chaitanya Sagar

Oct. 30, 25 · Analysis

Likes (1)

Comment

Save

1.5K Views

Categorical data is data with a predefined set of values. Using “Child,” “Adult,” or “Senior” instead of a person's age as a number is one example of age categorization. However, before using categorical data, one must know about various forms of categorical data

First of all, categorical data may or may not be defined in an order. To say that the size of a box is small, medium, or large means that there is an order described as small < medium < large. The same does not hold for, say, sports equipment, which could also be categorial data, but differentiated by names like dumbbell, grippers, or gloves; that is, you can order the items on any basis. Those that can be ordered are known as “ordinal” while those where there is no such ordering are “nominal” in nature.

Many times, an analyst converts numerical data to categorical data to make things easier. Besides using “Adult,” “Child,” or “Senior” class instead of age as a number, there can also be special cases, such as using “regular item” or “accessory” for equipment. In many problems, the output is also categorical. Whether a customer will churn or not, whether a person will buy a product or not, or whether an item is profitable — these are classic classification problems often tackled in AI consulting engagements. All problems where the output is categorical are known as classification problems. R provides various ways to transform and handle categorical data.

A simple way to transform data into classes is to use the split and cut functions in R, or the cut2 function in the Hmisc library.

Let’s use the iris dataset to categorize data. This dataset is available in R and can be called by using the ‘attach’ function. The dataset consists of 150 observations across five features: sepal length, sepal width, petal length, petal width, and species.

    Plain Text
   
 

   attach(iris) #Call the iris dataset

x=iris #store a copy of the dataset into x

#using the split function
list1=split(x, cut(x$Sepal.Length, 3)) #This will create a list of 3 split on the basis of sepal.length
summary(list1) #View the class ranges for list1
Length Class         Mode
(4.3,5.5] 6          data.frame list
(5.5,6.7] 6          data.frame list
(6.7,7.9] 6          data.frame list
#using Hmisc library
library(Hmisc)
list2=split(x, cut2(x$Sepal.Length, g=3)) #This will also create a similar list but with left boundary included
summary(list2) #View the class ranges for list2
Length Class          Mode
[4.3,5.5) 6          data.frame list
[5.5,6.4) 6          data.frame list
[6.4,7.9] 6          data.frame list
  

The first list, list 1, divides the dataset into 3 groups based on sepal length, with equal ranges. The second list, list 2, also divides the dataset into 3 groups based on sepal length, but it tries to keep the number of values equal in each group. We can check this using the range function.

    Plain Text
   
   #Range of sepal.length
range(x$Sepal.Length) #The output is 4.3 to 7.9

We can see that list 1 consists of three groups: the first has the range 4.3–5.5, the second has the range 5.5–6.4, and the third has the range 6.5–7.9. There is, however, one difference between the output of list1 and list2. List1 ensures the range across the three groups is equal. On the other hand, list 2 allows the number of values in each group to be balanced. An alternative code to the following is to add the group range as another feature in the dataset.

    R
   
   x$class <- cut(x$Sepal.Length, 3) #Add the class label instead of creating a list of data
x$class2 <- cut2(x$Sepal.Length, 3) #Add the class label instead of creating a list of data

If the classes are to be indexed as numbers 1, 2, 3… instead of their actual range, we can just convert our output to numeric. Using the indexes is also easier than the range of each group.

    R
   
   x$class=as.numeric(x$class)

In our example, the class values will now be transformed to either 1, 2, or 3. Suppose we now want to find the number of values in each class. How many rows fall into class 1? Or class 2? We can use the table() function in R to get that count.

    R
   
   class_length=table(x$group)
class_length #The sizes are 59,71 and 20 as indicated in the output below
1  2  3
59 71 20

This is a good way to get a quick summary of the classes and their sizes. However, this is where it ends. We cannot make further computations or use this information in our dataset. Moreover, class_length is a table and needs to be converted to a Data Frame before it is useful. The issue is that transforming a table into a Data Frame will create the variable names as Var1 and Freq, as the table does not retain the original feature name.

    R
   
 

   #Transforming the table to a Data Frame
class_length_df=as.data.frame(class_length)
Class_length_df #The output is:
Var1 Freq
1    1   59
2    2   71
3    3   20
#Here we see that the variable is named as Var1. We need to rename the variable using the names()
function
names(class_length_df)[1]=”group” #Changing the first variable Var1 to group
class_length_df
  group Freq
1     1   59
2     2   71
 3     3   20
  

In this case, where we have a few variables, we can easily rename the variable, but this is very risky in a large dataset where one can accidentally rename another important feature.

As I said, there is more than 1 way to do the same thing in R. All this hassle could have been avoided if there had been a function that would generate our class size as a Data Frame to start with. The “plyr” package has the count() function, that accomplishes this task. Using the count function in the plyr package is as simple as passing the original Data Frame and the variable for which we want to use the count.

    R
   
 

   #Using the plyr library
library(plyr)
class_length2=count(x,”group”) #Using the count function
class_length2 #The output is:
  group freq
1     1   59
2     2   71
3     3   20
  

The same output, in less number of steps. Let’s verify our output.

    R
   
   #Checking the data type of class_length2

class(class_length2) #Output is data.frame

The plyr package is very useful when it comes to categorical data. As we see, the count() function is really flexible and can generate the Data Frame we want. It is now easy to add the frequency of the categorical data to the original Data Frame x.

Comparison

The table() function is really useful as a quick summary and, with a little work, can produce an output similar to that given by the count() function. When we go a little further towards N-way tables, the table function transformed to a Data Frame works just as the count() function.

    R
   
 

   #Using the table for 2 way
two_way=as.data.frame(table(subset(x,select=c(“class”,”class2″))))
two_way
   class    class2 Freq
1 (4.3,5.5] [4.3,5.5)   52
2 (5.5,6.7] [4.3,5.5)    0
3 (6.7,7.9] [4.3,5.5)    0
4 (4.3,5.5] [5.5,6.4)    7
5 (5.5,6.7] [5.5,6.4)   49
6 (6.7,7.9] [5.5,6.4)    0
7 (4.3,5.5] [6.4,7.9]    0
8 (5.5,6.7] [6.4,7.9]   22
9 (6.7,7.9] [6.4,7.9]   20

two_way_count=count(x,c(“class”,”class2″))
two_way_count
    class    class2 freq
1 (4.3,5.5] [4.3,5.5)   52
2 (4.3,5.5] [5.5,6.4)    7
3 (5.5,6.7] [5.5,6.4)   49
4 (5.5,6.7] [6.4,7.9]   22
5 (6.7,7.9] [6.4,7.9]   20
  

The difference is still noticeable. While both outcomes are similar, the count() function omits values that are null or have a size of 0. Hence, the count() function produces cleaner output and outperforms the table() function, which produces frequency tables for all possible combinations of the variables. What if we want the N-way frequency table of the entire Data Frame? In this case, we can simply pass the entire Data Frame into the table() or count() function.

However, the table() function will be very slow in this case, as it will take time to calculate the frequencies of all possible combinations of features, whereas the count() function will only calculate and display the combinations where the frequency is non-zero.

    R
   
   #For the entire dataset
full1=count(x) #much faster
full2=as.data.frame(table(x))

What if we want to display our data in a cross-tabulated format instead of displaying it as a list? We have the xtabs function for this purpose.

    R
   
 

   cross_tab = xtabs(~ class + class2, x)
cross_tab
class2
class [4.3,5.5) [5.5,6.4) [6.4,7.9]
 (4.3,5.5]        52         7         0
 (5.5,6.7]         0        49        22
 (6.7,7.9]         0         0        20
  

However, the class type of this function is an xtabs table.

    R
   
   class(cross_tab)
“xtabs” “table”

Converting the same to a Data Frame regenerates the same output as the table() function does.

    R
   
 

   y=as.data.frame(cross_tab)
y
class class2 Freq
1 (4.3,5.5] [4.3,5.5)   52
2 (5.5,6.7] [4.3,5.5)    0
3 (6.7,7.9] [4.3,5.5)    0
4 (4.3,5.5] [5.5,6.4)    7
5 (5.5,6.7] [5.5,6.4)   49
6 (6.7,7.9] [5.5,6.4)    0
7 (4.3,5.5] [6.4,7.9]    0
8 (5.5,6.7] [6.4,7.9]   22
9 (6.7,7.9] [6.4,7.9]   20
  

There is another difference when we use cross-tabulated output for N-way classification when N > 3. Because we can show only two features in a cross-tabulated format, xtabs divides the data by the third variable and displays cross-tabulated outputs for each value of the third variable. Illustrating the same for class, class 2, and Species.

    R
   
 

   threeway_cross_tab = xtabs(~ class + class2 + Species, x)
threeway_cross_tab

, , Species = setosa

          class2
class [4.3,5.5) [5.5,6.4) [6.4,7.9]
(4.3,5.5] 45 2 0
(5.5,6.7] 0 3 0
  (6.7,7.9]         0         0         0

, , Species = versicolor

          class2
class       [4.3,5.5) [5.5,6.4) [6.4,7.9]
(4.3,5.5] 6 5 0
(5.5,6.7] 0 28 8
  (6.7,7.9]         0         0         3

, , Species = virginica

          class2
class [4.3,5.5) [5.5,6.4) [6.4,7.9]
(4.3,5.5] 1 0 0
(5.5,6.7] 0 18 14
  (6.7,7.9]         0         0        17
  

The output becomes larger and harder to read as N increases in an N-way cross-tabulation. In this situation again, the count() function seamlessly produces a clean, easily visualizable output.

    R
   
 

   threeway_cross_tab_df = count(x, c(‘class’, ‘class2’, ‘Species’))
threeway_cross_tab_df
      class    class2    Species freq
 (4.3,5.5] [4.3,5.5)     setosa   45
 (4.3,5.5] [4.3,5.5) versicolor    6
 (4.3,5.5] [4.3,5.5)  virginica    1
 (4.3,5.5] [5.5,6.4)     setosa    2
 (4.3,5.5] [5.5,6.4) versicolor    5
 (5.5,6.7] [5.5,6.4)     setosa    3
 (5.5,6.7] [5.5,6.4) versicolor   28
 (5.5,6.7] [5.5,6.4)  virginica   18
 (5.5,6.7] [6.4,7.9] versicolor    8
(5.5,6.7] [6.4,7.9]  virginica   14
(6.7,7.9] [6.4,7.9] versicolor    3
(6.7,7.9] [6.4,7.9]  virginica   17
  

The same output is presented in a concise way by count(). The count() function in the plyr package is thus very useful when it comes to counting frequencies of categorical variables.

Authored by Chaitanya Sagar, Founder and CEO, Perceptive Analytics. A recognized thought leader in analytics and data science, frequently writes on performance dashboards, AI integration, and decision intelligence.

Analytics Data (computing) Frame (networking)

Published at DZone with permission of Chaitanya Sagar. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending

How to Get a Frequency Table of a Categorical Variable as a Data Frame

Learn how to get a frequency table of a categorical variable as a Data Frame in R using table(), cut(), count(), and plyr — with clean, ready-to-use output.

Comparison

Related

Partner Resources