Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Java Data Analysis: Using Cross Tabs

DZone's Guide to

Java Data Analysis: Using Cross Tabs

Want to learn more about using data analysis in your code? Check out this tutorial on using cross tabs for data analysis in Java!

· Java Zone ·
Free Resource

Download Microservices for Java Developers: A hands-on introduction to frameworks and containers. Brought to you in partnership with Red Hat.

One of the most common data analysis operations is looking at how frequently certain kinds of observations appear in a data set.  You can use cross-tabulations, also known as contingency tables, to perform this task.

Here, we show you how to compute cross-tabs easily in Java using the Tablesaw open source (Apache 2) library. Tablesaw is a data-frame and visualization library that supports one and two dimensional cross-tabs. For this, the Table class contains the methods you need.

Example

In the example below, we show the observation counts for each combination. The dataset is using polling results from President George W. Bush, before and after 9/11/2001.  

// Preparation: load the data
Table table = Table.read().csv("../data/bush.csv");
// Add a column for the months from the date col so we can count by month
StringColumn month = table.dateColumn("date").month();
month.setName("month");
table.addColumns(month);

// perform the crossTab operation
Table counts = table.xTabCounts("month", "who");

// formatting 
// make table print as integers with no decimals instead of the raw doubles it holds
counts.columnsOfType(ColumnType.NUMBER)
                .forEach(x ->
                       ((NumberColumn)x).setPrintFormatter(NumberColumnFormatter.ints()));
// print
System.out.println(counts);


The formatted output is shown below.

                              Crosstab Counts: month x who                              
 [labels]   |  fox  |  gallup  |  newsweek  |  time.cnn  |  upenn  |  zogby  |  total  |
----------------------------------------------------------------------------------------
     APRIL  |    6  |      10  |         3  |         1  |      0  |      3  |     23  |
    AUGUST  |    3  |       8  |         2  |         1  |      0  |      2  |     16  |
  DECEMBER  |    4  |       9  |         4  |         3  |      2  |      5  |     27  |
  FEBRUARY  |    7  |       9  |         4  |         4  |      1  |      4  |     29  |
   JANUARY  |    7  |      13  |         6  |         3  |      5  |      8  |     42  |
      JULY  |    6  |       9  |         4  |         3  |      0  |      4  |     26  |
      JUNE  |    6  |      11  |         1  |         1  |      0  |      4  |     23  |
     MARCH  |    5  |      12  |         4  |         3  |      0  |      6  |     30  |
       MAY  |    4  |       9  |         5  |         3  |      0  |      1  |     22  |
  NOVEMBER  |    4  |       9  |         6  |         3  |      1  |      1  |     24  |
   OCTOBER  |    7  |      10  |         8  |         2  |      1  |      3  |     31  |
 SEPTEMBER  |    5  |      10  |         8  |         3  |      0  |      4  |     30  |
     Total  |   64  |     119  |        55  |        30  |     10  |     45  |    323  |


Of course, you can dispense with the formatting if you like. It is important to note the total column on the right, which shows that 23 polls were conducted in April, 16 polls in August, etc, across all pollsters. Similarly, the column totals at the bottom shows that Fox conducted 64 polls, Gallup 119, etc.

Single Variable Totals

You can get single variable counts using the  xTabCounts() method that takes only one column name argument .

Table whoCounts = table.xTabCounts("who");
// formatting and printing as above


This produces:

     Column: who      
 Category  |  Count  |
----------------------
   gallup  |    119  |
    zogby  |     45  |
 time.cnn  |     30  |
      fox  |     64  |
 newsweek  |     55  |
    upenn  |     10  |


Calculating Percents

You may want to see the percent of polls conducted by each pollster, rather than raw counts. The xTabPercents() method can be used for that.

Table whoPercents = table.xTabPercents("who");


Actually, percents are a misnomer. The results produced are the proportions in decimal format. To get percent-formatted output, we use a different  NumericColumnFormatter.

When you have two variables, you can display the percent that falls into each combination as shown below.

Table tablePercents = table.xTabTablePercents("month", "who");
tablePercents.columnsOfType(ColumnType.NUMBER)
                .forEach(x ->
                ((NumberColumn)x).setPrintFormatter(NumberColumnFormatter.percent(1)));


Because the percents are small, we updated the formatter to show a single fractional digit after the decimal point. The output can best be understood by looking at an example. Of all the polls in the dataset, 1.9 percent were conducted by Fox in April, 3.1 percent by Gallup in April, and 0.9 percent by Fox in August, as well as the others shown in the table below.

                               Crosstab Table Proportions:                                 
 [labels]   |   fox   |  gallup  | newsweek  |  time.cnn  |  upenn  |  zogby  |  total   |
------------------------------------------------------------------------------------------
     APRIL  |   1.9%  |    3.1%  |     0.9%  |      0.3%  |   0.0%  |   0.9%  |    7.1%  |
    AUGUST  |   0.9%  |    2.5%  |     0.6%  |      0.3%  |   0.0%  |   0.6%  |    5.0%  |
  DECEMBER  |   1.2%  |    2.8%  |     1.2%  |      0.9%  |   0.6%  |   1.5%  |    8.4%  |
  FEBRUARY  |   2.2%  |    2.8%  |     1.2%  |      1.2%  |   0.3%  |   1.2%  |    9.0%  |
   JANUARY  |   2.2%  |    4.0%  |     1.9%  |      0.9%  |   1.5%  |   2.5%  |   13.0%  |
      JULY  |   1.9%  |    2.8%  |     1.2%  |      0.9%  |   0.0%  |   1.2%  |    8.0%  |
      JUNE  |   1.9%  |    3.4%  |     0.3%  |      0.3%  |   0.0%  |   1.2%  |    7.1%  |
     MARCH  |   1.5%  |    3.7%  |     1.2%  |      0.9%  |   0.0%  |   1.9%  |    9.3%  |
       MAY  |   1.2%  |    2.8%  |     1.5%  |      0.9%  |   0.0%  |   0.3%  |    6.8%  |
  NOVEMBER  |   1.2%  |    2.8%  |     1.9%  |      0.9%  |   0.3%  |   0.3%  |    7.4%  |
   OCTOBER  |   2.2%  |    3.1%  |     2.5%  |      0.6%  |   0.3%  |   0.9%  |    9.6%  |
 SEPTEMBER  |   1.5%  |    3.1%  |     2.5%  |      0.9%  |   0.0%  |   1.2%  |    9.3%  |
     Total  |  19.8%  |   36.8%  |    17.0%  |      9.3%  |   3.1%  |  13.9%  |  100.0%  |


As you can see, this also gives you the total percents by month and pollster.

Column Percents and Row Percents

The final option is to show column percents or row percents. We'll start with column percents. You calculate them as shown below.

Table columnPercents = table.xTabColumnPercents("month", "who");


This produces the following table. If you look at the column for "fox," the values you see are the percentages for Fox alone. Nine percent of Fox's polls were conducted in April, five percent in August, etc. Looking across the columns, on the other hand, are not very intuitive (or useful, probably) until you get to the total, which shows the average across all pollsters by month.

                              Crosstab Column Proportions:                               
 [labels]   |  fox   |  gallup  |  newsweek  |  time.cnn  |  upenn  |  zogby  |  total  |
-----------------------------------------------------------------------------------------
     APRIL  |    9%  |      8%  |        5%  |        3%  |     0%  |     7%  |     7%  |
    AUGUST  |    5%  |      7%  |        4%  |        3%  |     0%  |     4%  |     5%  |
  DECEMBER  |    6%  |      8%  |        7%  |       10%  |    20%  |    11%  |     8%  |
  FEBRUARY  |   11%  |      8%  |        7%  |       13%  |    10%  |     9%  |     9%  |
   JANUARY  |   11%  |     11%  |       11%  |       10%  |    50%  |    18%  |    13%  |
      JULY  |    9%  |      8%  |        7%  |       10%  |     0%  |     9%  |     8%  |
      JUNE  |    9%  |      9%  |        2%  |        3%  |     0%  |     9%  |     7%  |
     MARCH  |    8%  |     10%  |        7%  |       10%  |     0%  |    13%  |     9%  |
       MAY  |    6%  |      8%  |        9%  |       10%  |     0%  |     2%  |     7%  |
  NOVEMBER  |    6%  |      8%  |       11%  |       10%  |    10%  |     2%  |     7%  |
   OCTOBER  |   11%  |      8%  |       15%  |        7%  |    10%  |     7%  |    10%  |
 SEPTEMBER  |    8%  |      8%  |       15%  |       10%  |     0%  |     9%  |     9%  |
     Total  |  100%  |    100%  |      100%  |      100%  |   100%  |   100%  |   100%  |


Here, row percents show the opposite viewpoint.

Table rowPercents = table.xTabRowPercents("month", "who");


Here, we see that, of all the polls conducted in April, Fox conducted 26 percent, Gallup conducted 43 percent, and The University of Pennsylvania conducted zero percent when rounded.

                               Crosstab Row Proportions:                                
 [labels]   |  fox  |  gallup  |  newsweek  |  time.cnn  |  upenn  |  zogby  |  total  |
----------------------------------------------------------------------------------------
     APRIL  |  26%  |     43%  |       13%  |        4%  |     0%  |    13%  |   100%  |
    AUGUST  |  19%  |     50%  |       12%  |        6%  |     0%  |    12%  |   100%  |
  DECEMBER  |  15%  |     33%  |       15%  |       11%  |     7%  |    19%  |   100%  |
  FEBRUARY  |  24%  |     31%  |       14%  |       14%  |     3%  |    14%  |   100%  |
   JANUARY  |  17%  |     31%  |       14%  |        7%  |    12%  |    19%  |   100%  |
      JULY  |  23%  |     35%  |       15%  |       12%  |     0%  |    15%  |   100%  |
      JUNE  |  26%  |     48%  |        4%  |        4%  |     0%  |    17%  |   100%  |
     MARCH  |  17%  |     40%  |       13%  |       10%  |     0%  |    20%  |   100%  |
       MAY  |  18%  |     41%  |       23%  |       14%  |     0%  |     5%  |   100%  |
  NOVEMBER  |  17%  |     38%  |       25%  |       12%  |     4%  |     4%  |   100%  |
   OCTOBER  |  23%  |     32%  |       26%  |        6%  |     3%  |    10%  |   100%  |
 SEPTEMBER  |  17%  |     33%  |       27%  |       10%  |     0%  |    13%  |   100%  |
     Total  |  20%  |     37%  |       17%  |        9%  |     3%  |    14%  |   100%  |


And, that's all there is to cross-tabs.

Tablesaw supports many other functions that make data analysis easy, including loading data, filtering, mapping, summarizing, cleaning, saving, and so on. Tablesaw's visualization support is based on the plot.ly library. 

Download Building Reactive Microservices in Java: Asynchronous and Event-Based Application Design. Brought to you in partnership with Red Hat

Topics:
java ,data analysis ,tablesaw ,cross tabs ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}