DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Practical Generators in Go 1.23 for Database Pagination
  • Java and MongoDB Integration: A CRUD Tutorial [Video Tutorial]
  • Mastering Persistence: Why the Persistence Layer Is Crucial for Modern Java Applications
  • Hyperion Essbase Technical Functionality

Trending

  • Scaling Microservices With Docker and Kubernetes on Production
  • Chat With Your Knowledge Base: A Hands-On Java and LangChain4j Guide
  • Monolith: The Good, The Bad and The Ugly
  • Build a Simple REST API Using Python Flask and SQLite (With Tests)
  1. DZone
  2. Data Engineering
  3. Data
  4. Java Data Analysis: Using Cross Tabs

Java Data Analysis: Using Cross Tabs

Want to learn more about using data analysis in your code? Check out this tutorial on using cross tabs for data analysis in Java!

By 
Larry White user avatar
Larry White
·
Jul. 24, 18 · Tutorial
Likes (6)
Comment
Save
Tweet
Share
12.1K Views

Join the DZone community and get the full member experience.

Join For Free

One of the most common data analysis operations is looking at how frequently certain kinds of observations appear in a data set.  You can use cross-tabulations, also known as contingency tables, to perform this task.

Here, we show you how to compute cross-tabs easily in Java using the Tablesaw open source (Apache 2) library. Tablesaw is a data-frame and visualization library that supports one and two dimensional cross-tabs. For this, the Table class contains the methods you need.

Example

In the example below, we show the observation counts for each combination. The dataset is using polling results from President George W. Bush, before and after 9/11/2001.  

// Preparation: load the data
Table table = Table.read().csv("../data/bush.csv");
// Add a column for the months from the date col so we can count by month
StringColumn month = table.dateColumn("date").month();
month.setName("month");
table.addColumns(month);

// perform the crossTab operation
Table counts = table.xTabCounts("month", "who");

// formatting 
// make table print as integers with no decimals instead of the raw doubles it holds
counts.columnsOfType(ColumnType.NUMBER)
                .forEach(x ->
                       ((NumberColumn)x).setPrintFormatter(NumberColumnFormatter.ints()));
// print
System.out.println(counts);


The formatted output is shown below.

                              Crosstab Counts: month x who                              
 [labels]   |  fox  |  gallup  |  newsweek  |  time.cnn  |  upenn  |  zogby  |  total  |
----------------------------------------------------------------------------------------
     APRIL  |    6  |      10  |         3  |         1  |      0  |      3  |     23  |
    AUGUST  |    3  |       8  |         2  |         1  |      0  |      2  |     16  |
  DECEMBER  |    4  |       9  |         4  |         3  |      2  |      5  |     27  |
  FEBRUARY  |    7  |       9  |         4  |         4  |      1  |      4  |     29  |
   JANUARY  |    7  |      13  |         6  |         3  |      5  |      8  |     42  |
      JULY  |    6  |       9  |         4  |         3  |      0  |      4  |     26  |
      JUNE  |    6  |      11  |         1  |         1  |      0  |      4  |     23  |
     MARCH  |    5  |      12  |         4  |         3  |      0  |      6  |     30  |
       MAY  |    4  |       9  |         5  |         3  |      0  |      1  |     22  |
  NOVEMBER  |    4  |       9  |         6  |         3  |      1  |      1  |     24  |
   OCTOBER  |    7  |      10  |         8  |         2  |      1  |      3  |     31  |
 SEPTEMBER  |    5  |      10  |         8  |         3  |      0  |      4  |     30  |
     Total  |   64  |     119  |        55  |        30  |     10  |     45  |    323  |


Of course, you can dispense with the formatting if you like. It is important to note the total column on the right, which shows that 23 polls were conducted in April, 16 polls in August, etc, across all pollsters. Similarly, the column totals at the bottom shows that Fox conducted 64 polls, Gallup 119, etc.

Single Variable Totals

You can get single variable counts using the  xTabCounts() method that takes only one column name argument .

Table whoCounts = table.xTabCounts("who");
// formatting and printing as above


This produces:

     Column: who      
 Category  |  Count  |
----------------------
   gallup  |    119  |
    zogby  |     45  |
 time.cnn  |     30  |
      fox  |     64  |
 newsweek  |     55  |
    upenn  |     10  |


Calculating Percents

You may want to see the percent of polls conducted by each pollster, rather than raw counts. The xTabPercents() method can be used for that.

Table whoPercents = table.xTabPercents("who");


Actually, percents are a misnomer. The results produced are the proportions in decimal format. To get percent-formatted output, we use a different  NumericColumnFormatter.

When you have two variables, you can display the percent that falls into each combination as shown below.

Table tablePercents = table.xTabTablePercents("month", "who");
tablePercents.columnsOfType(ColumnType.NUMBER)
                .forEach(x ->
                ((NumberColumn)x).setPrintFormatter(NumberColumnFormatter.percent(1)));


Because the percents are small, we updated the formatter to show a single fractional digit after the decimal point. The output can best be understood by looking at an example. Of all the polls in the dataset, 1.9 percent were conducted by Fox in April, 3.1 percent by Gallup in April, and 0.9 percent by Fox in August, as well as the others shown in the table below.

                               Crosstab Table Proportions:                                 
 [labels]   |   fox   |  gallup  | newsweek  |  time.cnn  |  upenn  |  zogby  |  total   |
------------------------------------------------------------------------------------------
     APRIL  |   1.9%  |    3.1%  |     0.9%  |      0.3%  |   0.0%  |   0.9%  |    7.1%  |
    AUGUST  |   0.9%  |    2.5%  |     0.6%  |      0.3%  |   0.0%  |   0.6%  |    5.0%  |
  DECEMBER  |   1.2%  |    2.8%  |     1.2%  |      0.9%  |   0.6%  |   1.5%  |    8.4%  |
  FEBRUARY  |   2.2%  |    2.8%  |     1.2%  |      1.2%  |   0.3%  |   1.2%  |    9.0%  |
   JANUARY  |   2.2%  |    4.0%  |     1.9%  |      0.9%  |   1.5%  |   2.5%  |   13.0%  |
      JULY  |   1.9%  |    2.8%  |     1.2%  |      0.9%  |   0.0%  |   1.2%  |    8.0%  |
      JUNE  |   1.9%  |    3.4%  |     0.3%  |      0.3%  |   0.0%  |   1.2%  |    7.1%  |
     MARCH  |   1.5%  |    3.7%  |     1.2%  |      0.9%  |   0.0%  |   1.9%  |    9.3%  |
       MAY  |   1.2%  |    2.8%  |     1.5%  |      0.9%  |   0.0%  |   0.3%  |    6.8%  |
  NOVEMBER  |   1.2%  |    2.8%  |     1.9%  |      0.9%  |   0.3%  |   0.3%  |    7.4%  |
   OCTOBER  |   2.2%  |    3.1%  |     2.5%  |      0.6%  |   0.3%  |   0.9%  |    9.6%  |
 SEPTEMBER  |   1.5%  |    3.1%  |     2.5%  |      0.9%  |   0.0%  |   1.2%  |    9.3%  |
     Total  |  19.8%  |   36.8%  |    17.0%  |      9.3%  |   3.1%  |  13.9%  |  100.0%  |


As you can see, this also gives you the total percents by month and pollster.

Column Percents and Row Percents

The final option is to show column percents or row percents. We'll start with column percents. You calculate them as shown below.

Table columnPercents = table.xTabColumnPercents("month", "who");


This produces the following table. If you look at the column for "fox," the values you see are the percentages for Fox alone. Nine percent of Fox's polls were conducted in April, five percent in August, etc. Looking across the columns, on the other hand, are not very intuitive (or useful, probably) until you get to the total, which shows the average across all pollsters by month.

                              Crosstab Column Proportions:                               
 [labels]   |  fox   |  gallup  |  newsweek  |  time.cnn  |  upenn  |  zogby  |  total  |
-----------------------------------------------------------------------------------------
     APRIL  |    9%  |      8%  |        5%  |        3%  |     0%  |     7%  |     7%  |
    AUGUST  |    5%  |      7%  |        4%  |        3%  |     0%  |     4%  |     5%  |
  DECEMBER  |    6%  |      8%  |        7%  |       10%  |    20%  |    11%  |     8%  |
  FEBRUARY  |   11%  |      8%  |        7%  |       13%  |    10%  |     9%  |     9%  |
   JANUARY  |   11%  |     11%  |       11%  |       10%  |    50%  |    18%  |    13%  |
      JULY  |    9%  |      8%  |        7%  |       10%  |     0%  |     9%  |     8%  |
      JUNE  |    9%  |      9%  |        2%  |        3%  |     0%  |     9%  |     7%  |
     MARCH  |    8%  |     10%  |        7%  |       10%  |     0%  |    13%  |     9%  |
       MAY  |    6%  |      8%  |        9%  |       10%  |     0%  |     2%  |     7%  |
  NOVEMBER  |    6%  |      8%  |       11%  |       10%  |    10%  |     2%  |     7%  |
   OCTOBER  |   11%  |      8%  |       15%  |        7%  |    10%  |     7%  |    10%  |
 SEPTEMBER  |    8%  |      8%  |       15%  |       10%  |     0%  |     9%  |     9%  |
     Total  |  100%  |    100%  |      100%  |      100%  |   100%  |   100%  |   100%  |


Here, row percents show the opposite viewpoint.

Table rowPercents = table.xTabRowPercents("month", "who");


Here, we see that, of all the polls conducted in April, Fox conducted 26 percent, Gallup conducted 43 percent, and The University of Pennsylvania conducted zero percent when rounded.

                               Crosstab Row Proportions:                                
 [labels]   |  fox  |  gallup  |  newsweek  |  time.cnn  |  upenn  |  zogby  |  total  |
----------------------------------------------------------------------------------------
     APRIL  |  26%  |     43%  |       13%  |        4%  |     0%  |    13%  |   100%  |
    AUGUST  |  19%  |     50%  |       12%  |        6%  |     0%  |    12%  |   100%  |
  DECEMBER  |  15%  |     33%  |       15%  |       11%  |     7%  |    19%  |   100%  |
  FEBRUARY  |  24%  |     31%  |       14%  |       14%  |     3%  |    14%  |   100%  |
   JANUARY  |  17%  |     31%  |       14%  |        7%  |    12%  |    19%  |   100%  |
      JULY  |  23%  |     35%  |       15%  |       12%  |     0%  |    15%  |   100%  |
      JUNE  |  26%  |     48%  |        4%  |        4%  |     0%  |    17%  |   100%  |
     MARCH  |  17%  |     40%  |       13%  |       10%  |     0%  |    20%  |   100%  |
       MAY  |  18%  |     41%  |       23%  |       14%  |     0%  |     5%  |   100%  |
  NOVEMBER  |  17%  |     38%  |       25%  |       12%  |     4%  |     4%  |   100%  |
   OCTOBER  |  23%  |     32%  |       26%  |        6%  |     3%  |    10%  |   100%  |
 SEPTEMBER  |  17%  |     33%  |       27%  |       10%  |     0%  |    13%  |   100%  |
     Total  |  20%  |     37%  |       17%  |        9%  |     3%  |    14%  |   100%  |


And, that's all there is to cross-tabs.

Tablesaw supports many other functions that make data analysis easy, including loading data, filtering, mapping, summarizing, cleaning, saving, and so on. Tablesaw's visualization support is based on the plot.ly library. 

Data analysis Data (computing) Database Java (programming language)

Opinions expressed by DZone contributors are their own.

Related

  • Practical Generators in Go 1.23 for Database Pagination
  • Java and MongoDB Integration: A CRUD Tutorial [Video Tutorial]
  • Mastering Persistence: Why the Persistence Layer Is Crucial for Modern Java Applications
  • Hyperion Essbase Technical Functionality

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!