DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • What Java Developers Need to Know About Geo-Distributed Databases
  • Simplify Java Persistence Using Quarkus and Hibernate Reactive
  • The Magic of Apache Spark in Java
  • The Beginner's Guide To Understanding Graph Databases

Trending

  • Self-Supervised Learning Techniques
  • Stabilizing ETL Pipelines With Airflow, Presto, and Metadata Contracts
  • Lessons Learned in Test-Driven Development
  • Reducing Hallucinations Using Prompt Engineering and RAG
  1. DZone
  2. Data Engineering
  3. Databases
  4. Data Science for Java Developers With Tablesaw

Data Science for Java Developers With Tablesaw

Tablesaw is like an open-source Java power tool for data manipulation with hooks for interactive visualization, analytics, and machine learning. Come learn all about it!

By 
Larry White user avatar
Larry White
·
Aug. 20, 17 · Tutorial
Likes (38)
Comment
Save
Tweet
Share
28.3K Views

Join the DZone community and get the full member experience.

Join For Free

Data science is one of the hottest areas in computing today. Most people learn data science using either Python or R. Both are excellent languages for crunching and analyzing data.

But many Java developers feel left behind. There are great Java libraries for machine learning, especially for jobs that require distributed computing, but there's no simple path for Java developers to learn and apply data science. By minimizing the number of things you need to learn, the open-source Tablesaw provides a gateway.

Think of Tablesaw as a Java power tool for data manipulation with hooks for interactive visualization, analytics, and machine learning. Used interactively or embedded in an application, its focus is to make data science as easy in Java as in R or Python. If you've done some data science, you may think of it as a data frame.

Tablesaw is easy to learn, but it's not a toy. Tables can be large — up to two billion rows. Performance is brisk — on my laptop, I can retrieve 500 records from a table of half of a billion rows in two milliseconds. It is open-sourced under a business-friendly Apache 2 license.

What Makes Tablesaw Beginner-Friendly?

  1. It builds on what you know: For Java developers who want to do data science, it's a huge advantage to not have to also learn a new language.
  2. It's easy to get started: Simply add Tablesaw as a Maven dependency for your project and you’re up and running. We’ll walk through an example below to show you how.
  3. It's not distributed: Unlike many machine learning libraries, Tablesaw is not a distributed system. This removes enormous complexity and makes machine learning accessible to those without deep engineering experience or support. 
  4. The code is clear: There's a fluent API so you’ll understand your code the next time you read it. 
  5. It provides fast feedback: Tablesaw is designed to be used interactively for exploratory analysis.

Introductory Example

Here, I’ll show you some of Tablesaw’s basic data manipulation features. Future posts will address visualization, machine learning, the Kotlin API and REPL, and the Tablesaw architecture. The code for this example can be found here. 

Up and Running

To begin, create a Java project and add the Tablesaw core library as a Maven dependency. The current dependency is:

<!-- https://mvnrepository.com/artifact/tech.tablesaw/tablesaw-core -->
<dependency>
    <groupId>tech.tablesaw</groupId>
    <artifactId>tablesaw-core</artifactId>
    <version>0.23.3</version>
</dependency>

Next, create a class with a main method like so:

public class Foo {
    public static void main(String[] args {
       // rest of code goes here
    }
}

The rest of our code will go in this method.

The first thing to do is add a table. Tablesaw can load data from relational databases, but we will create our table from a flat text file: 

Table table1 = Table.read().csv(“bush.csv");

Table objects can provide a lot of information:

  • table1.name();  returns bush.csv since the table name defaults to the file name.

  • table1.shape(); returns 323 rows X 3 cols.

  • table1.structure(); returns a table of column metadata:

Index Column Name Column Type 
0     date        LOCAL_DATE  
1     approval    SHORT_INT   
2     who         CATEGORY    

Note that we've inferred the column types from the data.

  • table1.first(3); returns a new table containing only the first three rows.

BushApproval.csv
date       approval who 
2004-02-04 53       fox 
2004-01-21 53       fox 
2004-01-07 58       fox 

Inevitably, we want to work with columns. Each has a data type, and usually, you’ll want it by that type and not as a generic column because typed columns have more power. For example, to get the approval column, you can use:

NumberColumn approval = table1.numberColumn(“approval”);

Each column sub-type supports numerous operations. As a rule, operations on a column are applied to every element without explicit loops. Some call these “vector operations.” For example, operations like count(), min(), and contains() produce a single value for a column of data: 

double min = approval.min();.

Other operations return a new column. The method dayOfYear() applied to a DateColumn returns a short integer column with each element the day of the year from 1 to 366.

Some column-returning operations take a scalar value as an argument: dateColumn.plusDays(4);.

This adds four days to every element. Others take a second column as an argument. These process the two columns in order, applying each integer value from the argument to the corresponding element in the receiver.

Boolean operations like isMonday() don’t return a boolean column directly, but a Selection instead. Selections can be used to filter tables by the values in their columns, so we’ll see them again: 

Selection selection = table1.dateColumn(“date”).isMonday();

You can, of course, get a boolean column if you want it. You simply pass the Selection and the original column length to a BooleanColumn constructor, along with a name for the new column: 

BooleanColumn mondays = new BooleanColumn(“mondays”, selection, 1000);

There are hundreds of methods available for column manipulation, but let's turn now to tables. Operations exist for creating, describing, modifying, sorting, querying, and summarizing tables. Here we'll cover sorting, querying, and summarizing. 

Queries

Queries apply a selection to a table and return a new filtered table. The method where() is what you want.

Usually, you will pass the query as a Selection to where(). Queries can be easily created:  

NumberColumn approval = table1.numberColumn("approval");
Table highApproval = table1.where(approval.isGreaterThan(80));

Here you used the same kind of Selection objects we saw earlier in columns. You can also use those as arguments to table's where() method, allowing you to use column-specific logic to query a table. 

DateColumn date = table1.dateColumn("date");
Table Q3 = table1.where(date.isInQ3()); 

Sorting

The easiest way to sort a table is sortOn();. This code gets it done:

table1.sortOn(“who”, “approval”);

Here “who” and “approval” are column names, and the sort is ascending. To sort in descending order, use sortDescendingOn().

To sort in mixed order, you can prepend a minus sign to a column name to indicate a descending sort on that column. For example,  table1.sortOn(“who”, “-approval”); sorts on “who” in ascending order, and on “approval” in descending order.

Finally, you can write your own sort logic as an IntComparator, giving you full control over the ordering. 

Summarizing

Now, we’ll cover summarization techniques like pivot tables (cross tabs). If you want to simply calculate group statistics for a table, the summarize() method works nicely. There are a large number of statistics available, including range,  as shown below.

Table summary = table1.summarize("approval", range).by(“who”);


BushApproval.csv summary
who      Range [approval] 
fox      42.0             
gallup   41.0            
newsweek 40.0             
time.cnn 37.0             
upenn    10.0             
zogby    37.0             

Cross tabs are useful for producing counts or frequencies of the number of observations in a combination of categories. First, let's get two categorical columns:

CategoryColumn who = table1.categoryColumn("who");
CategoryColumn month = date.month();
table1.addColumn(month);

Now, we can calculate the raw counts for each combination: 

Table xtab = CrossTab.xTabCount(table1, month, who);
Crosstab Counts: date month x who
          fox gallup newsweek time.cnn upenn zogby total 
APRIL     6   10     3        1        0     3     23    
AUGUST    3   8      2        1        0     2     16    
DECEMBER  4   9      4        3        2     5     27    
FEBRUARY  7   9      4        4        1     4     29    
JANUARY   7   13     6        3        5     8     42    
JULY      6   9      4        3        0     4     26    
JUNE      6   11     1        1        0     4     23    
MARCH     5   12     4        3        0     6     30    
MAY       4   9      5        3        0     1     22    
NOVEMBER  4   9      6        3        1     1     24    
OCTOBER   7   10     8        2        1     3     31    
SEPTEMBER 5   10     8        3        0     4     30    
Total     64  119    55       30       10    45    323   

If you prefer to see the relative frequency for each combination, pass your crosstab table to the tablePercents() method:

CrossTab.tablePercents(xtab);

and it will return a table showing the relative frequency of each cell:

Crosstab Table Proportions: 
          fox         gallup      newsweek     time.cnn     upenn        zogby        total       
APRIL     0.01857585  0.030959751 0.009287925  0.0030959751 0.0          0.009287925  0.071207434 
AUGUST    0.009287925 0.024767801 0.0061919503 0.0030959751 0.0          0.0061919503 0.049535602 
DECEMBER  0.012383901 0.027863776 0.012383901  0.009287925  0.0061919503 0.015479876  0.083591335 
FEBRUARY  0.021671826 0.027863776 0.012383901  0.012383901  0.0030959751 0.012383901  0.08978328  
JANUARY   0.021671826 0.04024768  0.01857585   0.009287925  0.015479876  0.024767801  0.13003096  
JULY      0.01857585  0.027863776 0.012383901  0.009287925  0.0          0.012383901  0.08049536  
JUNE      0.01857585  0.03405573  0.0030959751 0.0030959751 0.0          0.012383901  0.071207434 
MARCH     0.015479876 0.0371517   0.012383901  0.009287925  0.0          0.01857585   0.09287926  
MAY       0.012383901 0.027863776 0.015479876  0.009287925  0.0          0.0030959751 0.06811146  
NOVEMBER  0.012383901 0.027863776 0.01857585   0.009287925  0.0030959751 0.0030959751 0.0743034   
OCTOBER   0.021671826 0.030959751 0.024767801  0.0061919503 0.0030959751 0.009287925  0.095975235 
SEPTEMBER 0.015479876 0.030959751 0.024767801  0.009287925  0.0          0.012383901  0.09287926  
Total     0.19814241  0.36842105  0.17027864   0.09287926   0.030959751  0.13931888   1.0         

There are similar methods for getting the row-wise or column-wise frequencies. 

What's Next?

There is much more I hope this is encouraging you to give Tablesaw a try. As I mentioned, future posts will cover visualization, machine learning, and more. You can find the code on github at https://github.com/jtablesaw/tablesaw .

Since you're a Java developer, consider taking a look at our contributor's page. Tablesaw is a work in progress. Help us make Java a great platform for data science. 

Database Data science Relational database Java (programming language) dev

Opinions expressed by DZone contributors are their own.

Related

  • What Java Developers Need to Know About Geo-Distributed Databases
  • Simplify Java Persistence Using Quarkus and Hibernate Reactive
  • The Magic of Apache Spark in Java
  • The Beginner's Guide To Understanding Graph Databases

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: