DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • The Magic of Apache Spark in Java
  • RION - A Fast, Compact, Versatile Data Format
  • Comparison of JavaScript Pivot Grids for Developers
  • A Deep Dive into Apache Doris Indexes

Trending

  • Revolutionizing Financial Monitoring: Building a Team Dashboard With OpenObserve
  • Unlocking the Potential of Apache Iceberg: A Comprehensive Analysis
  • Article Moderation: Your Questions, Answered
  • Secure by Design: Modernizing Authentication With Centralized Access and Adaptive Signals
  1. DZone
  2. Data Engineering
  3. Databases
  4. Looking for the Best Java Data Computation Layer Tool

Looking for the Best Java Data Computation Layer Tool

This essay is a deep dive into 4 types of data computation layer tools (class libraries) to compare structured data computing capabilities and basic functionalities.

By 
Jerry Zhang user avatar
Jerry Zhang
DZone Core CORE ·
May. 20, 21 · Review
Likes (1)
Comment
Save
Tweet
Share
6.4K Views

Join the DZone community and get the full member experience.

Join For Free

In most cases, structured data is computed within the database in SQL. In a few other cases where the database is either absent or unavailable, Java programmers need to perform computations with Java. Hardcoding, however, involves a heavy workload. A more convenient alternative is the Java data computation layer tool (including library functions). The tool computes data and returns the result set. Here we examine the structured data computation abilities of several Java data computation layer tools.

1. SQL Engine for Files 

This type of tool uses files, such as CSV and XLS, like physical database tables and offers a JDBC interface upward to allow programmers to compute the tables in SQL statements. There are many products of this type, such as CSVJDBC, XLSJDBC, CDATA Excel JDBC, and xlSQL, but none is mature enough to use conveniently. CSVJDBC is the most mature if we must pick one among them.

CSVJDBC is a free, open-source Java class library, which is naturally integration-friendly. Users just need to download a jar to integrate it with a Java program through the JDBC interface. Below is a part of the tab-separated text file d:\data\Orders.txt:

reStructuredText
 




x
11


 
1
OrderID     Client       SellerId      Amount   OrderDate
2

          
3
26      TAS  1         2142.4     2009-08-05
4

          
5
33      DSGC          1         613.2       2009-08-14
6

          
7
84      GC    1       88.5 2009-10-16
8

          
9
133    HU   1         1419.8     2010-12-12
10

          
11
…



The following block of code reads in all records of the file and print them out in the console:

Java
 




xxxxxxxxxx
1
23


 
1
package     csvjdbctest;
2
import     java.sql.*;
3
import     org.relique.jdbc.csv.CsvDriver;
4
import     java.util.Properties;
5
public class   Test1  {
6
      public static void main(String[] args) throws Exception {
7
           Class.forName("org.relique.jdbc.csv.CsvDriver");
8
           // Create a connection to directory given as first   command line
9
           String url = "jdbc:relique:csv:" +   "D:\\data" + "?" +  "separator=\t" + "&" +   "fileExtension=.txt";
10
           Properties props = new Properties();
11
           // Can only be queried after specifying the type of column
12
           props.put("columnTypes", "Int,String,Int,Double,Date");
13
           Connection conn =   DriverManager.getConnection(url,props);
14
           // Create a Statement object to execute the query   with.
15
           Statement stmt = conn.createStatement();
16
           //SQL: Conditional query
17
           ResultSet results = stmt.executeQuery("SELECT * FROM Orders");
18
           // Dump out the results to a CSV file/console output with the same format
19
           CsvDriver.writeToCsv(results, System.out, true);
20
           // Clean up
21
           conn.close();
22
      }
23
}



The Java class library has rather good support for hot deployment. This is because it implements computations in SQL and can conveniently separate the SQL query from the Java code. Users can modify a SQL statement directly without compiling or restart the application.

Though CSVJDBC is excellent in Java integration and hot deployment, they are not the core abilities of a Java data computation layer tool. Structured data computation ability is at the heart of the tool, but the JDBC driver has a bad one.

CSVJDBC supports only a limited number of basic computations, including conditional query, sorting and grouping, and aggregation.

Embedded Javascript
 




xxxxxxxxxx
1


 
1
// SQL: Conditional query
2
ResultSet   results = stmt.executeQuery("SELECT * FROM Orders where Amount>1000 and Amount<=3000 and Client like'%S%' ");
3
// SQL: order by
4
results =   stmt.executeQuery("SELECT * FROM Orders order by Client,Amount desc");
5
// SQL: group by
6
results =   stmt.executeQuery("SELECT year(Orderdate) y,sum(Amount) s FROM Orders group by year(Orderdate)");



Set-oriented operations, subqueries, and join operations are also part of the basic computations, but all of them are not supported. Even the several supported ones have a lot of problems. For instance, CSVJDBC requires that all file data be loaded into the memory for sorting and grouping, and aggregation operations, so it would be better if the file size is not too large.

SQL statements do not support debugging and perform badly. CSVJDBCA supports only CSV files. Other data formats, Database tables, Excel files, and JSON data, should be converted to CSV files for processing. The conversion is cost-inefficient as hardcoding or a third-party tool is needed.

2. dataFrame Generic Functions Library

This type of tool intends to simulate Python Pandas by providing generic data types similar to dataFrame, docking various data sources downward, and offering functional interface upward. Tablesaw, Joinery, Morpheus, Datavec, Paleo, Guava, and many others belong to this type. With Pandas being the successful precursor, this type of Java data computation layer tool fails to attract many followers, hence the reason they are generally poorly developed. Tablesaw (the latest version is 0.38.2), however, is the most developed one among them.

As a free, open-source Java class library, Tablesaw requires only the deployment of core jars and dependent packages to be integrated. The basic code generated by it is simple, too. To read in and print out all records of Orders.txt, for example, we have the following code:

Java
 




xxxxxxxxxx
1
10


 
1
package     tablesawTest;
2
import     tech.tablesaw.api.Table;
3
import     tech.tablesaw.io.csv.CsvReadOptions;
4
public class     TableTest {
5
      public static void main(String[] args) throws Exception{
6
          CsvReadOptions options_Orders = CsvReadOptions.builder("D:\\data\\Orders.txt").separator('\t').build();
7
          Table Orders = Table.read().usingOptions(options_Orders);
8
          System.out.println(orders.print());
9
      }
10
}



Besides CSV files, Tablsaw also supports other types of sources, including RDBMS, Excel, JSON, and HTML. Basically, it can satisfy the daily analytic work. The functional computations bring in a satisfactory debugging experience by supporting a set of functionalities, such as break a point, step, enter and exit while reducing hot deployment performance by requiring recompilement for any change of the algorithm.

Structured data computations are always our focus. Below are several basic computations:

Java
 




xxxxxxxxxx
1
17


 
1
// Conditional query
2
Table query=   Orders.where(
3
    Orders.stringColumn("Client").containsString("S").and(
4
        Orders.doubleColumn("Amount").isGreaterThan(1000).and(
5
            Orders.doubleColumn("Amount").isLessThanOrEqualTo(3000)
6
        )
7
    )
8
);
9
// Sorting
10
Table sort = Orders.sortOn("Client",   "-Amount");
11
//Grouping & aggregation
12
Table summary = Orders.summarize("Amount", sum).by(t1.dateColumn("OrderDate").year());
13
// Joins
14
CsvReadOptions options_Employees = CsvReadOptions.builder("D:\\data\\Employees.txt").separator('\t').build();
15
Table   Employees = Table.read().usingOptions(options_Employees);
16
Table joined =   Orders.joinOn("SellerId").inner(Employees,true,"EId");
17
joined.retainColumns("OrderID","Client","SellerId","Amount","OrderDate","Name","Gender","Dept");



For computations like sorting and grouping and aggregation operations that only need to take a few factors into consideration, the gap between Tablesaw and SQL are very small. If a computation, like conditional query and the join operation, involves a lot of factors, Tablesaw generates far more complicated code than SQL does. The reason is that Java, not intrinsically intended for structured data computing, must trade complicated code for an equal computing ability as SQL. Fortunately, the high-level language supports lambda syntax to be able to generate relatively intuitive code (though not as intuitive as SQL). The conditional query can be rewritten as follows:

Java
 




xxxxxxxxxx
1


 
1
Table query2=Orders.where(
2
    and(x->x.stringColumn("Client").containsString("S"),
3
        and(
4
            x ->   x.doubleColumn("Amount").isGreaterThan(1000),
5
            x ->   x.doubleColumn("Amount").isLessThanOrEqualTo(3000)
6
        )
7
    )
8
);



3. Lightweight Databases

 The characteristics of lightweight databases are small, easy to deploy, and convenient to integrate. Typical products of the lightweight databases include SQLite, Derby, and HSQLDB. Let’s take a special look at SQLite.

 SQLite is free and open source. Only one jar is needed to do the integration and deployment. It is not a stand-alone system and usually runs within a host application, such as Java, through an API interface. When it runs on an external storage disk, it supports a relatively large data volume. When it runs in the memory, it performs better but supports a smaller data volume. SQLite adheres to the JDBC standards. For example, to print out all records of orders table in the external database exl, we have the following code:

Java
 




xxxxxxxxxx
1
28


 
1
package     sqliteTest;
2
import     java.sql.Connection;
3
import     java.sql.DriverManager;
4
import     java.sql.ResultSet;
5
import     java.sql.Statement;
6
public class   Test   {
7
      public static void main(String[] args) throws Exception {
8
          Connection   connection =DriverManager.getConnection("jdbc:sqlite/data/ex1");
9
          Statement statement = connection.createStatement();
10
          ResultSet   results = statement.executeQuery("select   * from   Orders");
11
          printResult(results);   
12
          if(connection != null)connection.close();
13
      }
14
      public static void printResult(ResultSet rs) throws Exception{
15
          int colCount=rs.getMetaData().getColumnCount();
16
          System.out.println();
17
          for(int i=1;i<colCount+1;i++){
18
               System.out.print(rs.getMetaData().getColumnName(i)+"\t");
19
          }
20
          System.out.println();
21
          while(rs.next()){
22
              for (int i=1;i<colCount+1;i++){
23
                  System.out.print(rs.getString(i)+"\t");
24
              }
25
              System.out.println();
26
          }
27
      }
28
}



The support of SQLite for source data use is demanding. Data, no matter of what type, must be loaded to it for computation (Certain lightweight databases can directly identify CSV/Excel files as a database table). There are two ways to load the CSV file Orders.txt. The first uses Java to read in the CSV file, composes each row into an insert statement, and then executes the statements. This requires a lot of coeds. The second is manual. Users download the official maintenance tool sqlite3.ext and execute the following commands at the command line:

sqlite3.exe   ex1

.headers on

.separator   "\t"

.import   D:\\data\\Orders.txt Orders


Though losing points in the source data support scalability, the database software reclaims them by winning the structured data computation test. It handles all basic algorithms conveniently and performs well.

Java
 




xxxxxxxxxx
1


 
1
// Conditional   query
2
results = statement.executeQuery("SELECT * FROM Orders where Amount>1000 and Amount<=3000 and Client like'%S%' ");
3
// Sorting
4
results = statement.executeQuery("SELECT * FROM Orders order by Client,Amount desc");
5
// Grouping & aggregation
6
results = statement.executeQuery("SELECT strftime('%Y',Orderdate) y,sum(Amount) s FROM Orders group by strftime('%Y',Orderdate)  ");
7
// Join operation
8
results = statement.executeQuery("SELECT OrderID,Client,SellerId,Amount,OrderDate,Name,Gender,Dept from Orders     inner join Employees on Orders.SellerId=Employees.EId")



In a word, SQLlite has the intrinsic characteristics of a SQL engine, which are good hot-deployment performance and poor debugging experience.

4. Professional Structured Data Computation Languages

 They are designed for computing structured data, aiming to increase algorithm expression efficiency and execution performance and with the support for diverse data sources, convenient integration, algorithm hot-deploy, and algorithm debugging. There are not many of them. Scala, esProc, and linq4j are the most used ones. Yet as linq4j is not mature enough, we just look at the other two.

Scala is made as a general-purpose programming language. Yet it has become famous for its professional structured data computation abilities, which encapsulate Spark’s distributed computing ability and non-framework, non-service-connected local computing ability. Scala runs on JVM, so it can be easily integrated with a Java program. To read in and print out all records of Orders.txt, for instance, we can first write the following TestScala.Scala script:

Java
 




xxxxxxxxxx
1
19


 
1
package test
2
    import org.apache.spark.sql.SparkSession
3
    import org.apache.spark.sql.DataFrame
4
    object TestScala{
5
         def readCsv():DataFrame={
6
               // Only the   jar is needed for a local execution, without configuring or starting Spark
7
               val spark = SparkSession.builder()
8
               .master("local")
9
               .appName("example")
10
               .getOrCreate()
11
               val Orders = spark.read.option("header", "true").option("sep","\t")
12
               // Should auto-parse into the right   data type before performing computations
13
               .option("inferSchema", "true")
14
               .csv("D:/data/Orders.txt")
15
               // Should specify the date type for   subsequent date handling
16
               .withColumn("OrderDate", col("OrderDate").cast(DateType))
17
               return Orders
18
           }
19
   }



Compile the above Scale script file into an executable program (Java class) and then it can be called from a piece of Java code, as shown below:

Java
 




xxxxxxxxxx
1


 
1
package test;
2
import org.apache.spark.sql.Dataset;
3
public class HelloJava {
4
        public static void main(String[] args) {
5
                Dataset ds = TestScala.readCsv();
6
                ds.show();
7
        }
8
}



Scala code for handling basic structured data computations is fairly simple:

Scala
 




xxxxxxxxxx
1
14


 
1
//Conditional query
2
val condtion = Orders.where("Amount>1000 and Amount<=3000 and Client like'%S%' ")
3
//Sorting
4
val orderBy = Orders.sort(asc("Client"),desc("Amount"))
5
//Grouping & aggregation
6
val groupBy = Orders.groupBy(year(Orders("OrderDate"))).agg(sum("Amount"))
7
//Join operations
8
val Employees = spark.read.option("header",     "true").option("sep","\t")
9
    .option("inferSchema",   "true")
10
    .csv("D:/data/Employees.txt")
11
val join = Orders.join(Employees,Orders("SellerId")===Employees("EId"),"Inner")
12
    .select("OrderID","Client","SellerId","Amount","OrderDate","Name","Gender","Dept")
13
// Positions   of records are changed after the join, a sorting will keep the order consistent with that of the result set obtained using the other tools
14
    .orderBy("SellerId")



Scala supports running the same code on a rich variety of data sources. It can be regarded as the improved Java, so it also boasts an excellent debugging experience. But, as a compiled language, Scala is hard to hot-deploy.

There is a tool that outperforms Scala. It is esProc, which is targeted at structured data computations. esProc offers a JDBC interface to be conveniently integrated into a piece of Java program. For instance, below is the code for reading in and printing out all records of Orders.txt:

Java
 




xxxxxxxxxx
1
32


 
1
package Test;
2
    import java.sql.Connection;
3
    import java.sql.DriverManager;
4
    import java.sql.ResultSet;
5
    import java.sql.Statement;
6
    public class test1 {
7
          public static void main(String[]   args)throws   Exception {
8
                Class.forName("com.esproc.jdbc.InternalDriver");
9
                Connection connection = DriverManager.getConnection("jdbc:esproc:local://");
10
                Statement statement =     connection.createStatement();
11
                String str = "=file(\"D:/data/Orders.txt\").import@t()";
12
                ResultSet result =     statement.executeQuery(str);
13
                printResult(result);
14
                if(connection != null){
15
                        connection.close();
16
                }
17
        }
18
        public static void printResult(ResultSet rs) throws Exception{
19
                int colCount = rs.getMetaData().getColumnCount();
20
                System.out.println();
21
                for(int i=1;i<colCount+1;i++){
22
                        System.out.print(rs.getMetaData().getColumnName(i)+"\t");
23
                }
24
                System.out.println();
25
                while(rs.next()){
26
                        for   (int i=1;i<colCount+1;i++){
27
                                System.out.print(rs.getString(i)+"\t");
28
                        }
29
                        System.out.println();
30
                    }
31
            }
32
    }



The esProc code for handling basic structure data computations is simple and easy to understand:

Java
 




xxxxxxxxxx
1
15


 
1
// Conditional query
2

          
3
str = "=T(\"D:/data/Orders.txt\").select(Amount>1000 && Amount<=3000 && like(Client,\"*S*\"))";
4

          
5
// Sorting
6

          
7
str = "=T(\"D:/data/Orders.txt\").sort(Client,-Amount)";
8

          
9
// Grouping & aggregation
10

          
11
str = "=T(\"D:/data/Orders.txt\").groups(year(OrderDate);sum(Amount))";
12

          
13
// Join operations
14

          
15
str = "=join(T(\"D:/data/Orders.txt\"):O,SellerId; T(\"D:/data/Employees.txt\"):E,EId).new(O.OrderID,O.Client,O.SellerId,O.Amount,O.OrderDate,E.Name,E.Gender,E.Dept)";



esProc also provides corresponding SQL syntax for skilled SQL programmers. The above algorithm of handling grouping and aggregation can be rewritten as this:

str="$SELECT year(OrderDate),sum(Amount) from Orders.txt group by year(OrderDate)"


An esProc algorithm can be stored as a script file, too, which further reduces the code coupling. Here is an example:

Duty.xlsx stores daily duty records. Usually, one person will be on duty for several continuous workdays and then switch the job to the next person. Based on this table, we want to get the detailed duty data for each person in turn. Below is the data structure:

The source table (Duty.xlsx):

Source Table

The processing result:

Processing Result

The following esProc script file (con_days.dfx) is used to implement the above algorithm:

esProc Script File

Then the esProc script file can be called from the block of Java code below:

Java
 




x


 
1

          
2
…
3
Class.forName("com.esproc.jdbc.InternalDriver");
4
Connection connection = DriverManager.getConnection("jdbc:esproc:local://");
5
Statement statement = connection.createStatement();
6
ResultSet result = statement.executeQuery("call con_days()");
7
…



Though the above algorithm contains complex computing logic, which is hard to express even with Scala or a lightweight database, esProc is easy to implement with richer related functions and agile syntax.

Placing an algorithm outside of the main application also allows programmers to edit and debug code on the specialized IDE. This helps implement algorithms with extremely complex logic. Take the filtering of the running total as an example:

Database table sales stores customers’ sales amounts. The main fields are client and amount. The task is to find the first N big customers whose sum of sales amount takes up at least half of the total amount and sort them in descending order by amount.

We can use the esProc script below to achieve the algorithm and then call it in Java:

esProc Script to Call in Java

esProc has the most powerful structured data computation capability. Apart from this, it also excels at many other aspects, including code debugging, data source support, big data computing ability, and parallel processing support — which are explained in other essays.

Database Big data Java (programming language) file IO sql Debug code

Opinions expressed by DZone contributors are their own.

Related

  • The Magic of Apache Spark in Java
  • RION - A Fast, Compact, Versatile Data Format
  • Comparison of JavaScript Pivot Grids for Developers
  • A Deep Dive into Apache Doris Indexes

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!