Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Boost Computing Capability of Java over Massive Data

DZone's Guide to

How to Boost Computing Capability of Java over Massive Data

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Programmers usually resort to the database to implement the massive data computation of Java. However, the database is unavailable or inconvenient for some application scenarios. In which cases, the native capability of Java is the key to achieve the goal. For example:

  • The data source is not the database, but the text file, log, Excel, and XML file.
  • Depending on database alone, the cross-database and cross-data-source computing are hard to implement.
  • Implement the customized data source of JavaBean in the reporting tools. Most reporting tools support such interfaces as BIRT, JasperReport, and Crystal Report.

Java lacks the structured data computing functions, and its massive data computing capability is not powerful. So, it is rather difficult and inefficient to perform the similar computations, and the code is lengthy and cumbersome. Let’s check out one of the simplest grouping and summarizing algorithms: Retrieve sales record from Excel and compute the sales volume for each sales person. This algorithm can be implemented with SQL statement like this: select salesman,sum(value) from sales group by salesman. By comparison, in JAVA, the equivalent code is written like this:

  Comparator<salesRecord> comparator = new Comparator<salesRecord>() {

  publicint compare(salesRecord s1, salesRecord s2) {

  if (!s1.salesman.equals(s2.salesman)) {

  return s1.salesman.compareTo(s2.salesman);

  } else {

  return s1.ID.compareTo(s2.ID);

  }

  }

  };

  Collections. sort(sales, comparator);

  ArrayList<resultRecord> result=new ArrayList<resultRecord>();

  salesRecord standard=sales. get(0);

  float sumValue=standard.value;

  for(int i = 1;i < sales.size(); i ++){

  salesRecord rd=sales.get(i);

  if(rd.salesman.equals(standard.salesman)){

  sumValue=sumValue+rd.value;

  } else {

  result.add(new resultRecord(standard. salesman,sumValue));

  standard=rd;

  sumValue=standard.value;

  }

  }

  result.add(new resultRecord(standard.salesman,sumValue));

  return result;

The normal actions are ignored in the above codes, such as file retrieving from Excel and the data persistence as ArrayList. It can be imaged that the complete code would be much more lengthy and harder to understand.

 How to improve the massive data computing ability of Java? This is the common problems for many Java programmers. Here are three solutions: With the embedded database, either load the data to the external database for computing, or directly compute in esProc.

 Using the embedded database

The normal embedded databases include HSQLDB and Derby. They are all born to have the common structured data computing capability, and work with Java well. Programmers only need to read the text file, log, and other data into the embedded database to implement the computing over massive data.

Most embedded databases are open source projects. Free-of-charge is one of their advantages. They also have another advantage - their computing syntax is still SQL  which means it costs less to learn and is easy to use.

The drawback of embedded database is quite obvious. 1. The computing capability over massive data is not complete. Most embedded databases only provide the basic SQL algorithms, lacking the window function and other extended functions. Regarding its support for the stored procedure, it is relatively poor. So the embedded database is only fit for the basic and simple computing, and unfit for the complex massive data computing. 2.  Performance is not great. The embedded database is only fit for the small data volume. The performance cannot be compared to commercial database. The supported data volume is extremely limited. 3. Applicable scenarios is limited. The computations across database or data source demand high on performance. Sometimes, it requires the complex data conversion, and such scenarios are not fit for the embedded database.

  Loading to the external database

For the cross-database or cross-data-source computation , programmers can load data from various data sources to a same database using the functions provided by the external database. With this method, not only the computation over non-database data source, but also implementing the cross-source computation can be achieved conveniently.

Generally speaking, the loading functions of database can retrieve file directly. Some of these functions can read and retrieve data from Excel. The databases are usually capable of cleaning and converting. The dirty and messy data can thus be arranged and become easy to use. The databases are also usually capable of scheduling and dispatching. Data from various databases and data sources can be extracted to the local databases in advance. In this way, the high performance of cross-database and cross-data-source computing can be achieved easily.

Loading to the external database also has drawbacks. 1. High cost. On one hand, preparing additional database for temporary data storing and computing will increase the cost. On the other hand, the great quantity of raw data makes it quite time-consuming to load them to the intermediate database. Its I/O performance is one magnitude slower than that of direct copying the file. 2. Great development workload. The script for cleaning and converting is generally a procedural language, lacking the computing capability over structured data. Programmers may frequently encounter the business logics with strong relevance to the industry. Simply cleaning and converting scripts is hard to handle such situation. 3. Real-timeness is relatively poor. To implement the realtime data computing, we need to establish a trigger in the source database, and create the new time stamp in the source table. For the sake of safety and performance, such operation is often not allowed on the database like the production database or any databases specific to the business application. In this case, we have to compromise the realtimeness and user experience.

 Direct computing in esProc

esProc is a programming language for the structured data, providing rich inbuilt objects and library functions  for massive data computing, including ranking, sorting, yearly link relatively ratio, and year-on-year comparison. On implementing these computing, esProc is more convenient than SQL. With the native engine for mixed computing, esProc allows for accessing various relational databases and NoSQL database or retrieving the text file, log, and Excel file. So, esProc users can implement the cross-database or cross-data-source computing without storing the data temporarily. The performance of esProc is also good. When it comes to the multi-thread computing, its performance is close or even superior to Oracle. esProc provides the standard JDBC interface for external application. Java program and reporting tool can integrate with esProc seamlessly and get the same computing result as accessing database.

Now, let’s take a look at the several esProc code snippets below.

1.  Grouping and summarizing: sales.groups(salesman;sum(value))

2.  Top 10 best sellers in each department:products.group(department).(~.top(quantity;10)

3.  Computing across local file and Oracle database:

4.  Computing across Oracle and MSSQL databases:


5.  The computational result will be exported via JDBC for referencing by the report or Java application:

6.  With the combined use of various functions, we can take advantage of the step-by-step computing to simplify the procedure to achieve the computing goal:

The advantage of esProc is the rich library functions, superior support for the combined computing and cross-database computing, sound performance, and the convenience when using with the report. So, esProc can elevate the massive computing ability of Java over massive data significantly. Despite all these advantages, esProc has its drawbacks. First, although esProc syntax is relatively easy to understand, it takes some time to grasp it. After all, it is a different language to SQL and Java. Second, if invoking esProc, then the performance will be affected a bit because the Java main application and the invoked esProc will be both in a same thread process. If adopting the esProc parallel computing to minimize the impact on performance, the code will become slightly more complex.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}