Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Sorting Large CSV Files Using SQLite

DZone's Guide to

Sorting Large CSV Files Using SQLite

Sorting large CSV files can be a pain. In this post we take a look at how to ease that pain a little with a little help from SQLite and JDBC.

· Database Zone ·
Free Resource

Compliant Database DevOps and the role of DevSecOps DevOps is becoming the new normal in application development, and DevSecOps is now entering the picture. By balancing the desire to release code faster with the need for the same code to be secure, it addresses increasing demands for data privacy. But what about the database? How can databases be included in both DevOps and DevSecOps? What additional measures should be considered to achieve truly compliant database DevOps? This whitepaper provides a valuable insight. Get the whitepaper

“When you’re at the end of your rope, tie a knot and hold on.”
― Theodore Roosevelt

Review

We are trying to sort a large CSV file. The file contains a couple of million rows – not large by “big-data” standards, but large enough to face problems working with it.

In a previous article, we saw how Excel failed with a memory error, unable to load the CSV file.

An attempt to load the data into an ArrayList failed with an OutOfMemoryException.

Using SQLite

The first database option we tried for loading the CSV was H2.

Another option for a “desktop” database is SQLite — a very popular embedded database in the C/C++ world that is reputed to be very fast. Let’s use that to load our data and manipulate it in our application.

SQLite supports direct loading of a CSV file using the “mode” and “import” commands. However, it did not quite work right for me. So I am going to be writing code using JDBC to import and sort the data.

SQLite’s CSV import also cannot handle quoted fields, multi-line fields, and other idiosyncrasies found in some CSV files.

SQLite With JDBC

SQLite is a database library suitable for embedding within C/C++ programs. There is no “official” JDBC driver for it. However, some nice folks have developed one for it, which we are going to use. It seems to be in active development (as of Mar 2017) as far as I can see.

Download the JDBC driver from here. Or use this POM entry.

<dependency>
  <groupId>org.xerial</groupId>
  <artifactId>sqlite-jdbc</artifactId>
  <version>3.16.1</version>
</dependency>


Create Table From Header

Generally, the first line of a CSV file contains column headers. We are going to use this header line to create a table.

One nice thing about SQLite is that you can skip the column type when creating a table. SQLite uses dynamic typing and you can skip defining the type of a column when creating a table. The following is a valid SQLite table definition:

create table sample(
  yearID,
  teamID,
  lgID,
  playerID,
  salary
)


With this advantage, the code for creating a table is simple.

StringBuffer sbuf = new StringBuffer();
sbuf.append("create table sample(\n");
for (int i = 0, n = fieldNames.size() ; i < n ; i++) {
    String name = fieldNames.get(i);
    sbuf.append(" ").append(name);
    if ( i < n - 1 ) sbuf.append(",\n");
    else sbuf.append("\n");
}
sbuf.append(")");
con.createStatement().execute(sbuf.toString());


Transactions

In SQLite, by default, each statement is executed within the scope of a transaction. That is each statement can be assumed to be enclosed within a begin and commit block as follows:

begin transaction
sql statement ...
commit transaction


While this increases the reliability of the code and the data, it makes things very, very slow for us. So what we are going to do is to commit every N number of transactions.

if ( COMMIT_EVERY > 0 && uncommitted >= COMMIT_EVERY ) {
    con.createStatement().execute("commit transaction"); con.createStatement().execute("begin transaction"); uncommitted = 0; }


With this change, I saw a 10x improvement in performance.

Inserting the Data

We use a PreparedStatement to insert the data into the table.

StringBuffer sbuf = new StringBuffer();
StringBuffer vbuf = new StringBuffer();
sbuf.append("insert into sample(");
vbuf.append("values(");
for (int i = 0, n = fieldNames.size() ; i < n ; i++) {
    String name = fieldNames.get(i);
    sbuf.append(name);
    if ( i < n - 1 ) {
        sbuf.append(", ");
        vbuf.append("?, ");
    } else {
        vbuf.append("?)\n");
        sbuf.append(")\n");
    }
}
sbuf.append(vbuf.toString());
return con.prepareStatement(sbuf.toString());


Sort and Extract to CSV

After the data is inserted, we can sort the data with the required clause(s) and extract the data to CSV using a ResultSet.

String sortClause = "order by CAST(yearID as Integer) DESC, CAST(Salary as Integer) DESC, teamID";
String selectStmt = "select * from sample\n" + sortClause;
ResultSet rset = con.createStatement().executeQuery(selectStmt);


Main loop for extracting and exporting the data:

while (rset.next()) {
    rowNum++;
    for (int i = 0, n = fieldNames.size() ; i < n ; i++) {
    out.print(encodeValue(rset.getString(i+1)));
    if ( i < n - 1 ) out.print(',');
    else out.println();
    }
}


Encoding a field value to CSV:

static private Pattern rxquote = Pattern.compile("\"");

static private String encodeValue(String value) {
    boolean needQuotes = false;
    if ( value.indexOf(',') != -1 || value.indexOf('"') != -1 ||
     value.indexOf('\n') != -1 || value.indexOf('\r') != -1 )
    needQuotes = true;
    Matcher m = rxquote.matcher(value);
    if ( m.find() ) { needQuotes = true; value = m.replaceAll("\"\""); }
    if ( needQuotes ) return "\"" + value + "\"";
    else return value;
}


And here is the nicely sorted output data. It is sorted by yearID, salary and teamID.

yearID,teamID,lgID,playerID,salary
2016,LAD,NL,kershcl01,33000000
2016,ARI,NL,greinza01,31799030
2016,BOS,AL,priceda01,30000000
2016,DET,AL,cabremi01,28000000
...


For comparison purposes, the same data file which took 11 mins with an H2 database took about 5 mins with SQLite. Draw from that what you will, but keep in mind that there are probably too few data points to draw a concrete conclusion. The only thing worse than generalizing from one case is generalizing from no cases at all.

Summary

And that completes our exercise in sorting a large CSV file. Using a database seems to be the easiest way out. We have a choice of using H2 database or SQLite. Not much to choose between either,

Compliant Database DevOps and the role of DevSecOps DevOps is becoming the new normal in application development, and DevSecOps is now entering the picture. By balancing the desire to release code faster with the need for the same code to be secure, it addresses increasing demands for data privacy. But what about the database? How can databases be included in both DevOps and DevSecOps? What additional measures should be considered to achieve truly compliant database DevOps? This whitepaper provides a valuable insight. Get the whitepaper

Topics:
csv ,sqlite ,sorting ,database ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}