DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • An Overview of Creating Databases With Python
  • Dynamic Sorting in MS SQL Server
  • Convert CSV Files to Excel in .NET Framework
  • How To Recover SQL Server FILESTREAM Enabled Database

Trending

  • A Keycloak Example: Building My First MCP Server Tools With Quarkus
  • Building an AI Nutrition Coach With OpenAI, Gradio, and gTTS
  • Indexed Views in SQL Server: A Production DBA's Complete Guide
  • 10 Predictions Shaping the Future of Web Data Extraction Services
  1. DZone
  2. Data Engineering
  3. Databases
  4. Sorting Large CSV Files Using SQLite

Sorting Large CSV Files Using SQLite

Sorting large CSV files can be a pain. In this post we take a look at how to ease that pain a little with a little help from SQLite and JDBC.

By 
Jay Sridhar user avatar
Jay Sridhar
·
Mar. 28, 17 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
7.4K Views

Join the DZone community and get the full member experience.

Join For Free

“When you’re at the end of your rope, tie a knot and hold on.”
― Theodore Roosevelt

Review

We are trying to sort a large CSV file. The file contains a couple of million rows – not large by “big-data” standards, but large enough to face problems working with it.

In a previous article, we saw how Excel failed with a memory error, unable to load the CSV file.

An attempt to load the data into an ArrayList failed with an OutOfMemoryException.

Using SQLite

The first database option we tried for loading the CSV was H2.

Another option for a “desktop” database is SQLite — a very popular embedded database in the C/C++ world that is reputed to be very fast. Let’s use that to load our data and manipulate it in our application.

SQLite supports direct loading of a CSV file using the “mode” and “import” commands. However, it did not quite work right for me. So I am going to be writing code using JDBC to import and sort the data.

SQLite’s CSV import also cannot handle quoted fields, multi-line fields, and other idiosyncrasies found in some CSV files.

SQLite With JDBC

SQLite is a database library suitable for embedding within C/C++ programs. There is no “official” JDBC driver for it. However, some nice folks have developed one for it, which we are going to use. It seems to be in active development (as of Mar 2017) as far as I can see.

Download the JDBC driver from here. Or use this POM entry.

<dependency>
  <groupId>org.xerial</groupId>
  <artifactId>sqlite-jdbc</artifactId>
  <version>3.16.1</version>
</dependency>


Create Table From Header

Generally, the first line of a CSV file contains column headers. We are going to use this header line to create a table.

One nice thing about SQLite is that you can skip the column type when creating a table. SQLite uses dynamic typing and you can skip defining the type of a column when creating a table. The following is a valid SQLite table definition:

create table sample(
  yearID,
  teamID,
  lgID,
  playerID,
  salary
)


With this advantage, the code for creating a table is simple.

StringBuffer sbuf = new StringBuffer();
sbuf.append("create table sample(\n");
for (int i = 0, n = fieldNames.size() ; i < n ; i++) {
    String name = fieldNames.get(i);
    sbuf.append(" ").append(name);
    if ( i < n - 1 ) sbuf.append(",\n");
    else sbuf.append("\n");
}
sbuf.append(")");
con.createStatement().execute(sbuf.toString());


Transactions

In SQLite, by default, each statement is executed within the scope of a transaction. That is each statement can be assumed to be enclosed within a begin and commit block as follows:

begin transaction
sql statement ...
commit transaction


While this increases the reliability of the code and the data, it makes things very, very slow for us. So what we are going to do is to commit every N number of transactions.

if ( COMMIT_EVERY > 0 && uncommitted >= COMMIT_EVERY ) {
    con.createStatement().execute("commit transaction"); con.createStatement().execute("begin transaction"); uncommitted = 0; }


With this change, I saw a 10x improvement in performance.

Inserting the Data

We use a PreparedStatement to insert the data into the table.

StringBuffer sbuf = new StringBuffer();
StringBuffer vbuf = new StringBuffer();
sbuf.append("insert into sample(");
vbuf.append("values(");
for (int i = 0, n = fieldNames.size() ; i < n ; i++) {
    String name = fieldNames.get(i);
    sbuf.append(name);
    if ( i < n - 1 ) {
        sbuf.append(", ");
        vbuf.append("?, ");
    } else {
        vbuf.append("?)\n");
        sbuf.append(")\n");
    }
}
sbuf.append(vbuf.toString());
return con.prepareStatement(sbuf.toString());


Sort and Extract to CSV

After the data is inserted, we can sort the data with the required clause(s) and extract the data to CSV using a ResultSet.

String sortClause = "order by CAST(yearID as Integer) DESC, CAST(Salary as Integer) DESC, teamID";
String selectStmt = "select * from sample\n" + sortClause;
ResultSet rset = con.createStatement().executeQuery(selectStmt);


Main loop for extracting and exporting the data:

while (rset.next()) {
    rowNum++;
    for (int i = 0, n = fieldNames.size() ; i < n ; i++) {
    out.print(encodeValue(rset.getString(i+1)));
    if ( i < n - 1 ) out.print(',');
    else out.println();
    }
}


Encoding a field value to CSV:

static private Pattern rxquote = Pattern.compile("\"");

static private String encodeValue(String value) {
    boolean needQuotes = false;
    if ( value.indexOf(',') != -1 || value.indexOf('"') != -1 ||
     value.indexOf('\n') != -1 || value.indexOf('\r') != -1 )
    needQuotes = true;
    Matcher m = rxquote.matcher(value);
    if ( m.find() ) { needQuotes = true; value = m.replaceAll("\"\""); }
    if ( needQuotes ) return "\"" + value + "\"";
    else return value;
}


And here is the nicely sorted output data. It is sorted by yearID, salary and teamID.

yearID,teamID,lgID,playerID,salary
2016,LAD,NL,kershcl01,33000000
2016,ARI,NL,greinza01,31799030
2016,BOS,AL,priceda01,30000000
2016,DET,AL,cabremi01,28000000
...


For comparison purposes, the same data file which took 11 mins with an H2 database took about 5 mins with SQLite. Draw from that what you will, but keep in mind that there are probably too few data points to draw a concrete conclusion. The only thing worse than generalizing from one case is generalizing from no cases at all.

Summary

And that completes our exercise in sorting a large CSV file. Using a database seems to be the easiest way out. We have a choice of using H2 database or SQLite. Not much to choose between either,

SQLite CSV Data file Database Sorting

Published at DZone with permission of Jay Sridhar, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • An Overview of Creating Databases With Python
  • Dynamic Sorting in MS SQL Server
  • Convert CSV Files to Excel in .NET Framework
  • How To Recover SQL Server FILESTREAM Enabled Database

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: