DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. Data
  4. The Programmer's Way to Convert Excel to CSV (With UTF-8)

The Programmer's Way to Convert Excel to CSV (With UTF-8)

Excel provides a way to save spreadsheets as a CSV file, but it seems to fail at handling UTF-8 characters. See how a real programmer deals with the problem.

Jay Sridhar user avatar by
Jay Sridhar
CORE ·
Mar. 20, 17 · Tutorial
Like (21)
Save
Tweet
Share
93.29K Views

Join the DZone community and get the full member experience.

Join For Free

CSV stands for Comma-Separated-Values and is a very common format used for exchanging data between diverse applications. While the Excel Spreadsheet file format is complex (since it has to accommodate a lot more!), CSV is a simpler format representing just tabular data.

In this article, we show you a way of exporting the data from an Excel spreadsheet to CSV. We use the Apache POI library for the purpose.

But Why?

Excel directly provides for exporting CSV data using the Save As functionality in the File menu. While it gets the job done in most cases, it leaves something to be desired.

Specifically, if your spreadsheet contains Unicode data, you are out of luck. It appears that Excel uses Windows default character set Windows-1252 (or cp-1252) to perform the export. This character set is a very limited set and cannot represent characters from most foreign languages. This leaves your CSV output from Excel severely broken if it contains such characters.

The proper solution is to use Unicode (specifically UTF-8) for encoding the CSV file so all your data is preserved.

It is indeed a surprise that the Excel team has not yet figured out in 2017 that people may have Unicode data that need to be exported properly to CSV.

Maybe the Excel developers have not read this article by Joel Spolsky stressing the need for programmers to be aware of Unicode. Ironically, Joel was (until 1995?) the project manager for the Excel team. Joel penned the article in 2003, long after leaving the Excel group.

Using Apache POI

We use Apache POI to read the Excel spreadsheet. Building the program requires this dependency to be declared in pom.xml (assuming Maven for the build) as follows:

<dependency>
  <groupId>org.apache.poi</groupId>
  <artifactId>poi-ooxml</artifactId>
  <version>${poi.version}</version>
</dependency>


Take 1 – The Naive Approach

The first cut of the approach to convert Excel spreadsheet data to CSV is shown in the block below.

Workbook wb = new XSSFWorkbook(new File(xlsxFile));
DataFormatter formatter = new DataFormatter();
PrintStream out = new PrintStream(new FileOutputStream(csvFile),
                                  true, "UTF-8");
for (Sheet sheet : wb) {
    for (Row row : sheet) {
        boolean firstCell = true;
        for (Cell cell : row) {
            if ( ! firstCell ) out.print(',');
            String text = formatter.formatCellValue(cell);
            out.print(text);
            firstCell = false;
        }
        out.println();
    }
}


Use the UTF-8 BOM

There are several problems with the code above. While the code above correctly outputs UTF-8 and encodes characters properly, Excel cannot load the generated CSV file. The reason is that Excel needs the Byte-Order-Marker to indicate that the file is encoded in UTF-8. With that modification, the code is now:

PrintStream out = new PrintStream(new FileOutputStream(csvFile),
                                  true, "UTF-8");
byte[] bom = {(byte)0xEF, (byte)0xBB, (byte)0xBF};
out.write(bom);
for (Sheet sheet : wb) {
...


Exporting Formulas

When Excel exports CSV, it evaluates all the formulas and writes out the data. In the code below, we have an option of exporting the formula itself (starting with a "=") to CSV, or evaluating the formula and exporting the result.

Workbook wb = new XSSFWorkbook(new File(xlsxFile));
FormulaEvaluator fe = wb.getCreationHelper().createFormulaEvaluator();
...
if ( fe != null ) cell = fe.evaluateInCell(cell);
String value = formatter.formatCellValue(cell);
if ( cell.getCellTypeEnum() == CellType.FORMULA ) value = "=" + value;
...


Escape Quotes and Commas

When exporting a field value, it is necessary to properly escape certain characters such as double quotes, commas, and line separators. This is handled as follows:

static private Pattern rxquote = Pattern.compile("\"");

static private String encodeValue(String value) {
    boolean needQuotes = false;
    if ( value.indexOf(',') != -1 || value.indexOf('"') != -1 ||
         value.indexOf('\n') != -1 || value.indexOf('\r') != -1 )
        needQuotes = true;
    Matcher m = rxquote.matcher(value);
    if ( m.find() ) needQuotes = true; value = m.replaceAll("\"\"");
    if ( needQuotes ) return "\"" + value + "\"";
    else return value;
}


Including Empty Rows and Cells

When a formula is exported as-is, it needs the cell references to remain intact. We achieve this by exporting empty rows and cells to CSV to maintain these references.

To ensure that empty rows and cells are also output to CSV, we use row and cell numbers explicitly since the for-each loop shown above skips empty rows and cells.

Sheet sheet = wb.getSheetAt(sheetNo);
for (int r = 0, rn = sheet.getLastRowNum() ; r <= rn ; r++) {
    Row row = sheet.getRow(r);
    if ( row == null ) { out.println(','); continue; }
    for (int c = 0, cn = row.getLastCellNum() ; c < cn ; c++) {
        Cell cell = row.getCell(c,Row.MissingCellPolicy.RETURN_BLANK_AS_NULL);
    }
}


Export Specified Sheet Only

If the spreadsheet contains multiple sheets, a sheet number must be explicitly specified for export.

Sheet sheet = wb.getSheetAt(sheetNo);


The Final Cut

The core exporting program segment with all these updates now looks like this.

Workbook wb = new XSSFWorkbook(new File(xlsxFile));
int sheetNo = Integer.parseInt(args[index++]);
FormulaEvaluator fe = null;
if ( index < args.length ) {
    fe = wb.getCreationHelper().createFormulaEvaluator();
}

DataFormatter formatter = new DataFormatter();
PrintStream out = new PrintStream(new FileOutputStream(csvFile),
                                  true, "UTF-8");
byte[] bom = {(byte)0xEF, (byte)0xBB, (byte)0xBF};
out.write(bom);
{
    Sheet sheet = wb.getSheetAt(sheetNo);
    for (int r = 0, rn = sheet.getLastRowNum() ; r <= rn ; r++) {
        Row row = sheet.getRow(r);
        if ( row == null ) { out.println(','); continue; }
        boolean firstCell = true;
        for (int c = 0, cn = row.getLastCellNum() ; c < cn ; c++) {
            Cell cell = row.getCell(c, Row.MissingCellPolicy.RETURN_BLANK_AS_NULL);
            if ( ! firstCell ) out.print(',');
            if ( cell != null ) {
                if ( fe != null ) cell = fe.evaluateInCell(cell);
                String value = formatter.formatCellValue(cell);
                if ( cell.getCellTypeEnum() == CellType.FORMULA ) {
                    value = "=" + value;
                }
                out.print(encodeValue(value));
            }
            firstCell = false;
        }
        out.println();
    }
}


Summary

Exporting data in Excel spreadsheet to CSV is quite simple; it is also provided directly in Excel. However exporting Unicode data within the spreadsheet is not done correctly by Excel. In this article, we covered some issues arising from this process including exporting formulas, maintaining cell references and proper escaping of data.

UTF-8 file IO Data (computing) Programmer (hardware) Convert (command) Apache POI

Published at DZone with permission of Jay Sridhar, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Steel Threads Are a Technique That Will Make You a Better Engineer
  • Container Security: Don't Let Your Guard Down
  • The Path From APIs to Containers
  • 11 Observability Tools You Should Know

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: