DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • How to Quarantine a Malicious File in Java
  • How To Compare DOCX Documents in Java
  • AI in Java: Building a ChatGPT Clone With Spring Boot and LangChain
  • Keep Your Application Secrets Secret

Trending

  • Solving the Mystery: Why Java RSS Grows in Docker on M1 Macs
  • The Cost of Knowing: When Observability Becomes the Outage
  • The 7 Pillars of Meeting Design: Transforming Expensive Conversations into Decision Assets
  • Genkit Middleware: Intercept, Extend, and Harden your Gen AI Pipelines
  1. DZone
  2. Coding
  3. Java
  4. Efficiently Reading Large Excel Files (Over 1 Million Rows) Using the Open-Source Sjxlsx Java API

Efficiently Reading Large Excel Files (Over 1 Million Rows) Using the Open-Source Sjxlsx Java API

The primary objective of this article is to prevent the "out of memory-Java heap issue" when reading large Excel files using the open-source "sjxlsx" library.

By 
Mahendran Chinnaiah user avatar
Mahendran Chinnaiah
·
Oct. 09, 25 · Analysis
Likes (7)
Comment
Save
Tweet
Share
6.7K Views

Join the DZone community and get the full member experience.

Join For Free

If you are a developer, regardless of the technologies you use, at some point, you will face the challenge of handling large Excel files.

Most commonly, you will see the "out of memory" error when handling these files.

Here is the requirement:

You get a workbook with 100 spreadsheets, and each of them has 1 million records with a minimum of 100 columns.

For your reference, here's how many rows each Excel version can support:

  • 1,048,576 rows – Excel 365, 2013, 2010, 2007
  • 65,536 rows – Excel 2003 and earlier versions

Cost and Hosting Server Limitations

  1. Third-party license purchase is not feasible. Open source is allowed.
  2. Deployment should be on average-tier cloud environments (4 GB disk/4 GB RAM or less) or on-premises Windows Server (16/32 GB RAM) already loaded with 10+ running applications.

How would you handle this situation?

Let's explore the current digital market. Are there any open-source solutions available to meet this requirement?

I can’t use “interop” when your cloud runs on "Linux OS," and it also causes intermittent hang issues in multi-threaded applications.  

Most of them will say that the popular library is the "Apache POI" streaming library.

Let’s proceed with a practical implementation using Apache POI.

To get the large dataset (Excel) file, we have N number of websites, such as awesome-public-datasets, Google Dataset Search, and World Bank Data. 

Personally, I frequently visit Kaggle. Let's download the container Dataset as a 1.9GB CSV file, and then save it as an Excel file (.XSLX), which became around a 600 MB XLSX file.

The sheet includes 87 columns and 1048576 rows.

Before we run the solution, let me capture my laptop's resource usage so that we can compare later.

Resource Usage

  • CPU: 3%
  • Memory: 54%

CPU: 3%, memory: 54%


IDE

I am using Eclipse IDE (v: 4.36.0).

Apache POI

I am using the latest Apache POI 5.2.5 and other dependency JARS available from Apache POI.

Source Code

Here, I am just attempting to read the sheet name from the workbook, not the rows.

Java
 
public static void main(String[] args) throws Exception {
		String filePath = "C:\\POC\\Containers_Dataset.xlsx";
		ReadExcelbyApachePOI(filePath);
	}

/*List out sheet name*/
	static void ReadExcelbyApachePOI(String filePath) throws Exception {
		try (OPCPackage opcPackage = OPCPackage.open(new File(filePath), PackageAccess.READ)) {
			XSSFWorkbook workbook = new XSSFWorkbook(opcPackage);
			XSSFReader xssfReader = new XSSFReader(opcPackage);
			StylesTable styles = xssfReader.getStylesTable();
			XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader
	                .getSheetsData();
			while (iter.hasNext()) {
				InputStream stream = iter.next();
	            String sheetName = iter.getSheetName();
	            System.out.println("Sheetname: " + sheetName);
			}
			
		} catch (IOException e) {
			e.printStackTrace();
		}
	}


Result

Encountering "Java heap space - Out of memory error."

Java
 
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space


Resource Usage

  • CPU: 58%
  • Memory: 94%

CPU: 58%, memory: 94%

Let us try with the sjxlsx open-source API.

It’s an open-source Java API Source Code; this project was first published on Google Code, and it seems unmaintained. In GitHub, it is available for anyone who can download and update the changes for their needs.

"Memory" and "speed" are primary goals of this API.

It provides two modes, which are "classic" and "stream."

  1. Classic – All records of the sheet will be loaded.
  2. Stream – Read one record at a time.

Microsoft XLSX uses XML+zip (OOXML) to store the data. So, to be fast, "sjxlsx" uses STAX for XML input and output.

Source Code

Java
 
public static void main(String[] args) throws Exception {
		String filePath = "C:\\POC\\Containers_Dataset.xlsx";
		SimpleXLSXWorkbook workbook = newWorkbook(filePath);
		testLoadALL(workbook);
	}
private static SimpleXLSXWorkbook newWorkbook(String filePath) {
		return new SimpleXLSXWorkbook(new File(filePath));
	}
	/*Read Each Row*/
	private static void printRow(int rowPos, com.incesoft.tools.excel.xlsx.Cell[] row) {
		int cellPos = 0;
		for (com.incesoft.tools.excel.xlsx.Cell cell : row) {
			System.out.println(com.incesoft.tools.excel.xlsx.Sheet.getCellId(rowPos, cellPos) + "=" + cell.getValue());
			cellPos++;
		}
	}
	
	/*Load & Read workbook
	 * false => Read each row
	 * true  => Load all rows
	 */
	public static void testLoadALL(SimpleXLSXWorkbook workbook) {
		com.incesoft.tools.excel.xlsx.Sheet sheetToRead = workbook.getSheet(0,false);
		SheetRowReader rowreader = sheetToRead.newReader();
		int rowPos = 0;
		while (rowreader != null) {
			com.incesoft.tools.excel.xlsx.Cell[] row = rowreader.readRow();
			printRow(rowPos, row);
			rowPos++;

		}

	}


Resource Usage

  • CPU: 3% (No changes)
  • RAM: 61% (7% usage: 1 GB usage)

CPU: 3%, RAM: 61%


Output

Java
 
BN1048519=40298
BO1048519=0
BP1048519=0
BQ1048519=0
BR1048519=0
BS1048519=610
BT1048519=0
BU1048519=1
BV1048519=64240
BW1048519=923
BX1048519=158
BY1048519=32
BZ1048519=0
CA1048519=0
CB1048519=0
CC1048519=0
CD1048519=0
CE1048519=0
CF1048519=0
CG1048519=0
CH1048519=10000206
CI1048519=0
A1048520=100.64.0.2-10.16.0.9-35919-8080-6
B1048520=100.64.0.2
C1048520=35919
D1048520=10.16.0.9
E1048520=8080
F1048520=6
G1048520=45266.83932053241
H1048520=41626
I1048520=6
J1048520=5
K1048520=515
L1048520=357
M1048520=515
N1048520=0
O1048520=85.8333333333333
P1048520=210.24786958888899
Q1048520=357
R1048520=0
S1048520=71.400000000000006
T1048520=159.65525359348399
U1048520=20948.445682986501
V1048520=264.25791572574798
W1048520=4162.5999999999904
X1048520=12728.124713056101
Y1048520=40374
Z1048520=9
AA1048520=41626
AB1048520=8325.2000000000007
AC1048520=17922.528077813098
AD1048520=40374
AE1048520=29
AF1048520=41594
AG1048520=10398.5
AH1048520=20011.5685292282
AI1048520=40406
AJ1048520=26
AK1048520=1
AL1048520=1
AM1048520=0
AN1048520=0
AO1048520=0
AP1048520=0
AQ1048520=200
AR1048520=168
AS1048520=144.14068130495301
AT1048520=120.11723442079401
AU1048520=0
AV1048520=515
AW1048520=79.272727272727295
AX1048520=179.87445116474399
AY1048520=32354.8181818181
AZ1048520=2
BA1048520=2
BB1048520=0
BC1048520=2
BD1048520=10
BE1048520=0
BF1048520=0
BG1048520=0
BH1048520=0.83333333333333304
BI1048520=79.272727272727195
BJ1048520=85.8333333333333
BK1048520=71.400000000000006
BL1048520=0
BM1048520=0
BN1048520=0
BO1048520=0
BP1048520=0
BQ1048520=0
BR1048520=0
BS1048520=46
BT1048520=0
BU1048520=32
BV1048520=64240
BW1048520=502
BX1048520=1
BY1048520=32
BZ1048520=0
CA1048520=0
CB1048520=0
CC1048520=0
CD1048520=0
CE1048520=0
CF1048520=0
CG1048520=0
CH1048520=41626
CI1048520=0
A1048521=100.64.0.2-10.16.0.9-9479-8080-6
B1048521=100.64.0.2
C1048521=9479
D1048521=10.16.0.9
E1048521=8080
F1048521=6
G1048521=45266.835683206016
H1048521=111205
I1048521=6
J1048521=5
K1048521=537
L1048521=357


Performance Results

Woo-hoo! I can read the records.

The winner is "sjxlsx." It has been proven that this library consumes less than 1 GB of memory, compared to the higher usage by Apache POI.

It is an excellent open-source Java API for reading large Excel datasets.

Additional Features

In addition to that, it supports writing Excel and offers rich data formatting in sheets.

Java
 
public static void main(String[] args) throws Exception {
		// WRITE - we take WRITE as a special kind of MODIFY
		 SimpleXLSXWorkbook workbook = newWorkbook();
        OutputStream output = ExcelOutput("write");
        WriteExcel(workbook, output);
        output.close();
	}
 private static SimpleXLSXWorkbook newWorkbook() {
         return new SimpleXLSXWorkbook(new File("/sample.xlsx"));
 }
	 private static OutputStream ExcelOutput(String suffix) throws Exception {
         return new BufferedOutputStream(new FileOutputStream("/sample_"
                         + suffix + ".xlsx"));
 }
	 
	public static void WriteExcel(SimpleXLSXWorkbook workbook,
            OutputStream outputStream) throws Exception {
		com.incesoft.tools.excel.xlsx.Sheet sheet = workbook.getSheet(0);
		WriteRecords(workbook, sheet, 0);
    workbook.commit(outputStream);
}
	 static public void WriteRecords(SimpleXLSXWorkbook wb, com.incesoft.tools.excel.xlsx.Sheet sheet,
             int rowOffset) {
     int columnCount = 10;
     int rowCount = 10;
     int offset = rowOffset;
     for (int r = offset; r < offset + rowCount; r++) {
             int modfiedRowLength = sheet.getModfiedRowLength();
             for (int c = 0; c < columnCount; c++) {
                     sheet.modify(modfiedRowLength, c, r + "," + c, null);
             }
     }
}


Writing a Richly Styled Row

Java
 
public static void WriteRichStyleRow(SimpleXLSXWorkbook wb, com.incesoft.tools.excel.xlsx.Sheet sheet)
              throws Exception {
      Font font = wb.createFont();
      font.setColor("FFFF0000");
      Fill fill = wb.createFill();
      fill.setFgColor("FF00FF00");
      CellStyle style = wb.createStyle(font, fill);

      RichText richText = wb.createRichText();
      richText.setText("test_text");
      Font font2 = wb.createFont();
      font2.setColor("FFFF0000");
      richText.applyFont(font2, 1, 2);
      sheet.modify(0, 0, (String) null, style);
      sheet.modify(1, 0, richText, null);
}


Summary

Ultimately, "sjxlsx" provides an efficient, lightweight way to read large Excel files without infrastructure headaches.

Thank you!

API Open source Java (programming language) Row (database)

Opinions expressed by DZone contributors are their own.

Related

  • How to Quarantine a Malicious File in Java
  • How To Compare DOCX Documents in Java
  • AI in Java: Building a ChatGPT Clone With Spring Boot and LangChain
  • Keep Your Application Secrets Secret

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook