{{announcement.body}}
{{announcement.title}}

Anatomy of an Excel File and Large Excel File Operation With the POI Library

DZone 's Guide to

Anatomy of an Excel File and Large Excel File Operation With the POI Library

The main goal of this article is to read large Excel files with the streaming API from the POI library.

· Big Data Zone ·
Free Resource

Introduction

Excel is a popular spreadsheet for different types of tasks related to business, research, analytics, and tabular tasks. The Apache Foundation released the POI library December 30, 2001. It is a pure Java library which supports Microsoft office formats. We can read small files easily with POI with some simple interfaces but we need some more advance support to read large files. The main goal of this article is to read large Excel files with the streaming API from the POI library. 

Architecture of an Excel OOXML-Based File (.xlsx file)

Excel files actually combine some XML files in a zip format (for .xlsx format). As we can see, an Excel file looks like this.

Excel file

We will rename the file extension to .zip from .xlsx. Now we can see that the extracted folder only contains some XML files which are organixed in a different folder.

Tree structure of the extracted Excel file

Here we can see all the files are written in an XML structure which we will read and write with our POI library.

So, in summary, we can say that the whole Excel file as workbook. This workbook contains some worksheets. SharedStrings.xml contains shared strings of the whole workbook and the styles contain the styles of each cell of every sheet.

POI Library

The POI library in Java is used to read and write Excel files in Java. Actually excel file has two types, One is Binary format and another is OOXML based. Apache POI contains different types of components to read and write Excel files.

  • HSSF (Horrible SpreadSheet Format) - Used to read and write Excel files from the 97-2003(.xls) format. All interfaces and implentetions ar eavailable in org.apache.poi.hssf.usermodel package in the POI library. 

  • XSSF (XML SpreadSheet Format) - Used to read and write Excel files from the 2007+ versions (.xlsx) of Excel files. All interfaces and implentetions are available in org.apache.poi.xssf.eventusermodel package in the POI library.

  • SXSSF (Streaming extension of XSSF) - Used to read .xlsx format Excel files with the stream concept and popular for reading large files efficiently. 

Basic Interfaces and Their Implementations

Workbook is an interface which is inherited from the Closeable interface and the Sheet Iterable. A Closeable gives support to the data stream so that it can be closed. The close method is invoked to release resources that the object is holding (such as open files).

public interface Workbook extends Closeable, Iterable<Sheet> 

Workbook has several implementations:

1. HSSFWorkbook

public final class HSSFWorkbook extends POIDocument implements Workbook

2. XSSFWorkbook 

public class XSSFWorkbook extends POIXMLDocument implements Workbook 

3. SXSSFWorkbook

public class SXSSFWorkbook implements Workbook 

Sheet is an interface which is used to write and read row, column, and cells properties.

public interface Sheet extends Iterable<Row>

Sheet has several implementations:

1. HSSFSheet

public final class HSSFSheet implements Sheet

2. XSSFSheet

public class XSSFSheet extends POIXMLDocumentPart implements Sheet

3. SXSSFSheet

public class SXSSFSheet implements Sheet

4. XSSFDialogsheet

public class XSSFDialogsheet extends XSSFSheet implements Sheet 

Row

public interface Row extends Iterable<Cell> 

 Row has also several implementation like XSSFRow, HSSFRow, SXSSFRow

Cell

public interface Cell 

Cell also has several implementations, like XSSFCell, HSSFCell, SXSSFCell.

So, in summery, we can say that workbooks contain sevaral sheets, sheets contain many rows, and every row contains some cells.

Exploring File Writting and Reading With XSSF Interface

Here I am presenting some sample code to write and read Excel files using the XSSF interface. You can find the code for the Resident and ExcelFileHeaderConstants files here:

1. Resident.java

2. ExcelFileHeaderConstants.java

  public static void main(String[] args) throws Exception{

        final String FILE_NAME = "Resident_INFO_File.xlsx";

        writeFile(FILE_NAME);
        System.out.println("Write task done!!!\n Now Reading...........");
        readFile(FILE_NAME);
    }

    private static void writeFile(String FILE_NAME) throws IOException {
        List<Resident> residentList = new ArrayList<>();
        for(int i = 0;i<10;i++)
        {
            Resident resident = new Resident();
            resident.setName("Name"+i);
            resident.setMobile("0142485824" + i);
            resident.setAddress("ABC" + i);
            resident.setEmail("count" + i + "@gmail.com");
            resident.setNationalId("8687678687687" + i);
            resident.setAge(i+30);
            residentList.add(resident);
        }

        List<String> residentHeaders = new ArrayList<>();
        residentHeaders.add(ExcelFileHeaderConstants.NAME);
        residentHeaders.add(ExcelFileHeaderConstants.ADDRESS);
        residentHeaders.add(ExcelFileHeaderConstants.MOBILE);
        residentHeaders.add(ExcelFileHeaderConstants.EMAIL);
        residentHeaders.add(ExcelFileHeaderConstants.AGE);
        residentHeaders.add(ExcelFileHeaderConstants.NATIONAL_ID);
        Workbook workbook = buildWorkbook(residentHeaders,"TEST_SHEET",residentList);
        FileOutputStream outputStream = new FileOutputStream(FILE_NAME);
        workbook.write(outputStream);
        outputStream.close();
        workbook.close();
    }

    private static Workbook buildWorkbook(List<String> headers, String sheetName, List<Resident> data) {
        //1. Workbook creation
        Workbook workbook = new XSSFWorkbook();
        Font font = workbook.createFont();
        font.setFontHeightInPoints((short) 10);
        font.setColor((short) Font.COLOR_NORMAL);
        XSSFCellStyle cellStyle = (XSSFCellStyle) workbook.createCellStyle();
        cellStyle.setFont(font);

        //2. Sheet creation
        Sheet sheet = workbook.createSheet();
        sheet.setColumnWidth((short) 0, (short) ((50 * 8) / ((double) 1 / 20)));
        sheet.setColumnWidth((short) 1, (short) ((50 * 8) / ((double) 1 / 20)));
        workbook.setSheetName(0,sheetName);

        Sheet refSheet = workbook.createSheet();
        refSheet.setColumnWidth((short) 0, (short) ((50 * 8) / ((double) 1 / 20)));
        refSheet.setColumnWidth((short) 1, (short) ((50 * 8) / ((double) 1 / 20)));
        workbook.setSheetName(1,"List_reference_hidden_sheet");
        // workbook.setSheetVisibility(1, SheetVisibility.VERY_HIDDEN);
        //Header creation

        String[] addresses = {"Delhi","Kolkata","Chennai","Asam","Udisha","Mumbai","Panjab","Shilong"};
        int count=0;
        Row headerRow = sheet.createRow(count);
        for (String header : headers) {
            Cell cell1 = headerRow.createCell(count++);
            cell1.setCellValue(header);
            cell1.setCellStyle(cellStyle);
        }

        Row headerRowRefSheet = refSheet.createRow(0);
        Cell rcell = headerRowRefSheet.createCell(0);
        rcell.setCellValue("Cities");
        rcell.setCellStyle(cellStyle);

        Row rrow = null;
        int rrownum=0;
        Cell celll =null;
        for (String address : addresses) {
            rrow = refSheet.createRow(rrownum++);
            celll = rrow.createCell(0);
            celll.setCellValue(address);
        }

        Name namedCell = workbook.createName();
        namedCell.setNameName("HiddenList");
        String reference = "List_reference_hidden_sheet!$A$2:$A$"+(addresses.length+1)+"";
        namedCell.setRefersToFormula(reference);

        int rownum = 1;
        Row row = null;
        Cell cell = null;
        count=0;
        for (Resident resident:data) {
            count=0;
            row = sheet.createRow(rownum++);
            cell = row.createCell(count++);
            cell.setCellValue(resident.getName());
            cell = row.createCell(count++);
            cell.setCellValue(resident.getAddress());
            cell = row.createCell(count++);
            cell.setCellValue(resident.getMobile());
            cell = row.createCell(count++);
            cell.setCellValue(resident.getEmail());
            cell = row.createCell(count++);
            cell.setCellValue(resident.getAge());
            cell = row.createCell(count++);
            cell.setCellValue(resident.getNationalId());
        }
        DataValidationHelper dvHelper = sheet.getDataValidationHelper();
        DataValidationConstraint dvConstraint = dvHelper.createFormulaListConstraint("HiddenList");
        CellRangeAddressList addressList = new CellRangeAddressList(1, addresses.length, count, count);
        DataValidation validation = dvHelper.createValidation(dvConstraint, addressList);
        sheet.addValidationData(validation);

        return workbook;
    }

    private static void readFile(String fileName) throws IOException {

            FileInputStream excelInputStream = new FileInputStream(new File(fileName));
            Workbook workbook = new XSSFWorkbook(excelInputStream);
            Sheet sheet = workbook.getSheetAt(0);
            Iterator < Row > rowItr = sheet.iterator();
            int rowNum = 0;
            while (rowItr.hasNext()) {
                Row row = rowItr.next();
                Iterator < Cell > cellItr = row.iterator();
                System.out.print(rowNum + ". ");
                while (cellItr.hasNext()) {
                    Cell cell = cellItr.next();
                    if (cell.getCellTypeEnum() == CellType.STRING) {
                        System.out.print(cell.getStringCellValue() + "\t");
                    } else if (cell.getCellTypeEnum() == CellType.NUMERIC) {
                        System.out.print(cell.getNumericCellValue() + "\t");
                    }
                }
                System.out.println();
                rowNum++;
            }
            workbook.close();
            excelInputStream.close();
    }

Yes this will work fine. But this not suitable for large files. Large files take up too much memory. So sometimes we also get an 'Out of Memory' error. Just update the number counter loops at line 12 in the previous code like this:

  /* Excel Max support 1048576 rows in .xlsx format*/
  for(int i = 0;i<1048570;i++)    {
  }

I am still getting the 'Out of Memory' error. My laptop configuration is Core i7 with 4GB of ram.

To overcome this error, we can use SXSSF which will flush rows after the defined row is created. So it keeps the memory clean.

Implementation of SXSSF-Based Interface to Read and Write Excel Files 

For streaming purposes, we have to change the type of workbook when creating workbook. 

private static Workbook buildWorkbook(List<String> headers, String sheetName, List<Resident> data) 
{    
  //1. Workbook creation    
  Workbook workbook = new SXSSFWorkbook(100);

  ...
}

So with this implementation we will never get an 'Out of Memory' error, because it always keeps 100 rows in memory. Others will be flushed imediately if the row number is greater than 100. Here low memory is being used. That is the main feature of SXSSF APIs. 

If we update this code with SXSSF then writting tasks will be completed successfully; but carefully look at the reading task to see if it's still in XSSF format. So its still getting error.

So lets go implement it with SXSSF event based memory efficient readding with SAX parser. Let's modify the read() method. Here the OPCPackage works as a container which will contain all parts of an Excel file like workbook, sheets, sharedStrings, etc.

private static void readFile(String fileName)
  throws IOException, OpenXML4JException, ParserConfigurationException, SAXException 
  {    
    File file = new File(fileName);  
    OPCPackage opcPackage = OPCPackage.open(file);   
    ParseExcelFile parsedFile = new ParseExcelFile(opcPackage,10,Resident.class);
    boolean isValid = parsedFile.process();   
    System.out.println("File is valid ? "+isValid);
  } 

ParseExcelwill initalize all event properties. The process method will read OPCPackage using XSSFReader. And then parse sharedStrings, styles and relationship data. Then it will iterate all sheet of workbook and call the processSheet method to process a prticular sheet. Then the processSheet method initializes some properties of SAXParser to parse sheets. When XMLReader calls the parse()  method, an event will be thrown to read the Excel sheet.

public class ParseExcelFile {

    private OPCPackage xlsxPackage;
    private int minColumns;
    private PrintStream output;
    private Class clazz;
    private SXSSFWorkbook outWorkbook;

    public ParseExcelFile(OPCPackage pkg, int minColumns, Class clazz) {
        this.xlsxPackage = pkg;
        this.minColumns = minColumns;
        this.clazz = clazz;
        this.outWorkbook = new SXSSFWorkbook();
    }

    public void processSheet(StylesTable styles,
                             ReadOnlySharedStringsTable strings,String sheetName, InputStream sheetInputStream)
            throws IOException, ParserConfigurationException, SAXException {
        SheetModel sheetModel = new SheetModel();
        sheetModel.setNumberOfColumns(0);
        sheetModel.setNumberOfRows(0);
        sheetModel.setWorkbook(outWorkbook);
        sheetModel.setCurSheetName(sheetName);
        sheetModel.setCurOutSheet(outWorkbook.createSheet(sheetName));
        long startTime = System.currentTimeMillis();
        InputSource sheetSource = new InputSource(sheetInputStream);
        SAXParserFactory saxFactory = SAXParserFactory.newInstance();
        SAXParser saxParser = saxFactory.newSAXParser();
        XMLReader sheetParser = saxParser.getXMLReader();
        ContentHandler handler = new XSSFSheetEventHandler(styles, strings,
                this.minColumns,  sheetModel);
        sheetParser.setContentHandler(handler);
        sheetParser.parse(sheetSource);
        sheetModel.setTimeToProcess(System.currentTimeMillis() - startTime);
        System.out.println("Number Of Rows : "+ sheetModel.getNumberOfRows());
        System.out.println("Time to process : "+ sheetModel.getTimeToProcess());
    }

    public boolean process() throws IOException, OpenXML4JException,
            ParserConfigurationException, SAXException {

        ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(
                this.xlsxPackage);
        XSSFReader xssfReader = new XSSFReader(this.xlsxPackage);

        StylesTable styles = xssfReader.getStylesTable();
        XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader
                .getSheetsData();
        while (iter.hasNext()) {
            InputStream stream = iter.next();
            String sheetName = iter.getSheetName();
            System.out.println("Processing Sheet : "+sheetName);
            processSheet(styles, strings,sheetName, stream);
            stream.close();
        }
        return true;
    }
}

And XSSFSheetEventHandler is here, which will handle all events during the reading of a sheet. Here, if one element of a sheet is going to start a read it will go to the startelement method and after completely reading it will go to the endelement method. For example, if events start to read a row it will go to the startelement method and after the last data cell of a row's reading is completed it will go the endelement method. The same applies for cells, columns, etc.

class XSSFSheetEventHandler extends DefaultHandler {
    private StylesTable stylesTable;
    private ReadOnlySharedStringsTable sharedStringsTable;
    private List<?> list = new ArrayList();
    private SheetModel sheetModel;
    private final int minColumnCount;
    private boolean vIsOpen;
    private String sqref;
    private XSSFDataTypes nextDataType;
    private short formatIndex;
    private String formatString;
    private final DataFormatter formatter;
    private int lastColumnNumber = -1;
    private StringBuffer value;
    private int dataType;

    public XSSFSheetEventHandler(StylesTable styles,
                            ReadOnlySharedStringsTable strings, int cols, SheetModel sheetModel) {
        this.stylesTable = styles;
        this.sharedStringsTable = strings;
        this.minColumnCount = cols;
        this.value = new StringBuffer();
        this.nextDataType = XSSFDataTypes.NUMBER;
        this.formatter = new DataFormatter();
        this.sheetModel = sheetModel;
    }

    public void startElement(String uri, String localName, String name,
                             Attributes attributes) throws SAXException {
        if ("inlineStr".equals(name) || "v".equals(name)) {
            vIsOpen = true;
            value.setLength(0);
        }
        else if ("c".equals(name)) {
            String r = attributes.getValue("r");
            this.nextDataType = XSSFDataTypes.NUMBER;
            this.formatIndex = -1;
            this.formatString = null;
            String cellType = attributes.getValue("t");
            String cellStyleStr = attributes.getValue("s");
            if ("b".equals(cellType))
                nextDataType = XSSFDataTypes.BOOL;
            else if ("e".equals(cellType))
                nextDataType = XSSFDataTypes.ERROR;
            else if ("inlineStr".equals(cellType))
                nextDataType = XSSFDataTypes.INLINESTR;
            else if ("s".equals(cellType))
                nextDataType = XSSFDataTypes.SSTINDEX;
            else if ("str".equals(cellType))
                nextDataType = XSSFDataTypes.FORMULA;
            else if (cellStyleStr != null) {
                int styleIndex = Integer.parseInt(cellStyleStr);
                XSSFCellStyle style = stylesTable.getStyleAt(styleIndex);
                this.formatIndex = style.getDataFormat();
                this.formatString = style.getDataFormatString();
                if (this.formatString == null)
                    this.formatString = BuiltinFormats
                            .getBuiltinFormat(this.formatIndex);
            }
        }else if ("dataValidation".equals(name)){

            String dvXmlType = attributes.getValue("type");
            sqref = attributes.getValue("sqref");
            if(dvXmlType.equals("list")){
                dataType = DataValidationConstraint.ValidationType.LIST;
                System.out.println("List Type data ");
            }
            System.out.println("At data validation start");

        }else if(name.startsWith("formula")){
            vIsOpen = true;
            value = new StringBuffer();
            System.out.println("At Formula start");
        }

    }

    public void endElement(String uri, String localName, String name)
            throws SAXException {
        String thisStr = null;

        if (vIsOpen) {
            if ("v".equals(name) || "is".equals(name)) {
                vIsOpen = false;
            } else if (name.startsWith("formula")) {
                char last = name.charAt(name.length() - 1);
                switch (last) {
                    case '1':
                        System.out.println(""+value.toString());
                        break;
                    case '2':
                        System.out.println(value.toString());
                        break;
                }
                vIsOpen = false;
                value = null;
            }
        }
        if ("v".equals(name)) {
            System.out.println("    "+value);
            switch (nextDataType) {
                case BOOL:
                    char first = value.charAt(0);
                    thisStr = first == '0' ? "FALSE" : "TRUE";
                    break;

                case ERROR:
                    thisStr = "\"ERROR:" + value.toString() + '"';
                    break;

                case FORMULA:
                    thisStr = '"' + value.toString() + '"';
                    break;

                case INLINESTR:
                    XSSFRichTextString rtsi = new XSSFRichTextString(value
                            .toString());
                    thisStr = '"' + rtsi.toString() + '"';
                    break;

                case SSTINDEX:
                    String sstIndex = value.toString();
                    try {
                        int idx = Integer.parseInt(sstIndex);
                        XSSFRichTextString rtss = new XSSFRichTextString(
                                sharedStringsTable.getEntryAt(idx));
                        thisStr = '"' + rtss.toString() + '"';
                    } catch (NumberFormatException ex) {

                    }
                    break;

                case NUMBER:
                    String n = value.toString();
                    if (this.formatString != null)
                        thisStr = formatter.formatRawCellContents(Double
                                        .parseDouble(n), this.formatIndex,
                                this.formatString);
                    else
                        thisStr = n;
                    break;

                default:
                    thisStr = "(TODO: Unexpected type: " + nextDataType + ")";
                    break;
            }
            if (lastColumnNumber == -1) {
                lastColumnNumber = 0;
            }

        } else if ("row".equals(name)) {
            sheetModel.setNumberOfRows(sheetModel.getNumberOfRows()+1);
            System.out.println();
        }

        else if ("dataValidation".equals(name)) {
            System.out.println("At datavalidation end");
        }
    }
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        if (vIsOpen)
            value.append(ch, start, length);
    }
}

I am usingSheetModel here to get a populated result from an event based reading of an Excel file. SheetModel's code is given here:

public class SheetModel {
    public long numberOfRows;
    public long numberOfColumns;
    public long timeToProcess;
    public SXSSFWorkbook workbook;
    public Sheet curOutSheet;
    public String curSheetName;

    public Sheet getCurOutSheet() {
        return curOutSheet;
    }

    public void setCurOutSheet(Sheet curOutSheet) {
        this.curOutSheet = curOutSheet;
    }



    public String getCurSheetName() {
        return curSheetName;
    }

    public void setCurSheetName(String curSheetName) {
        this.curSheetName = curSheetName;
    }



    public SXSSFWorkbook getWorkbook() {
        return workbook;
    }

    public void setWorkbook(SXSSFWorkbook workbook) {
        this.workbook = workbook;
    }



    public long getNumberOfRows() {
        return numberOfRows;
    }

    public void setNumberOfRows(long numberOfRows) {
        this.numberOfRows = numberOfRows;
    }

    public long getNumberOfColumns() {
        return numberOfColumns;
    }

    public void setNumberOfColumns(long numberOfColumns) {
        this.numberOfColumns = numberOfColumns;
    }

    public long getTimeToProcess() {
        return timeToProcess;
    }

    public void setTimeToProcess(long timeToProcess) {
        this.timeToProcess = timeToProcess;
    }
}

So with using SXSSF interfaces we can easily write and read large Excel files. This is a memory efficient procedure.

I will explain all properties of event properties in my next article. Then I will give the link here. All of my code from this this article (similar code) is available on GitHub at this link: PoiAdvanceExample.

Topics:
apache poi architecture, big data, data analysis and coding, excel, java tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}