Parsing OOXML Files With Ruby
In this post, we take a look at how one dev team tackled an issue they were having with parsing data in their application.
Join the DZone community and get the full member experience.
Join For FreeFor testing ONLYOFFICE document editors, we wrote a docx, xlsx, and pptx parser in Ruby. It is free and available both on GitHub and RubyGems.org via the AGPL v3 license.
In this article, we will tell you how it is done and how it works.
No Existing Tool Suites
We could have taken any existing tools. Reasons that stopped us:
- Most of them turned out to be abandoned by the developers.
- They are available as three separate libraries (for text documents, spreadsheets, and presentations) with different interfaces, which makes them extremely inconvenient to use.
- They support only basic functionality.
Parser for ONLYOFFICE QA
We needed a more powerful tool for testing ONLYOFFICE editors as they are:
- being actively developed.
- allow for the application of complex formatting and styling to documents, spreadsheets, and presentations.
- have maximum compatibility with OOXML formats (docx, xlsx, pptx).
Parsing Complex Functionality
The ONLYOFFICE editors have max compatibility with MS Office formats, so the parser must have it as well. We developed it in accordance with the ECMA-376 standard which, in fact, is in four tomes and has around seven thousand pages.
You might understand that we can’t implement each little thing from the standard. But we have everything that is needed to test the ONLYOFFICE editor's advanced features.
Apart from parsing basic features like paragraphs, tables, and shapes, our parser supports:
- Color schemes
- Paragraph and table styles
- Charts
- Columns
- Auto-shapes properties
- Lists
Why We Needed a Parser
After launching automated testing at ONLYOFFICE, we adopted a single concept of functional tests.
Let’s take a simple one:
1. Create a new document.
2. Type any text and apply a bold font to it.
3. Check that the font is bold.
ONLYOFFICE editors are written using HTML5 Canvas, so the text in our document is a picture. Verifying the thickness of the font via a picture is not an easy job to do. Look, for example, at Arial Black. Will you be able to understand whether the text is bold or not?
That is why we've added an additional verification step to this scenario:
4. Download the file as docx and check that the text has the Bold parameter.
There are hundreds of such parameters. Yet none of the existing parsers support anything but parsing text, tables, and some other simple things. That's why we created our own library.
How the Parser Works
If you have ever tried to zip a .docx file, then you probably noticed that the compression ratio is very small. That’s because an OOXML file is just a set of archived XML files.
For example, let's create a simple file with some text in the ONLYOFFICE document editor and download it as a docx.
Now we need to extract the document as a zip to see its structure, which looks like this:
#tree
.
├── [Content_Types].xml
├── docProps
│ ├── app.xml
│ └── core.xml
├── _rels
└── word
├── document.xml
├── fontTable.xml
├── _rels
│ └── document.xml.rels
├── settings.xml
├── styles.xml
├── theme
│ ├── _rels
│ │ └── theme1.xml.rels
│ └── theme1.xml
└── webSettings.xml
Let’s see what are all those things:
[Content_Types].xml
— the list of the mime-types of the document.
app.xml
— document metadata, app metadata, statistics.
core.xml
— latest modifications metadata.
document.xml
— Ohh, that's a bingo. These are our document’s contents we were looking for!
fontTable.xml
— the document font table. Might be useful.
document.xml.rels
— the complete list of files in the archive; this list will come in handy for complex documents with pictures and graphics.
settings.xml
— as the name suggests, it contains various document settings like default zoom, numbers separators, etc.
styles.xml
, theme1.xml
and theme1.xml.rels
— very detailed files with styles and themes settings. The ability to recognize these is one of the key advantages of the product.
webSettings.xml
— the document web version settings. Not very popular functionality for docx.
So, if we are dealing with a simple docx like the one we created, we need document.xml
only.
That’s simple XML. Luckily, it can easily be parsed using Ruby. We just take the Nokogiri, get the DOM tree, and then check the OOXML standard (or use reverse engineering) to see where the parameter we need is hidden.
How the Parser Was Written
We started working on the tool with two mistakes which we corrected later. Here they are, and we hope that our experience will help you avoid such problems.
Big Files
So, we need to test three different editors for text documents, spreadsheets, and presentations. How can we organize the code for that purpose? It seems funny now, but, at first, we had four files (the fourth one for methods that are common for all three editors) that were 4,000 lines each. Fixing that took time. We re-structured the code precisely, and the result is 200 files instead of four. Now it’s way easier to fix bugs.
No Tests
Us: *have no tests for the parser (because we wrote that one to test the editors, and not to have one more thing for testing).
Everything: crashes after we correct one typo trying to fix a minor issue
So, we had to create a `spec` folder, put two hundred files in there, check out a bunch of parameters in order to know that the commit we made before leaving work would not crash the verification of that option in the third level menu that no one remembers how make work correctly.
We also had cool ideas. For example:
Using RuboCop
RuboCop is a Ruby static code analyzer and formatter, based on the community Ruby style guide. We love it as it helps us avoid mistakes, keep the code clean, and be sure that our latest commit didn’t make it worse (thanks to integration via overcommit).
Here’s what it looks like. If, after having the hardest day, you accidentally forgot that variables in Ruby are lower-cased and try to commit something like this:
— path_to_zip_file = copy_file_and_rename_to_zip(path_to_file)
+ ZIP_file = copy_file_and_rename_to_zip(path_to_file)
An error will occur.
Analyze with RuboCop........................................[RuboCop] FAILED
Errors on modified lines:
ooxml_parser/lib/ooxml_parser/common_parser/parser.rb:8:7: E: dynamic constant assignment
You won’t be able to commit the code without additional manipulations. This is an excellent foolproof system.
Making Use of Our Document Base
By the time we created the parser, we already had a vast collection of huge and, in fact, quite strange docx, xlsx, and pptx files. We collected them during the early stages of ONLYOFFICE editors development to check the rendering of complex documents. Several years later we used them to test our parser. We detected a significant number of errors, and it took us several weeks to fix them. But it was worth it.
Now we have an awesome tool for parsing OOXML files. We use them for testing:
- ONLYOFFICE Developer Edition, document, spreadsheet, and presentation editors as an advanced component for other web apps.
- ONLYOFFICE Integration Edition online editors with connectors for popular business solutions – Nextcloud, ownCloud, SharePoint, Confluence, Alfresco.
- ONLYOFFICE Enterprise Edition - online editors integrated with a collaboration platform.
Hope that this will be useful for your project as much as it is for ONLYOFFICE.
Opinions expressed by DZone contributors are their own.
Comments