DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
What's in store for DevOps in 2023? Hear from the experts in our "DZone 2023 Preview: DevOps Edition" on Fri, Jan 27!
Save your seat
  1. DZone
  2. Coding
  3. Languages
  4. Parsing OOXML Files With Ruby

Parsing OOXML Files With Ruby

In this post, we take a look at how one dev team tackled an issue they were having with parsing data in their application.

Nadya Knyazeva user avatar by
Nadya Knyazeva
·
Apr. 03, 19 · Tutorial
Like (3)
Save
Tweet
Share
6.86K Views

Join the DZone community and get the full member experience.

Join For Free

For testing ONLYOFFICE document editors, we wrote a docx, xlsx, and pptx parser in Ruby. It is free and available both on GitHub and RubyGems.org via the AGPL v3 license.

In this article, we will tell you how it is done and how it works.

No Existing Tool Suites

We could have taken any existing tools. Reasons that stopped us:

  • Most of them turned out to be abandoned by the developers.
  • They are available as three separate libraries (for text documents, spreadsheets, and presentations) with different interfaces, which makes them extremely inconvenient to use.
  • They support only basic functionality.

Parser for ONLYOFFICE QA

We needed a more powerful tool for testing ONLYOFFICE editors as they are:

  • being actively developed.
  • allow for the application of complex formatting and styling to documents, spreadsheets, and presentations.
  • have maximum compatibility with OOXML formats (docx, xlsx, pptx).

Parsing Complex Functionality

The ONLYOFFICE editors have max compatibility with MS Office formats, so the parser must have it as well. We developed it in accordance with the ECMA-376 standard which, in fact, is in four tomes and has around seven thousand pages.

You might understand that we can’t implement each little thing from the standard. But we have everything that is needed to test the ONLYOFFICE editor's advanced features.

Apart from parsing basic features like paragraphs, tables, and shapes, our parser supports:

  • Color schemes
  • Paragraph and table styles
  • Charts
  • Columns
  • Auto-shapes properties
  • Lists

Why We Needed a Parser

After launching automated testing at ONLYOFFICE, we adopted a single concept of functional tests.

Let’s take a simple one:

1. Create a new document.

2. Type any text and apply a bold font to it.

3. Check that the font is bold.

ONLYOFFICE editors are written using HTML5 Canvas, so the text in our document is a picture. Verifying the thickness of the font via a picture is not an easy job to do. Look, for example, at Arial Black. Will you be able to understand whether the text is bold or not?

Image title Image title



That is why we've added an additional verification step to this scenario:

4. Download the file as docx and check that the text has the Bold parameter.

There are hundreds of such parameters. Yet none of the existing parsers support anything but parsing text, tables, and some other simple things. That's why we created our own library.

How the Parser Works

If you have ever tried to zip a .docx file, then you probably noticed that the compression ratio is very small. That’s because an OOXML file is just a set of archived XML files.

For example, let's create a simple file with some text in the ONLYOFFICE document editor and download it as a docx.

Image title

Now we need to extract the document as a zip to see its structure, which looks like this:

#tree
.       
├── [Content_Types].xml     
├── docProps                
│   ├── app.xml            
│   └── core.xml            
├── _rels                   
└── word
    ├── document.xml        
    ├── fontTable.xml       
    ├── _rels               
    │   └── document.xml.rels                  
    ├── settings.xml        
    ├── styles.xml          
    ├── theme               
    │   ├── _rels           
    │   │   └── theme1.xml.rels                
    │   └── theme1.xml      
    └── webSettings.xml 

Let’s see what are all those things:

[Content_Types].xml  — the list of the mime-types of the document.

app.xml  — document metadata, app metadata, statistics.

core.xml  — latest modifications metadata.

document.xml — Ohh, that's a bingo. These are our document’s contents we were looking for!

fontTable.xml  — the document font table. Might be useful.

document.xml.rels  — the complete list of files in the archive; this list will come in handy for complex documents with pictures and graphics.

settings.xml  — as the name suggests, it contains various document settings like default zoom, numbers separators, etc.

styles.xml, theme1.xml  and theme1.xml.rels  — very detailed files with styles and themes settings. The ability to recognize these is one of the key advantages of the product.

webSettings.xml — the document web version settings. Not very popular functionality for docx.

So, if we are dealing with a simple docx like the one we created, we need document.xml only.

Image title

That’s simple XML. Luckily, it can easily be parsed using Ruby. We just take the Nokogiri, get the DOM tree, and then check the OOXML standard (or use reverse engineering) to see where the parameter we need is hidden.

How the Parser Was Written

We started working on the tool with two mistakes which we corrected later. Here they are, and we hope that our experience will help you avoid such problems.

Big Files

So, we need to test three different editors for text documents, spreadsheets, and presentations. How can we organize the code for that purpose? It seems funny now, but, at first, we had four files (the fourth one for methods that are common for all three editors) that were 4,000 lines each. Fixing that took time. We re-structured the code precisely, and the result is 200 files instead of four. Now it’s way easier to fix bugs.

No Tests

Us: *have no tests for the parser (because we wrote that one to test the editors, and not to have one more thing for testing).

Everything: crashes after we correct one typo trying to fix a minor issue

So, we had to create a `spec` folder, put two hundred files in there, check out a bunch of parameters in order to know that the commit we made before leaving work would not crash the verification of that option in the third level menu that no one remembers how make work correctly.

We also had cool ideas. For example:

Using RuboCop

RuboCop is a Ruby static code analyzer and formatter, based on the community Ruby style guide. We love it as it helps us avoid mistakes, keep the code clean, and be sure that our latest commit didn’t make it worse (thanks to integration via overcommit).

Here’s what it looks like. If, after having the hardest day, you accidentally forgot that variables in Ruby are lower-cased and try to commit something like this:

— path_to_zip_file = copy_file_and_rename_to_zip(path_to_file)
+ ZIP_file = copy_file_and_rename_to_zip(path_to_file)

An error will occur.

Analyze with RuboCop........................................[RuboCop] FAILED

Errors on modified lines:

ooxml_parser/lib/ooxml_parser/common_parser/parser.rb:8:7: E: dynamic constant assignment

You won’t be able to commit the code without additional manipulations. This is an excellent foolproof system.

Making Use of Our Document Base

By the time we created the parser, we already had a vast collection of huge and, in fact, quite strange docx, xlsx, and pptx files. We collected them during the early stages of ONLYOFFICE editors development to check the rendering of complex documents. Several years later we used them to test our parser. We detected a significant number of errors, and it took us several weeks to fix them. But it was worth it.

Now we have an awesome tool for parsing OOXML files. We use them for testing:

  • ONLYOFFICE Developer Edition, document, spreadsheet, and presentation editors as an advanced component for other web apps.
  • ONLYOFFICE Integration Edition online editors with connectors for popular business solutions – Nextcloud, ownCloud, SharePoint, Confluence, Alfresco.
  • ONLYOFFICE Enterprise Edition - online editors integrated with a collaboration platform.

Hope that this will be useful for your project as much as it is for ONLYOFFICE.

Document OnlyOffice Parser (programming language)

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • When AI Strengthens Good Old Chatbots: A Brief History of Conversational AI
  • How To Create a Stub in 5 Minutes
  • Explainer: Building High Performing Data Product Platform
  • New MacBook Air Beats M1 Max for Java Development

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: