DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • How to Merge HTML Documents in Java
  • How to Quarantine a Malicious File in Java
  • How to Get Plain Text From Common Documents in Java
  • How To Convert Common Documents to PNG Image Arrays in Java

Trending

  • Ensuring Configuration Consistency Across Global Data Centers
  • Next-Gen IoT Performance Depends on Advanced Power Management ICs
  • AI Speaks for the World... But Whose Humanity Does It Learn From?
  • Event-Driven Microservices: How Kafka and RabbitMQ Power Scalable Systems
  1. DZone
  2. Software Design and Architecture
  3. Integration
  4. How To Compare DOCX Documents in Java

How To Compare DOCX Documents in Java

In this article, learn how to carry out DOCX comparisons programmatically by calling a specialized web API with Java code examples.

By 
Brian O'Neill user avatar
Brian O'Neill
DZone Core CORE ·
Jun. 21, 24 · Tutorial
Likes (4)
Comment
Save
Tweet
Share
9.5K Views

Join the DZone community and get the full member experience.

Join For Free

If you’ve spent a lot of time creating and editing documents in the MS Word application, there’s a good chance you’ve heard of (and maybe even used) the DOCX comparison feature. This simple, manual comparison tool produces a three-pane view displaying the differences between two versions of a file. It’s a useful tool for summarizing the journey legal contracts (or other, similar documents that tend to start as templates) take when they undergo multiple rounds of collaborative edits.

As useful as manual DOCX document comparisons are, they’re still manual, which immediately makes them inefficient at scale. Thankfully, though, the open-source file structure DOCX is based on - OpenXML - is designed to facilitate the automation of manual processes like this by making Office document file structure easily accessible to programmers. With the right developer tools, you can make programmatic DOCX comparisons at scale in your own applications.

In this article, you’ll learn how to carry out DOCX comparisons programmatically by calling a specialized web API with Java code examples. This will help you automate DOCX comparisons without the need to understand OpenXML formatting or write a ton of new code.  Before we get to our demonstration, however, we'll first briefly review OpenXML formatting, and we'll also learn about an open-source library that can be used to read and write Office files in Java.

Understanding OpenXML 

OpenXML formatting has been around for a long time now (since 2007), and it’s the standard all major Office documents are currently based on.

Thanks to OpenXML formatting, all Office files – including Word (DOCX), Excel (XLSX), PowerPoint (PPTX), and others – are structured as open-source zip archives containing compressed metadata, file specifications, etc. in XML format.  

We can easily review this file structure for ourselves by renaming Office files as .zip files. To do that, we can CD into one of our DOCX file's directories (Windows) and rename our file using the below command (replacing the example file name below with our own file name):

Shell
 
ren "hello world".docx "hello world".zip


We can then open the .zip version of our DOCX file and poke around in our file archive.

When we open DOCX files in our MS Word application, our files are unzipped, and we can then use various built-in application tools to manipulate our files’ contents.  

This open-source file structure makes it relatively straightforward to build applications that read and write DOCX files. It is, to use a well-known example, the reason why programs like Google Drive can upload and manipulate DOCX files in their own text editor applications. With a good understanding of OpenXML structure, we could build our own text editor applications to manipulate DOCX files if we wanted – it would just be a LOT of work. It wouldn’t be especially worth our time, either, given the number of applications and programming libraries that already exist for exactly that purpose.  

Writing DOCX Comparisons in Java

While the OpenXML SDK is open source (hosted on GitHub for anyone to use), it’s written to be used with .NET languages like C#. If we were looking to automate DOCX comparisons with an open-source library in Java, we would need to use something like the Apache POI library to build our application instead.

Our process would roughly entail:

  1. Adding Apache POI dependencies to our pom.xml
  2. Importing the XWPF library (designed for OpenXML files)
  3. Writing some code to load and extract relevant content from our documents 

Part 3 is where things would start to get complicated - we would need to write a bunch of code to retrieve and compare paragraph elements from each document, and if we wanted to ensure consistent formatting across both of our documents (important for our resulting comparison document), we would need to break down our paragraphs into runs. We would then, of course, need to implement our own robust error handling before writing our DOCX comparison result to a new file.

Advantages of a Web API for DOCX Comparison

Writing our DOCX comparison from scratch would take time, and it would also put the burden of our file-processing operation squarely on our own server. That might not be a big deal for comparisons involving smaller-sized DOCX documents, but it would start to take a toll with larger-sized documents and larger-scale (higher volume) operations.

By calling a web API to handle our DOCX comparison instead, we’ll limit the amount of code we need to write, and we’ll offload the heavy lifting in our comparison workflow to an external server. That way, we can focus more of our hands-on coding efforts on building robust features in our application that handle the results of our DOCX comparisons in various ways.

Demonstration

Using the code examples below, we can call an API that simplifies the process of automating DOCX comparisons. Rather than writing a bunch of new code, we’ll just need to copy relevant examples, load our input files, and write our resulting comparison strings to new DOCX files of their own.

To help demonstrate what the output of our programmatic comparison looks like, I’ve included a screenshot from a simple DOCX comparison result below. This document shows the comparison between two versions of a classic Lorem Ipsum passage – one containing all of the original Latin text, and the other containing a few lines of English text:

Screenshot from a simple DOCX comparison result

To structure our API call, we can begin by installing the client SDK. Let’s add a reference to our pom.xml repository:

XML
 
<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>


And let’s add a reference to the dependency in our pom.xml:

XML
 
<dependencies>
<dependency>
    <groupId>com.github.Cloudmersive</groupId>
    <artifactId>Cloudmersive.APIClient.Java</artifactId>
    <version>v4.25</version>
</dependency>
</dependencies>


After that, we can add the following Imports to our controller:

Java
 
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.CompareDocumentApi;


Now we can turn our attention to configuration.  We’ll need to supply a free Cloudmersive API key (this allows 800 API calls/month with no commitments) in the following configuration snippet:

Java
 
ApiClient defaultClient = Configuration.getDefaultApiClient();

// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");


Next, we can use our final code examples below to create an instance of the API and call the DOCX comparison function:

Java
 
CompareDocumentApi apiInstance = new CompareDocumentApi();
File inputFile1 = new File("/path/to/inputfile"); // File | First input file to perform the operation on.
File inputFile2 = new File("/path/to/inputfile"); // File | Second input file to perform the operation on (more than 2 can be supplied).
try {
    byte[] result = apiInstance.compareDocumentDocx(inputFile1, inputFile2);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling CompareDocumentApi#compareDocumentDocx");
    e.printStackTrace();
}


Now we can easily automate DOCX comparisons with a few lines of code.  If our input DOCX files contain any errors, the endpoint will try to auto-repair the files before making the comparison.

Conclusion

In this article, we learned about the MS Word DOCX Comparison tool and discussed how DOCX comparisons can be automated (thanks to OpenXML formatting). We then learned how to call a low-code DOCX comparison API with Java code examples.

API Document Open source Java (programming language)

Opinions expressed by DZone contributors are their own.

Related

  • How to Merge HTML Documents in Java
  • How to Quarantine a Malicious File in Java
  • How to Get Plain Text From Common Documents in Java
  • How To Convert Common Documents to PNG Image Arrays in Java

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!