DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
11 Monitoring and Observability Tools for 2023
Learn more
  1. DZone
  2. Software Design and Architecture
  3. Integration
  4. How To Validate Three Common Document Types in Python

How To Validate Three Common Document Types in Python

Learn about three API solutions capable of validating PDF, Excel, and MS Word documents for document validation services in file processing applications.

Brian O'Neill user avatar by
Brian O'Neill
CORE ·
Jan. 16, 23 · Tutorial
Like (2)
Save
Tweet
Share
3.32K Views

Join the DZone community and get the full member experience.

Join For Free

Regardless of the industry vertical we work in – whether that be somewhere in the technology, e-commerce, manufacturing, or financial fields (or some Venn diagram of them all) – we mostly rely on the same handful of standard document formats to share critical information between our internal teams and with external organizations. These documents are almost always on the move, bouncing across our networks as fast as our servers will allow them to.  Many internal business automation workflows, for example – whether written with custom code or pieced together through an enterprise automation platform – are designed to process standard PDF invoice documents step by step as they pass from one stakeholder to another.  Similarly, customized reporting applications are typically used to access and process Excel spreadsheets which the financial stakeholders of a given organization (internal or external) rely on.  All the while, these documents remain beholden to strictly enforced data standards, and each application must consistently uphold these standards. That’s because every document, no matter how common, is uniquely capable of defying some specific industry regulation, containing an unknown error in its encoding, or even - at the very worst - hiding a malicious security threat behind a benign façade.

As rapidly evolving business applications continue to make our professional lives more efficient, business users on any network place more and more trust in the cogs that turn within their suite of assigned applications to uphold high data standards on their behalf.  As our documents travel from one location to another, the applications they pass through are ultimately responsible for determining the integrity, security, and compliance of each document’s contents. If an invalid PDF file somehow reaches its end destination, the application which processes it – and, by extension, those stakeholders responsible for creating, configuring, deploying, and maintaining the application in the first place – will have some difficult questions to answer.  It’s important to know upfront, right away, whether there are any issues present within the documents our applications are actively processing. If we don’t have a way of doing that, we run the risk of allowing our own applications to shoot us in the foot.

Thankfully, it’s straightforward (and standard) to solve this problem with layers of data validation APIs. In particular, document validation APIs are designed to fit seamlessly within the architecture of a file processing application, providing a quick feedback loop on each individual file they encounter to ensure the application runs smoothly when valid documents pass through and halting its process immediately when invalid documents are identified.  There are dozens of common document types which require validation in a file processing application, and many of the most common among those, including PDF, Excel, and DOCX (which this article seeks to highlight), are all compressed and encoded in very unique ways, making it particularly vital to programmatically identify whether their contents are structured correctly and securely.

Document Validation APIs

The purpose of this article is to highlight three API solutions that can be used to validate three separate and exceedingly common document types within your various document processing applications: PDF, Excel XLSX, and Microsoft Word DOCX. These APIs are all free to use, requiring a free-tier API key and only a few lines of code (provided below in Python for your convenience) to call their services. While the process of validating each document type listed above is unique, the response body provided by each API is standardized, making it efficient and straightforward to identify whether an error was found within each document type and if so, what warnings are associated with that error. Below, I’ll quickly outline the general body of information supplied in each of the above document validation API's response:

  • DocumentIsValid – This response contains a simple Boolean value indicating whether the document in question is valid based on its encoding.
  • PasswordProtected – This response provides a Boolean value indicating whether the document in question contains password protection (which – if unexpected – can indicate an underlying security threat).
  • ErrorCount – This response provides an integer reflecting the number of errors detected within the document in question.
  • WarningCount – This response indicates the number of warnings produced by the API response independently of the error count.
  • ErrorsAndWarnings – This response category includes more detailed information about each error identified within a document, including an error description, error path, error URI (uniform resource identifier, such as URL or URN), and IsError Boolean.

Demonstration

To use any of the three APIs referred to above, the first step is to install the Python SDK with a pip command provided below:

 
pip install cloudmersive-convert-api-client


With installation complete, we can turn our attention to the individual functions which call each individual API’s services.  

To call the PDF validation API, we can use the following code:

Python
 
from __future__ import print_function
import time
import cloudmersive_convert_api_client
from cloudmersive_convert_api_client.rest import ApiException
from pprint import pprint

# Configure API key authorization: Apikey
configuration = cloudmersive_convert_api_client.Configuration()
configuration.api_key['Apikey'] = 'YOUR_API_KEY'



# create an instance of the API class
api_instance = cloudmersive_convert_api_client.ValidateDocumentApi(cloudmersive_convert_api_client.ApiClient(configuration))
input_file = '/path/to/inputfile' # file | Input file to perform the operation on.

try:
    # Validate a PDF document file
    api_response = api_instance.validate_document_pdf_validation(input_file)
    pprint(api_response)
except ApiException as e:
    print("Exception when calling ValidateDocumentApi->validate_document_pdf_validation: %s\n" % e)


To call the Microsoft Excel XLSX validation API, we can use the following code instead:

Python
 
from __future__ import print_function
import time
import cloudmersive_convert_api_client
from cloudmersive_convert_api_client.rest import ApiException
from pprint import pprint

# Configure API key authorization: Apikey
configuration = cloudmersive_convert_api_client.Configuration()
configuration.api_key['Apikey'] = 'YOUR_API_KEY'



# create an instance of the API class
api_instance = cloudmersive_convert_api_client.ValidateDocumentApi(cloudmersive_convert_api_client.ApiClient(configuration))
input_file = '/path/to/inputfile' # file | Input file to perform the operation on.

try:
    # Validate a Excel document (XLSX)
    api_response = api_instance.validate_document_xlsx_validation(input_file)
    pprint(api_response)
except ApiException as e:
    print("Exception when calling ValidateDocumentApi->validate_document_xlsx_validation: %s\n" % e)


And finally, to call the Microsoft Word DOCX validation API, we can use the final code snippet supplied below:

Python
 
from __future__ import print_function
import time
import cloudmersive_convert_api_client
from cloudmersive_convert_api_client.rest import ApiException
from pprint import pprint

# Configure API key authorization: Apikey
configuration = cloudmersive_convert_api_client.Configuration()
configuration.api_key['Apikey'] = 'YOUR_API_KEY'



# create an instance of the API class
api_instance = cloudmersive_convert_api_client.ValidateDocumentApi(cloudmersive_convert_api_client.ApiClient(configuration))
input_file = '/path/to/inputfile' # file | Input file to perform the operation on.

try:
    # Validate a Word document (DOCX)
    api_response = api_instance.validate_document_docx_validation(input_file)
    pprint(api_response)
except ApiException as e:
    print("Exception when calling ValidateDocumentApi->validate_document_docx_validation: %s\n" % e)


Please note that while these APIs do provide some basic security benefits during their document validation processes (i.e., identifying unexpected password protection on a file, which is a common method for sneaking malicious files through a network - the password can be supplied to an unsuspecting downstream user at a later date), they do not constitute fully formed security APIs, such as those that would specifically hunt for viruses, malware, and other forms of malicious content hidden within a file. Any document – especially those that originated outside of your internal network – should always be thoroughly vetted through specific security-related services (i.e., services equipped with virus and malware signatures) before entering or leaving your file storage systems.

API Document Python (language) Integration

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Multi-Tenant Architecture for a SaaS Application on AWS
  • How To Use Linux Containers
  • Getting a Private SSL Certificate Free of Cost
  • Low-Code Development: The Future of Software Development

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: