DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • DocRaptor vs. WeasyPrint: A PDF Export Showdown
  • Python and Open-Source Libraries for Efficient PDF Management
  • Python Development With Asynchronous SQLite and PostgreSQL
  • Regex in Action: Practical Examples for Python Programmers

Trending

  • Architecting an Embedded Efficiency Layer: A Platform Deep Dive into Day-Two Operational Tuning
  • The Hidden Latency of Autoscaling
  • S3 Vectors: How to Build a RAG Without a Vector Database
  • 5 Layers of Prompt Injection Defense You Can Wire Into Any Node.js App
  1. DZone
  2. Popular
  3. Open Source
  4. Extracting Clean Excel Tables From PDFs Using Python + Docling

Extracting Clean Excel Tables From PDFs Using Python + Docling

Extract tables from PDFs to fully formatted, analysis-ready Excel files with pdf-tables-to-excel, supporting OCR, complex layouts, and numeric parsing.

By 
Sanjay Krishnegowda user avatar
Sanjay Krishnegowda
·
Dec. 25, 25 · Analysis
Likes (0)
Comment
Save
Tweet
Share
2.8K Views

Join the DZone community and get the full member experience.

Join For Free

PDFs remain the most widely used format for distributing structured reports — financial statements, regulatory filings, research documents, fund fact sheets, and more. Yet despite their structured appearance, PDFs are not machine-readable. Extracting tables reliably is famously error-prone and often requires hours of manual cleanup.

This is especially true in finance and enterprise environments where analysts rely on Excel for modeling and reporting.

To address this challenge, I built an open-source Python package:

pdf-tables-to-excel

A tool designed to detect, extract, and export clean, analysis-ready Excel tables from any PDF — powered by Docling’s state-of-the-art document parsing.

Install it in seconds:

Python
 
pip install pdf-tables-to-excel


This article walks through the motivation, engineering decisions, architecture, and practical workflows behind the tool.

Why I Built This (The Real Motivation)

I’ve spent years working with technical users in financial services — quant teams, credit analysts, portfolio researchers, operations, and data engineering groups. Across all of them, I repeatedly observed one universal pain point:

Extracting tables from PDFs feels like manual data entry.

Even when using open-source libraries like Camelot, Tabula, pdfplumber, or PyPDF2, most tools stop at returning a Pandas DataFrame. Then analysts still need to:

  • Fix column alignments
  • Convert text percentages to real numeric Excel percentages
  • Handle negative values shown as (1.3%)
  • Unmerge headers
  • Manually build and style Excel sheets
  • Split multi-table PDFs into separate DataFrames and sheets
  • Apply proper borders
  • Auto-fit column widths
  • Preserve header formatting
  • Handle OCR for scanned PDFs

Every team reinvented the same 200 lines of conversion code.

Some tools extract data, but none produce an Excel that a human can immediately use, without additional cleanup. Additionally:

  • Many finance documents contain complex layouts.
  • Merged headers and percentage columns fail silently.
  • Currency formats lose precision.
  • Most libraries misdetect table boundaries.
  • Scanned PDFs require unreliable OCR workflows.

This motivated me to build a unified tool, where table extraction + Excel formatting are bundled into a single operation.

Why Docling? The Accuracy Advantage

Out of all PDF parsing libraries available today, Docling stands out in terms of:

Layout-Awareness

Docling understands the document structure, not just the text.

Table Geometry Detection

It identifies rows, columns, spans, merges, and alignment.

OCR Integration

Scanned PDFs are handled via RapidOCR or EasyOCR.

High Accuracy on Complex Documents

This includes financial tables with:

  • multi-line headers
  • merged spans
  • nested table regions
  • footnotes and annotations
  • alternating row patterns

Robust Against Inconsistent Table Borders

Docling detects tables even when:

  • Borders are missing
  • Cells are visually misaligned
  • Fonts vary
  • Whitespace is inconsistent

This means the extracted DataFrames are significantly cleaner than what most legacy tools produce. Internally, the pipeline looks like this:

Python
 
PDF →
  Docling Layout Analyzer →
    Table Structure Detection →
      TableItem → Pandas DataFrame →
        Excel Formatting Engine →
          Styled Workbook (.xlsx)


And that’s where this package shines.

Why Extracting Tables from PDFs Is Harder Than It Looks

Although PDFs appear structured, they are fundamentally graphic layout containers, not semantic documents. A “table” inside a PDF is often just text placed at aligned coordinates. There is no real concept of:

  • Rows
  • Columns
  • Spans
  • Cell types
  • Numeric formats

This is why most extractors fail when dealing with:

1. Merged Headers

Financial tables frequently contain two or three header rows representing categories and subcategories. Traditional extractors flatten them incorrectly, losing context.

2. Parentheses for Negative Numbers

Accountants often express negatives as (123) instead of -123. OCR and text-based extractors usually treat this as text.

3. Lack of Borders

Some PDFs remove table lines for better readability, making geometric detection unreliable.

4. Complex Cell Spanning

A single header may span 4–5 columns; most tools misalign these structures.

5. Scanned PDFs

OCR introduces noise, misread digits, and extra whitespace. By integrating Docling and adding post-processing layers for numeric parsing, this library removes a majority of these obstacles, producing consistently structured DataFrames that convert cleanly into Excel.

How this library differs from existing tools:

Feature Typical PDF table extractors pdf-tables-to-excel
Table detection Often inconsistent Docling-based, high accuracy
Output Pandas DataFrame only Fully formatted Excel file
Multi-table support Manual Automatic, one sheet per table
Borders & formatting No Yes (clean, minimal Excel formatting)
Auto column width No Yes
Numeric parsing Limited or none Currency, percentages, negatives
CLI support Rare Yes (pdf2styledexcel)
OCR support Optional, unreliable Built-in via Docling’s OCR layers
Finance-ready? ❌ ✔


This isn’t “just another wrapper.” It’s opinionated software created specifically to solve an end-to-end workflow problem.

Technical Architecture

1. Table Extraction Engine (Docling)

Docling outputs TableItem objects, each containing:

  • cell geometry
  • spans
  • text content
  • header blocks
  • row alignments
  • confidence scores

These are converted into Pandas DataFrames with robust normalization.

2. Normalization Layer

This is where the tool outperforms general-purpose extractors.

  • Converts 21.8% → 0.218
  • Converts (4.3%) → -0.043
  • Converts $1,234 → 1234.0
  • Detects negatives in parentheses
  • Handles thousands separators
  • Cleans whitespace
  • Supports missing values gracefully

This ensures Excel receives real numeric values, not text strings.

3. Excel Formatting Engine

Built using XlsxWriter, the tool:

  • Creates one sheet per table
  • Applies bold header styling
  • Auto-resizes columns
  • Adds thin borders to actual table area
  • Freezes header row
  • Supports two naming modes:
    • sequential → Table 1, Table 2, ...
    • by_page → Page 1 Table 1

The goal is to deliver Excel files that analysts actually want to use.

Design Principles Behind the Library

When designing pdf-tables-to-excel, I followed three core principles:

1. Zero Manual Cleanup

Tools that return DataFrames still leave analysts to fix formatting. This library makes the Excel output the final product, with:

  • Clean numeric types
  • Percent/currency conversion
  • Auto column widths
  • Proper borders
  • Consistent sheet naming

2. Predictable Behavior

Given the same PDF, the output should always be deterministic. Many extractors produce different results if whitespace changes even slightly. By using Docling’s structured layout model, extraction becomes far more stable.

3. Batteries Included

Analysts should not need to write helper scripts for each report. Everything — from table detection to Excel styling — is packaged into a single function call. This keeps the API clean while still allowing advanced customization through optional parameters.

Extended Code Example

Basic Usage

Python
 
from pdf_tables_to_excel import convert_pdf_to_excel

convert_pdf_to_excel(
    input_pdf="Annual_Report.pdf",
    output_xlsx="extracted_tables.xlsx",
    sheet_naming="by_page",
    include_empty=False,
)


Example workflow:

Python
 
tables = extract_tables("Earnings_Release.pdf")
for t in tables:
    print(t.source_page, t.df.head())


Real-World Use Cases

1. Financial Services

Extract tables from:

  • Earnings releases
  • Trustee reports
  • Loan servicing tapes
  • Regulatory disclosures (10-K, 10-Q)
  • Fund factsheets

2. Data Science / ML Pipelines

Convert PDF datasets into structured inputs for feature engineering and modeling.

3. Enterprise Automation

Integrate in:

  • ETL workflows
  • RPA pipelines
  • Document intelligence systems
Library PDF Python (language) Open source

Opinions expressed by DZone contributors are their own.

Related

  • DocRaptor vs. WeasyPrint: A PDF Export Showdown
  • Python and Open-Source Libraries for Efficient PDF Management
  • Python Development With Asynchronous SQLite and PostgreSQL
  • Regex in Action: Practical Examples for Python Programmers

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook