Extracting Clean Excel Tables From PDFs Using Python + Docling

Extract tables from PDFs to fully formatted, analysis-ready Excel files with pdf-tables-to-excel, supporting OCR, complex layouts, and numeric parsing.

Dec. 25, 25 · Analysis

Likes (0)

Comment

Save

3.3K Views

PDFs remain the most widely used format for distributing structured reports — financial statements, regulatory filings, research documents, fund fact sheets, and more. Yet despite their structured appearance, PDFs are not machine-readable. Extracting tables reliably is famously error-prone and often requires hours of manual cleanup.

This is especially true in finance and enterprise environments where analysts rely on Excel for modeling and reporting.

To address this challenge, I built an open-source Python package:

pdf-tables-to-excel

A tool designed to detect, extract, and export clean, analysis-ready Excel tables from any PDF — powered by Docling’s state-of-the-art document parsing.

Install it in seconds:

    Python
   
   pip install pdf-tables-to-excel

This article walks through the motivation, engineering decisions, architecture, and practical workflows behind the tool.

Why I Built This (The Real Motivation)

I’ve spent years working with technical users in financial services — quant teams, credit analysts, portfolio researchers, operations, and data engineering groups. Across all of them, I repeatedly observed one universal pain point:

Extracting tables from PDFs feels like manual data entry.

Even when using open-source libraries like Camelot, Tabula, pdfplumber, or PyPDF2, most tools stop at returning a Pandas DataFrame. Then analysts still need to:

Fix column alignments
Convert text percentages to real numeric Excel percentages
Handle negative values shown as (1.3%)
Unmerge headers
Manually build and style Excel sheets
Split multi-table PDFs into separate DataFrames and sheets
Apply proper borders
Auto-fit column widths
Preserve header formatting
Handle OCR for scanned PDFs

Every team reinvented the same 200 lines of conversion code.

Some tools extract data, but none produce an Excel that a human can immediately use, without additional cleanup. Additionally:

Many finance documents contain complex layouts.
Merged headers and percentage columns fail silently.
Currency formats lose precision.
Most libraries misdetect table boundaries.
Scanned PDFs require unreliable OCR workflows.

This motivated me to build a unified tool, where table extraction + Excel formatting are bundled into a single operation.

Why Docling? The Accuracy Advantage

Out of all PDF parsing libraries available today, Docling stands out in terms of:

Layout-Awareness

Docling understands the document structure, not just the text.

Table Geometry Detection

It identifies rows, columns, spans, merges, and alignment.

OCR Integration

Scanned PDFs are handled via RapidOCR or EasyOCR.

High Accuracy on Complex Documents

This includes financial tables with:

multi-line headers
merged spans
nested table regions
footnotes and annotations
alternating row patterns

Robust Against Inconsistent Table Borders

Docling detects tables even when:

Borders are missing
Cells are visually misaligned
Fonts vary
Whitespace is inconsistent

This means the extracted DataFrames are significantly cleaner than what most legacy tools produce. Internally, the pipeline looks like this:

    Python
   
   PDF →
  Docling Layout Analyzer →
    Table Structure Detection →
      TableItem → Pandas DataFrame →
        Excel Formatting Engine →
          Styled Workbook (.xlsx)

And that’s where this package shines.

Why Extracting Tables from PDFs Is Harder Than It Looks

Although PDFs appear structured, they are fundamentally graphic layout containers, not semantic documents. A “table” inside a PDF is often just text placed at aligned coordinates. There is no real concept of:

Rows
Columns
Spans
Cell types
Numeric formats

This is why most extractors fail when dealing with:

1. Merged Headers

Financial tables frequently contain two or three header rows representing categories and subcategories. Traditional extractors flatten them incorrectly, losing context.

2. Parentheses for Negative Numbers

Accountants often express negatives as (123) instead of -123. OCR and text-based extractors usually treat this as text.

3. Lack of Borders

Some PDFs remove table lines for better readability, making geometric detection unreliable.

4. Complex Cell Spanning

A single header may span 4–5 columns; most tools misalign these structures.

5. Scanned PDFs

OCR introduces noise, misread digits, and extra whitespace. By integrating Docling and adding post-processing layers for numeric parsing, this library removes a majority of these obstacles, producing consistently structured DataFrames that convert cleanly into Excel.

How this library differs from existing tools:

Feature	Typical PDF table extractors	`pdf-tables-to-excel`
Table detection	Often inconsistent	Docling-based, high accuracy
Output	Pandas DataFrame only	Fully formatted Excel file
Multi-table support	Manual	Automatic, one sheet per table
Borders & formatting	No	Yes (clean, minimal Excel formatting)
Auto column width	No	Yes
Numeric parsing	Limited or none	Currency, percentages, negatives
CLI support	Rare	Yes (`pdf2styledexcel`)
OCR support	Optional, unreliable	Built-in via Docling’s OCR layers
Finance-ready?	❌	✔

This isn’t “just another wrapper.” It’s opinionated software created specifically to solve an end-to-end workflow problem.

Technical Architecture

1. Table Extraction Engine (Docling)

Docling outputs TableItem objects, each containing:

cell geometry
spans
text content
header blocks
row alignments
confidence scores

These are converted into Pandas DataFrames with robust normalization.

2. Normalization Layer

This is where the tool outperforms general-purpose extractors.

Converts 21.8% → 0.218
Converts (4.3%) → -0.043
Converts $1,234 → 1234.0
Detects negatives in parentheses
Handles thousands separators
Cleans whitespace
Supports missing values gracefully

This ensures Excel receives real numeric values, not text strings.

3. Excel Formatting Engine

Built using XlsxWriter, the tool:

Creates one sheet per table
Applies bold header styling
Auto-resizes columns
Adds thin borders to actual table area
Freezes header row
Supports two naming modes:
- sequential → Table 1, Table 2, ...
- by_page → Page 1 Table 1

The goal is to deliver Excel files that analysts actually want to use.

Design Principles Behind the Library

When designing pdf-tables-to-excel, I followed three core principles:

1. Zero Manual Cleanup

Tools that return DataFrames still leave analysts to fix formatting. This library makes the Excel output the final product, with:

Clean numeric types
Percent/currency conversion
Auto column widths
Proper borders
Consistent sheet naming

2. Predictable Behavior

Given the same PDF, the output should always be deterministic. Many extractors produce different results if whitespace changes even slightly. By using Docling’s structured layout model, extraction becomes far more stable.

3. Batteries Included

Analysts should not need to write helper scripts for each report. Everything — from table detection to Excel styling — is packaged into a single function call. This keeps the API clean while still allowing advanced customization through optional parameters.

Extended Code Example

Basic Usage

    Python
   
   from pdf_tables_to_excel import convert_pdf_to_excel

convert_pdf_to_excel(
    input_pdf="Annual_Report.pdf",
    output_xlsx="extracted_tables.xlsx",
    sheet_naming="by_page",
    include_empty=False,
)

Example workflow:

    Python
   
   tables = extract_tables("Earnings_Release.pdf")
for t in tables:
    print(t.source_page, t.df.head())

Real-World Use Cases

1. Financial Services

Extract tables from:

Earnings releases
Trustee reports
Loan servicing tapes
Regulatory disclosures (10-K, 10-Q)
Fund factsheets

2. Data Science / ML Pipelines

Convert PDF datasets into structured inputs for feature engineering and modeling.

3. Enterprise Automation

Integrate in:

ETL workflows
RPA pipelines
Document intelligence systems

Library PDF Python (language) Open source

Opinions expressed by DZone contributors are their own.

Related

Trending