Extracting Clean Excel Tables From PDFs Using Python + Docling
Extract tables from PDFs to fully formatted, analysis-ready Excel files with pdf-tables-to-excel, supporting OCR, complex layouts, and numeric parsing.
Join the DZone community and get the full member experience.
Join For FreePDFs remain the most widely used format for distributing structured reports — financial statements, regulatory filings, research documents, fund fact sheets, and more. Yet despite their structured appearance, PDFs are not machine-readable. Extracting tables reliably is famously error-prone and often requires hours of manual cleanup.
This is especially true in finance and enterprise environments where analysts rely on Excel for modeling and reporting.
To address this challenge, I built an open-source Python package:
pdf-tables-to-excel
A tool designed to detect, extract, and export clean, analysis-ready Excel tables from any PDF — powered by Docling’s state-of-the-art document parsing.
Install it in seconds:
pip install pdf-tables-to-excel
This article walks through the motivation, engineering decisions, architecture, and practical workflows behind the tool.
Why I Built This (The Real Motivation)
I’ve spent years working with technical users in financial services — quant teams, credit analysts, portfolio researchers, operations, and data engineering groups. Across all of them, I repeatedly observed one universal pain point:
Extracting tables from PDFs feels like manual data entry.
Even when using open-source libraries like Camelot, Tabula, pdfplumber, or PyPDF2, most tools stop at returning a Pandas DataFrame. Then analysts still need to:
- Fix column alignments
- Convert text percentages to real numeric Excel percentages
- Handle negative values shown as
(1.3%) - Unmerge headers
- Manually build and style Excel sheets
- Split multi-table PDFs into separate DataFrames and sheets
- Apply proper borders
- Auto-fit column widths
- Preserve header formatting
- Handle OCR for scanned PDFs
Every team reinvented the same 200 lines of conversion code.
Some tools extract data, but none produce an Excel that a human can immediately use, without additional cleanup. Additionally:
- Many finance documents contain complex layouts.
- Merged headers and percentage columns fail silently.
- Currency formats lose precision.
- Most libraries misdetect table boundaries.
- Scanned PDFs require unreliable OCR workflows.
This motivated me to build a unified tool, where table extraction + Excel formatting are bundled into a single operation.
Why Docling? The Accuracy Advantage
Out of all PDF parsing libraries available today, Docling stands out in terms of:
Layout-Awareness
Docling understands the document structure, not just the text.
Table Geometry Detection
It identifies rows, columns, spans, merges, and alignment.
OCR Integration
Scanned PDFs are handled via RapidOCR or EasyOCR.
High Accuracy on Complex Documents
This includes financial tables with:
- multi-line headers
- merged spans
- nested table regions
- footnotes and annotations
- alternating row patterns
Robust Against Inconsistent Table Borders
Docling detects tables even when:
- Borders are missing
- Cells are visually misaligned
- Fonts vary
- Whitespace is inconsistent
This means the extracted DataFrames are significantly cleaner than what most legacy tools produce. Internally, the pipeline looks like this:
PDF →
Docling Layout Analyzer →
Table Structure Detection →
TableItem → Pandas DataFrame →
Excel Formatting Engine →
Styled Workbook (.xlsx)
And that’s where this package shines.
Why Extracting Tables from PDFs Is Harder Than It Looks
Although PDFs appear structured, they are fundamentally graphic layout containers, not semantic documents. A “table” inside a PDF is often just text placed at aligned coordinates. There is no real concept of:
- Rows
- Columns
- Spans
- Cell types
- Numeric formats
This is why most extractors fail when dealing with:
1. Merged Headers
Financial tables frequently contain two or three header rows representing categories and subcategories. Traditional extractors flatten them incorrectly, losing context.
2. Parentheses for Negative Numbers
Accountants often express negatives as (123) instead of -123. OCR and text-based extractors usually treat this as text.
3. Lack of Borders
Some PDFs remove table lines for better readability, making geometric detection unreliable.
4. Complex Cell Spanning
A single header may span 4–5 columns; most tools misalign these structures.
5. Scanned PDFs
OCR introduces noise, misread digits, and extra whitespace. By integrating Docling and adding post-processing layers for numeric parsing, this library removes a majority of these obstacles, producing consistently structured DataFrames that convert cleanly into Excel.
How this library differs from existing tools:
| Feature | Typical PDF table extractors | pdf-tables-to-excel |
|---|---|---|
| Table detection | Often inconsistent | Docling-based, high accuracy |
| Output | Pandas DataFrame only | Fully formatted Excel file |
| Multi-table support | Manual | Automatic, one sheet per table |
| Borders & formatting | No | Yes (clean, minimal Excel formatting) |
| Auto column width | No | Yes |
| Numeric parsing | Limited or none | Currency, percentages, negatives |
| CLI support | Rare | Yes (pdf2styledexcel) |
| OCR support | Optional, unreliable | Built-in via Docling’s OCR layers |
| Finance-ready? | ❌ | ✔ |
This isn’t “just another wrapper.” It’s opinionated software created specifically to solve an end-to-end workflow problem.
Technical Architecture
1. Table Extraction Engine (Docling)
Docling outputs TableItem objects, each containing:
- cell geometry
- spans
- text content
- header blocks
- row alignments
- confidence scores
These are converted into Pandas DataFrames with robust normalization.
2. Normalization Layer
This is where the tool outperforms general-purpose extractors.
- Converts
21.8%→0.218 - Converts
(4.3%)→-0.043 - Converts
$1,234→1234.0 - Detects negatives in parentheses
- Handles thousands separators
- Cleans whitespace
- Supports missing values gracefully
This ensures Excel receives real numeric values, not text strings.
3. Excel Formatting Engine
Built using XlsxWriter, the tool:
- Creates one sheet per table
- Applies bold header styling
- Auto-resizes columns
- Adds thin borders to actual table area
- Freezes header row
- Supports two naming modes:
sequential→Table 1,Table 2, ...by_page→Page 1 Table 1
The goal is to deliver Excel files that analysts actually want to use.
Design Principles Behind the Library
When designing pdf-tables-to-excel, I followed three core principles:
1. Zero Manual Cleanup
Tools that return DataFrames still leave analysts to fix formatting. This library makes the Excel output the final product, with:
- Clean numeric types
- Percent/currency conversion
- Auto column widths
- Proper borders
- Consistent sheet naming
2. Predictable Behavior
Given the same PDF, the output should always be deterministic. Many extractors produce different results if whitespace changes even slightly. By using Docling’s structured layout model, extraction becomes far more stable.
3. Batteries Included
Analysts should not need to write helper scripts for each report. Everything — from table detection to Excel styling — is packaged into a single function call. This keeps the API clean while still allowing advanced customization through optional parameters.
Extended Code Example
Basic Usage
from pdf_tables_to_excel import convert_pdf_to_excel
convert_pdf_to_excel(
input_pdf="Annual_Report.pdf",
output_xlsx="extracted_tables.xlsx",
sheet_naming="by_page",
include_empty=False,
)
Example workflow:
tables = extract_tables("Earnings_Release.pdf")
for t in tables:
print(t.source_page, t.df.head())
Real-World Use Cases
1. Financial Services
Extract tables from:
- Earnings releases
- Trustee reports
- Loan servicing tapes
- Regulatory disclosures (10-K, 10-Q)
- Fund factsheets
2. Data Science / ML Pipelines
Convert PDF datasets into structured inputs for feature engineering and modeling.
3. Enterprise Automation
Integrate in:
- ETL workflows
- RPA pipelines
- Document intelligence systems
Opinions expressed by DZone contributors are their own.
Comments