DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • How To Recover SQL Server FILESTREAM Enabled Database
  • Extract Information From PDF Invoice
  • SQL Recovery Model: Simple vs. Full
  • How to Restore a Transaction Log Backup in SQL Server

Trending

  • Key Considerations in Cross-Model Migration
  • Scaling in Practice: Caching and Rate-Limiting With Redis and Next.js
  • How to Format Articles for DZone
  • Stateless vs Stateful Stream Processing With Kafka Streams and Apache Flink
  1. DZone
  2. Data Engineering
  3. Databases
  4. TrapRange: a Method to Extract Table Content in PDF Files

TrapRange: a Method to Extract Table Content in PDF Files

This article highlights TrapRange, a data method that can be used to detect and extract table content from a table to a PDF file.

By 
Tho Luong user avatar
Tho Luong
DZone Core CORE ·
Updated Jun. 25, 20 · Tutorial
Likes (6)
Comment
Save
Tweet
Share
69.6K Views

Join the DZone community and get the full member experience.

Join For Free

Introduction

A table data structure is one of the most important data structures in a document, especially when exporting data from enterprise systems, data is usually in table format.

There are several data file formats that are often used to store tabular content such as CSV, text, and PDFs. For the first two formats, it is quite straight forward just by opening file, loop through lines, and split cells with proper separator. The libraries to do this are quite a lot.

With PDF file, the story is completely different because it doesn't have a dedicated data definition for tabular content, something like table, tr, td tag in HTML. PDF is a complicated format with text data, font, styling, and also image, audio, and video, they can be mixed all together. Below is my proposed solution to data in high-density tabular content.

How to Detect a Table

After some investigation, I realized that:

  • Column: Text content in cells of the same column lies on a rectangular space that does not overlap with other rectangular spaces of another column. For example, in the following image, red rectangle and blue rectangle are separated spaces
  • Row: Words in the same horizontal alignment are in the same row. But this is just sufficient condition because a cell in a row may be a multi-line cell. For example, the fourth cell in the yellow rectangle has two lines, phrases “FK to this customer’s record in” and "Ledgers table" are not in the same horizontal alignment but they are still considered in the same row. In my solution, I simply assume that content in a cell only is single-line content. Different lines in a cell are considered to belong to different rows. Therefore the content in the yellow rectangle contains two rows: 1. {"Ledger_ID", "|", "Sales Ledger Account", "FK to this customer's record to"} 2. {NULL, NULL, NULL, "Ledgers table"}

recognize a table

PDFBox API

My library behind traprange is PDFBox which is the best PDF lib I know so far. To extract text from a PDF file, PDFBox API provides 4 classes:

  • PDDocument: contains information of the entire PDF file. In order to load a PDF file, we use method PDDocument.load(stream: InputStream)
  • PDPage: represents each page in PDF document. We possibly archive a specific page content by passing the index of the page with this method: document.getDocumentCatalog().getAllPages().get(pageIdx: int)
  • TextPosition: represents an individual word or character in the document. We can fetch all TextPosition objects of a PDPage by overriding method processTextPosition(text: TextPosition) in class PDTextStripper. A TextPosition object has methods getX(), getY(), getWidth(), getHeight() that returns its bound in page and method getCharacter() to get its content.

In my work, I process text chunks directly by using TextPosition objects. For each text chunk in PDF file, it returns a text element with the following attributes:

  • x: horizontal distance from the left of the page
  • y: vertical distance from the top border of the page
  • maxX: equals x + width of the text chunk
  • maxY: equals y+ height of the text chunk

textposition rectangle

Trap Ranges

The most important thing is identifying the bound of each row and column because if we know the bound of a row/column, we can retrieve all texts in that row/column from that we can easily extract all content inside the table and put it in a structured model. We name these bounds are trap-ranges. TrapRange has two attributes:

  • lowerBound: Contains the lower endpoint of this range
  • upperBound: Contains the upper endpoint of this range To calculate values of trap-ranges, we loop through all texts of the page and project range of each text onto the horizontal and vertical axis, get the result and join them together. After looping through all texts of the page, we will calculate trap-ranges and use them to identify cell data of the table.

join sample

Algorithm 1: calculating trap-ranges for each PDF page:

SQL
 




x


 
1
columnTrapRanges <-- []
2
rowTrapRanges <-- []
3
for each text in page
4
begin
5
     columnTrapRanges <-- join(columnTrapRanges, {text.x, text.x + text.width} )
6
     rowTrapRanges <-- join(rowTrapRanges, {text.y, text.y + text.height} )
7
end



After calculating trap-ranges for the table, we loop through all texts again and classify them into correct cells of the table.

Algorithm 2: classifying text chunks into correct cells:

SQL
 




xxxxxxxxxx
1


 
1
table <-- new Table()
2
for each text in page
3
begin
4
     rowIdx <-- in rowTrapRanges, get index of the range that containts this text
5
     columnIdx <-- in columnTrapRanges, get index of the range that contains this text
6
     table.addText(text, rowIdx, columnIdx)
7
end


Design and Implement

Traprange class diagram











The above is a class diagram of main classes:

  • TrapRangeBuilder: build() to calculate and return ranges
  • Table, TableRow and TableCell: for table data structure
  • PDFTableExtractor is the most important class. It contains methods to initialize and extract table data from PDF files. Builder pattern was used here. Following is some highlighted methods in this class:
  • setSource: set source of the PDF file. There're 3 overloads setSource(InputStream), setSource(File) and setSource(String)
  • addPage: to determine which pages will be processed. Default is all pages
  • exceptPage: to skip a page
  • exceptLine: to skip noisy data. All texts in these lines will be avoided.
  • extract: process and return result

Example

SQL
 




xxxxxxxxxx
1
10


 
1
PDFTableExtractor extractor = new PDFTableExtractor();
2
List<Table> tables = extractor.setSource(“table.PDF”)
3
    .addPage(0)
4
    .addPage(1)
5
    .exceptLine(0) //the first line in each page
6
    .exceptLine(1) //the second line in each page
7
    .exceptLine(-1)//the last line in each page
8
    .extract();
9
String html = tables.get(0).toHtml();//table in html format
10
String csv = tables.get(0).toString();//table in csv format using semicolon as a delimiter


Following are some sample results (check out and run the test file TestExtractor.java):

  • Sample 1: Source: sample-1.pdf, result: sample-1.html
  • Sample 2: Source: sample-2.pdf, result: sample-2.html
  • Sample 3: Source: sample-3.pdf, result: sample-3.html
  • Sample 4: Source: sample-4.pdf, result: sample-4.html
  • Sample 5: Source: sample-5.pdf, result: sample-5.html

Evaluation

In experimentation, I used PDF files having high density of table content. The results show that my implementation detects tabular content better than other open-sources: pdftotext, pdftohtml, pdf2table. With documents having multi tables or too much noisy data, my method does not work well. If row has cells overlapped, columns of these cells will be merged.

Conclusion

TrapRange method works the best with PDF files having high density of table data. With documents have multi-table or too much noisy data, TrapRange is not a good choice. My method also can be implemented in other programming languages by replacing PDFBox by a corresponding PDF library or using command-line tool pdftohtml to extract text chunks and using these data as input data for algorithm 1, 2.

Visit and fork my project at: https://github.com/thoqbk/traprange

References

  1. http://en.wikipedia.org/wiki/Portable_Document_Format
  2. http://pdfbox.apache.org
  3. http://ieg.ifs.tuwien.ac.at/pub/yildiz_iicai_2005.pdf
  4. http://www.foolabs.com/xpdf/
  5. http://ieg.ifs.tuwien.ac.at/projects/pdf2table/
PDF Database Data file sql Extract

Published at DZone with permission of Tho Luong. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • How To Recover SQL Server FILESTREAM Enabled Database
  • Extract Information From PDF Invoice
  • SQL Recovery Model: Simple vs. Full
  • How to Restore a Transaction Log Backup in SQL Server

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!