DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Utilizing Database Hooks Like a Pro in Node.js
  • My 7 Must-Have Tools for JavaScript Pros That I Can’t Live Without in 2023
  • A Beginner's Guide to Back-End Development
  • DZone Community Awards 2022

Trending

  • Cloud Cost Optimization for ML Workloads With NVIDIA DCGM
  • Navigating the LLM Landscape: A Comparative Analysis of Leading Large Language Models
  • How AI Is Changing the Way Developers Write Code
  • How Large Tech Companies Architect Resilient Systems for Millions of Users
  1. DZone
  2. Data Engineering
  3. Databases
  4. Extracting Tables from PDFs in Javascript with PDF.js

Extracting Tables from PDFs in Javascript with PDF.js

By 
Gary Sieling user avatar
Gary Sieling
·
Dec. 26, 13 · Interview
Likes (0)
Comment
Save
Tweet
Share
20.2K Views

Join the DZone community and get the full member experience.

Join For Free

a common and difficult problem acquiring data is extracting tables from a pdf. previously, i described how to extract the text from a pdf with pdf.js , a pdf rendering library made by mozilla labs.

the rendering process requires an html canvas object, and then draws each object (character, line, rectangle, etc) on it. the easiest way to get a list of these is to to intercept all the calls pdf.js makes to drawing functions on the canvas object. (see “ self modifying javascripts ” for a similar technique). the “set” method below adds a wrapper closure to each function, which logs the call.

function replace(ctx, key) {
  var val = ctx[key];
   if (typeof(val) == "function") {
     ctx[key] = function() {
      var args = array.prototype.slice.call(arguments);
      console.log("called " + key + "(" + args.join(",") + ")");
      return val.apply(ctx, args);
    }
  }
}
 
for (var k in context) {
  replace(context, k);
}
 
var rendercontext = {
  canvascontext: context,
  viewport: viewport
};
 
page.render(rendercontext);

this lets us see a series of calls:

called transform(1,0,0,1,150.42,539.67)
called translate(0,0)
called scale(1,-1)
called scale(0.752625,0.752625)
called measuretext(c)
called save()
called scale(0.9701818181818181,1)
called filltext(c,0,0)
called restore()
called restore()
called save()
called transform(1,0,0,1,150.42,539.6

we can easily retrieve the text by noting the first argument to each “filltext” call:

"congregations ranked by growth and decline in membership and worship attendance, 2006 to 2011philadelphia presbytery - table 16net
membership changenet worship changepercent changepercent changeworship
 2006worship 2011membership 2006membership 2011abington, abington-
143(74)-13.18%(57)0(15)0.00%(22)numberrank3003001,085942anchor,
wrightstown0(23)0.00%(27)-12(25)-21.43%(52)numberrank56449797arch
street, philadelphia-117(71)-68.42%(117)27(5)90.00%
(2)numberrank305717154aston, aston3(21)3.53%(22)-5(19)-9.43%
(31)numberrank53488588beaconno reportboth yearsno reportboth
yearsnumberrankbensalem, bensalem-23(39)-13.94%(62)-28(36)-28.57%
(64)numberrank9870165142berean, philadelphia106(4)44.92%(4)no
reportboth yearsnumberrank00236342bethany collegiate, havertown-
188(76)-42.44%(110)43(3)21.29%(7)numberrank202245443255bethel,
philadelphia-13(33)-13.68%(60)-27(35)-35.06%
(71)numberrank77509582bethesda, philadelphia9(18)5.56%(18)no reportboth
yearsnumberrank1150162171beverly hills, upper darby-3(26)-3.03%
(32)-11(24)-20.00%(48)numberrank55449996bridesburg,
philadelphia0(23)0.00%(27)no reportboth yearsnumberrank004444bristol,
bristolno reportboth yearsno reportboth yearsnumberrankpage 1 of
10report prepared by research services, presbyterian church (u.s.a.)1-
800-728-7228, ext #204006-oct-12"

notable, this doesn’t track line endings, and not all the characters are recorded in the expected order (the first line is rendered after the second).

the calls to transform, translate, and scale control where text is placed. the filltext method also takes an (x, y) parameter set that moves the individual letters between words. the exact position is a combination of successive operations, which are modeled as a stack of matrix operations.

thankfully, pdf.js tracks the output of these operations as it renders, so we don’t have to recalculate it.

thus, we can make a method that records the letters and their real positions. this method takes the internal context object, the type of state transition, and the arguments to the transition. this method is then called from the ‘record’ function listed above.

var chars = [];
var cur = {};
 
function record(ctx, state, args) {
  if (state === 'filltext') {
    var c = args[0];
    cur.c = c;
    cur.x = ctx._transformmatrix[4] + args[1];
    cur.y = ctx._transformmatrix[5] + args[2];
 
    chars[chars.length] = cur;
    cur = {};
  }
}

these results can be sorted by position (x and y). the sort method arranges letters by position – if they are shifted up or down a small amount, they are considered to be on one line.

chars.sort(
  function(a, b) {
    var dx = b.x - a.x;
    var dy = b.y - a.y;
 
    if (math.abs(dy) < 0.5) {
      return dx * -1;
    } else {
      return dy * -1;
    }
  }
);

this presents several difficulties: this doesn’t detect right-to-left text, and it’s becoming clear that we’re going to have a hard time knowing when you’re in a table and when we aren’t.

to do this, we define a function which can transform the array of letters and positions into a csv style output. this tracks from letter to letter – if it sees a “large” change in y, it makes a new line. if it sees a “large” change in x, it treats it as a new column.

the real challenge is defining “large” which for my test pdf were around 15 and 20, for dx and dy.

function gettext(marks, ex, ey, v) {
  var x = marks[0].x;
  var y = marks[0].y;
 
  var txt = '';
  for (var i = 0; i < marks.length; i++) {
    var c = marks[i];
    var dx = c.x - x;
    var dy = c.y - y;
 
    if (math.abs(dy) > ey) {
      txt += "\"\n\"";
      if (marks[i+1]) {
        // line feed - start from position of next line
        x = marks[i+1].x;
      }
    }
 
    if (math.abs(dx) > ex) {
      txt += "\",\"";
    }
 
    if (v) {
      console.log(dx + ", " + dy);
    }
 
    txt += c.c;
 
    x = c.x;
    y = c.y;
  }
 
  return txt;
}

this algorithm doesn’t handle newlines in rows, and oddly, the columns don’t come out in the right order, but they appear to be consistently out of order. line with large spaces (e.g. an em-dash) are detected as having multiple columns, but this can be cleaned up later – here is some sample output.

you can see an example below, and the final source is available on github .

congregations ranked by growth and decline in m","embership and w","orship attendance, 2006 to 2011"
"","philadelphia presbytery"," - table 16"
"","net ","membership ","change"
"","net worship ","change","percent ","change","percent ","change","worship"," 2006","worship"," 2011","membership"," 2006","membership"," 2011"
"","abington, abington","-143","(74)","-13.18%(57)","0","(15)","0.00%(22)","number","rank","300","300","1,085","942"
"","anchor, wrightstown","0","(23)","0.00%(27)","-12","(25)","-21.43%(52)","number","rank","56","44","97","97"
"","arch street, philadelphia","-117","(71)","-68.42%","(117)","27(5)","90.00%(2)","number","rank","30","57","171","54"
"","aston, aston","3","(21)","3.53%(22)","-5","(19)","-9.43%(31)","number","rank","53","48","85","88"
"","beacon","no report","both years","no report","both years","number","rank"
"","bensalem, bensalem","-23","(39)","-13.94%(62)","-28","(36)","-28.57%(64)","number","rank","98","70","165","142"
"","berean, philadelphia","106(4)","44.92%(4)","no report","both years","number","rank","0","0","236","342"
"","bethany collegiate, havertown","-188","(76)","-42.44%","(110)","43(3)","21.29%(7)","number","rank","202","245","443","255"
"","bethel, philadelphia","-13","(33)","-13.68%(60)","-27","(35)","-35.06%(71)","number","rank","77","50","95","82"
"","bethesda, philadelphia","9","(18)","5.56%(18)","no report","both years","number","rank","115","0","162","171"
"","beverly hills, upper darby","-3","(26)","-3.03%(32)","-11","(24)","-20.00%(48)","number","rank","55","44","99","96"
"","bridesburg, philadelphia","0","(23)","0.00%(27)","no report","both years","number","rank","0","0","44","44"
"","bristol, bristol","no report","both years","no report","both years","number","rank"
"","page 1 of 10","report prepared by research services, presbyterian church (u.s.a.)","1-800-728-7228, ext #2040","06-oct-12"
Database JavaScript

Published at DZone with permission of Gary Sieling, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Utilizing Database Hooks Like a Pro in Node.js
  • My 7 Must-Have Tools for JavaScript Pros That I Can’t Live Without in 2023
  • A Beginner's Guide to Back-End Development
  • DZone Community Awards 2022

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: