Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Extract Content From a PDF

DZone's Guide to

How to Extract Content From a PDF

Need to grab text or images from a PDF programatically from your Mac? Then read on, because this post has a solution for you.

· Integration Zone ·
Free Resource

SnapLogic is the leading self-service enterprise-grade integration platform. Download the 2018 GartnerMagic Quadrant for Enterprise iPaaS or play around on the platform, risk free, for 30 days.

I've occasionally needed to extract text and/or images from a PDF. I've found a couple of easy, free ways to do this on MacOS.

There's commercial software such as Adobe Acrobat that will extract images from a PDF, of course, but there's an easier way: a free application called The Unarchiver that treats a PDF file as if it were a zip file and extracts everything into a folder. Just install the app, then right-click on a PDF file, and select Open With.

Related pro-tip: if you want to extract all the images from a Keynote presentation, you can simply unzip the presentation using the commandline unzip application. It'll expand into a folder that contains all the images and other assets (or you can right-click and open with the Archive Utility app).

Mission accomplished, but you'll probably have a bunch of .tiff files where you want compact.jpg or compressed .png files instead. If you're a command line user, and you have ImageMagick installed, you can convert them all at once with a Bash variable substitution like this:

find . -name '*.tiff' | while read line; do 
convert "$line" "${line%%tiff}jpg" 
done 

That'll do the trick for the images. For the text, you can just open the PDF in Mac's default PDF viewer, the Preview app. Use Cmd-A to select all of the text and other content, and then you can simply paste it into any plaintext destination. If you don't have a favorite text editor such as Atom or Sublime Text, you can use Mac's default TextEdit app. Just use Format > Make Plain Text to set it to plain text mode.

With SnapLogic’s integration platform you can save millions of dollars, increase integrator productivity by 5X, and reduce integration time to value by 90%. Sign up for our risk-free 30-day trial!

Topics:
shell ,bash ,integration ,extract text

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}