Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Extracting PDF Text with Scala

DZone's Guide to

Extracting PDF Text with Scala

· Java Zone
Free Resource

The single app analytics solutions to take your web and mobile apps to the next level.  Try today!  Brought to you in partnership with CA Technologies

This example extracts the text contents of a PDF for use in other systems. This demonstrates some basic differences from Java: multi-line strings (hooray!), imports, primitive arrays, and what implementing an interface looks like. The big downside to this is that the Eclipse Scala plugin doesn’t seem to have the ability to fill in interface methods on an object.

import java.io._
 
import org.apache.tika.parser.pdf._
import org.apache.tika.metadata._
import org.apache.tika.parser._
import org.xml.sax._
 
object pdfHandler extends ContentHandler {
	def characters(ch : Array[Char], start: Int, length: Int) {
		println(new String(ch))
	}
 
	def endDocument() {
	}
 
	def endElement(uri: String, localName: String, qName: String) {
	}
 
	def endPrefixMapping(prefix: String) {
	}
 
	def ignorableWhitespace(ch: Array[Char], start: Int, length: Int) {
	}
 
	def processingInstruction(target: String, data: String) {
	}
 
	def setDocumentLocator(locator: Locator) {
	}
 
	def skippedEntity(name: String) {
	}
 
	def startDocument() {
	}
 
	def startElement(uri: String, localName: String, qName: String, atts: Attributes) {
	}
 
	def startPrefixMapping(prefix: String, uri: String) {
	}
}
 
object pdf extends App {
	val folder = """\\nas\Files\Data\pacer2\"""
	val subfolder = """\00\00\gov.uscourts.rid.6064\"""
	val file = """gov.uscourts.rid.6064.20.0.pdf"""
 
	val pdf : PDFParser = new PDFParser();
 
	val stream : InputStream = new FileInputStream(folder + subfolder + file)
	val handler : ContentHandler = pdfHandler
	val metadata : Metadata = new Metadata()
	val context : ParseContext = new ParseContext()
 
	pdf.parse(stream,
         handler,
         metadata,
         context)
 
    stream.close()
}

Output:

UNITED STATES DISTRICT COURT
FOR THE DISTRICT OF RHODE ISLAND
...
It is hereby agreed by and between the parties that the above-captioned matter be
dismissed, with prejudice, no interest, no costs.


CA App Experience Analytics, a whole new level of visibility. Learn more. Brought to you in partnership with CA Technologies.

Topics:

Published at DZone with permission of Gary Sieling, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}