Apache Tika
This short tutorial is all about extracting data with the Apache Tika library. Read below to find out how to extract your own data.
Join the DZone community and get the full member experience.
Join For FreeOverview
Apache Tika is a library that allows you to extract data from PDF, XLS, PDT, etc. Apache Tika using a decorator pattern so you can easily fit it to your needs.
Adding Dependency
XML
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.26</version>
</dependency>
Apache Tika in Action
Java
public class DzoneApacheTikaExample {
public static void main(String[] args) throws TikaException, IOException, SAXException {
DzoneApacheTikaExample example = new DzoneApacheTikaExample();
String value = example.readFromPdf("test-reading.pdf");
System.out.println("Data read from PDF: ");
System.out.println(value);
}
public String readFromPdf(String readFromFileName) throws IOException, TikaException, SAXException {
InputStream stream = this.getClass()
.getClassLoader()
.getResourceAsStream(readFromFileName);
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(stream, handler, metadata, context);
return handler.toString();
}
}
BodyContentHandler: Collects all information inside the body tag.
Metadata: All metadata collected during the parsing will be stored there.
ParseContext: Used to pass context information to Apache Tika parser.
ContentHandler: Based on decorator patter so we can make complex handling.
For example, let's write parsed data into a file:
Java
public class DzoneApacheTikaExample {
public static void main(String[] args) throws TikaException, IOException, SAXException {
DzoneApacheTikaExample example = new DzoneApacheTikaExample();
example.writeParsedDataToFile("capitals.xlsx", "/Users/ali_zhagparov/Desktop/excel-content.txt");
}
public void writeParsedDataToFile(String readFromFileName, String writeToFileName) throws IOException, TikaException, SAXException {
InputStream stream = this.getClass().getClassLoader().getResourceAsStream(readFromFileName);
File yourFile = new File(writeToFileName);
yourFile.createNewFile(); // if file already exists, will do nothing
FileOutputStream fileOutputStream = new FileOutputStream(yourFile, false);
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(new WriteOutContentHandler(fileOutputStream));
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(stream, handler, metadata, context);
}
}
Apache Tika
Opinions expressed by DZone contributors are their own.
Comments