Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Creating HTML From PDF, Excel, or Word With Apache NiFi and Apache Tika

DZone's Guide to

Creating HTML From PDF, Excel, or Word With Apache NiFi and Apache Tika

Let's look into extracting text or HTML from PDF, Excel, and Word documents via Apache NiFi.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

This version has been tested with HDF 3.1 and Apache NiFi 1.5. This processor is using Apache Tika 1.17 and is a non-supported open-source community processor that I have written.

A user posted asking about HTML output. I took a look and it was easy so I added an option for that.

Apache NiFi Flow

You must download or build the nifi-extracttextprocessor NAR and put in your library. Then, you can add the processor.

Select HTML or text:

Here's the autogenerated documentation:

You can see we set the output mime.type to text/HTML.

Apache NiFi example flow to read a file and convert to HTML:

Source and JUnit in Eclipse:

Example output HTML:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="pdf:PDFVersion" content="1.3"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
<meta name="xmp:CreatorTool" content="Rave (http://www.nevrona.com/rave)"/>
<meta name="access_permission:modify_annotations" content="true"/>
<meta name="access_permission:can_print_degraded" content="true"/>
<meta name="meta:creation-date" content="2006-03-01T07:28:26Z"/>
<meta name="created" content="Wed Mar 01 02:28:26 EST 2006"/>
<meta name="access_permission:extract_for_accessibility" content="true"/>
<meta name="access_permission:assemble_document" content="true"/>
<meta name="xmpTPg:NPages" content="2"/>
<meta name="Creation-Date" content="2006-03-01T07:28:26Z"/>
<meta name="dcterms:created" content="2006-03-01T07:28:26Z"/>
<meta name="dc:format" content="application/pdf; version=1.3"/>
<meta name="access_permission:extract_content" content="true"/>
<meta name="access_permission:can_print" content="true"/>
<meta name="pdf:docinfo:creator_tool" content="Rave (http://www.nevrona.com/rave)"/>
<meta name="access_permission:fill_in_form" content="true"/>
<meta name="pdf:encrypted" content="false"/>
<meta name="producer" content="Nevrona Designs"/>
<meta name="access_permission:can_modify" content="true"/>
<meta name="pdf:docinfo:producer" content="Nevrona Designs"/>
<meta name="pdf:docinfo:created" content="2006-03-01T07:28:26Z"/>
<meta name="Content-Type" content="application/pdf"/>
<title>
</title>
</head>
<body>
<div class="page">
<p/>
<p> A Simple PDF File
This is a small demonstration .pdf file -
</p>
<p> just for use in the Virtual Mechanics tutorials. More text. And more
text. And more text. And more text. And more text.
</p>
<p> And more text. And more text. And more text. And more text. And more
text. And more text. Boring, zzzzz. And more text. And more text. And
more text. And more text. And more text. And more text. And more text.
And more text. And more text.
</p>
<p> And more text. And more text. And more text. And more text. And more
text. And more text. And more text. Even more. Continued on page 2 ...</p>
<p/>
</div>
<div class="page">
<p/>
<p> Simple PDF File 2
...continued from page 1. Yet more text. And more text. And more text.
And more text. And more text. And more text. And more text. And more
text. Oh, how boring typing this stuff. But not as boring as watching
paint dry. And more text. And more text. And more text. And more text.
Boring. More, a little more text. The end, and just as well. </p>
<p/>
</div>
</body>
</html>

The source code can be found here. Here's the NAR release.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,streaming ,apache tika ,apache nifi ,tutorial ,html

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}