DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Next-Gen Lie Detector: Stack Selection
  • Reading an HTML File, Parsing It and Converting It to a PDF File With the Pdfbox Library
  • How to Convert HTML to PDF in Java
  • How to Merge HTML Documents in Java

Trending

  • Manual Sharding in PostgreSQL: A Step-by-Step Implementation Guide
  • How to Perform Custom Error Handling With ANTLR
  • Memory Leak Due to Time-Taking finalize() Method
  • Operational Principles, Architecture, Benefits, and Limitations of Artificial Intelligence Large Language Models
  1. DZone
  2. Data Engineering
  3. Data
  4. Creating HTML From PDF, Excel, or Word With Apache NiFi and Apache Tika

Creating HTML From PDF, Excel, or Word With Apache NiFi and Apache Tika

Let's look into extracting text or HTML from PDF, Excel, and Word documents via Apache NiFi.

By 
Tim Spann user avatar
Tim Spann
DZone Core CORE ·
Mar. 14, 18 · Tutorial
Likes (5)
Comment
Save
Tweet
Share
11.6K Views

Join the DZone community and get the full member experience.

Join For Free

This version has been tested with HDF 3.1 and Apache NiFi 1.5. This processor is using Apache Tika 1.17 and is a non-supported open-source community processor that I have written.

A user posted asking about HTML output. I took a look and it was easy so I added an option for that.

Apache NiFi Flow

You must download or build the nifi-extracttextprocessor NAR and put in your library. Then, you can add the processor.

Select HTML or text:

Here's the autogenerated documentation:

You can see we set the output mime.type to text/HTML.

Apache NiFi example flow to read a file and convert to HTML:

Source and JUnit in Eclipse:

Example output HTML:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="pdf:PDFVersion" content="1.3"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
<meta name="xmp:CreatorTool" content="Rave (http://www.nevrona.com/rave)"/>
<meta name="access_permission:modify_annotations" content="true"/>
<meta name="access_permission:can_print_degraded" content="true"/>
<meta name="meta:creation-date" content="2006-03-01T07:28:26Z"/>
<meta name="created" content="Wed Mar 01 02:28:26 EST 2006"/>
<meta name="access_permission:extract_for_accessibility" content="true"/>
<meta name="access_permission:assemble_document" content="true"/>
<meta name="xmpTPg:NPages" content="2"/>
<meta name="Creation-Date" content="2006-03-01T07:28:26Z"/>
<meta name="dcterms:created" content="2006-03-01T07:28:26Z"/>
<meta name="dc:format" content="application/pdf; version=1.3"/>
<meta name="access_permission:extract_content" content="true"/>
<meta name="access_permission:can_print" content="true"/>
<meta name="pdf:docinfo:creator_tool" content="Rave (http://www.nevrona.com/rave)"/>
<meta name="access_permission:fill_in_form" content="true"/>
<meta name="pdf:encrypted" content="false"/>
<meta name="producer" content="Nevrona Designs"/>
<meta name="access_permission:can_modify" content="true"/>
<meta name="pdf:docinfo:producer" content="Nevrona Designs"/>
<meta name="pdf:docinfo:created" content="2006-03-01T07:28:26Z"/>
<meta name="Content-Type" content="application/pdf"/>
<title>
</title>
</head>
<body>
<div class="page">
<p/>
<p> A Simple PDF File
This is a small demonstration .pdf file -
</p>
<p> just for use in the Virtual Mechanics tutorials. More text. And more
text. And more text. And more text. And more text.
</p>
<p> And more text. And more text. And more text. And more text. And more
text. And more text. Boring, zzzzz. And more text. And more text. And
more text. And more text. And more text. And more text. And more text.
And more text. And more text.
</p>
<p> And more text. And more text. And more text. And more text. And more
text. And more text. And more text. Even more. Continued on page 2 ...</p>
<p/>
</div>
<div class="page">
<p/>
<p> Simple PDF File 2
...continued from page 1. Yet more text. And more text. And more text.
And more text. And more text. And more text. And more text. And more
text. Oh, how boring typing this stuff. But not as boring as watching
paint dry. And more text. And more text. And more text. And more text.
Boring. More, a little more text. The end, and just as well. </p>
<p/>
</div>
</body>
</html>

The source code can be found here. Here's the NAR release.

HTML Apache NiFi Apache Tika PDF

Opinions expressed by DZone contributors are their own.

Related

  • Next-Gen Lie Detector: Stack Selection
  • Reading an HTML File, Parsing It and Converting It to a PDF File With the Pdfbox Library
  • How to Convert HTML to PDF in Java
  • How to Merge HTML Documents in Java

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: