DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • How to Convert HTML to PDF in Java
  • Creating HTML From PDF, Excel, or Word With Apache NiFi and Apache Tika
  • A PDF Framework That Solves the Pain Points of Enterprise Development
  • Exploring Shadow DOM With Examples Using Cypress

Trending

  • Development of Custom Web Applications Within SAP Business Technology Platform
  • Send Your Logs to Loki
  • How To Start a Successful Career in DevOps
  • Deploy a Session Recording Solution Using Ansible and Audit Your Bastion Host
  1. DZone
  2. Coding
  3. Languages
  4. Converting PDF to HTML Using PDFBox

Converting PDF to HTML Using PDFBox

James Sugrue user avatar by
James Sugrue
CORE ·
Apr. 07, 10 · Tutorial
Like (2)
Save
Tweet
Share
90.84K Views

Join the DZone community and get the full member experience.

Join For Free

Over the past few days, while working on another project, I needed to covert PDF documents into HTML. I did the usual searches for tools, but as I'm sure you'll have noticed, the tools available don't get great results. But then, seeing as I'm a software developer, I decided to see if I could program it myself. My requirements were quite simple: get the text out of the document, with the aim of HTML output, and extract the images at the same time.

My first port of call was iText, as it was a library that I was already familiar with. iText is great for creating documents, and I was able to get some text out, but the image extraction wasn't really working out for me. The following is a code snippet that I was using to get the images from the PDFs in iText, based on a post on the iText mailing list. But when I used it, none of the images I generated were right - mostly just the box outlines/borders of the images in the PDF. I presume I was doing something wrong.

PdfReader reader = new PdfReader(new FileInputStream(new File("C:\\test.pdf")));

for(int i =0; i < reader.getXrefSize(); i++)
{
PdfObject pdfobj = reader.getPdfObject(i);
if(pdfobj != null)
{
if (!pdfobj.isStream()) {
//throw new Exception("Not a stream");
}
else
{
PdfStream stream = (PdfStream) pdfobj;
PdfObject pdfsubtype = stream.get(PdfName.SUBTYPE);
if (pdfsubtype == null) {
// throw new Exception("Not an image stream");

}
else
{
if (!pdfsubtype.toString().equals(PdfName.IMAGE.toString())) {
//throw new Exception("Not an image stream");
}
else
{
// now you have a PDF stream object with an image
byte[] img = PdfReader.getStreamBytesRaw((PRStream) stream);
// but you don't know anything about the image format.
// you'll have to get info from the stream dictionary
System.out.println("----img ------");
System.out.println("height:" + stream.get(PdfName.HEIGHT));
System.out.println("width:" + stream.get(PdfName.WIDTH));
int height = new Integer(stream.get(PdfName.HEIGHT).toString()).intValue();
int width = new Integer(stream.get(PdfName.WIDTH).toString()).intValue();
System.out.println("bitspercomponent:" +
stream.get(PdfName.BITSPERCOMPONENT));

java.awt.Image image = Toolkit.getDefaultToolkit().createImage(img);
BufferedImage bi = new BufferedImage(width, height, BufferedImage.TYPE_INT_ARGB);
Graphics2D g2 = bi.createGraphics();
ImageIO.write(bi, "PNG",new File("C:\\images\\"+ i + ".png"));
}

}
}
// ...
// // or you could try making a java.awt.Image from the array:
// j

}
}


}
catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
catch(Exception e)
{
e.printStackTrace();
}

As I was low on time, I moved onto PDFBox which looked like it had already considered my use cases. I got the latest source code from SVN and tried the org.apache.pdfbox.ExtractText class straight away. This allows you to specify a -html flag instead of using the default text output. I ran into an exception straight away. After some debugging I found that what I had downloaded was missing the resources/glyphlist.txt file. I found a copy on the Adobe site and was able to run the utility then. 

One other thing to note while using these utilities is that you'll need to have ICU4J, iText and the Apache Commons Logging libraries on your build path. 

The good news was that the utility got all the text out and put it into a HTML format. But the generated HTML wasn't that pretty. Each line that it read got terminated with a <br/>, admittedly, an easy thing to change around.

Moving onto image extraction, I tried out org.apache.pdfbox.ExtractImages. This class worked perfectly, saving all the images in the PDF as jpeg. I did make one alteration to PDXObjectImage.write2file so that I put the images in a particular folder.

The PDFBox utilities really impressed me, as I wasn't sure if it was possible to get this information out of the PDF so easily. All the pieces are there for one single utility that would generate better HTML for you along with the images. As far as I know, no solution exists to do all of this in Java (if I'm wrong, please let me know in the comments section). Have any of the readers tried to achieve this process using iText, PDFBox or any other Java library? 

HTML PDF

Opinions expressed by DZone contributors are their own.

Related

  • How to Convert HTML to PDF in Java
  • Creating HTML From PDF, Excel, or Word With Apache NiFi and Apache Tika
  • A PDF Framework That Solves the Pain Points of Enterprise Development
  • Exploring Shadow DOM With Examples Using Cypress

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: