DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Reading an HTML File, Parsing It and Converting It to a PDF File With the Pdfbox Library
  • How to Merge HTML Documents in Java
  • How to Get Plain Text From Common Documents in Java
  • How to Convert HTML to DOCX in Java

Trending

  • Performing and Managing Incremental Backups Using pg_basebackup in PostgreSQL 17
  • Automating Data Pipelines: Generating PySpark and SQL Jobs With LLMs in Cloudera
  • Manual Sharding in PostgreSQL: A Step-by-Step Implementation Guide
  • Ethical AI in Agile
  1. DZone
  2. Coding
  3. Languages
  4. Three Ways To Separate Plain Text From HTML Using Java

Three Ways To Separate Plain Text From HTML Using Java

Learn about three API solutions that can be employed to convert an HTML document to text, convert an HTML string to text, and remove HTML from a text string.

By 
Brian O'Neill user avatar
Brian O'Neill
DZone Core CORE ·
Nov. 15, 22 · Tutorial
Likes (4)
Comment
Save
Tweet
Share
11.5K Views

Join the DZone community and get the full member experience.

Join For Free

On each webpage we visit, we are confronted with a huge variety of multimedia content, all of which is put together and presented using Hyper Text Markup Language (HTML).  HTML is a basic programming language that many developers are familiar with, composed of elements that – when interpreted by a browser – typically form a coherent, organized, and intentional display with various customized elements.  This code provides the framework for how we view images, videos, bodies of writing, hyperlinks, data entry fields, and anything else you can think of on a web page – and all that code is available for anyone to view with a simple right-click on any browser.

Given the immense volume of formatting elements present in any complex HTML string, the actual subject of the code - the text contents and file paths buried within those strings - can be a bit difficult to access independently of those formatting specifications. If, for example, we want to review web copy and subsequently edit or manipulate that text in a meaningful way, we’re going to have a difficult time copying and pasting that information from the displayed web page directly. We’ll just end up with a mess of inconsistently formatted text riddled with hyperlinks, logos, disjointed tabs and spaces, and more.  This isn’t to say that it can’t be done. We can, of course, copy small snippets of text from any web page and reformat those snippets to resemble their original form in rich text editors like Microsoft Word.  The issue is that this “point and click” approach chews up valuable time in our workday, and if we need to scale up our operation to include multiple websites and thousands of characters worth of text, we’ll be doing ourselves a big disservice in the long run by attempting to do so manually.

Rather than waste valuable time and energy attempting to snag the text we want with deft clicks and drags, we’re much better off removing it from the HTML code entirely using an API service that is specifically equipped to do so.  We can accomplish this separation through a few methods which – while appearing identical on the surface – accommodate slightly different use cases.  These methods include the following:

  1. Converting an HTML file to plain text
  2. Converting an HTML string to plain text
  3. Removing HTML from a text string

The first and second methods are essentially the same operation with two different scenarios in mind: in the former, our HTML code is readily available to us in file form (one which will open directly as a browser page when we click on it), and in the latter, our HTML is available to us as a text string (for example, HTML we copied via right-clicking on our web browser).  The third method, while technically accomplishing the same goal as the first two methods, envisions a more security-focused use case, helping us to identify HTML and Cross-Site Scripting attacks (a form of cyber threat in which a malicious actor places executable scripts into trusted app/website code) in a given text string, without assuming we have a fully formed HTML string to work with.

In the remainder of this article, I will demonstrate three simple API solutions which can be used to separate HTML code from plain text contents for any of the three slightly different scenarios listed above. These APIs are all free to use and are available via the Cloudmersive Document Conversion API endpoint with a single free-tier Cloudmersive API key to authenticate each service (provides a limit of 800 API calls per month and zero additional commitments). Below, I’ve provided ready-to-run Java code examples to help you structure your API call to each of these three APIs, with additional notes included regarding input requests and output responses where they are needed.

Before we begin calling each individual API iteration, I will first provide instructions to help you install the API client with Maven or Gradle (these are the same for each API iteration).  To install with Maven, our first task is to add a reference to our repository in pom.xml:

 
<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>


After that, we just need to add an additional reference to the pom.xml dependency:

 
<dependencies>
<dependency>
    <groupId>com.github.Cloudmersive</groupId>
    <artifactId>Cloudmersive.APIClient.Java</artifactId>
    <version>v4.25</version>
</dependency>
</dependencies>


To install with Gradle instead, we’ll need to add the below snippet to our root build.gradle (at the end of repositories):

 
allprojects {
	repositories {
		...
		maven { url 'https://jitpack.io' }
	}
}


After that, we’ll need to add the below dependency in build.gradle:

 
dependencies {
        implementation 'com.github.Cloudmersive:Cloudmersive.APIClient.Java:v4.25'
}


With installation complete, we can move on and begin structuring our API calls.  The first API iteration I will demonstrate can be used to convert an HTML document (file) to plain text.  This API requires a file path input included within the inputFile field (indicated by the code comments in the examples provided below).   We just need to copy the code below into our file and configure our inputs accordingly:

 
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.ConvertDocumentApi;

ApiClient defaultClient = Configuration.getDefaultApiClient();

// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");

ConvertDocumentApi apiInstance = new ConvertDocumentApi();
File inputFile = new File("/path/to/inputfile"); // File | Input file to perform the operation on.
try {
    TextConversionResult result = apiInstance.convertDocumentHtmlToTxt(inputFile);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling ConvertDocumentApi#convertDocumentHtmlToTxt");
    e.printStackTrace();
}


The second API iteration I will demonstrate can be used to convert an HTML string to text.  This API’s request parameters require a simple HTML string input, which can be included in the following format:

 
<?xml version="1.0" encoding="UTF-8"?>
<HtmlToTextRequest>
	<Html>string</Html>
</HtmlToTextRequest>


With your HTML string properly formatted, you can pass your string through the below code examples, and you’re all done with this method:

 
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.ConvertWebApi;

ApiClient defaultClient = Configuration.getDefaultApiClient();

// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");

ConvertWebApi apiInstance = new ConvertWebApi();
HtmlToTextRequest input = new HtmlToTextRequest(); // HtmlToTextRequest | HTML to Text request parameters
try {
    HtmlToTextResponse result = apiInstance.convertWebHtmlToTxt_0(input);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling ConvertWebApi#convertWebHtmlToTxt_0");
    e.printStackTrace();
}


The final API iteration I will demonstrate can be used to perform the inverse of the previous two operations: removing HTML from a string of text.  To call this API, we will need to format our input request like the example below:

 
<?xml version="1.0" encoding="UTF-8"?>
<RemoveHtmlFromTextRequest>
	<TextContainingHtml>string</TextContainingHtml>
</RemoveHtmlFromTextRequest>


Once our request is formatted, we can pass that request through the final code examples below, which will return a plain TextContentResult string:

 
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.EditTextApi;

ApiClient defaultClient = Configuration.getDefaultApiClient();

// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");

EditTextApi apiInstance = new EditTextApi();
RemoveHtmlFromTextRequest request = new RemoveHtmlFromTextRequest(); // RemoveHtmlFromTextRequest | Input request
try {
    RemoveHtmlFromTextResponse result = apiInstance.editTextRemoveHtml(request);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling EditTextApi#editTextRemoveHtml");
    e.printStackTrace();
}


With these three solutions in your back pocket, you’ll be able to easily separate HTML from plain text at scale with three distinct use cases in mind.  Each API solution returns a simple plain text string, which can be easily reviewed and edited within any rich or plain text editor.

HTML Plain text Java (programming language)

Opinions expressed by DZone contributors are their own.

Related

  • Reading an HTML File, Parsing It and Converting It to a PDF File With the Pdfbox Library
  • How to Merge HTML Documents in Java
  • How to Get Plain Text From Common Documents in Java
  • How to Convert HTML to DOCX in Java

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!