Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Converting Data from RTF to DITA Format with Java

DZone's Guide to

Converting Data from RTF to DITA Format with Java

· DevOps Zone
Free Resource

Learn more about how CareerBuilder was able to resolve customer issues 5x faster by using Scalyr, the fastest log management tool on the market. 

One of my tasks some years ago was to convert the content of RTF[ref] files into a DITA[ref] format via converter program written in Java. In this article I want to present the used library and conversion function which were used in the conversion process. This article also serves as a reminder of the used techniques.

To finish the conversion-task few conversion steps were needed. The conversion:

  • from RTF into HTML.

  • from HTML into XML.

  • from XML via XSLT into the DITA.

RTF into HTML

Java SDK has a function which converts files from RTF into HTML which unexpectedly can be found in the „swing“-libraries. Unexpectedly because the „Swing“-libraries are usually used for GUI elements creation and manipulation. I have found this solution on some forums after searching on the net. The code could look like:

public static String rtfToHtml(Reader rtf) throws IOException {
   JEditorPane p = new JEditorPane();
   p.setContentType("text/rtf");
   EditorKit kitRtf = p.getEditorKitForContentType("text/rtf");
   try {
      kitRtf.read(rtf, p.getDocument(), 0);
      kitRtf = null;
      EditorKit kitHtml = p.getEditorKitForContentType("text/html");
      Writer writer = new StringWriter();
      kitHtml.write(writer, p.getDocument(), 0, p.getDocument().getLength());
      return writer.toString();
   } catch (BadLocationException e) {
      e.printStackTrace();
   }
   return null;
}
 
 // content as string
 String rtfText = ...;
 String htmlText = rtfToHtml(new StringReader(rtfText));

 // content from file
 String htmlText = rtfToHtml(new FileReader(new File("myfile.rtf")));

Following references were used for the code of this part of the article: [ref] and [ref].

After this conversion the file is in HTML format. Because the last conversion step needs well formatted XML file as input for the XSLT transformation, new conversion between-step was used for that task. One approach is using the Java implementation of the „tidy“-Tool[ref] known as jTidy[ref].

HTML into XML

The „tidy“-Tool is common used for validation of HTML content or conversion from HTML into XHTML. It is free for usage and under the Apache License (version 2.0). In a Java program jTydy can looks like[ref]:

Tidy tidy = new Tidy(); // obtain a new Tidy instance
tidy.setXHTML(boolean xhtml); // set desired config options using tidy setters
... // (equivalent to command line options)
tidy.setInputEncoding("utf-8");
tidy.setXmlOutput(true);
tidy.setShowWarning(true);
tidy.setQuiet(true);
tidy.parse(inputStream, System.out); // run tidy, providing an input and output stream

Code Example – parameter for the XHTML format output.

After this step we have XHTML content which can be used for transformation via XSLT.

XML via XSLT into the DITA

The idea for the last conversion step was described in an article from the IBM site where the conversion process via the h2d.xslt file is explained [ref][ref]. The h2d.xslt file can be downloaded from the DITA/OASIS[ref] site and is created for XML transformation purposes. For adding custom transformation rules you can modify this XSLT file or use new one. The code for this step of the processing is a common XSLT transformation and looks like the description in one of my past article about Java and XML [ref] and [ref].

Because the structure of the XML files can differ from project to project I wont write more details about the XSLT rules here. That's why the solution of this part of the conversion is project/structure depend and can not be described generally. You will probably also need inner conversion steps until you get the wanted input / output structure for further processing.

Other references where this thema is main part of are [ref] and [ref]

After the last step we get DITA files, which can be used in a DITA friendly editor and be used for further processing or output like PDF, HTML, Microsoft® Word file and so on. As very last step it is possible that you need to bound the links of the dita files to the ditamap file manually and programmatically.

Résumé

This is a short and vague description, meant as reminder, of a conversion tool which I had written some years ago for internal purposes. It describes the usage of some external tools used in the conversion workflow.

This article describes one possible way to convert data from RTF into a DITA format. This solution was split into few conversion steps where the content was converted from one format into another: RTF → HTML → XHTML → DITA.

For creating this kind of conversion where the advantage of the XML transformation is used you will need following skills:

  • good understanding of the Java programming language.
  • very good understanding of XSLT.

If you dont like or want to use Java you can exchange the Java part with other programming language as you like.


Find out more about how Scalyr built a proprietary database that does not use text indexing for their log management tool.

Topics:

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}