Over a million developers have joined DZone.

Crimes Against XML

· Java Zone

Discover how AppDynamics steps in to upgrade your performance game and prevent your enterprise from these top 10 Java performance problems, brought to you in partnership with AppDynamics.

Integration is an important part in any B2B application and XML is a common (IMHO good) way of passing data between systems. Generally one side will have to do the integration dirty work. Normally this is the smaller company, which for one reason or other is where I'm always working. DB schemas can be a pain to change but XML schemas particularly ones that are exposed externally are almost impossible to alter. In my time, I have seen schemas from hell that have made me want to rip my hair out and ruined what could have been a very nice day. Here is a list of what I consider XML crimes.

Not Providing an XML Schema
Whenever you're working with XML you should always have a schema. It
's completely inappropriate and amateurish to distribute XML externally and not provide a schema (preferably XSD). I'm surprised how often I see people working with XML without a schema. How else are you meant to document and validate your XML? Not to mention code completion and auto-generation.

Having Elements That Only Contain Attributes
Attributes are intended to describe the propert
ies of the element, basically their for metadata [1]. Sometimes people get confused and put all the data in attributes.
e.g.

<artist name="Arcade Fire" country="Canada" type="Band"/>

A much better way to present this data would be:

<artist type="band">
<name>Arcade Fire</name>
<country>Canada</country>
</artist>

Why is this better? Firstly, it's the way it was intended, secondly it's easier to parse, and finally it's more extendable. A key component in interoperability is consistency, having some values in attributes and others in elements breaks this rule.

Incorrect Character Encoding
Encoding seems to be one of those things most developers don't understand. Unless you have a
very good reason otherwise, all XML should be encoded as UTF-8. You should always add an XML declaration at the top of your documents.
e.g.

<?XML version="1.0" encoding="UTF-8" ?> 

For better or worse ASCII chars tend to have the same code points in commonly used encoding schemes. This means encoding problems generally don't present until a system goes live; commonly '`' is the culprit on Window's platforms. Just because you don't have any foreign/weird characters or you just don't care about internationalisation does not mean you have amnesty for this crime.

Another point that is important to note is that adding encoding = UTF-8, doesn't mean its encoded as UTF-8. By default Java produces text files in the encoding scheme of your OS (1251 on Windows), and Java is not the only accomplice in this crime.

Elements Containing Delimited Fields
Best described with an example:

<artist>
<name>Arcade Fire</name>
<members>Win Butler, Régine Chassagne, Richard Reed Parry, William Butler, Tim Kingsbury, Sarah Neufeld, Jeremy Gara</members>
</artist>

One of the good things about XML is that it can be used to map most data structures, such as lists in lists. Delimiting fields within elements shows the designer didn't think enough about future needs of the schema or doesn't understand XML. The example below is much easier to generate and parse.

<artist>
<name>Arcade Fire</name>
<members>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
</members>
</artist>

No additional logic is required to convert this structure into a useful object model. Finally what's going to happen if one of your fields contains the character that is being used for delimiting?

Not Wrapping Repeated Elements
e.g.

<artist>
<name>Arcade Fire</name>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
<album>Funeral</album>
<album>Neon Bible</album>
</artist>

Once again a nightmare to parse and it's not exactly human readable. XML in this structure will also result in an ambiguous schema, which in turn plays havoc with auto-code generators (e.g. JAXB). This is much better:

<artist>
<name>Arcade Fire</name>
<members>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
</members>
<albums>
<album>Funeral</album>
<album>Neon Bible</album>
</album>
</artist>

Repeating Elements Names in Different Contexts
Consistency is something all developers should strive for everyday, and in something as self contained as an XML schema it really shouldn't be that hard. One place XML schemas often fall down is in the names that are chosen for attributes and elements. In the example below the element named "artist" appears in two places, each of which has its own meaning and structure.
e.g.

<album>
<artist>
<name>Arcade Fire</name>
<members>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
</members>
</artist>
<track>
<name>Intervention</name>
<artist>Arcade Fire</artist>
</track>
</album>

A better structure would be

<album>
<artist>
<name>Arcade Fire</name>
<members>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
</members>
</artist>
<track>
<name>Intervention</name>
<track-artist>Arcade Fire</track-artist>
</track>
</album>

A logical and consistent layout will always make the parser simpler and the XML easier for humans to read. The opposite of this crime is just as bad, this variant involves not using the same element / attribute name for fields that are clearly the same. E.g. the "name" and "track-name" elements in the example below.

 <album>
<artist>
<name>Arcade Fire</name>
</artist>
<track>
<track-name>Intervention</track-name>
</track>
</album>

A much more logical structure is to use the same "name" element in both locations.

 <album>
<artist>
<name>Arcade Fire</name>
</artist>
<track>
<name>Intervention</name>
</track>
</album>

Abbreviating Element/Attribute Names
Yes XML is very verbose and files can become very big, but abbreviating element or attribute names to save space is not a good idea.

<a>
<n>Arcade Fire</n>
</a>

Firstly if your concerned about size, why are you using XML? Secondly XML contains a lot of repeated text, so it compresses very well. This XML is definitely not human readable and any code that works with it will be more difficult to maintain.

Key Value Lookup
The worst XML schema i have ever seen. I think an example is enough of a description

<root>
<entry>
<key>Artist</key>
<value>Arcade Fire</value>
</entry>
<entry>
<key>Country</key>
<value>Canada</value>
</entry>
</root>

This really flies in the face of everything that XML represents, you may as well just have a plain text file.

Conclusion

The two rules to stick to when designing an XML schema are:

  1. Ensure the XML is machine readable
  2. Ensure the XML is human readable

Please if your ever designing an XML schema remember both machines and humans need to be able to easily read the XML. Just think about it a little bit and don't go with whatever's easiest right now, otherwise others will have to live with your crimes.

The Java Zone is brought to you in partnership with AppDynamics. AppDynamics helps you gain the fundamentals behind application performance, and implement best practices so you can proactively analyze and act on performance problems as they arise, and more specifically with your Java applications. Start a Free Trial.

Topics:

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}