Crimes Against XML
Join the DZone community and get the full member experience.
Join For FreeIntegration is an
important part in any B2B application and XML is a common (IMHO good)
way of passing data between systems. Generally one side will
have to do the integration dirty work. Normally this is the
smaller company, which for one reason or other is where I'm always
working. DB schemas can be a pain to change but XML schemas
particularly ones that are exposed externally are almost impossible
to alter. In my time, I have seen
schemas from hell that have made me want to rip my hair out
and ruined what could have been a very nice day. Here is a list of
what I consider XML crimes.
Not
Providing an XML Schema
Whenever you're working with
XML you should always have a schema. It
's
completely inappropriate and amateurish to distribute XML externally
and not provide a schema (preferably XSD). I'm surprised how
often I see people working with XML
without a schema. How else are you meant
to document and validate your XML? Not to mention code completion
and auto-generation.
Having Elements That Only Contain
Attributes
Attributes are intended to describe the propert
ies
of the element, basically their for metadata [1].
Sometimes people get confused and put all the data in
attributes.
e.g.
<artist name="Arcade Fire" country="Canada" type="Band"/>
A much
better way to present this data would be:
<artist type="band">
<name>Arcade Fire</name>
<country>Canada</country>
</artist>
Why
is this better? Firstly, it's
the way it was intended, secondly it's
easier to parse, and finally it's
more extendable. A key component in interoperability is consistency, having some values in attributes and others in elements breaks this rule.
Incorrect Character Encoding
Encoding seems
to be one of those things most developers don't understand.
Unless you have a
very good reason otherwise, all
XML should be encoded as UTF-8. You should always add an
XML declaration at the top of your documents.
e.g.
<?XML version="1.0" encoding="UTF-8" ?>
For
better or worse ASCII chars tend to have the same code points in
commonly used encoding schemes. This means encoding problems
generally don't present until a system goes live; commonly '`'
is the culprit on Window's platforms. Just because you don't have
any foreign/weird characters or you just don't care about internationalisation does
not mean you have amnesty for this crime.
Another point that
is important to note is that adding encoding = UTF-8, doesn't mean its
encoded as UTF-8. By default Java produces text files in the
encoding scheme of your OS (1251 on Windows), and Java is not the
only accomplice in this crime.
Elements Containing
Delimited Fields
Best described with an example:
<artist>
<name>Arcade Fire</name>
<members>Win Butler, Régine Chassagne, Richard Reed Parry, William Butler, Tim Kingsbury, Sarah Neufeld, Jeremy Gara</members>
</artist>
One of the good things about XML is that it can be used to map most data structures, such as lists in lists. Delimiting fields within elements shows the designer didn't think enough about future needs of the schema or doesn't understand XML. The example below is much easier to generate and parse.
<artist>
<name>Arcade Fire</name>
<members>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
</members>
</artist>
No additional logic is required to convert this structure into a useful object model. Finally what's going to happen if one of your fields contains the character that is being used for delimiting?
Not
Wrapping Repeated Elements
e.g.
<artist>
<name>Arcade Fire</name>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
<album>Funeral</album>
<album>Neon Bible</album>
</artist>
Once again a nightmare to parse and it's not exactly human readable. XML in this structure will also result in an ambiguous schema, which in turn plays havoc with auto-code generators (e.g. JAXB). This is much better:
<artist>
<name>Arcade Fire</name>
<members>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
</members>
<albums>
<album>Funeral</album>
<album>Neon Bible</album>
</album>
</artist>
Repeating Elements Names in Different
Contexts
Consistency is something all developers should strive for everyday, and in something as self contained as an XML schema it really shouldn't be that hard. One place XML schemas often fall down is in the names that are chosen for attributes and elements. In the example below the element named "artist" appears in two places, each of which has its own meaning and structure.
e.g.
<album>
<artist>
<name>Arcade Fire</name>
<members>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
</members>
</artist>
<track>
<name>Intervention</name>
<artist>Arcade Fire</artist>
</track>
</album>
A better structure would be
<album>
<artist>
<name>Arcade Fire</name>
<members>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
</members>
</artist>
<track>
<name>Intervention</name>
<track-artist>Arcade Fire</track-artist>
</track>
</album>
A logical and consistent layout will always make the parser simpler
and the XML easier for humans to read. The opposite of this
crime is just as bad, this variant involves not using the same element / attribute name for fields that are clearly the same. E.g. the "name" and "track-name" elements in the example below.
<album>
<artist>
<name>Arcade Fire</name>
</artist>
<track>
<track-name>Intervention</track-name>
</track>
</album>
A much more logical structure is to use the same "name" element in both locations.
<album>
<artist>
<name>Arcade Fire</name>
</artist>
<track>
<name>Intervention</name>
</track>
</album>
Abbreviating
Element/Attribute Names
Yes XML is very verbose and files can
become very big, but abbreviating element or attribute names to save
space is not a good idea.
<a>
<n>Arcade Fire</n>
</a>
Firstly if your
concerned about size, why are you using XML? Secondly XML contains a lot of repeated
text, so it compresses very well. This XML is definitely not human readable and any code that works with it will be more difficult to maintain.
Key Value Lookup
The
worst XML schema i have ever seen. I think an example is enough
of a description
<root>
<entry>
<key>Artist</key>
<value>Arcade Fire</value>
</entry>
<entry>
<key>Country</key>
<value>Canada</value>
</entry>
</root>
This really flies in the face of everything that XML represents, you may as well just have a plain text file.
The two rules to stick to when designing an XML schema are:
- Ensure the XML is machine readable
- Ensure the XML is human readable
Please if your ever designing an XML schema remember both machines and humans need to be able to easily read the XML. Just think about it a little bit and don't go with whatever's easiest right now, otherwise others will have to live with your crimes.
Opinions expressed by DZone contributors are their own.
Trending
-
Building A Log Analytics Solution 10 Times More Cost-Effective Than Elasticsearch
-
Web Development Checklist
-
How To Scan and Validate Image Uploads in Java
-
Playwright JavaScript Tutorial: A Complete Guide
Comments