Over a million developers have joined DZone.

U.S. Code Available in XML Format

· Web Dev Zone

Start coding today to experience the powerful engine that drives data application’s development, brought to you in partnership with Qlik.

I saw today that the U.S. Code is available online now for download in a structured format. Ideas for apps, anyone?

It’s worth noting that this was available in some form already, e.g. through Cornell.

To give you a taste of what is there, I extracted a few interesting sections. The first part is some metadata about dc:creator is set to USCConverter 1.0, which would suggest that they built a custom tool to do this.

  <dc:title>Title 50</dc:title>
  <dc:creator>USCConverter 1.0</dc:creator>

Some of the headings are split up in this somewhat odd fashion.

<num value="50">Title 50—</num><heading>WAR AND NATIONAL DEFENSE</heading>

What I found most interesting is the ease with which you can extract citations. They seem to have used a GUID (or similar) structure for IDs. These have the nice property of making it easy to generate unique identifiers, without cross-referencing other locations. The downside is that they are not “natural” keys, meaning it looks like you can’t infer anything about where you are.

An open question to investigate – how much text would have to change over time in a section to trigger a new ID? These also points out the lack of historical information – you have to manually get old values for this text. Even armed with this citation information, you need access to court records, as these clarify, amend, or remove sections of law. I’d also be interested to know how / when the text in these documents are updated if a court strikes down a law. Right now, the only way I know of to get access to court records easily is through PACER (or RECAP, which has an archive of a small fraction).

<subsection class="indent0" id="idd035386a-f63e-11e2-8470-abc29ba29c4d"
<num value="e">(e)</num>
<heading> Crediting of amounts collected</heading>
<p style="-uslm-lc:I11" class="indent0">Amounts collected under this
section shall be credited to the account or accounts from which costs
associated with such amounts have been or will be incurred, to reimburse
or offset the direct costs of the program referred to in subsection (a).</p>

Another selection – the interesting bit to me here is they’ve included all sorts of random formatting information. You could probably use this to train an NLP algorithm to extract some form of context, but it would be a lot of work.

<tr style=" -uslm-lc:II01; ">
<td style=" text-align:left; vertical-align:top; 
border-right:1px solid black; padding-right:2pt;">
<p style=" text-align:left; text-indent: -1em; padding-left:1em;">
401 note
(<a href="/us/act/1947-07-26/ch343">Act July 26, 1947, ch. 343</a>,
title III, § 310,
<a href="/us/stat/61/509">61 Stat. 509</a>)
<td style=" text-align:left; vertical-align:top; border-left:1px 
solid black; padding-left: 2pt;">
<p style=" text-align:left; text-indent: -1em; padding-left:1em;">3077</p>

Create data driven applications in Qlik’s free and easy to use coding environment, brought to you in partnership with Qlik.


Published at DZone with permission of Gary Sieling, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}