I saw today that the U.S. Code is available online now for download in a structured format. Ideas for apps, anyone?
It’s worth noting that this was available in some form already, e.g. through Cornell.
To give you a taste of what is there, I extracted a few interesting sections. The first part is some metadata about dc:creator is set to USCConverter 1.0, which would suggest that they built a custom tool to do this.
<meta> <dc:title>Title 50</dc:title> <dc:type>USCTitle</dc:type> <docNumber>50</docNumber> <docPublicationName>Online@113-21</docPublicationName> <dc:publisher>OLRC</dc:publisher> <dcterms:created>2013-07-26T05:19:57</dcterms:created> <dc:creator>USCConverter 1.0</dc:creator> </meta>
Some of the headings are split up in this somewhat odd fashion.
<num value="50">Title 50—</num><heading>WAR AND NATIONAL DEFENSE</heading>
What I found most interesting is the ease with which you can extract citations. They seem to have used a GUID (or similar) structure for IDs. These have the nice property of making it easy to generate unique identifiers, without cross-referencing other locations. The downside is that they are not “natural” keys, meaning it looks like you can’t infer anything about where you are.
An open question to investigate – how much text would have to change over time in a section to trigger a new ID? These also points out the lack of historical information – you have to manually get old values for this text. Even armed with this citation information, you need access to court records, as these clarify, amend, or remove sections of law. I’d also be interested to know how / when the text in these documents are updated if a court strikes down a law. Right now, the only way I know of to get access to court records easily is through PACER (or RECAP, which has an archive of a small fraction).
<subsection class="indent0" id="idd035386a-f63e-11e2-8470-abc29ba29c4d" identifier="/us/usc/t50/s3618/e"> <num value="e">(e)</num> <heading> Crediting of amounts collected</heading> <content> <p style="-uslm-lc:I11" class="indent0">Amounts collected under this section shall be credited to the account or accounts from which costs associated with such amounts have been or will be incurred, to reimburse or offset the direct costs of the program referred to in subsection (a).</p>
Another selection – the interesting bit to me here is they’ve included all sorts of random formatting information. You could probably use this to train an NLP algorithm to extract some form of context, but it would be a lot of work.
<tr style=" -uslm-lc:II01; "> <td style=" text-align:left; vertical-align:top; border-right:1px solid black; padding-right:2pt;"> <p style=" text-align:left; text-indent: -1em; padding-left:1em;"> 401 note (<a href="/us/act/1947-07-26/ch343">Act July 26, 1947, ch. 343</a>, title III, § 310, <a href="/us/stat/61/509">61 Stat. 509</a>) </p> </td> <td style=" text-align:left; vertical-align:top; border-left:1px solid black; padding-left: 2pt;"> <p style=" text-align:left; text-indent: -1em; padding-left:1em;">3077</p> </td> </tr>