Over a million developers have joined DZone.

Counting Citations in US Law (in XML)

· Big Data Zone

Is iPaaS solving the right problems? Not knowing the fundamental difference between iPaaS and dPaaS could cost you down the road. Brought to you in partnership with Liaison Technologies.

The U.S. Congress recently released a series of XML documents containing U.S. Laws. The structure of these documents allow us to find which sections of the law are most commonly cited. Examining which citations occur most frequently allows us to see what Congress has spent the most time thinking about.

Citations occur for many reasons: a justification for addition or omission in subsequent laws, clarifications, or amendments, or repeals. As we might expect, the most commonly cited sections involve the IRS (Income Taxes, specifically), Social Security, and Military Procurement.

To arrive at this result, we must first see how U.S. Code is laid out. The laws are divided into a hierarchy of units, which allows anything from an entire title to individual sentences to cited. These sections have an ID and an identifier – “identifier” is used an an citation reference within the XML documents, and has a different form from the citations used by the legal community, comes in a form like “25 USC Chapter 21 § 1901″.

The XML hierarchy defines seventeen different levels which can be cited: ‘title’, ‘subtitle’, ‘chapter’, ‘subchapter’, ‘part’, ‘subpart’, ‘division’, ‘subdivision’, ‘article’, ‘subarticle’, ‘section’, ‘subsection’, ‘paragraph’, ‘subparagraph’, ‘clause’, ‘subclause’, and ‘item’.

We can use a simple XPath expression to retrieve one of these, like section:

§?104.
 Federal Highway Administration

(a)
 The Federal Highway Administration is an administration
in the Department of Transportation.

A portion of the human readable citation is contained in “num”. In order to retrieve a citation that a lawyer would recognize, we need to look at “num” for the parent element as well.

from elementtree import ElementTree as ET
import os
 
dir = "G:\\us_code\\xml_uscAll@113-21"
 
def getParent(parent_map, elt, idx):
  try:
    parent = elt
    for i in range(idx):
      parent = parent_map.get(parent)
 
    return \
      parent.findall('{http://xml.house.gov/schemas/uslm/1.0}num')[0].text +
      ' ' +
      parent.findall('{http://xml.house.gov/schemas/uslm/1.0}heading')[0].text
  except:
    return "--No Heading--"

Once we find the parent, we need to traverse all the way up the tree:

def getTree(parent_map, t):
  tree = []
  parent = ""
  idx = 0
  while (parent != "--No Heading--"):
    parent = getParent(parent_map, t, idx)
    tree.append(parent)
    idx += 1
  return tree
 
usc26.xml: Title 26— Subtitle A— CHAPTER 1—

This forms the basis for a function which builds a citation index – a list of every XML node that can be used in a citation, along with it’s human-readable citation and name. This takes some time, so if you reproduce this effort, you may want to save the results to a file.

dir = "G:\\us_code\\xml_uscAll@113-21"
urls = {}
 
def findElements(xpath, urls):
  for root, dirs, files in os.walk(dir):
    for f in files:
      if f.endswith('.xml'):
        tree = ET.parse(dir + "\\" + f)
        parent_map = dict((c, p) for p in tree.getiterator() for c in p)
        sections = tree.findall(xpath)
        for t in sections:
          urls[t.attrib.get('identifier')] = \
            (t.attrib.get('id'),
            getTree(parent_map, t),
            f)
 
refs = {}
refTypes = ['title', 'subtitle', 'chapter', \
  'subchapter', 'part', 'subpart', 'division', \
  'subdivision', 'article', 'subarticle', 'section', \
  'subsection', 'paragraph', 'subparagraph', 'clause', \
  'subclause', 'item']
 
for ref in refTypes:
  findElements('.//{http://xml.house.gov/schemas/uslm/1.0}' + ref, refs)
 
refs.items()[20]
('/us/usc/t2/s2102/b',
 ('id8a923648-f59b-11e2-8dfe-b6d89e949a2c',
  ['(b)  Issuance and publication of regulations',
   u'\xa7\u202f2102.  Duties of Commission',
   u'Part B\u2014 Senate Commission on Art',
   u'SUBCHAPTER V\u2014 HISTORICAL PRESERVATION AND FINE ARTS',
   u'CHAPTER 30\u2014 OPERATION AND MAINTENANCE OF CAPITOL COMPLEX',
   u'Title 2\u2014 THE CONGRESS',
   '--No Heading--'],
  'usc02.xml'))

Now that we know how to look up a citation we need to find the actual citations. Like HTML, the U.S. code documents use the “a href=” tag to reference a node, as well as “ref href=”. The same XPath technique used above allows us to find refs:

hrefs = {}
titles = {}
refpath = './/{http://xml.house.gov/schemas/uslm/1.0}ref'
for root, dirs, files in os.walk(dir):
  for f in files:
    if f.endswith('.xml'):
      tree = ET.parse(dir + "\\" + f)
      root = tree.getroot()
      h = {t.attrib.get('href'): f + ' ' + t.text \
          for t in tree.findall(refpath)}
      hrefs = dict(hrefs.items() + h.items())
 
 
hrefs.items()[0]
Out[55]:
('/us/pl/109/280/s601/a/3',
 u'usc29.xml Pub. L. 109\u2013280, title VI, \xa7\u202f601(a)(3)')

We have everything we need to find which sections are commonly cited, we just need to combine them. Most of the complexity here is dealing with missing entries (e.g. due to the fact that a citation can point anywhere in the hierarchy).

from collections import Counter
 
def countCitations(urls, hrefs):
  titles = Counter()
  subtitles = Counter()
  chapters = Counter()
  not_found = []
  for key in hrefs.keys():
    found = urls.get(key)
 
    title = "None"
    subtitle = "None"
    chapter = "None"
    file = "None"
 
    if (found != None):
      (id, history, file) = found
      if len(history) >= 2:
        title = history[-2]
        if len(history) >= 3:
          subtitle = history[-3]
          if len(history) >= 4:
            chapter = history[-4]
    else:
      not_found.append(key)
 
    titles[file + ": " + title] += 1
    subtitles[file + ": " + title + " - " + subtitle] += 1
    chapters[file + ": " + title + " - " + subtitle + " - " + chapter] += 1
  return (titles, subtitles, chapters, not_found)
 
(t, s, c, none) = countCitations(refs, hrefs)

This returns results that are rolled up to titles, subtitles, and chapters. In particular note how as we drill down, the results provide clarity as to what was most important in the priort section. Within “The Public Health and Welfare,” we see that Social Security is important, and within “Armed Forces,” we see that “General Military Law – Personnel” is important.

None: None: 359662
usc42.xml: Title 42— THE PUBLIC HEALTH AND WELFARE: 6679
usc10.xml: Title 10— ARMED FORCES: 2078
usc16.xml: Title 16— CONSERVATION: 2068
usc42.xml: None: 1965
usc15.xml: Title 15— COMMERCE AND TRADE: 1796
usc07.xml: Title 7— AGRICULTURE: 1689
usc22.xml: Title 22— FOREIGN RELATIONS AND INTERCOURSE: 1684
usc20.xml: Title 20— EDUCATION: 1660
usc26.xml: Title 26— INTERNAL REVENUE CODE: 1610
None: None - None: 359662
usc42.xml: None - None: 1965
usc42.xml: Title 42— THE PUBLIC HEALTH AND WELFARE - CHAPTER 7— SOCIAL SECURITY: 1573
usc10.xml: Title 10— ARMED FORCES - Subtitle A— General Military Law: 1490
usc42.xml: Title 42— THE PUBLIC HEALTH AND WELFARE - CHAPTER 6A— PUBLIC HEALTH SERVICE: 1220
usc26.xml: Title 26— INTERNAL REVENUE CODE - Subtitle A— Income Taxes: 841
usc05.xml: None - None: 736
usc10.xml: None - None: 639
usc16.xml: Title 16— CONSERVATION - CHAPTER 1— NATIONAL PARKS, MILITARY PARKS, MONUMENTS, AND SEASHORES: 616
usc20.xml: Title 20— EDUCATION - CHAPTER 70— STRENGTHENING AND IMPROVEMENT OF ELEMENTARY AND SECONDARY SCHOOLS: 531
None: None - None - None: 359662
usc42.xml: None - None - None: 1965
usc26.xml: Title 26— INTERNAL REVENUE CODE - Subtitle A— Income Taxes - CHAPTER 1— NORMAL TAXES AND SURTAXES: 817
usc10.xml: Title 10— ARMED FORCES - Subtitle A— General Military Law - PART II— PERSONNEL: 738
usc05.xml: None - None - None: 736
usc42.xml: Title 42— THE PUBLIC HEALTH AND WELFARE - CHAPTER 7— SOCIAL SECURITY - SUBCHAPTER XVIII— HEALTH INSURANCE FOR AGED AND DISABLED: 663
usc10.xml: None - None - None: 639
usc38.xml: None - None - None: 497
usc10.xml: Title 10— ARMED FORCES - Subtitle A— General Military Law - PART IV— SERVICE, SUPPLY, AND PROCUREMENT: 496
usc15.xml: None - None - None: 428

Future work in this area will involve cleaning up the results to remove some of the “None” entries, building a visualization of the results, and training a tagger to recognize the human-readable versions of citation in court documents. In the long run, I hope these developments help make legal information more accessible to everyone, rather than being locked up in expensive databases.

Discover the unprecedented possibilities and challenges, created by today’s fast paced data climate and why your current integration solution is not enough, brought to you in partnership with Liaison Technologies.

Topics:

Published at DZone with permission of Gary Sieling, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}