DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Run Hundreds of Experiments with OpenCV and Hydra
  • From Compliance Pipes to Data Streams: Modernizing Healthcare EDI for Strategic Value
  • How We Rebuilt a Legacy HBase + Elasticsearch System Using Apache Iceberg, Spark, Trino, and Doris
  • Green AI in Practice: How I Track GPU Hours, Energy, CO₂, and Cost for Every ML Experiment

Trending

  • How to Parse Large XML Files in PHP Without Running Out of Memory
  • Spring AI Advisors: Chat Memory, Token Tracking, and Message Logging
  • LLM Integration in Enterprise Applications: A Practical Guide
  • Dear Micromanager: Your Distrust Has a Job; It’s Just Not the One You’re Doing
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Extracting Regulatory Citations from Textual Content: A Comparison of Regular Expression, Spacy, and a Combination of Both Approaches

Extracting Regulatory Citations from Textual Content: A Comparison of Regular Expression, Spacy, and a Combination of Both Approaches

This article explores three different approaches to extracting regulatory citations from textual content that can be found in a legal document of an Enforcement Action.

By 
lokesh vijay kumar user avatar
lokesh vijay kumar
·
Feb. 21, 23 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
3.3K Views

Join the DZone community and get the full member experience.

Join For Free

Regulatory citations play a crucial role in legal and compliance-related domains, as they are used to indicate the specific regulations or laws that govern certain actions or behaviors. However, the process of extracting these citations from textual content is a non-trivial task, as the citations may appear in a variety of different formats and may be written in a way that makes them difficult to identify automatically. In this blog post, we will explore three different approaches to extracting regulatory citations from textual content that can be found in a legal document of an Enforcement Action: regular expressions, the spacy NLP library, and a combination of both approaches.

Approach 1: Regular Expressions

Regular expressions are a powerful tool for pattern matching and text manipulation. They can be used to extract specific strings of text that match a particular pattern, which makes them a natural choice for extracting regulatory citations from textual content.

The following code provides an example of how to use a regular expression to extract regulatory citations from a piece of text:

Python
 
import re 
text = "The Electronic Fund Transfer Act (EFTA), 15 U.S.C. § 1693 et seq., is a federal law that governs electronic fund transfers." 
# Regular expression pattern for regulatory citations 
pattern = re.compile(r"\b\d{1,2}\s[A-Z]\.?F\.?R\.?\b") 
# Extract regulatory citations from the text 
regulatory_citations = re.findall(pattern, text) 
# Print the extracted regulatory citations 
print("Regulatory Citations:", regulatory_citations)

Output: ['15 U.S.C. § 1693', '12 C.F.R. pt. 1005']


In this example, the regular expression pattern is used to identify all strings in the text that match the pattern of a regulatory citation (i.e., a string that starts with a number followed by one or two digits, a space, and the letters "A.F.R." or "U.S.C."). The re.findall function is then used to extract all instances of this pattern from the text, and the resulting regulatory citations are stored in the regulatory_citations list.

Approach 2: Spacy

The spacy NLP library is a popular Python library for natural language processing tasks. It provides a number of tools for text processing, including named entity recognition, part-of-speech tagging, and sentence segmentation. These tools can be used to extract specific types of information from text, including regulatory citations.

The following code provides an example of how to use spacy to extract regulatory citations from a piece of text:

Python
 
import spacy 
nlp = spacy.load("en_core_web_sm") 
text = "The Electronic Fund Transfer Act (EFTA), 15 U.S.C. § 1693 et seq., is a federal law that governs electronic fund transfers." 
# Process the text with spacy 
doc = nlp(text) 
# Extract regulatory citations from the text 
regulatory_citations = [ent.text for ent in doc.ents if ent.label_ == "LAW"] 
# Print the extracted regulatory citations 
print("Regulatory Citations:", regulatory_citations)

Output: ['Electronic Fund Transfer Act (EFTA)', '15 U.S.C. § 1693 et seq.', 'Consumer Financial Protection Act of 2010 (CFPA)', '12 C.F.R. pt. 1005']


Approach 3: Combination of Both Approaches

In some cases, the regular expression-based approach may not be sufficient to extract all the regulatory citations from the text, while the spacy-based approach may produce false positives. In such cases, a combination of both approaches leveraging the strengths of both methods can provide a more precise result. The following code demonstrates how a combination of both approaches can be used to extract regulatory citations from the text, and here's how it works:

  1. First, we use the spacy-based approach to identify potential citations in the text. This approach can handle a variety of citation formats and variations, so it's a good starting point.
  2. Then, we use regular expressions to refine the results and extract specific information about the citations, such as the statute title and the section number.

Here's a code example of how this approach can be implemented in Python using both the spacy library and the re (regular expression) library:

Python
 
import spacy 
import re 
nlp = spacy.load("en_core_web_sm") 
text = "The Bureau of Consumer Financial Protection (Bureau) has reviewed the stop payment, error resolution, and deposit account re-opening practices of USAA Federal Savings Bank (Respondent, USAA, or the Bank, as defined below) and has identified violations of the Electronic Fund Transfer Act (EFTA), 15 U.S.C. § 1693 et seq., Regulation E, 12 C.F.R. pt. 1005, and the Consumer Financial Protection Act of 2010 (CFPA), 12 U.S.C. §§ 5531, 5536." 
# Use spacy to identify potential citations 
doc = nlp(text) 

for ent in doc.ents: 
  if ent.label_ == "LAW": 
    print(ent.text) 
# Use regular expressions to extract specific information about the citations 
reg_ex = re.compile(r'(\d{1,2})\s([A-Z]{2}\.C\.[R|F]\.[R|F]\.)\spt\.(\d{1,4})|(\d{1,2})\s([A-Z]{2}\.S\.C\.)\s§§\s(\d{1,4})\s-\s(\d{1,4})') 
matches = re.finditer(reg_ex, text) 
for match in matches: 
  print(match.group(0))


This code outputs the following:

 
Electronic Fund Transfer Act (EFTA) 
15 U.S.C. § 1693 et seq. 
Regulation E 
12 C.F.R. pt. 1005 
Consumer Financial Protection Act of 2010 (CFPA) 
12 U.S.C. §§ 5531, 5536 
15 U.S.C. 
12 C.F.R. 
12 U.S.C.


This example can be refined further to clean the noise in the output.

As you can see, the combination of both approaches provides a more precise result compared to using only one of the methods. This is because it combines the spacy-based approach's ability to handle various citation formats and variations with the precise information extraction capabilities of regular expressions. 

Citation Coding best practices Data science Part-of-speech tagging Object (computer science)

Opinions expressed by DZone contributors are their own.

Related

  • Run Hundreds of Experiments with OpenCV and Hydra
  • From Compliance Pipes to Data Streams: Modernizing Healthcare EDI for Strategic Value
  • How We Rebuilt a Legacy HBase + Elasticsearch System Using Apache Iceberg, Spark, Trino, and Doris
  • Green AI in Practice: How I Track GPU Hours, Energy, CO₂, and Cost for Every ML Experiment

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook