Loading US Lobbying Data Into Neo4J
Let's see how to load US lobbying data into Neo4j.
Join the DZone community and get the full member experience.Join For Free
In the United States, the money spent for political reasons is immense and growing, especially since the Supreme Court struck down limits on federal campaign donations. More than a year away, candidates for the 2020 federal elections have already raised more than $800M. An alternative way to influence federal politics is via independent or non-coordinate spending, also known as soft money, where the political expenditures are not coordinate with candidates but advocate for or against candidates or issues. Dark money is political spending by non-profit organizations where donors can remain anonymous.
You might also like: Querying Graphs With Neo4j
An example of political spending occurring outside of election campaigns is lobbying. While the exact definition of lobbying differs — e.g., federal vs. state, state vs. state — in general, lobbying attempts to influence a politician, legislature, or public official to see a specific view on an issue, law, policy, regulation, etc. Lobbying has been present since the founding of the United States and is protected by the First Amendment as freedom of speech and right to petition. In 2018, almost $3.5B was spent lobbying the federal government.
Prior to 1996, there was little transparency about lobbying the federal government, but the Lobbying Disclosure Act of 1995 increased accountability by defining lobbying in law and, more importantly, requires that lobbying efforts are documented through filings made to the US Senate and House of Representatives.
Filings Raw Data
On the Senate site, the filing data is available as zipped XML documents by year and quarter since 1999. A completed XML file contains approximately 1000 filings, though some contain less (likely dependent on their internal system into which filings are uploaded).
<Filing ID="403EFFC2-F7B6-4FB0-AA2F-2584CC25FF3E" Year="2019" Received="2019-04-04T16:39:08.047" Amount="" Type="FIRST QUARTER REPORT" Period="1st Quarter (Jan 1 - Mar 31)"> <Registrant RegistrantID="1810" RegistrantName="AMERICAN BIRD CONSERVANCY" GeneralDescription="" Address="4301 Connecticut Ave NW #451 Washington, DC 20008" RegistrantCountry="USA" RegistrantPPBCountry="USA" /> <Client ClientName="AMERICAN BIRD CONSERVANCY" GeneralDescription="" ClientID="12" SelfFiler="TRUE" ContactFullname="STEVE HOLMER" IsStateOrLocalGov="TRUE" ClientCountry="USA" ClientPPBCountry="USA" ClientState="DISTRICT OF COLUMBIA" ClientPPBState="DISTRICT OF COLUMBIA" /> <Lobbyists> <Lobbyist LobbyistName="HOLMER, STEVE" LobbyistCoveredGovPositionIndicator="NOT COVERED" OfficialPosition="" ActivityInformation="A" /> <Lobbyist LobbyistName="Cipolletti, Jennifer L." LobbyistCoveredGovPositionIndicator="NOT COVERED" OfficialPosition="" ActivityInformation="A" /> </Lobbyists> <GovernmentEntities> <GovernmentEntity GovEntityName="Bureau of Land Management (BLM)" /> <GovernmentEntity GovEntityName="SENATE" /> <GovernmentEntity GovEntityName="Office of Management & Budget (OMB)" /> <GovernmentEntity GovEntityName="U.S. Forest Service" /> <GovernmentEntity GovEntityName="Agriculture, Dept of (USDA)" /> <GovernmentEntity GovEntityName="U.S. Fish & Wildlife Service (USFWS)" /> <GovernmentEntity GovEntityName="HOUSE OF REPRESENTATIVES" /> </GovernmentEntities> <Issues> <Issue Code="ENVIRONMENT/SUPERFUND" SpecificIssue="Saving Americas Pollinators Act, H.R. 1337, and H.R.230, The Ban Toxic Pesticides Act of 2019" /> <Issue Code="AGRICULTURE" SpecificIssue="Farm Bill of 2018 implementation issues, rule-makings, and appropriations" /> <Issue Code="NATURAL RESOURCES" SpecificIssue="Interior Appropriations" /> <Issue Code="ANIMALS" SpecificIssue="Bird Safe Buildings Act Albatross and Petrel Conservation Act Interior Appropriations for US Fish and Wildlife Service" /> <Issue Code="REAL ESTATE/LAND USE/CONSERVATION" SpecificIssue="Greater Sage-Grouse Endangered Species Act Exemption" /> </Issues> <ConvictionDisclosure> <Conviction ConvictionReported="NO" /> </ConvictionDisclosure> </Filing>
Using relational database terminology, the top-level <Filing> node is an associative entity, which pulls together all the related information about the filing — i.e., unique identifier, period represented, dollar amount spent for effort, date of filing — and the child nodes are the specifics.
- Client: Special interest groups — e.g., corporations, non-profits, industries, national and international governments — advocating for/against legislation or regulations under consideration by the federal government.
- Lobbyist: A professional hired by the client to present the client's position and persuade the federal government to take the client's position with regards to proposed legislation and regulations.
- Registrant: The organization performing lobbying activities on behalf of the client, registered with the US government. Clients may lobby on their own behalf as both client and registrant or may hire firms who specialize in lobbying and hire lobbyists.
- Government Entity: A department, reguatory agency, commission or branch of government lobbied. Multiple entities are usually associated with a single filing; by far the most lobbied entity is the Senate and House of Representatives.
- Issue: Filings are assigned to general categories to simplify reporting, e.g., Education, Transportation, Natural Resources. Each filing contains a detailed description of the lobbying effort.
The basic process for loading the filings data is fairly straight-forward:
- Unzip the XML documents (files) from the downloaded zip file
- Deserialize the XML file
- For each filing, find or create the supporting Neo4J nodes — the registrants, lobbyists, clients, issues, and government entities — and then create a new filing node with the appropriate relationships
The documents are well-formed, so using JAXB for deserializing the XML is fairly straight-forward. I manually created an XSD representing the XML data and generated annotated POJOs that could be used for iterating through the filings.
I'm using the Neo4J Object Mapping library (OGM) to load the data into Neo4J. Neo4J OGM is an annotation-driven persistence library, similar in concept to the Java Persistence API (e.g., Hibernate) where objects represent nodes and relationships. Neo4J OGM also supports sessions, transactions and programmatic queries.
After different attempts, here's how I ended up modeling the data:
- Schema Definition. (Filing) nodes far outnumber other node types and are the concept around which all other information is related. As I'll demonstrate in a subsequent article, all useful queries include the [FILED] relationship, making visualizations difficult.
- Performance. The constant querying of Neo4J for existing nodes — clients, registrants, lobbyists, issues, and government entities — slows the load process as more filings are loaded. (Lobbyist) nodes are the next largest by volume, it could be that caching or better database indices are required, definitely an area to investigate.
My next article will demonstrate different ways to query the database for interesting facts.
The project source can be found on GitHub.
Opinions expressed by DZone contributors are their own.