Over a million developers have joined DZone.

My Neo4j Summer Road Trip to the World of Healthcare (Part 2)

In this, the second part of this series, see how we can analyze the data we originally imported.

· Database Zone

Sign up for the Couchbase Community Newsletter to stay ahead of the curve on the latest NoSQL news, events, and webinars. Brought to you in partnership with Coucbase.

Welcome back to my Neo4j summer adventure. In my previous post, I gathered all the available data and explored how to model the data into a healthcare graph. Starting with this post, I will be focusing on loading the data into the healthcare graph.

As a Neo4j newbie, before starting ETL, I researched on methods people have been using to transform XML data into Neo4j graph data. Most of them converted the XML files to CSV first then loaded the data into Neo4j. While I was teaching myself Cypher, I discovered that APOC allows me to extract information from XML and load them directly into a graph. However, there are few blogs out there that document this procedure, so why don't I try the new way – it won't be a real adventure without some fun explorations will it?

In this week's blog, I want to show you how I load XML files into a graph using APOC. This week, I will be working with lobbying disclosures and contributions data, and by the end of this blog, you will see some fun queries I created to gain interesting insights into how the healthcare system is influenced by the lobbying system.

Now let's begin our adventure for this week!

1. Getting Ready

Download the data into a directory. In this project, I am working on XML data from 2013. The contributions contain 87.5MB of data and disclosures contain 894.9MB data. You can download the same data here:

Download the latest APOC:

  • Download the binary jar here.
  • Place the file into your $NEO4J_HOME/plugins folder.
  • Install Python driver py2neo:

     $ pip install py2neo 

    2. Data Integration

    Now we are ready to go. Though Neo4j is schemaless, having a clear structure of the graph is helpful to determine where to go. It's more like a map or compass, and this is especially true when I need to traverse an XML tree structure to access the child elements. 

    Now let's take a look at the map of where we will be going for this week: 

    Part 2 of using Neo4j to graph the healthcare industry. This week: XML and lobbying disclosures

    Nodes :Issue, :Disclosure, and :Client will be extracted from disclosure XML files, and nodes :Legislator, :Committee, :Contribution, and :Contributor will be extracted from contribution XML files. Both the disclosure and contribution XML data contain information about nodes  :LobbyFirm and :Lobbyist, and I will use a MERGE statement to create :LobbyFirm and :Lobbyist to prevent duplicates.

    Now let me show you how I processed disclosure XML using APOC. (You can find the whole ETL Python code here.)

    A. Accessing Child Elements of XML in APOC

    Let me start off by showing you the structure of the disclosure XML files. 

    The XML file structure of lobbying disclosures

    APOC allows me to access the child elements of <LOBBYINGDISCLOSURE2>. Here is the Cypher statement to extract the properties of :LobbyFirm (in orange):

    CALL apoc.load.xml('file:///2013_1stQuarter_XML/300529228.xml') 
    YIELD value
    WITH [attr in value._children 
    WHERE attr._type in ['organizationName', 'address1', 'city', 'state', 'zip', 'country', 'houseID'] | [attr._type, attr._text]] as pairs 
    CALL apoc.map.fromPairs(pairs) 
    YIELD value as properties
    RETURN properties

    The query returns this:

    An APOC Cypher query on lobbying disclosure data

    The way of calling APOC to extract properties for other nodes is very similar; you can find every single detail of my Python code here. In this project, when creating nodes :Issue and :Lobbyist, I have to deal with more complicated parent-child structures (as you can see from the XML map above, <Lobbyists> and <issueAreaCode> are siblings, and <Lobbyists> has children <Lobbyist>; I maintained this structure in the healthcare graph).

    If you are facing a similar problem, the collect() function will be helpful. I used it to aggregate properties (labeled in yellow and blue) into a list, then access the desired properties by indexing.

    Now let's run the query from the Python driver, I used py2neo in my project:

    query = '''
        CALL apoc.load.xml({file})
        YIELD value
        WITH [attr in value._children
        WHERE attr._type in ['organizationName', 'firstName', 'lastName', 'address1’, 'city', 'state', 'zip', 'country', 'houseID'] | [attr._type,          attr._text]] as pairs
        CALL apoc.map.fromPairs(pairs)
        YIELD value as properties
        RETURN properties
    properties = g.run(query, file=’file:///2013_1stQuarter_XML/300529228.xml’).evaluate()
    print(‘type of properties:', type(properties))


    {'city': 'Austin', 'organizationName': 'Tuggey Fernandez LLP', 'country': 'USA', 'firstName': None, 'houseID': '416750001', 'state': 'TX', 'address1': '611 South Congress Avenue, Suite 340', 'zip': '78704', 'lastName': None, 'address2': None}
    type of properties: 

    Running the Cypher query will return a cursor object. In this case, I know there is only one value, Properties, being returned, so I could call the evaluate() method, which returns the value of the cursor object. As we can see, evaluate() turns the cursor object into a dictionary, which is very easy to work with in Python.

    Knowing how to extract information using APOC and understanding the return value, I next define a Python function that cleans the data and returns a dictionary of properties of :LobbyFirm. Cypher supports some powerful string processing functions which can also be used to clean the data.

    One more thing to notice here is that I only extract properties if the data is valid, NULL value properties should not be stored in Neo4j.

    def get_LobbyFirm_property(file):
        :param file: the xml file path to be parsed
        :return: a dict of properties of LobbyFirm
        query = '''
            CALL apoc.load.xml({file})
            YIElD value
            WITH [attr in value._children
            WHERE attr._type in ['organizationName', 'firstName', 'lastName', 'address1',
            'address2', 'city', 'state', 'zip', 'country',
            'houseID'] | [attr._type, attr._text]] as pairs
            CALL apoc.map.fromPairs(pairs)
            YIELD value as properties
            RETURN properties
        pre_property = g.run(query, file=file).evaluate()
        property = {}
        # name
        if pre_property['organizationName']== None and pre_property['firstName'] != None and pre_property['lastName'] != None :
            property['name'] = str(pre_property['firstName'] + ' ' + pre_property['lastName'])
        elif pre_property['organizationName'] != None:
            property['name'] = pre_property['organizationName']
        if pre_property['address1']!= None and pre_property['address2']!= None:
            property['address'] = str(pre_property['address1'] + ' ' + pre_property['address2'])
        elif pre_property['address1']!= None and pre_property['address2']== None:
            property['address'] = pre_property['address1']
        if pre_property['city'] != None:
            property['city'] = pre_property['city']
        if pre_property['state'] != None:
            property['state'] = pre_property['state']
        if pre_property['country'] == None:
            property['country'] = 'USA'
            property['country'] = pre_property['country']
        # zip
        if pre_property['zip'] != None:
            property['zip'] = pre_property['zip']
        # houseOrgId
        if pre_property['houseID'] != None:
            property['houseOrgId'] = pre_property['houseID'][:5]
        return property

    B. Use MERGE and CREATE Statements to Load Data into Neo4j

    def create_LobbyFirm_node(properties):
        :param properties: a dict of properties of the node
        :return: node internal id
        query = '''
            MERGE (lbf: LobbyFirm {houseOrgId:{houseOrgId}})
            ON CREATE SET lbf = {properties}
            RETURN id(lf)
        index = '''
        CREATE INDEX ON: LobbyFirm(houseOrgId)
        id = g.run(query, houseOrgId = properties['houseOrgId'], properties=properties).evaluate()
        return id

    I decide to create the :LobbyFirm node by merging on houseOrgId, which is a unique 5-digit number for each lobbying firm. The MERGE statement prevents duplicates in the graph. It's a good practice to merge on only one property of the node. When merging on more than one property, only nodes that match ALL the values will be returned; otherwise, a duplicate will be created.

    For example, MERGE (lbf: LobbyFirm {houseOrgId: "12345", firmName: "ABCD"}) is like saying "Find me the node labeled :LobbyFirm  AND houseOrgId is 12345 AND firmName is ABCD. If no property is matched, create a new node with houseOrgId is 12345 and firmName is ABCD." 

    In this case, there may be more than one node being created that has the same houseOrgId. Here is a great blog post that cleared up my confusion, such as when to use MERGE  vs. CREATE.

    C. Create Relations Using Internal Node ID

    I have 72,002 disclosure files to be processed. As my Python code loops through each disclosure file, it needs to create relations among these nodes. A relationship is generated only when the two nodes are created within the same iteration. The graph created at each iteration looks like this: 

    A graph data model for lobbying disclosure data

    Notice in the previous code where I created the :LobbyFirm I also returned the ID of the node. This internal ID allows me to identify the new nodes created at that iteration, and thus, I am able to generate relations for these nodes.

    lf_dc_rel = g.run(
        '''MATCH (dc:Disclosure) WHERE id(dc) = {dc_id}
        MATCH (lf:LobbyFirm) WHERE id(lf) = {lf_id}
        CREATE (lf)-[r:FILED]->(dc)
        ''', dc_id = dc_id, lf_id = lf_id

    Here dc_id and lf_id are passed as parameters, each of them represents the ID of :Disclosure node and :LobbyFirm node.

    There are some limitations when using internal node id to identify nodes. You need to be careful especially when you delete an existing node. The id for the deleted node will be reused when creating a new node.

    In this case, you can use a plugin called UUID which "assigns UUIDs to newly created nodes and relationships in the graph and makes sure nobody can (accidentally or intentionally) change or delete them.

    3. Visualize the Healthcare Graph in Neo4j

    Each year, corporations spend billions of dollars to gain access to government decision-makers, and healthcare organizations are no exception. One of the purposes of my project is to connect these organizations with the legislators by modeling the lobbying system.

    Now that I have all of the lobbying data loaded into Neo4j, I would love to find out how the healthcare industry (or any other group) is influenced by the lobbying system.

    First, let's figure out the general lobbying issues in 2013:

    MATCH (n: Issue) RETURN distinct(n.issueAreaCode) ORDER BY n.issueAreaCode

    The query returns 79 unique issue area codes in the disclosures. You can refer to the general lobbying issue code to find out what these issues are. Here are the top 10 general lobbying issues in 2013:

    MATCH (n:Issue) RETURN n.issueAreaCode, count(n) as num order by num DESC LIMIT 10

    The top ten issues in healthcare lobbying

    HCR (Health Issues) and MMM (Medicare/Medicaid) are the two areas that I am most interested in, and we can see there were 9988 HCR issues and 5016 MMM issues being lobbied in 2013.

    Just out of personal curiosity, I also wanted to know how many issues being lobbied are related to gun control in 2013, and here is a screenshot for my discovery: 

    Results of a Cypher query on gun control and healthcare

    Second, find me the lobbying firms and lobbyists who lobby for Medicare and Medicaid issues:

    MATCH (lf:LobbyFirm)<-[:WORKS_AT]-(lob: Lobbyist)-[:LOBBIES]->(iss: Issue {issueAreaCode:'MMM'})
    RETURN lf.houseOrgId as Frim_ID, lob.firstName as First_Name, lob.lastName as Last_Name, iss.issueAreaCode as Issue, iss.description as Description LIMIT 8

    Healthcare lobbying issues related to Medicare and Medicaid

    Next, tell me who the clients are that signed disclosures with lobby firms for Medicare and Medicaid issues.

    MATCH (cl:Client)-[:SIGNED]->(dc:Disclosure)-[:HAS]->(iss:Issue{issueAreaCode: "MMM"})
    WITH cl, dc, iss
    MATCH (lf:LobbyFirm)-[:FILED]->(dc), (lob:Lobbyist)-[:LOBBIES]->(iss)
    RETURN distinct(cl.clientName) as Client, lf.houseOrgId as Firm_ID, lob.firstName as First_Name, lob.lastName as Last_Name LIMIT 25

    Healthcare lobbying disclosures filed for Medicare and Medicaid

    To visualize the result in a graph:

    A graph visualization of healthcare lobbying disclosures filed for Medicare and Medicaid

    We can see there are five clients who signed a disclosure with lobby firm No. 31603 for Medicare-related issues. All of the relevant issues are lobbied by Marshall.

    Now, let's find out — for these lobbyists and lobby firms who are involved in lobbying Medicare and Medicaid issues — how much they contributed to government leaders and who received these contributions?

    MATCH (lf:LobbyFirm)<-[:WORKS_AT]-(lob: Lobbyist)-[:LOBBIES]->(iss: Issue {issueAreaCode:'MMM'})
    WITH lob, lf
    MATCH (lob)-[:FILED]->(cb:Contribution)-[:MADE_TO]->(com:Committee)-[:FUNDS]->(leg:Legislator)
    OPTIONAL MATCH (lf)-[:FILED]->(cb)-[:MADE_TO]->(com)-[:FUNDS]->(leg)
    RETURN lf.city as City, lf.houseOrgId as Firm_ID, lf.name as Firm_Name, 
    lob.firstName as FirstName, lob.lastName as LastName, cb.amount as Amount, cb.date as Date, leg.name as Legislator LIMIT 50

    Lobbyist contributions related to Medicare and Medicaid for both who and how much they contributed

    What does the result look like in our healthcare graph?

    A graph visualization of healthcare lobbying contributions related to Medicare and Medicaid

    It is much easier to read the results as a graph in Neo4j!

    Finally, how are healthcare organizations connected to legislators?

    MATCH (cl:Client{clientName:'Pharmaceutical Research and Manufacturers of America (PhRMA)'})-[:SIGNED]->(dc:Disclosure)-[:HAS]->(iss:Issue{issueAreaCode:'MMM'})<-[:LOBBIES]-(lob:Lobbyist)-[:WORKS_AT]->(lf:LobbyFirm)
    WITH cl,dc,iss,lob,lf
    MATCH (lob)-[:FILED]->(cb:Contribution)-[:MADE_TO]->(com:Committee)-[:FUNDS]->(leg:Legislator)
    OPTIONAL MATCH (lf)-[:FILED]->(cb)-[:MADE_TO]->(com)-[:FUNDS]->(leg)
    RETURN cl,dc,iss,lob,lf,cb,com,leg LIMIT 300

    MATCH (cl:Client{clientName:'Pharmaceutical Research and Manufacturers of America (PhRMA)'})-[:SIGNED]->(dc:Disclosure)-[:HAS]->(iss:Issue{issueAreaCode:'MMM'})<-[:LOBBIES]-(lob:Lobbyist)-[:WORKS_AT]->(lf:LobbyFirm)
    WITH cl,dc,iss,lob,lf
    MATCH (lob)-[:FILED]->(cb:Contribution)-[:MADE_TO]->(com:Committee)-[:FUNDS]->(leg:Legislator)
    OPTIONAL MATCH (lf)-[:FILED]->(cb)-[:MADE_TO]->(com)-[:FUNDS]->(leg)
    RETURN cl,dc,iss,lob,lf,cb,com,leg LIMIT 300

    A graph of connections between healthcare lobbyists

    This looks amazingly interesting. Let's take a closer look at the graph:

    A closer look at the graph of connections between healthcare lobbyists

    From the graph, I can tell that in 2013, the lobbyist Drew Goesl lobbied a Medicare issue for Pharmaceutical Research and Manufacturers of America (PhRMA), which specifically focuses on "Legislative issues related to access to pharmaceuticals, including Medicare Part D, and Children's Health Insurance Program (CHIP), rebates in Medicaid and for dual-eligibles; comparative effectiveness; 340B Drug Program; Medicare Part B prescription drug reimbursement, and related provisions."

    During the same year, the lobbyist Drew Goesl made contributions to several committees who fund legislators including James Lee Witt, Patrick Murphy, William Lewis Owens, Mike McIntryre, Mark Pryor, Corey Booker, John Larson, Linda Forrester, Edward Perlmutter, James Matheson, Joseph Crowley, Harry Reird, Susan DelBene, Scott Peters, and Edward J. Markey.

    Due to the data limitation, I cannot draw a conclusion that PhRMA and the legislators mentioned above have direct connections. However, the healthcare graph is helpful for the public to trace and integrate information just like this.

    You also might have noticed there is a bug in my model: I have tons of duplicated nodes for the same legislator. This is because the data is not consistent. The real world data is not as friendly and tidy as might be the case in an academic scenario.


    In the next few blog posts, I will demonstrate how to process strings and how to match nodes when you have messy and limited data sources. Next week, I will start to work on provider prescription data and will show you some tricks I used to load the large CSV files that I downloaded from the FDA and CMS websites. I hope you enjoyed the second post in this series — stay tuned for more excitements to come! 

    The Getting Started with NoSQL Guide will get you hands-on with NoSQL in minutes with no coding needed. Brought to you in partnership with Couchbase.

    neo4j ,analysis

    Published at DZone with permission of Yaqi Shi, DZone MVB. See the original article here.

    Opinions expressed by DZone contributors are their own.

    The best of DZone straight to your inbox.

    Please provide a valid email address.

    Thanks for subscribing!

    Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}