Over a million developers have joined DZone.

My Neo4j Summer Road Trip to the World of Healthcare (Part 1)

New to graph databases? Looking to get a stronger handle on how to model with them? In the first part of this series, take a look at how you can use Neo4j in the world of healthcare.

· Database Zone

Build fast, scale big with MongoDB Atlas, a hosted service for the leading NoSQL database. Try it now! Brought to you in partnership with MongoDB.

Nothing excites me more than going on an adventurous road trip to explore the unknown. What's even better is this adventure doesn't involve getting sweaty and exhausted to drive long hours in the middle of nowhere.

Yes, I'm talking about a digital road trip. This summer, I will rock the world of healthcare using Neo4j to see how each group of stakeholders connects to each other. You are welcome to join me if you are wondering what interesting discoveries I will find and curious how to use a graph database to uncover your own adventurous world. Now, let's go!

No one can deny that among all the major industries in the US, healthcare is one of the most complicated. It involves a wide range of stakeholders, from providers to drug manufacturers to legislators as well as patients. Coming from a medical background, I am extremely interested in discovering how each party plays their roles and influence the whole industry. 

My journey of exploring the healthcare industry started with a whiteboard. One a sunny, laidback afternoon, my friends and I were discussing who plays a role in the healthcare industry, and we started to draw each party with their relationships on a whiteboard. The graph started like this:

A Data Model for Healthcare

After 10 minutes of discussion, the graph grew into something look like this:

An Extended Data Model of the Healthcare Industry

Now, this looks amazing! The left part of the graph demonstrates the procedures of a clinical encounter between a patient and a provider, and the right part of the graph represents the relationships among different stakeholders. 

My initial curiosity lies in the right side of the graph, and I am pretty sure there are a lot of public datasets that I can use. By creating a graph database, I am hoping to answer interesting questions such as:

  1. Tracking prescription drug abuse history by states, hospitals, and providers.
  2. Tracking Medicare Part D costs by drugs, providers, and states.
  3. Looking for lobbying history, specifically for individual drug manufacturers and lobbyist information.

The next question is, "How am I going to model such a highly interconnected system to answer these questions in a relational database?" While I'm sure I could do it and later query the database by joining all the tables together, I would have to write a 1-2 page long query (!) for each piece of information that I'm curious about.

Well, that doesn't sound fun, and I don't want to spend my summer in writing long and complicated SQL queries. But wait, I could totally represent the system in a graph database just like how I drew it all out on a whiteboard!

I spent the next few days teaching myself how to do data modeling for a graph database using this awesome online tutorial. Check it out if you are interested, but I am going to show you what I did to model the data in a graph database.

First and Foremost, Draw Out the Domain

Searching for the right data is like looking for a piece of a leaf in the wild jungle. By first drawing out the general graph of the system that I am trying to model, it gives me the idea of which types of trees I need to look for, so to speak.

In my project, I quickly identified data related to drugs, drug manufacturers, providers, and lobbyists that are essential for me to construct the system. 

Second, Document What Is Available

Knowing the targets, I quickly started my journey of data collection. There is no magic about how to quickly locate the targets other than using Google. 

After a few days of data collection, along with further studying about each stakeholder of the healthcare industry (Wikipedia always answers my questions), I found a few major sites where I can download the data that I need: 

Provider PrescriptionsCSVhttps://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Part-D-Prescriber.html
Provider Enumeration SystemCSVhttp://download.cms.gov/nppes/NPI_Files.html
FDA Drug CodesCSVhttp://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm
Drug ManufacturersCSVhttp://www.fda.gov/drugs/informationondrugs/ucm135778.htm
Lobbying DisclosuresXMLhttp://disclosures.house.gov/ld/ldsearch.aspx
Lobbying ContributionsXMLhttp://disclosures.house.gov/lc/lcsearch.aspx
Congress, Legislators & BillsOtherhttps://www.govtrack.us/developers

Modeling data in a graph database requires a thorough understanding of both the context as well as the character of the data. Having the data sources helps me to better define the domain of my model:

  • A healthcare provider prescribes drugs.
  • A drug manufacturer produces drugs.
  • The drug manufacturer hires a lobbying firm, which files a disclosure.
  • The lobbying disclosure relates to one or several issues.
  • The lobbying firm makes contributions to a Congress representative's campaign.

Third, Answer Two Questions to Refine Your Model

Having a definition of the domain is still not sufficient to identify the nodes and relationships in a graph model. It's worth taking a moment to look in detail of the data that I found. 

The general idea of the mapping data "Introduction to Graph Databases" online training course to demonstrate this data modeling process:

RDBMS Tables Represented in a Data Model

An RDBMS JOIN Table Represented in a Data Model

A Hybrid Data Model of an RDBMS and Graph Database

RDBMS JOIN Tables Are Equal to Relationships in a Graph Database

A Graph Database Data Model

When looking at the details of your datasets, answering these two questions is helpful to convert RDBMS tables into a graph data model:

  1. Do the two tables contain columns that share the same values?
    Answering this question is helpful to link two nodes together. The question is easy to answer if you could find the key columns in both tables. But what if the two tables don't share a foreign key? Then it's good to document that columns in the tables represent the same object.
    For example: Provider prescription data doesn't contain FDA drug codes (which can be used as a foreign key). However, both provider prescription data and FDA drug code data contains columns of brand_name and generic_name, thus I'm able to link these two tables together bybrand_name and generic_name.
  2. Which columns (i.e., properties) can be extracted as an independent node?
    Remember that columns are usually treated as properties? But sometimes you might want to create a node for some columns, and doing so can be beneficial when querying your data.
    For example: In the lobbying disclosure data, a lobbying firm can be treated as a property of the Disclosure node, but it can also be extracted as an independent node which has a relationFILED pointing to Disclosure and another relation LOCATED_IN to the State node.
    A good reason for treating the lobbying firm as a node rather than a property is that the lobbying firm contains a lot of information such as: namehouseID, and location, which can be stored as properties for that node. Another reason, perhaps more beneficial, is that it allows you to create a relationship between lobbying firm and state, thus offering the option to query "which lobbying firms and drug manufacturers are located in the same state."

Last, Create the Graph Schema

Although the graph database is schemaless, it's good to have a clear idea of how your data will be stored in a graph before you create the database, and solid schema documentation will be really helpful for later development. Here is the graph I created after studying both the context and the characters of the healthcare dataset: 

Part 1 of Using Neo4j to Graph the Healthcare Industry

I created this graph in only a few minutes using the Arrows tool (created by Alistair Jones).

I hope you found the first week of our Neo4j healthcare road trip exciting! Starting next week, I will show you how to ETL the data into Neo4j. As you have seen, there are two types of data in my project: CSV and XML, I will start to extract XML data by calling APOC which stands for "Awesome Procedures On Cypher" and load the data into Neo4j by using the py2neo package in Python.

I've also linked the project's GitHub repository here where you can find the detailed schema documentation, current code development, and some other fun stuff. 

Now it's easier than ever to get started with MongoDB, the database that allows startups and enterprises alike to rapidly build planet-scale apps. Introducing MongoDB Atlas, the official hosted service for the database on AWS. Try it now! Brought to you in partnership with MongoDB.


Published at DZone with permission of Yaqi Shi, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}