Nothing excites me more than going on an adventurous road trip to explore the unknown. What's even better is this adventure doesn't involve getting sweaty and exhausted to drive long hours in the middle of nowhere.
Yes, I'm talking about a digital road trip. This summer, I will rock the world of healthcare using Neo4j to see how each group of stakeholders connects to each other. You are welcome to join me if you are wondering what interesting discoveries I will find and curious how to use a graph database to uncover your own adventurous world. Now, let's go!
No one can deny that among all the major industries in the US, healthcare is one of the most complicated. It involves a wide range of stakeholders, from providers to drug manufacturers to legislators as well as patients. Coming from a medical background, I am extremely interested in discovering how each party plays their roles and influence the whole industry.
My journey of exploring the healthcare industry started with a whiteboard. One a sunny, laidback afternoon, my friends and I were discussing who plays a role in the healthcare industry, and we started to draw each party with their relationships on a whiteboard. The graph started like this:
After 10 minutes of discussion, the graph grew into something look like this:
Now, this looks amazing! The left part of the graph demonstrates the procedures of a clinical encounter between a patient and a provider, and the right part of the graph represents the relationships among different stakeholders.
My initial curiosity lies in the right side of the graph, and I am pretty sure there are a lot of public datasets that I can use. By creating a graph database, I am hoping to answer interesting questions such as:
- Tracking prescription drug abuse history by states, hospitals, and providers.
- Tracking Medicare Part D costs by drugs, providers, and states.
- Looking for lobbying history, specifically for individual drug manufacturers and lobbyist information.
The next question is, "How am I going to model such a highly interconnected system to answer these questions in a relational database?" While I'm sure I could do it and later query the database by joining all the tables together, I would have to write a 1-2 page long query (!) for each piece of information that I'm curious about.
Well, that doesn't sound fun, and I don't want to spend my summer in writing long and complicated SQL queries. But wait, I could totally represent the system in a graph database just like how I drew it all out on a whiteboard!
I spent the next few days teaching myself how to do data modeling for a graph database using this awesome online tutorial. Check it out if you are interested, but I am going to show you what I did to model the data in a graph database.
First and Foremost, Draw Out the Domain
Searching for the right data is like looking for a piece of a leaf in the wild jungle. By first drawing out the general graph of the system that I am trying to model, it gives me the idea of which types of trees I need to look for, so to speak.
In my project, I quickly identified data related to drugs, drug manufacturers, providers, and lobbyists that are essential for me to construct the system.
Second, Document What Is Available
Knowing the targets, I quickly started my journey of data collection. There is no magic about how to quickly locate the targets other than using Google.
After a few days of data collection, along with further studying about each stakeholder of the healthcare industry (Wikipedia always answers my questions), I found a few major sites where I can download the data that I need:
|Provider Enumeration System||CSV||http://download.cms.gov/nppes/NPI_Files.html|
|FDA Drug Codes||CSV||http://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm|
|Congress, Legislators & Bills||Other||https://www.govtrack.us/developers|
Modeling data in a graph database requires a thorough understanding of both the context as well as the character of the data. Having the data sources helps me to better define the domain of my model:
- A healthcare provider prescribes drugs.
- A drug manufacturer produces drugs.
- The drug manufacturer hires a lobbying firm, which files a disclosure.
- The lobbying disclosure relates to one or several issues.
- The lobbying firm makes contributions to a Congress representative's campaign.
Third, Answer Two Questions to Refine Your Model
Having a definition of the domain is still not sufficient to identify the nodes and relationships in a graph model. It's worth taking a moment to look in detail of the data that I found.
The general idea of the mapping data "Introduction to Graph Databases" online training course to demonstrate this data modeling process:
When looking at the details of your datasets, answering these two questions is helpful to convert RDBMS tables into a graph data model:
- Do the two tables contain columns that share the same values?
Answering this question is helpful to link two nodes together. The question is easy to answer if you could find the key columns in both tables. But what if the two tables don't share a foreign key? Then it's good to document that columns in the tables represent the same object.
For example: Provider prescription data doesn't contain FDA drug codes (which can be used as a foreign key). However, both provider prescription data and FDA drug code data contains columns of
generic_name, thus I'm able to link these two tables together by
- Which columns (i.e., properties) can be extracted as an independent node?
Remember that columns are usually treated as properties? But sometimes you might want to create a node for some columns, and doing so can be beneficial when querying your data.
For example: In the lobbying disclosure data, a lobbying firm can be treated as a property of the
Disclosurenode, but it can also be extracted as an independent node which has a relation
Disclosureand another relation
A good reason for treating the lobbying firm as a node rather than a property is that the lobbying firm contains a lot of information such as:
location, which can be stored as properties for that node. Another reason, perhaps more beneficial, is that it allows you to create a relationship between lobbying firm and state, thus offering the option to query "which lobbying firms and drug manufacturers are located in the same state."
Last, Create the Graph Schema
Although the graph database is schemaless, it's good to have a clear idea of how your data will be stored in a graph before you create the database, and solid schema documentation will be really helpful for later development. Here is the graph I created after studying both the context and the characters of the healthcare dataset:
I hope you found the first week of our Neo4j healthcare road trip exciting! Starting next week, I will show you how to ETL the data into Neo4j. As you have seen, there are two types of data in my project: CSV and XML, I will start to extract XML data by calling APOC which stands for "Awesome Procedures On Cypher" and load the data into Neo4j by using the py2neo package in Python.
I've also linked the project's GitHub repository here where you can find the detailed schema documentation, current code development, and some other fun stuff.