DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Data
  4. JSON to Insights: Analyzing U.S. Healthcare Data

JSON to Insights: Analyzing U.S. Healthcare Data

Let's analyze a public dataset from the U.S. healthcare system to see what we can learn.

Keshav Murthy user avatar by
Keshav Murthy
CORE ·
Dec. 13, 18 · Tutorial
Like (3)
Save
Tweet
Share
7.27K Views

Join the DZone community and get the full member experience.

Join For Free

"Nothing is certain except for death and taxes."

This isn't a dataset made with a bed of roses or manicured green grass. A bit more serious. Let's see if we can quickly learn anything here. The dataset is the following.

"name" : "NCHS - Leading Causes of Death: United States",
"attribution" : "National Center for Health Statistics",

The public dataset is available at https://data.cdc.gov/api/views/bi63-dtpu/rows.json?accessType=DOWNLOAD

Step 1: Download the file into a local file (e.g. health.json). Upload this file to one of the nodes in the Couchbase cluster.

/opt/couchbase/bin/cbimport json -c couchbase://127.0.0.1 -u Administrator -p password -b cause -d file://health.json -g cause:0 -f sample


> CREATE PRIMARY INDEX ON cause;

Step 2: Import the data into a bucket called cause. After you create the bucket, create the primary index. You'll need this for querying.

Step 3. Inspect the structure of the data.

All of the data is provided in a SINGLE JSON document. Because of this, INFER doesn't help. You'll have to inspect and understand the structure manually. This data in typical government dataset with a lot of data in simple arrays with the meaning of each entity given at in the metadata.

Simple array:

<strong>select data from cause ;</strong>

This simply contains an array of data without the schema. For the public datasets, the schema is in the meta field.

Let's transform the structure into simple JSON key-value pairs so we can handle these bit more effectively. You can learn more about how this magic happened in this article.


WITH cs AS (
  SELECT
    meta.`view`.columns [*].fieldName f,
    data
  FROM
    cause
)
SELECT
  o
FROM
  cs UNNEST cs.data AS d1 
LET o = OBJECT p :d1 [ARRAY_POSITION(cs.f, p)] FOR p IN cs.f END;

Image title

Task 1: Find out the cause for most deaths in a state, by year.

The common table expression (CTE) in the WITH clause (csdata) transforms the complex json data into flat JSON. You can do this dynamically or do this once and INSERT back into a bucket, as I've discussed in the article on New York baby names. In this article, I use CTEs.



WITH csdata as (
  WITH cs AS (
    SELECT
      meta.`view`.columns [*].fieldName f,
      data
    FROM
      cause
  )
  SELECT
    o
  FROM
    cs UNNEST cs.data AS d1 LET o = OBJECT p :d1 [ARRAY_POSITION(cs.f, p)] FOR p IN cs.f END
)
SELECT
  c.o.state,
  c.o.year,
  c.o.cause_name,
  COUNT(c.o.cause_name),
  SUM(TONUMBER(c.o.deaths)) totdeaths
FROM
  csdata as c
WHERE
  c.o.state &lt;&gt; "United States"
  and c.o.cause_name &lt;&gt; "All causes"
GROUP BY
  c.o.state,
  c.o.year,
  c.o.cause_name
ORDER BY
  totdeaths DESC,
  c.o.state,
  c.o.year


In this case, all the deaths in California come on top, mainly due to its population.

Task 2: Find out leading causes of death in each state for the year 2016.

Query 2: Use the resultset from the previous query and then use the FIRST_VALUE() window function to determine the top cause. Partitioning by state (in the OVER BY clause) will give you the partitions by state and ORDER BY dx.totdeaths within the OVER BY clause will give you the top cause in every state.


WITH csdata as (
  WITH cs AS (
    SELECT
      meta.`view`.columns [*].fieldName f,
      data
    FROM
      cause
  )
  SELECT
    o
  FROM
    cs UNNEST cs.data AS d1 LET o = OBJECT p :d1 [ARRAY_POSITION(cs.f, p)] FOR p IN cs.f END
),
d2 as(
SELECT
  c.o.state,
  c.o.year,
  c.o.cause_name,
  SUM(TONUMBER(c.o.deaths)) totdeaths
FROM
  csdata as c
WHERE
  c.o.state &lt;&gt; "United States"
  and c.o.cause_name &lt;&gt; "All causes"
  and c.o.year = "2016"
GROUP BY
  c.o.state,
  c.o.year,
  c.o.cause_name),
d3 as (
SELECT dx.state, dx.cause_name, dx.totdeaths,
  FIRST_VALUE(dx.cause_name) OVER(PARTITION BY dx.state ORDER BY dx.totdeaths DESC) topreason,
  FIRST_VALUE(dx.totdeaths) OVER(PARTITION BY dx.state ORDER BY dx.totdeaths DESC) topcount
FROM d2 dx)
SELECT d3
FROM d3
WHERE d3.topcount = d3.totdeaths
order by d3.state

Task 3: Find out how the top reason has changed by the year, from 1999 to 2016, by state.

Query 3: Simply generate the report for all the years (1999-2016) and then determine the top reason and finally get the highest reason by grouping by state, year and getting MAX(topcount) for the topreason cause.

WITH csdata as (
  WITH cs AS (
    SELECT
      meta.`view`.columns [*].fieldName f,
      data
    FROM
      cause
  )
  SELECT
    o
  FROM
    cs UNNEST cs.data AS d1 LET o = OBJECT p :d1 [ARRAY_POSITION(cs.f, p)] FOR p IN cs.f END
),
d2 as(
SELECT
  c.o.state,
  c.o.year,
  c.o.cause_name,
  SUM(TONUMBER(c.o.deaths)) totdeaths
FROM
  csdata as c
WHERE
  c.o.state &lt;&gt; "United States" 
  and c.o.cause_name &lt;&gt; "All causes"
GROUP BY
  c.o.state,
  c.o.year,
  c.o.cause_name),
d3 as (
SELECT dx.state, dx.year,
  FIRST_VALUE(dx.cause_name) OVER(PARTITION BY dx.state, dx.year ORDER BY dx.totdeaths DESC ) topreason,
  FIRST_VALUE(dx.totdeaths) OVER(PARTITION BY dx.state, dx.year ORDER BY dx.totdeaths DESC) topcount
FROM d2 dx)
SELECT d3.state , d3.year , d3.topreason, max(d3.topcount) topcount
FROM d3
GROUP BY d3.state, d3.year, d3.topreason
order by d3.state, d3.year

Here's the partial result.



Visualizing this gives us the following histogram.

Image title

JSON Data (computing) Database

Published at DZone with permission of Keshav Murthy, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Utilize OpenAI API to Extract Information From PDF Files
  • Ultimate Guide to FaceIO
  • PHP vs React
  • AWS Cloud Migration: Best Practices and Pitfalls to Avoid

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: