DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Data
  4. JSON to Insights: Fast and Easy

JSON to Insights: Fast and Easy

Every company, including yours, has stored a lot of data in JSON — the result of surveys, campaigns, and forums.

Sitaram Vemulapalli user avatar by
Sitaram Vemulapalli
·
Keshav Murthy user avatar by
Keshav Murthy
CORE ·
Dec. 07, 18 · Tutorial
Like (7)
Save
Tweet
Share
7.47K Views

Join the DZone community and get the full member experience.

Join For Free
"The answer my friend is hiding in JSON" — Bob Dylan

There are a lot of JSON datasets and then is JSON datasets. Every company, including yours, has stored a lot of data in JSON — the result of surveys, campaigns, and forums.

There are many ways to get skin the JSON. You can write Python program for every report, visualization you want to do. Or, you can use (SQL for JSON) to generate the right algorithm for you to analyze JSON data. In this article, we show you how to use N1QL to extract insights quickly. We also use two features coming up in the next release: Common Table Expression (CTE) and Window Functions.

Goal: Use public JSON dataset for US Open golf scores to create a simple leaderboard, ranking, etc.

Three Things You'll Do as Part of This:

  1. Ingest the data into Couchbase easily.
  2. Start getting the value of this JSON data immediately.
  3. Shape the JSON to generate useful reports using new features quickly.

Source Data: https://github.com/jackschultz/usopen

Queries in this post are also available at: https://github.com/keshavmr/usopen-golf-queries

Data repo structure: This GitHub repo https://github.com/jackschultz/usopen contains US Open golf -2018 data. For each hole, it as a separate document for each day.

Each document has this structure. This is the document for hole 1 on day 1. The filed Ps has the list of players, each with a unique ID.

Each player's playing statistics is following that, stroke by stroke. The players are matched to scores using the field unique ID for the player.

Start Getting Insights:

Before you start querying, create a primary index on the bucket.

CREATE PRIMARY INDEX ON usopen;

Task 1: Create a report of player scores by round and the final total.

After playing with JSON from bottom-up, we came up with this query. The explanation is after the query.

Tabular Results (In Tabular form, from the Couchbase query workbench)

WITH d AS (
  SELECT
    pl.hnum AS holedn,
    pl.ps.Nat AS country,
    (pl.ps.FN || " " || pl.ps.LN) AS name,
    pl.ps.ID AS ID,
    array_length(hps.Sks) AS score,
    hpl.hole AS `hole`,
    hpl.day AS `day`
  FROM
    (
      SELECT
        meta(usopen).id AS hnum,
        ps
      FROM
        usopen USE keys "holes:1:1" unnest Ps AS ps
    ) pl
    INNER JOIN (
      SELECT
        TONUMBER(split(meta(usopen).id, ":") [1]) AS `hole`,
        TONUMBER(split(meta(usopen).id, ":") [2]) AS `day`,
        hps
      FROM
        usopen unnest Rs AS rs UNNEST rs.Hs AS hs UNNEST hs.HPs AS hps
    ) hpl ON (pl.ps.ID = hps.ID)
)
SELECT
  d.name,
  SUM(
    CASE WHEN d.day = 1 THEN d.score ELSE 0 END
  ) R1,
  SUM(
    CASE WHEN d.day = 2 THEN d.score ELSE 0 END
  ) R2,
  SUM(
    CASE WHEN d.day = 3 THEN d.score ELSE 0 END
  ) R3,
  SUM(
    CASE WHEN d.day = 4 THEN d.score ELSE 0 END
  ) R4,
  SUM(d.score) T
FROM
  d
GROUP BY
  d.name
ORDER BY
  d.name


Let's look at the query block by block.

Look at the WITH d clause. The statement untangles the JSON from PER-day-PER-hole-shot-by-shot data to simple scalar values.

  • Holedn is the document key — hole-day-number
  • Country is the player's nationality
  • ID is the player's unique ID.

Hole and day are obvious and the score is the player's score for that hole.

In the FROM clause of the SELECT statement, pl is the full list of players taken from the document for the first day, first hole (holes:1:1).

Rs is the players' result, shot by shot, hole by hole. First, we unnest that array couple of times to project details on each hole and score for that hole, determined by array_length(hps.Sks).

Once we have the hole-by-hole score, it's easy to write the final query to aggregate by the player and by day.

**The WITH clause is the common table expression (CTE) feature in the upcoming Mad-Hatter release. The old way to do this in Couchbase 5.5 or below is using the LET clause. Post the question in Couchbase forum if you need help here).

Task 2: Now, create the full leaderboard and add the information. The golfers who got cut won't play third or the fourth round. We use this information to determine the players who got cut.

Query 2. Take the previous query and name it as a common table dx and then add the following expression to determine that cut.

Here's the full query:

WITH dy AS (
  SELECT
    pl.hnum AS holedn,
    pl.ps.Nat AS country,(pl.ps.FN || " " || pl.ps.LN) AS name,
    pl.ps.ID AS ID,
    array_length(hps.Sks) AS score,
    hpl.hole AS `hole`,
    hpl.day AS `day`
  FROM
    (
      SELECT
        meta(usopen).id AS hnum,
        ps
      FROM
        usopen USE keys "holes:1:1" unnest Ps AS ps
    ) pl
    INNER JOIN (
      SELECT
        TONUMBER(split(meta(usopen).id, ":") [1]) AS `hole`,
        TONUMBER(split(meta(usopen).id, ":") [2]) AS `day`,
        hps
      FROM
        usopen unnest Rs AS rs unnest rs.Hs AS hs unnest hs.HPs AS hps
    ) hpl ON (pl.ps.ID = hps.ID)
),
dx AS (
  SELECT
    d.name,
    sum(
      CASE WHEN d.day = 1 THEN d.score ELSE 0 END
    ) R1,
    sum(
      CASE WHEN d.day = 2 THEN d.score ELSE 0 END
    ) R2,
    sum(
      CASE WHEN d.day = 3 THEN d.score ELSE 0 END
    ) R3,
    sum(
      CASE WHEN d.day = 4 THEN d.score ELSE 0 END
    ) R4,
    sum(d.score) T
  FROM
    dy AS d
  GROUP BY
    d.name
  ORDER BY
    d.name
)
SELECT
  d2.name,
  d2.R1,
  d2.R2,
  d2.R3,
  d2.R4,
  d2.T,(
    CASE WHEN (
      d2.R1 = 0
      OR d2.R2 = 0
      OR d2.R3 = 0
      OR d2.R4 = 0
    ) THEN "CUT" ELSE MISSING END
  ) AS CUT
FROM
  dx AS d2
ORDER BY
  CUT ASC,
  d2.T ASC

Task 3: Determine the winners.

We need to rank the players based on the total score to determine who won the tournament. The rankings are skipped over if there are ties in the scores. Doing this in SQL without window functions is expensive. Here, we write the query using the RANK() window function. Window functions are a feature in N1QL in the upcoming release (Mad-Hatter)

Notice the ranks 4, 8, 9, 10, 11 missing because of the tie scores!

Task 4: Now, let's find out how each player fared after round1, round2, round3 compared to the final round. Using the window functions, it becomes as easy making the marshmallows covered with chocolate disappear.

Query 4: Use the same RANK() function, by ORDER BY the score of each day (day1, day1+day2, day1+day2+day3) instead of just the final score.

WITH dy AS (
  SELECT
    pl.hnum AS holedn,
    pl.ps.Nat AS country,(pl.ps.FN || " " || pl.ps.LN) AS name,
    pl.ps.ID AS ID,
    array_length(hps.Sks) AS score,
    hpl.hole AS `hole`,
    hpl.day AS `day`
  FROM
    (
      SELECT
        meta(usopen).id AS hnum,
        ps
      FROM
        usopen USE keys "holes:1:1" unnest Ps AS ps
    ) pl
    INNER JOIN (
      SELECT
        TONUMBER(split(meta(usopen).id, ":") [1]) AS `hole`,
        TONUMBER(split(meta(usopen).id, ":") [2]) AS `day`,
        hps
      FROM
        usopen unnest Rs AS rs unnest rs.Hs AS hs unnest hs.HPs AS hps
    ) hpl ON (pl.ps.ID = hps.ID)
),
dx AS (
  SELECT
    d.name,
    sum(
      CASE WHEN d.day = 1 THEN d.score ELSE 0 END
    ) R1,
    sum(
      CASE WHEN d.day = 2 THEN d.score ELSE 0 END
    ) R2,
    sum(
      CASE WHEN d.day = 3 THEN d.score ELSE 0 END
    ) R3,
    sum(
      CASE WHEN d.day = 4 THEN d.score ELSE 0 END
    ) R4,
    sum(d.score) T
  FROM
    dy AS d
  GROUP BY
    d.name
  ORDER BY
    d.name
)
SELECT
  d2.name,
  d2.R1,
  d2.R2,
  d2.R3,
  d2.R4,
  d2.T,
  DENSE_RANK() OVER (
    ORDER BY
      d2.T + CUT
  ) AS rankMoney,
  RANK() OVER (
    ORDER BY
      d2.T + CUT
  ) AS rankFinal,
  RANK() OVER (
    ORDER BY
      d2.R1
  ) AS round1rank,
  RANK() OVER (
    ORDER BY
      d2.R1 + d2.R2
  ) AS round2rank,
  RANK() OVER (
    ORDER BY
      d2.R1 + d2.R2 + d2.R3 + CUT
  ) AS round3rank
FROM
  dx AS d2 LET CUT = (
    CASE WHEN (
      d2.R1 = 0
      OR d2.R2 = 0
      OR d2.R3 = 0
      OR d2.R4 = 0
    ) THEN 1000 ELSE 0 END
  )
ORDER BY
  rankFinal,
  round1rank,
  round2rank,
  round3rank


Now you can see how the players moved up or down each day.

Task 5: Create the full scorecard for the leader using the basic shot-by-shot statistics.

Query 5: Brooks Koepka is the final winner of the US open. Let's get his scores, hole by hole and get the cumulative scores for him by round. Notice how the simple SUM() and the COUNT() aggregate works as a window function with the OVER() clause.

This first partitions the score by day and then by hole - specified by PARTITION BY clause, in the order of the holes, 1-18. The SUM then adds up the scores so far.

This SUM() function simply adds up the score from day 1, hole 1 to day 4, hole 18 - this is specified by the ORDER BY d3.day, d3.hole within the OVER() clause. The field ToTScore shows the total shorts for the tournament by Koepka at each hole.

WITH dy AS (
  SELECT
    pl.hnum AS holedn,
    pl.ps.Nat AS country,(pl.ps.FN || " " || pl.ps.LN) AS name,
    pl.ps.ID AS ID,
    array_length(hps.Sks) AS score,
    hpl.hole AS `hole`,
    hpl.day AS `day`,
    hpl.Par AS Par
  FROM
    (
      SELECT
        meta(usopen).id AS hnum,
        ps
      FROM
        usopen USE keys "holes:1:1" unnest Ps AS ps
      WHERE
        ps.LN = "Koepka"
    ) pl
    INNER JOIN (
      SELECT
        TONUMBER(split(meta(usopen).id, ":") [1]) AS `hole`,
        TONUMBER(split(meta(usopen).id, ":") [2]) AS `day`,
        hs.Par,
        hps
      FROM
        usopen unnest Rs AS rs unnest rs.Hs AS hs unnest hs.HPs AS hps
    ) hpl ON (pl.ps.ID = hps.ID)
),
dx AS (
  SELECT
    d.name,
    d.day,
    d.score,
    d.hole,
    d.Par
  FROM
    dy AS d
  ORDER BY
    d.name
),
dz AS (
  SELECT
    d2.day,
    d2.hole,
    d2.score,
    SUM(d2.score) OVER (
      PARTITION BY d2.day
      ORDER BY
        d2.hole
    ) hst,
    d2.Par,
    SUM(d2.Par) OVER (
      PARTITION BY d2.day
      ORDER BY
        d2.hole
    ) hpr
  FROM
    dx AS d2 LET CUT = (
      CASE WHEN (
        d2.R1 = 0
        OR d2.R2 = 0
        OR d2.R3 = 0
        OR d2.R4 = 0
      ) THEN 1000 ELSE 0 END
    )
  ORDER BY
    d2.day,
    d2.hole
)
SELECT
  d3.Par,
  d3.day,
  d3.hole,
  d3.hst,
  d3.score,(d3.hst - d3.hpr) ToPar,
  sum(d3.score) OVER (
    ORDER BY
      d3.day,
      d3.hole
  ) ToTScore,
  count(1) OVER (
    ORDER BY
      d3.day,
      d3.hole
  ) HoleNum
FROM
  dz AS d3

JSON Database Insight (email client) Data (computing)

Published at DZone with permission of Sitaram Vemulapalli. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • The Real Democratization of AI, and Why It Has to Be Closely Monitored
  • Kotlin Is More Fun Than Java And This Is a Big Deal
  • Implementing Infinite Scroll in jOOQ
  • DevSecOps Benefits and Challenges

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: