JSON to Insights: Fast and Easy
Every company, including yours, has stored a lot of data in JSON — the result of surveys, campaigns, and forums.
Join the DZone community and get the full member experience.
Join For Free"The answer my friend is hiding in JSON" — Bob Dylan
There are a lot of JSON datasets and then is JSON datasets. Every company, including yours, has stored a lot of data in JSON — the result of surveys, campaigns, and forums.
There are many ways to get skin the JSON. You can write Python program for every report, visualization you want to do. Or, you can use (SQL for JSON) to generate the right algorithm for you to analyze JSON data. In this article, we show you how to use N1QL to extract insights quickly. We also use two features coming up in the next release: Common Table Expression (CTE) and Window Functions.
Goal: Use public JSON dataset for US Open golf scores to create a simple leaderboard, ranking, etc.
Three Things You'll Do as Part of This:
- Ingest the data into Couchbase easily.
- Start getting the value of this JSON data immediately.
- Shape the JSON to generate useful reports using new features quickly.
Source Data: https://github.com/jackschultz/usopen
Queries in this post are also available at: https://github.com/keshavmr/usopen-golf-queries
Data repo structure: This GitHub repo https://github.com/jackschultz/usopen contains US Open golf -2018 data. For each hole, it as a separate document for each day.
Each document has this structure. This is the document for hole 1 on day 1. The filed Ps has the list of players, each with a unique ID.
Each player's playing statistics is following that, stroke by stroke. The players are matched to scores using the field unique ID for the player.
Start Getting Insights:
Before you start querying, create a primary index on the bucket.
CREATE PRIMARY INDEX ON usopen;
Task 1: Create a report of player scores by round and the final total.
After playing with JSON from bottom-up, we came up with this query. The explanation is after the query.
Tabular Results (In Tabular form, from the Couchbase query workbench)
WITH d AS (
SELECT
pl.hnum AS holedn,
pl.ps.Nat AS country,
(pl.ps.FN || " " || pl.ps.LN) AS name,
pl.ps.ID AS ID,
array_length(hps.Sks) AS score,
hpl.hole AS `hole`,
hpl.day AS `day`
FROM
(
SELECT
meta(usopen).id AS hnum,
ps
FROM
usopen USE keys "holes:1:1" unnest Ps AS ps
) pl
INNER JOIN (
SELECT
TONUMBER(split(meta(usopen).id, ":") [1]) AS `hole`,
TONUMBER(split(meta(usopen).id, ":") [2]) AS `day`,
hps
FROM
usopen unnest Rs AS rs UNNEST rs.Hs AS hs UNNEST hs.HPs AS hps
) hpl ON (pl.ps.ID = hps.ID)
)
SELECT
d.name,
SUM(
CASE WHEN d.day = 1 THEN d.score ELSE 0 END
) R1,
SUM(
CASE WHEN d.day = 2 THEN d.score ELSE 0 END
) R2,
SUM(
CASE WHEN d.day = 3 THEN d.score ELSE 0 END
) R3,
SUM(
CASE WHEN d.day = 4 THEN d.score ELSE 0 END
) R4,
SUM(d.score) T
FROM
d
GROUP BY
d.name
ORDER BY
d.name
Let's look at the query block by block.
Look at the WITH d clause. The statement untangles the JSON from PER-day-PER-hole-shot-by-shot data to simple scalar values.
- Holedn is the document key — hole-day-number
- Country is the player's nationality
- ID is the player's unique ID.
Hole and day are obvious and the score is the player's score for that hole.
In the FROM clause of the SELECT statement, pl is the full list of players taken from the document for the first day, first hole (holes:1:1).
Rs is the players' result, shot by shot, hole by hole. First, we unnest that array couple of times to project details on each hole and score for that hole, determined by array_length(hps.Sks).
Once we have the hole-by-hole score, it's easy to write the final query to aggregate by the player and by day.
**The WITH clause is the common table expression (CTE) feature in the upcoming Mad-Hatter release. The old way to do this in Couchbase 5.5 or below is using the LET clause. Post the question in Couchbase forum if you need help here).
Task 2: Now, create the full leaderboard and add the information. The golfers who got cut won't play third or the fourth round. We use this information to determine the players who got cut.
Query 2. Take the previous query and name it as a common table dx and then add the following expression to determine that cut.
Here's the full query:
WITH dy AS (
SELECT
pl.hnum AS holedn,
pl.ps.Nat AS country,(pl.ps.FN || " " || pl.ps.LN) AS name,
pl.ps.ID AS ID,
array_length(hps.Sks) AS score,
hpl.hole AS `hole`,
hpl.day AS `day`
FROM
(
SELECT
meta(usopen).id AS hnum,
ps
FROM
usopen USE keys "holes:1:1" unnest Ps AS ps
) pl
INNER JOIN (
SELECT
TONUMBER(split(meta(usopen).id, ":") [1]) AS `hole`,
TONUMBER(split(meta(usopen).id, ":") [2]) AS `day`,
hps
FROM
usopen unnest Rs AS rs unnest rs.Hs AS hs unnest hs.HPs AS hps
) hpl ON (pl.ps.ID = hps.ID)
),
dx AS (
SELECT
d.name,
sum(
CASE WHEN d.day = 1 THEN d.score ELSE 0 END
) R1,
sum(
CASE WHEN d.day = 2 THEN d.score ELSE 0 END
) R2,
sum(
CASE WHEN d.day = 3 THEN d.score ELSE 0 END
) R3,
sum(
CASE WHEN d.day = 4 THEN d.score ELSE 0 END
) R4,
sum(d.score) T
FROM
dy AS d
GROUP BY
d.name
ORDER BY
d.name
)
SELECT
d2.name,
d2.R1,
d2.R2,
d2.R3,
d2.R4,
d2.T,(
CASE WHEN (
d2.R1 = 0
OR d2.R2 = 0
OR d2.R3 = 0
OR d2.R4 = 0
) THEN "CUT" ELSE MISSING END
) AS CUT
FROM
dx AS d2
ORDER BY
CUT ASC,
d2.T ASC
Task 3: Determine the winners.
We need to rank the players based on the total score to determine who won the tournament. The rankings are skipped over if there are ties in the scores. Doing this in SQL without window functions is expensive. Here, we write the query using the RANK() window function. Window functions are a feature in N1QL in the upcoming release (Mad-Hatter)
Notice the ranks 4, 8, 9, 10, 11 missing because of the tie scores!
Task 4: Now, let's find out how each player fared after round1, round2, round3 compared to the final round. Using the window functions, it becomes as easy making the marshmallows covered with chocolate disappear.
Query 4: Use the same RANK() function, by ORDER BY the score of each day (day1, day1+day2, day1+day2+day3) instead of just the final score.
WITH dy AS (
SELECT
pl.hnum AS holedn,
pl.ps.Nat AS country,(pl.ps.FN || " " || pl.ps.LN) AS name,
pl.ps.ID AS ID,
array_length(hps.Sks) AS score,
hpl.hole AS `hole`,
hpl.day AS `day`
FROM
(
SELECT
meta(usopen).id AS hnum,
ps
FROM
usopen USE keys "holes:1:1" unnest Ps AS ps
) pl
INNER JOIN (
SELECT
TONUMBER(split(meta(usopen).id, ":") [1]) AS `hole`,
TONUMBER(split(meta(usopen).id, ":") [2]) AS `day`,
hps
FROM
usopen unnest Rs AS rs unnest rs.Hs AS hs unnest hs.HPs AS hps
) hpl ON (pl.ps.ID = hps.ID)
),
dx AS (
SELECT
d.name,
sum(
CASE WHEN d.day = 1 THEN d.score ELSE 0 END
) R1,
sum(
CASE WHEN d.day = 2 THEN d.score ELSE 0 END
) R2,
sum(
CASE WHEN d.day = 3 THEN d.score ELSE 0 END
) R3,
sum(
CASE WHEN d.day = 4 THEN d.score ELSE 0 END
) R4,
sum(d.score) T
FROM
dy AS d
GROUP BY
d.name
ORDER BY
d.name
)
SELECT
d2.name,
d2.R1,
d2.R2,
d2.R3,
d2.R4,
d2.T,
DENSE_RANK() OVER (
ORDER BY
d2.T + CUT
) AS rankMoney,
RANK() OVER (
ORDER BY
d2.T + CUT
) AS rankFinal,
RANK() OVER (
ORDER BY
d2.R1
) AS round1rank,
RANK() OVER (
ORDER BY
d2.R1 + d2.R2
) AS round2rank,
RANK() OVER (
ORDER BY
d2.R1 + d2.R2 + d2.R3 + CUT
) AS round3rank
FROM
dx AS d2 LET CUT = (
CASE WHEN (
d2.R1 = 0
OR d2.R2 = 0
OR d2.R3 = 0
OR d2.R4 = 0
) THEN 1000 ELSE 0 END
)
ORDER BY
rankFinal,
round1rank,
round2rank,
round3rank
Now you can see how the players moved up or down each day.
Task 5: Create the full scorecard for the leader using the basic shot-by-shot statistics.
Query 5: Brooks Koepka is the final winner of the US open. Let's get his scores, hole by hole and get the cumulative scores for him by round. Notice how the simple SUM() and the COUNT() aggregate works as a window function with the OVER() clause.
This first partitions the score by day and then by hole - specified by PARTITION BY clause, in the order of the holes, 1-18. The SUM then adds up the scores so far.
This SUM() function simply adds up the score from day 1, hole 1 to day 4, hole 18 - this is specified by the ORDER BY d3.day, d3.hole within the OVER() clause. The field ToTScore shows the total shorts for the tournament by Koepka at each hole.
WITH dy AS (
SELECT
pl.hnum AS holedn,
pl.ps.Nat AS country,(pl.ps.FN || " " || pl.ps.LN) AS name,
pl.ps.ID AS ID,
array_length(hps.Sks) AS score,
hpl.hole AS `hole`,
hpl.day AS `day`,
hpl.Par AS Par
FROM
(
SELECT
meta(usopen).id AS hnum,
ps
FROM
usopen USE keys "holes:1:1" unnest Ps AS ps
WHERE
ps.LN = "Koepka"
) pl
INNER JOIN (
SELECT
TONUMBER(split(meta(usopen).id, ":") [1]) AS `hole`,
TONUMBER(split(meta(usopen).id, ":") [2]) AS `day`,
hs.Par,
hps
FROM
usopen unnest Rs AS rs unnest rs.Hs AS hs unnest hs.HPs AS hps
) hpl ON (pl.ps.ID = hps.ID)
),
dx AS (
SELECT
d.name,
d.day,
d.score,
d.hole,
d.Par
FROM
dy AS d
ORDER BY
d.name
),
dz AS (
SELECT
d2.day,
d2.hole,
d2.score,
SUM(d2.score) OVER (
PARTITION BY d2.day
ORDER BY
d2.hole
) hst,
d2.Par,
SUM(d2.Par) OVER (
PARTITION BY d2.day
ORDER BY
d2.hole
) hpr
FROM
dx AS d2 LET CUT = (
CASE WHEN (
d2.R1 = 0
OR d2.R2 = 0
OR d2.R3 = 0
OR d2.R4 = 0
) THEN 1000 ELSE 0 END
)
ORDER BY
d2.day,
d2.hole
)
SELECT
d3.Par,
d3.day,
d3.hole,
d3.hst,
d3.score,(d3.hst - d3.hpr) ToPar,
sum(d3.score) OVER (
ORDER BY
d3.day,
d3.hole
) ToTScore,
count(1) OVER (
ORDER BY
d3.day,
d3.hole
) HoleNum
FROM
dz AS d3
Published at DZone with permission of Sitaram Vemulapalli. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments