What's in a New York Name? Unlock data.gov Using N1QL
The techniques shown here will help you analyze JSON datasets in data.gov. Keshav Murthy uses the example of the most popular baby names in NYC.
Join the DZone community and get the full member experience.
Join For FreeData.gov, started in 2009, has about 189,000 datasets. Data is published in XML, CSV, JSON, HTML, PDF, and other formats. Data.gov aims to improve public access to high value, machine-readable datasets generated by the Executive Branch of the Federal Government. Lots of this data comes from the Socrata database. They also provide Socrata APIs to retrieve the subset of the data that you need. Data is valuable. Insights are more valuable. Instead of working with data trickle, let’s load all the data and analyze them.
We start this series using a dataset on a simple and seemingly inconsequential decision parents make: baby names. Obviously, parents take this decision quite seriously. If New York is the melting pot, let’s see what baby names those parents choose.
The techniques shown here will help you analyze not just this dataset, but all of the JSON datasets in data.gov. Let’s try to understand all of Obama’s open data secrets!
The Most Popular Baby Names by Sex and Mother's Ethnic Group, New York City dataset has collected baby names in New York City from 2011 to 2014. You can see the raw data at this link.
The data is in a single JSON document and consists of two fields: meta and data. Meta contains the description of the structure of the document and data, AKA metadata. Data contains data. This structure comes from Socrata.
Here’s is the structure.
{
"meta" : {
"view" : {
"id" : "25th-nujf",
"name" : "Most Popular Baby Names by Sex and Mother's Ethnic Group, New York City",
"attribution" : "Department of Health and Mental Hygiene (DOHMH)",
"averageRating" : 0,
"category" : "Health",
"createdAt" : 1382724894,
"description" : "The most popular baby names by sex and mother's ethnicity in New York City.",
"displayType" : "table",
…
"columns" : [ {
"id" : -1,
"name" : "sid",
"dataTypeName" : "meta_data",
"fieldName" : ":sid",
"position" : 0,
"renderTypeName" : "meta_data",
"format" : { }
}, {
"id" : -1,
"name" : "id",
"dataTypeName" : "meta_data",
"fieldName" : ":id",
"position" : 0,
"renderTypeName" : "meta_data",
"format" : { }
}
...
]
"data" : [ [ 1, "EB6FAA1B-EE35-4D55-B07B-8E663565CCDF", 1, 1386853125, "399231", 1386853125, "399231", "{\n}", "2011", "FEMALE", "HISPANIC", "GERALDINE", "13", "75" ]
, [ 2, "2DBBA431-D26F-40A1-9375-AF7C16FF2987", 2, 1386853125, "399231", 1386853125, "399231", "{\n}", "2011", "FEMALE", "HISPANIC", "GIA", "21", "67" ]
, [ 3, "54318692-0577-4B21-80C8-9CAEFCEDA8BA", 3, 1386853125, "399231", 1386853125, "399231", "{\n}", "2011", "FEMALE", "HISPANIC", "GIANNA", "49", "42" ]
...
]
}
Rest of the article is divided into two sections:
Transforming the data.gov JSON dataset to be easily accessible.
Querying the New York names dataset with N1QL.
1. Transforming data.gov JSON
While the data is JSON, the whole document is stored as a single document. Let’s see how fast we can get insights.
We use Couchbase, a distributed NoSQL database and its built-in N1QL query language designed to be SQL for JSON. You need to download and install Couchbase 4.5 to follow along the following examples.
Step 1
Create two buckets, datagov and nynames, and create a primary index on both. Just 100 megabytes is sufficient for this data sets. And then create primary indices on both.
CREATE PRIMARY INDEX ON datagov;
CREATE PRIMARY INDEX ON nynames;
Step 2
Download the JSON document from this link.
Step 3
Create a file (insert.sql) with the INSERT statement and append the document. You need to surround the source data with the INSERT statement highlighted here in blue.
INSERT INTO datagov VALUES("nyname01", {
"meta" : {
"view" : {
"id" : "25th-nujf",
"name" : "Most Popular Baby Names by Sex and Mother's Ethnic Group, New York City",
"attribution" : "Department of Health and Mental Hygiene (DOHMH)",
"averageRating" : 0,
"category" : "Health",
"createdAt" : 1382724894,
"description" : "The most popular baby names by sex and mother's ethnicity in New York City.",
"displayType" : "table",
"downloadCount" : 1782,
"indexUpdatedAt" : 1465427458,
"newBackend" : false,
"numberOfComments" : 0,
"oid" : 6785037,
"publicationAppendEnabled"
….
});
Step 4
Simply execute the statement via cbq.
$ cbq < insert.sql
Step 5
Now you have the whole data as a single document in a Couchbase bucket. This document is a well-formed JSON. Because the data is stored as series of scalar values of arrays of arrays, querying can be quite complex.
Step 6
Let’s simplify this by transforming the data into flattened JSON.
First, let’s list the full set of data fields.
SELECT meta.`view`.columns[*].fieldName
FROM datagov;
[
{
"fieldName": [
":sid",
":id",
":position",
":created_at",
":created_meta",
":updated_at",
":updated_meta",
":meta",
"brth_yr",
"gndr",
"ethcty",
"nm",
"cnt",
"rnk"
]
}
]
To transform the data, you can use the query below.
INSERT INTO nynames (KEY UUID(), VALUE kname)
SELECT {":sid":d[0],
":id":d[1],
":position":d[2],
":created_at":d[3],
":created_meta":d[4],
":updated_at":d[5],
":updated_meta":d[6],
":meta":d[7],"brth_yr":d[8],
"brth_yr":d[9],
"ethcty":d[10],
"nm":d[11],
"cnt":d[12],
"rnk":d[13]} kname
FROM (SELECT d FROM datagov UNNEST data d) as u1;
{
"results": [],
"metrics": {
"elapsedTime": "3.578776104s",
"executionTime": "3.578736949s",
"resultCount": 0,
"resultSize": 0,
"mutationCount": 13962
}
}
Better method: I’ve saved the best for last.The query in step (8) works fine. However, this is the old way. This requires you two-hand code the field names. There has to be a better, generalized way. With N1Ql, there is a better way.
INSERT INTO nynames
(
KEY UUID(),
value o
)
SELECT o
FROM (
SELECT meta.`view`.columns[*].fieldName f,
data
FROM datagov) d
UNNEST data d1
LET o = OBJECT p:d1[ARRAY_POSITION(d.f, p)] FOR p IN d.f END ;
{
"results": [],
"metrics": {
"elapsedTime": "3.765045363s",
"executionTime": "3.765005394s",
"resultCount": 0,
"resultSize": 0,
"mutationCount": 13962
}
}
Let’s first understand this query before we can appreciate it.
The INSERT INTO clause simply generates UUID() and the document o projected from the SELECT into the bucket nynames.
In the SELECT, there is a subquery that’s projecting the column list we saw before and the entire dataset. In the main SELECT, the data is unnested to flatten it. Now you have three objects to handle.
d.f: List of field names.
d.data: Full data of baby names, arrays of arrays.
d1: An array for each baby name and its statistics
Let’s look at the expression:
LET o = OBJECT p:d1[ARRAY_POSITION(d.f, p)] FOR p IN d.f END ;
This creates a JSON document by creating name:value pairs by having the variable p iterate through the names in d.f (names) and the expression d1[array_position(d.f, p)] selecting the corresponding data from the flattened array d.
Many thanks to Sitaram Vemulapalli for writing this beautiful query.
2. Querying the New York Names Dataset
As you can see, this JSON structure isn’t conducive for reporting or query processing. So, let’s flatten this. N1QL makes it very easy to transform this array of scalars to easily manipulate JSON data. See the section at the latter half of the article to see how to transform this data into a flattened JSON like below.
Once you do the transformation, you have all the name data in a bucket called nynames. Now, let’s ask some common questions and see what we understand!
Q1: Let’s get some sample data. Since we have used UUID() as primary keys and this query uses the primary key, the data you get here are random.
SELECT *
FROM nynames
LIMIT 2;
[
{
"nynames": {
":created_at": 1465395604,
":created_meta": "399231",
":id": "D3F1A794-A634-4DA4-AA66-96987BB21E92",
":meta": null,
":position": 13240,
":sid": 13240,
":updated_at": 1465395604,
":updated_meta": "399231",
"brth_yr": "2014",
"cnt": "27",
"ethcty": "BLACK NON HISPANIC",
"gndr": "MALE",
"nm": "Joseph",
"rnk": "45"
}
},
{
"nynames": {
":created_at": 1386853125,
":created_meta": "399231",
":id": "9AC42CB0-2A84-4497-A907-DF7BF645C953",
":meta": "{\n}",
":position": 374,
":sid": 374,
":updated_at": 1386853125,
":updated_meta": "399231",
"brth_yr": "2011",
"cnt": "22",
"ethcty": "WHITE NON HISPANIC",
"gndr": "FEMALE",
"nm": "JENNA",
"rnk": "69"
}
}
]
Q2. What’s the total count of names and babies in the New York dataset?
This data covers the 2,811 most popular names. Let’s see which name is the most popular and which name is the least popular, at least in this dataset.
SELECT COUNT(DISTINCT nm ) as namecount,
SUM(TONUMBER(cnt)) as babycount
FROM nynames;
[
{
"babycount": 482131,
"namecount": 2811
}
]
SELECT name, SUM(TONUMBER(cnt)) as nc
FROM nynames
GROUP BY name
ORDER BY SUM(TONUMBER(cnt)) DESC
LIMIT 1;
[
{
"nc": 10,
"nm": "Kameron"
}
]
Let’s find the top 10 popular names in New York.
SELECT nm as name, SUM(TONUMBER(cnt)) as namecount
FROM nynames
GROUP BY nm
ORDER BY SUM(TONUMBER(cnt)) DESC
LIMIT 10;
[
{
"name": "JAYDEN",
"namecount": 4040
},
{
"name": "JACOB",
"namecount": 3437
},
{
"name": "ETHAN",
"namecount": 3403
},
{
"name": "DANIEL",
"namecount": 3009
},
{
"name": "MATTHEW",
"namecount": 2984
},
{
"name": "MICHAEL",
"namecount": 2972
},
{
"name": "ISABELLA",
"namecount": 2929
},
{
"name": "SOPHIA",
"namecount": 2809
},
{
"name": "DAVID",
"namecount": 2764
},
{
"name": "AIDEN",
"namecount": 2698
}
]
So, they repeat boys names much more often than girls names!
Q4. What’s the total count of babies by gender?
SELECT gndr as gender, SUM(TONUMBER(cnt)) as babycount
FROM nynames
GROUP BY gndr;
[
{
"babycount": 272048,
"gender": "MALE"
},
{
"babycount": 210083,
"gender": "FEMALE"
}
]
Q5. What are the top three groups in terms of number of children?
SELECT ethcty AS Ethnicity,
brth_yr AS BirthYear,
SUM(tonumber(cnt)) AS NameCount
FROM nynames
GROUP BY ethcty, brth_yr
ORDER BY SUM(TONUMBER(cnt)) DESC
LIMIT 3;
[
{
"BirthYear": "2011",
"Ethnicity": "WHITE NON HISPANIC",
"NameCount": 99512
},
{
"BirthYear": "2011",
"Ethnicity": "HISPANIC",
"NameCount": 97768
},
{
"BirthYear": "2011",
"Ethnicity": "BLACK NON HISPANIC",
"NameCount": 42164
}
]
Q6. Now, let’s see Most popular starting alphabet
SELECT SUBSTR(LOWER(`nm`), 0, 1) alphabet,
SUM(TONUMBER(cnt)) ncount
FROM nynames
GROUP BY substr(lower(`nm`), 0, 1)
ORDER BY sum(tonumber(cnt)) DESC
LIMIT 10;
[
{
"alphabet": "a",
"ncount": 73562
},
{
"alphabet": "j",
"ncount": 61599
},
{
"alphabet": "m",
"ncount": 48090
},
{
"alphabet": "s",
"ncount": 36204
},
{
"alphabet": "e",
"ncount": 35552
},
{
"alphabet": "c",
"ncount": 27191
},
{
"alphabet": "l",
"ncount": 26865
},
{
"alphabet": "d",
"ncount": 21026
},
{
"alphabet": "n",
"ncount": 17195
},
{
"alphabet": "k",
"ncount": 16632
}
]
Conclusion
Data.gov has tons of data. This data unlocks how the US government has worked in the past eight years. You can question the data to understand what worked and what didn’t.
N1QL provides the key link between just raw data to queryable form. In the transformation query, data specific field names are never used. That means that this query can be used for all of the JSON data sets in this format at data.gov. This is the power of N1QL, SQL for JSON: expressive, powerful, and completed language for querying, transforming, and manipulating JSON data.
Go ahead. Try this on a data set you love on data.gov.
Related Information:
Couchbase: http://www.couchbase.com
Data.gov: http://www.data.gov
JSON datasets in Data.gov: JSON Data Sets.
Opinions expressed by DZone contributors are their own.
Comments