Aggregations With Couchbase Server vs SQL Server
I am super excited to report that we have built N1QL in Couchbase Server 4 to tackle the issue of "variety" posed by big data. Read on to learn more.
Join the DZone community and get the full member experience.
Join For FreeSQL has been around for a seriously long time. It is a very intuitive and efficient way to process structured data and has been the choice for databases for many years. With the world of big data, however, data has velocity, variety, and volume. SQL can tackle two of the "v"s ( “velocity” and “volume”) fine with optimizations. In fact, many new dialects of SQL (N1QL, Spark, U-SQL, Impala, Drill, and more) are doing exactly that. However, “variety” is a different ballgame! Big data is complex in the new world; it has unpredictable, constantly evolving, jagged scheme, sparse values, and deeply nested structures. For that, SQL needs to be extended!
Well, I am super excited to report that we have built N1QL in Couchbase Server 4 to tackle exactly these issues! N1QL can easly traverse over the complex structures in JSON (the de-facto serialization standard for complex data). Using N1QL, you can work with not only NULLs but attributes that are MISSING in various shapes of JSON you process. Or you can use operators like ANY/EVERY to query over arrays embedded within JSON document. Or you can use commands like UNNEST & NEST that can flatten or unflatten nested arrays. There are too many of these powerful extensions to count here, so I won't do that. Instead, I am going to show you one hidden gem that was originally posted here by Gerald. This gem is very useful with aggregates if you are using N1QL vs a relational database like SQL Server.
One of the big advantages of N1QL is its ability to understand the array types. Aggregation functions like MAX() are not revolutionary, but with the additions of nesting and arrays, something as simple as MAX() can be super powerful. Okay, so... imagine trying to find attributes of the product that has the MAX price. In SQL Server that is a fairly simple query to write using TSQL.
SELECT productID, name FROM t1
WHERE price = (SELECT MAX(price) FROM t1)
GO
That is good. Here is the output—the product with the highest price is product named "c" with ID 3.
productID name
----------- -----------------------------------------------------------------
3 c
Here is what the execution plan looks like. Basically, the execution plan scans for the MAX price value. Once you have the value, it is a nested loop join to search for the other attributes of the product in the table like productID and name.
Stmt Text
---------------------------------------------------------------------------------------------
select productID, name from t1 where price = (select max(price) from t1)
|--Nested Loops(Inner Join, WHERE:([Expr1004]=[test].[dbo].[t1].[price]))
|--Stream Aggregate(DEFINE:([Expr1004]=MAX([test].[dbo].[t1].[price])))
| |--Clustered Index Scan(OBJECT:([test].[dbo].[t1].[PK__t1__2D10D14A7FD17868]))
|--Clustered Index Scan(OBJECT:([test].[dbo].[t1].[PK__t1__2D10D14A7FD17868]))
However, N1QL has a great advantage here. Since it can process MAX() with arrays, you can return all the attributes of the document without an additional scan.
SELECT MAX([price, {productID, name]) FROM test;
Or you can simply return the full document using the following query:
SELECT MAX([price, test]) FROM test;
Let's take a look at the execution plan for the N1QL query. Here comes the spoiler: you see a single FETCH operation that can perform the MAX and the projection is done without a second FETCH.
cbq> explain select MAX([price, test]) from test;
{
"requestID": "b735ce5f-700c-4740-a065-6d4ba681129f",
"signature": "json",
"results": [
{
"#operator": "Sequence",
"~children": [
{
"#operator": "PrimaryScan",
"index": "#primary",
"keyspace": "test",
"namespace": "default",
"using": "gsi"
},
{
"#operator": "Parallel",
"~child": {
"#operator": "Sequence",
"~children": [
{
"#operator": "Fetch",
"keyspace": "test",
"namespace": "default"
},
{
"#operator": "InitialGroup",
"aggregates": [
"max([(`test`.`price`), `test`])"
],
"group_keys": []
}
]
}
},
{
"#operator": "IntermediateGroup",
"aggregates": [
"max([(`test`.`price`), `test`])"
],
"group_keys": []
},
{
"#operator": "FinalGroup",
"aggregates": [
"max([(`test`.`price`), `test`])"
],
"group_keys": []
},
{
"#operator": "Parallel",
"~child": {
"#operator": "Sequence",
"~children": [
{
"#operator": "InitialProject",
"result_terms": [
{
"expr": "max([(`test`.`price`), `test`])"
}
]
},
{
"#operator": "FinalProject"
}
]
}
}
]
}
],
"status": "success",
"metrics": {
"elapsedTime": "1.8224ms",
"executionTime": "1.7614ms",
"resultCount": 1,
"resultSize": 2347
}
}
Even though this is a pretty cool trick and a great performance gain, we've only just scratched the surface of what N1QL is capable of. There is a ton more to discover with N1QL. You can get started with Couchbase Server and N1QL here with the getting started guide.
By the way, if you'd like to try this out with SQL Server 2016 and Couchbase 4, here are the scripts to see this in action for yourself.
SQL Server Script
Note: Run this in a database called "test"
create table t1(productID int primary key, price int, name varchar(128));
go
insert into t1(productID,price,name) values(1,10,'a');
insert into t1(productID,price,name) values(2,9,'b');
insert into t1(productID,price,name) values(3,12,'c');
insert into t1(productID,price,name) values(4,11,'d');
insert into t1(productID,price,name) values(5,1,'e');
go
set statistics profile on
go
select max(price),productID,name from t1
go
--Msg 8120, Level 16, State 1, Line 10
--Column 't1.productID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
select productID, name from t1 where price = (select max(price) from t1)
go
Couchbase Server Script
Note: Create a bucket called "test"
Insert into test(key,value) values("1",{productID:1, price:10, name:'a'});
insert into test(key,value) values("1",{productID:1, price:10, name:'a'});
insert into test(key,value) values("2",{"productID":2, "price":9, "name":"b"});
insert into test(key,value) values("3",{"productID":3, "price":12, "name":"c"});
insert into test(key,value) values("4",{"productID":4, "price":11, "name":"d"});
insert into test(key,value) values("5",{"productID":5, "price":1, "name":"e"});
create primary index on test;
select max(price, productID, name) from t1;
select max([price, test]) from test;
explain select max([price, test]) from test;
Published at DZone with permission of Cihan B., DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments