YCSB-JSON: Implementation for Couchbase and MongoDB
Learn how to implement the YCSB-JSON performance benchmarking for Couchbase and MongoDB databases.
Join the DZone community and get the full member experience.
Join For FreeYCSB is a great benchmarking tool built to be easily extended by any driver which supports and implements basic operations like insert, read, update, delete, and scan. Plain synthetic data introduced by YCSB fits this paradigm perfectly.
But when it comes to JSON databases, queries became way more sophisticated: querying arrays and nestled objects, running joins, aggregations. The YCSB-JSON extension, on one hand, should be able to utilize all possible JSON operations supported by a database. On the other hand, implementing this approach in YCSB should be generic enough to be easily extended by any other DB driver no matter what level of JSON querying it supports.
The YCSB-JSON is designed to better emulate realistic, end-user scenarios. It designed to work on any JSON data either real datasets or pseudo-realistic or fully synthetic. And one of the requirements for the tool is that there shouldn’t be any hardcoded values in query predicates. A user can only control the data cardinality during dataset generation process.
Fig 1. YCSB-JSON implementation at a glance.
Data Model
The data model we chose for this benchmark is well described in this article. The dataset is generated by using the fakeit tool and loaded into a database (Couchbase, MongoDB) by external scripts. While the model is defined and fixed values are randomly generated. This data is randomly generated but it’s not synthetic.
Data Management
For each operation in the workload queries are fixed, but bound values for each parameterized predicate are non-deterministic. So, the following data management flow was chosen:
- Generate documents with fakeit.
- Load generated data to a database with an external script.
- Run the YCSB load phase. During this phase, YCSB will read a random subset of the generated documents and store all its values in its internal cache.
- During the run phase, YCSB will use the values from its cache while binding and executing queries against the database.
Predicates Generator
The YCSB uses generators when operating with data. The YCSB-JSON introduces its own generator mapped to a particular data model. The mapping and the model exist only within generator namespace. The generator output is a set of generic predicates (field-value pairs) for a particular query. This allows to modify the model and extend the tool with other queries without modifying rest of the YCSB core code.
Predicates generator: Generator.java
Example #1: Pagination Query
One of the YCSB-JSON operations, the pagination query, can be represented by the following statement:
SELECT * FROM <bucket> WHERE address.zip = <value> OFFSET <num> LIMIT <num>
The query predicate is a field within an object. When using Couchbase N1QL the field can be simply accessed as “address.zip”. But another database might not be as flexible so YCSB-JSON generator creates 2 predicates: the parent predicate (address) and child/nested predicate (zip).
And the child predicate has a value randomly picked from a list of sample values for this particular field.
The function below generates the SoeQueryPredicate object, where name
is “address” and nested predicate
is another SoeQueryPredicate object with name “zip” and value <value>:
Example #2 Report Query
Predicates for more complex queries are generated in the same way. The only difference is that when a query introduces multiple predicates, the predicates sequence (array of predicates) is being generated instead for a single predicate. Here is a Report query:
SELECT o2.month, c2.address.zip, SUM(o2.sale_price) FROM <bucket> c2
INNER JOIN orders o2 ON KEYS c2.order_list
WHERE c2.address.zip = “value” AND o2.month = “value”
GROUP BY o2.month, c2.address.zip ORDER BY SUM(o2.sale_price)
The function below generates a sequence of:
“Month” predicate, “address” predicate with nested “zip” predicate, “sale_price” predicate, etc:
Other queries generators can be found here.
New Operations
The YCSB code needs to be updated with new operations.
Extending YCSB CoreWorkload with new operations: SoeWorkload.java
Implementation of YCSB-JSON Operations for Couchbase and MongoDB
The DB driver function of a YCSB-JSON operation takes an additional parameter which is a generator object. It is being passed by Workload class and it has a particular predicate sequence prebuilt.
Because predicates structure and sequences are well defined by the generator a DB driver can access names and values directly and construct the query using its native query language or other access methods. Below are examples of implementing Page and Report queries.
Page query, generating query statement for Couchbase:
For MongoDB:
Report query, Couchbase:
MongoDB:
All Couchbase implementations: Couchbase2Client.java
All MongoDB implementations: MongoDbClient.java
References
Next Steps
Implement a fakeit-like generator in YCSB to simplify data and query predicates generation.
Opinions expressed by DZone contributors are their own.
Comments