No one doubts that indexing and querying with big data is challenging. Big data comes at you fast, with high velocity, variety, and volume! 100Ks of updates/sec and TBs of data to scan—you cannot do this in real-time unless you have solid indexing! Imagine these apps:
- The travel app that is pricing and recording all the flights and hotels you've looked at!
- The viral online game that has to display the accurate scoreboard for top players!
- The fraud detection app that needs to look at your recent activity to decide if the active credit-card transaction is a legitimate one!
These are use cases with queries that need to deal with high ingest of data but cannot compromise on milliseconds in the response time! If you cannot render the travel-itineraries, the score-boards, or respond to a fraud in real time, all bets are off! Okay! this sounds impossible and you ask: "How do you index and query this types of data in real time?"
Global Index vs. Local Index
Distributed systems offer 2 types of indexing models:
- Local indexes: In the cluster, each node indexes the data it locally holds. This optimizes for indexing fast. However as the data ingest increases, index maintenance locally competes with the incoming workload, and as the cluster gets larger (more nodes) the scatter gather hits the query latency. Imagine this query: "Find the top 10 most active users for month of Aug"
#SQL would look something like this SELECT customer_name, total_logins.jan_2015 FROM customer_bucket WHERE type=“customer_profile” ORDER BY total_logins.jan_2015 DESC LIMIT 10; #index for the query would look something like this INDEX ON Customer_bucket(customer_name, total_logins.jan_2015) WHERE type=“customer_profile”;
Here are the steps for executing the query on a cluster with a local index:
- No one node knows the answer! So, Scatter is required to figure out "TOP 10" on each node locally using the local index.
- Gather gets the "TOP 10" back to the coordinating node.
- The final step is to re-sort and figure out the real TOP 10 active users, combining the results from all nodes and sending the results back to the client.
Let's assume this was done over 100 nodes and you added your 101st node! Nothing gets faster upon executing this query. Every node still does the same work including the new node. In fact, the 101st node hurts the latency of the query!
By the way, many NoSQL databases like Couchbase Server or MongoDB do local indexing. For details on local indexing, see the Couchbase Server map-reduce views here.
- Global Indexes: The index is independently partitioned and placed away from the data on the nodes. It can be challenging to keep up with mutations, as indexing the data will require a network access but works fantastically for queries. Imagine the same query above. The index now sits on a node or two (maybe partitioned by continents as in the example below).
#index for the query would look something like this INDEX ON Customer_bucket(customer_name, total_logins.jan_2015) WHERE type=“customer_profile” AND continent="Europe"; INDEX ON Customer_bucket(customer_name, total_logins.jan_2015) WHERE type=“customer_profile” AND continent="America"; INDEX ON Customer_bucket(customer_name, total_logins.jan_2015) WHERE type=“customer_profile” AND continent="Asia";
Here are the steps for executing the query on a cluster with a global index:
- Now we have a node with the global index that knows the answer! So, no scatter required here! We simply retrieve the top login count from the index.
- The final step is to send the results back to the client.
Unlike those 100 nodes in the previous example, your 101st node can now do real work! Query latencies are much faster!
Global indexes are rarely seen in distributed databases (NoSQL or otherwise). Neither MongoDB nor Cassandra comes with global indexes. However, you can see global indexes in Couchbase Server under the name global secondary indexes. Couchbase Server GSIs can also be deployed independently to a separate zone within the cluster using the index service. That means data service nodes that are doing the core data operations (INSERT/UPDATE/DELETE) don't have to compete with the indexing that goes on in the other part of the cluster. This deployment topology is called MDS (multi-dimensional scaling) and you can find out more about it here.
In the world of distributed databases, it is important to have options for indexing. Otherwise, querying can be unpredictable in latency, and big data can be impossible to query in real-time! You can check out Couchbase Server for these added indexing options.