Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Indexing Big Data: Global vs. Local Indexes in Distributed Databases

DZone's Guide to

Indexing Big Data: Global vs. Local Indexes in Distributed Databases

In the world of distributed databases, it is important to have options for indexing. Otherwise, querying can be unpredictable in latency, and big data can be impossible to query in real-time! Read on for tips on how to best index your database.

· Database Zone
Free Resource

Traditional relational databases weren’t designed for today’s customers. Learn about the world’s first NoSQL Engagement Database purpose-built for the new era of customer experience.

No one doubts that indexing and querying with big data is challenging. Big data comes at you fast, with high velocity, variety, and volume! 100Ks of updates/sec and TBs of data to scan—you cannot do this in real-time unless you have solid indexing! Imagine these apps:

  • The travel app that is pricing and recording all the flights and hotels you've looked at!
  • The viral online game that has to display the accurate scoreboard for top players!
  • The fraud detection app that needs to look at your recent activity to decide if the active credit-card transaction is a legitimate one!

These are use cases with queries that need to deal with high ingest of data but cannot compromise on milliseconds in the response time! If you cannot render the travel-itineraries, the score-boards, or respond to a fraud in real time, all bets are off! Okay! this sounds impossible and you ask: "How do you index and query this types of data in real time?"

Global Index vs. Local Index

Distributed systems offer 2 types of indexing models: 

  • Local indexes: In the cluster, each node indexes the data it locally holds. This optimizes for indexing fast. However as the data ingest increases, index maintenance locally competes with the incoming workload, and as the cluster gets larger (more nodes) the scatter gather hits the query latency. Imagine this query: "Find the top 10 most active users for month of Aug"
#SQL would look something like this
SELECT customer_name, total_logins.jan_2015 
FROM customer_bucket 
WHERE type=“customer_profile” 
ORDER BY total_logins.jan_2015 DESC 
LIMIT 10;

#index for the query would look something like this
INDEX ON Customer_bucket(customer_name, total_logins.jan_2015) 
WHERE type=“customer_profile”;

Here are the steps for executing the query on a cluster with a local index: 

  1. No one node knows the answer! So, Scatter is required to figure out "TOP 10" on each node locally using the local index.
  2. Gather gets the "TOP 10" back to the coordinating node.
  3. The final step is to re-sort and figure out the real TOP 10 active users, combining the results from all nodes and sending the results back to the client.

Let's assume this was done over 100 nodes and you added your 101st node! Nothing gets faster upon executing this query. Every node still does the same work including the new node. In fact, the 101st node hurts the latency of the query!

By the way, many NoSQL databases like Couchbase Server or MongoDB do local indexing. For details on local indexing, see the Couchbase Server map-reduce views here

Image title

  • Global Indexes: The index is independently partitioned and placed away from the data on the nodes. It can be challenging to keep up with mutations, as indexing the data will require a network access but works fantastically for queries. Imagine the same query above. The index now sits on a node or two (maybe partitioned by continents as in the example below).
#index for the query would look something like this
INDEX ON Customer_bucket(customer_name, total_logins.jan_2015) 
WHERE type=“customer_profile”
AND continent="Europe";
INDEX ON Customer_bucket(customer_name, total_logins.jan_2015) 
WHERE type=“customer_profile”
AND continent="America";
INDEX ON Customer_bucket(customer_name, total_logins.jan_2015) 
WHERE type=“customer_profile”
AND continent="Asia";

Here are the steps for executing the query on a cluster with a global index: 

  1. Now we have a node with the global index that knows the answer! So, no scatter required here! We simply retrieve the top login count from the index.
  2. The final step is to send the results back to the client.

Unlike those 100 nodes in the previous example, your 101st node can now do real work! Query latencies are much faster! 

Image title

Global indexes are rarely seen in distributed databases (NoSQL or otherwise). Neither MongoDB nor Cassandra comes with global indexes. However, you can see global indexes in Couchbase Server under the name global secondary indexes. Couchbase Server GSIs can also be deployed independently to a separate zone within the cluster using the index service. That means data service nodes that are doing the core data operations (INSERT/UPDATE/DELETE) don't have to compete with the indexing that goes on in the other part of the cluster. This deployment topology is called MDS (multi-dimensional scaling) and you can find out more about it here

In the world of distributed databases, it is important to have options for indexing. Otherwise, querying can be unpredictable in latency, and big data can be impossible to query in real-time! You can check out Couchbase Server for these added indexing options.

Learn how the world’s first NoSQL Engagement Database delivers unparalleled performance at any scale for customer experience innovation that never ends.

Topics:
distributed database ,big data ,sql ,database architecture ,real time analytics ,nosql ,performance and scalability ,couchbase ,mongodb ,cassandra

Published at DZone with permission of Cihan B., DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}