DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • How Trustworthy Is Big Data?
  • Enhancing Avro With Semantic Metadata Using Logical Types
  • A Deep Dive into Apache Doris Indexes
  • The Evolution of Database Architectures: Navigating Big Data, Cloud, and AI Integration

Trending

  • Four Essential Tips for Building a Robust REST API in Java
  • Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways
  • Navigating the LLM Landscape: A Comparative Analysis of Leading Large Language Models
  • Ensuring Configuration Consistency Across Global Data Centers
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Indexing Big Data: Global vs. Local Indexes in Distributed Databases

Indexing Big Data: Global vs. Local Indexes in Distributed Databases

In the world of distributed databases, it is important to have options for indexing. Otherwise, querying can be unpredictable in latency, and big data can be impossible to query in real-time! Read on for tips on how to best index your database.

By 
Cihan B. user avatar
Cihan B.
·
Updated May. 16, 16 · Analysis
Likes (8)
Comment
Save
Tweet
Share
15.5K Views

Join the DZone community and get the full member experience.

Join For Free

no one doubts that indexing and querying with big data is challenging. big data comes at you fast, with high velocity, variety, and volume! 100ks of updates/sec and tbs of data to scan—you cannot do this in real-time unless you have solid indexing! imagine these apps:

  • the travel app that is pricing and recording all the flights and hotels you've looked at!
  • the viral online game that has to display the accurate scoreboard for top players!
  • the fraud detection app that needs to look at your recent activity to decide if the active credit-card transaction is a legitimate one!

these are use cases with queries that need to deal with high ingest of data but cannot compromise on milliseconds in the response time! if you cannot render the  travel-itineraries, the score-boards,  or  respond to a fraud in real time  , all bets are off! okay! this sounds impossible and you ask:  "how do you index and query this types of data in real time?" 

global index vs. local index

distributed systems offer 2 types of indexing models:

  • local indexes: in the cluster, each node indexes the data it locally holds. this optimizes for indexing fast. however as the data ingest increases, index maintenance locally competes with the incoming workload, and as the cluster gets larger (more nodes) the scatter gather hits the query latency. imagine this query: "  find the top 10 most active users for month of aug  "
#sql would look something like this
select customer_name, total_logins.jan_2015 
from customer_bucket 
where type=“customer_profile” 
order by total_logins.jan_2015 desc 
limit 10;

#index for the query would look something like this
index on customer_bucket(customer_name, total_logins.jan_2015) 
where type=“customer_profile”;

here are the steps for executing the query on a cluster with a local index:

  1. no one node knows the answer! so, scatter is required to figure out "top 10" on each node locally using the local index.
  2. gather gets the "top 10" back to the coordinating node.
  3. the final step is to re-sort and figure out the real top 10 active users, combining the results from all nodes and sending the results back to the client.

let's assume this was done over 100 nodes and you added your 101st node! nothing gets faster upon executing this query. every node still does the same work including the new node. in fact, the 101st node hurts the latency of the query!

by the way, many nosql databases like couchbase server or mongodb do local indexing. for details on local indexing, see the couchbase server map-reduce views  here  .

image title

  • global indexes: the index is independently partitioned and placed away from the data on the nodes. it can be challenging to keep up with mutations, as indexing the data will require a network access but works fantastically for queries. imagine the same query above. the index now sits on a node or two (maybe partitioned by continents as in the example below).
#index for the query would look something like this
index on customer_bucket(customer_name, total_logins.jan_2015) 
where type=“customer_profile”
and continent="europe";
index on customer_bucket(customer_name, total_logins.jan_2015) 
where type=“customer_profile”
and continent="america";
index on customer_bucket(customer_name, total_logins.jan_2015) 
where type=“customer_profile”
and continent="asia";

here are the steps for executing the query on a cluster with a global index:

  1. now we have a node with the global index that knows the answer! so, no scatter required here! we simply retrieve the top login count from the index.
  2. the final step is to send the results back to the client.

unlike those 100 nodes in the previous example, your 101st node can now do real work! query latencies are much faster!

image title

global indexes are rarely seen in distributed databases (nosql or otherwise). neither mongodb nor cassandra comes with global indexes. however, you can see global indexes in couchbase server under the name  global secondary indexes  . couchbase server gsis can also be deployed independently to a separate zone within the cluster using the index service. that means data service nodes that are doing the core data operations (insert/update/delete) don't have to compete with the indexing that goes on in the other part of the cluster. this deployment topology is called mds (multi-dimensional scaling) and you can find out more about it  here  .

in the world of distributed databases, it is important to have options for indexing. otherwise, querying can be unpredictable in latency, and big data can be impossible to query in real-time! you can check out couchbase server for these added indexing options.

Big data Database

Published at DZone with permission of Cihan B., DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • How Trustworthy Is Big Data?
  • Enhancing Avro With Semantic Metadata Using Logical Types
  • A Deep Dive into Apache Doris Indexes
  • The Evolution of Database Architectures: Navigating Big Data, Cloud, and AI Integration

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: