DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • Building a Reverse Image Search System Based on Milvus and VGG
  • Vector Databases Are Reinventing How Unstructured Data Is Analyzed
  • How to Build Your Exchange Server Recovery Strategy to Overcome Ransomware Attacks
  • JQueue: A Library to Implement the Outbox Pattern

Trending

  • New Free Tool From Contrast Security Makes API Security Testing Fast and Easy
  • Decoding the Differences: Continuous Integration, Delivery and Deployment
  • Message Construction: Enhancing Enterprise Integration Patterns
  • Modular Software Architecture: Advantages and Disadvantages of Using Monolith, Microservices and Modular Monolith
  1. DZone
  2. Data Engineering
  3. Data
  4. How to Build a Google Search Autocomplete

How to Build a Google Search Autocomplete

Build Google search autocomplete functionality.

Pulkit Kedia user avatar by
Pulkit Kedia
·
Oct. 11, 19 · Tutorial
Like (6)
Save
Tweet
Share
17.35K Views

Join the DZone community and get the full member experience.

Join For Free

Google autocomplete functionality

Whenever you start typing your search on Google, you get a list of recommendations, and the more letters you type, the more accurate the recommendations get. If you're like me, you've always wondered how this works — is the inverted index being stored, or is it something else?

The data structure that would be apt here is a Trie.

You may also like: Implementing Low-Level Trie: Solving With C++.

System Requirements

Considering the scale of Google, the factors that we need to keep in mind are latency, consistency, and availability. A desirable latency should be very low, giving/changing suggestions on each letter you would type. Next, the system needs to be available all the time; however, the consistency can be comprised here. Each time you type something, it would change the frequency of the previously stored query, which would affect the suggestions. The slight delay here is okay and eventual consistency would also work.

The Concept:

Example of Trie data structure

Trie data structure example

Approach #1

A Trie represents a word as a tree with each letter as a node and the next letter at its child node and so on. Google also stores each word/sentence in the form of a trie. Consider here, the parent node is “h,” its child node is “a,” then “r” and so on. Each node can have 26 child nodes. Now, each node can also store the frequency of each letter searched.

For example, node “h” stores the search frequency of “h." Its child node, “a,” stores the search frequency of “ha” and so on. Now, if we want to show the top N recommendations, say you typed “h,” and the suggestions should show “harry potter” or “harry styles.” Then, we need to sort all recommendations from the parent node down to every level on the frequency and show it. This would mean scanning terabytes of data, and as latency is our goal, this scanning approach would not work. 

Approach #2

To make this approach more efficient, we can store more data on each node, along with the search frequency. Lets store the top N queries at each node from the subtree below it. This means that the node “h” would have queries like “harry potter,” “harley davidson,” etc stored. If you traversed down the tree to “harl” (i.e. you type “harl”), the node, “l,” would have queries like “harley davidson,” “harley quinn,” etc.

This approach is better compared to the previous one, as the read is quite efficient. Anytime a node’s frequency gets updated, we traverse back from the node to its parent until we reach the root. For every parent, we check if the current query is part of the top N. If so, we replace the corresponding frequency with the updated frequency. If not, we check if the current query’s frequency is high enough to be a part of the top N.

If so, we update the top N with the frequency. Though this approach works, it does affect our read efficiency — we need to put a lock on the node each time we do write/update, so that the user won’t get stale values, but if we consider eventual consistency, then this might not be much of an issue. The user might get stale values for a while, but eventually, it would get consistent. Still, we will look at an extension of this approach.

Approach #3

As an extension of the previous approach, we can store the data offline. We can store a hashmap of a query to its frequency, and once the frequency reaches a set threshold value, we can then map it to the servers.

Scaling

Now, there wouldn’t be just one big server that stores all petabytes of data; we would vertically scaling for life — there is a better approach. We can distribute data (sharding) by prefixes on various servers. For example, prefixes like "a," "aa," "aab," etc. would go on server #1 and so on. We could use a load balancer to keep the map of the prefix with the server number.

But consider this, some servers with data like "x,” “xa,” “yy,” etc. would have less traffic compared to the letter “a.” So, there can be a threshold check on each server; if the load surpasses that traffic, then the data can be distributed among other shards.

If you are concerned about a single point of failure, there can be many servers acting as load balancer, so if any load balancer goes down, it is replaced by another one. You can use ZooKeepers to continuously health check load balancers and act accordingly.


Further Reading

  • ZooKeeper: A Real World Example of How to Use It.
  • Apache Solr: Getting Optimal Search Results.
  • Kafka Producer and Consumer Examples Using Java.
Google (verb) Google Search Database Data (computing) Data structure Build (game engine)

Opinions expressed by DZone contributors are their own.

Related

  • Building a Reverse Image Search System Based on Milvus and VGG
  • Vector Databases Are Reinventing How Unstructured Data Is Analyzed
  • How to Build Your Exchange Server Recovery Strategy to Overcome Ransomware Attacks
  • JQueue: A Library to Implement the Outbox Pattern

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: