DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • The Role of Data Governance in Data Strategy: Part II
  • Raft in Tarantool: How It Works and How to Use It
  • It's 2025: How Do You Choose Between Doris and ClickHouse?
  • Exploring the New Boolean Data Type in Oracle 23c AI

Trending

  • Endpoint Security Controls: Designing a Secure Endpoint Architecture, Part 2
  • Designing a Java Connector for Software Integrations
  • Unit Testing Large Codebases: Principles, Practices, and C++ Examples
  • Modern Test Automation With AI (LLM) and Playwright MCP
  1. DZone
  2. Data Engineering
  3. Databases
  4. How to Create a Search Engine and Algorithm With ClickHouse and Snowflake

How to Create a Search Engine and Algorithm With ClickHouse and Snowflake

Explore a step-by-step guide to developing a search engine and algorithm using Clickhouse, an open-source data warehousing solution.

By 
Akash Mukherjee user avatar
Akash Mukherjee
·
Oct. 03, 24 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
4.2K Views

Join the DZone community and get the full member experience.

Join For Free

ClickHouse is an open-source data warehousing solution that is architected as a columnar database management system. This makes it extremely powerful to work with massive datasets, especially ones that are long as they can be aggregated, ordered, or computed with low latency. When working with the same data type, it's very efficient for fast scanning and filtering of the data. This makes it a great use case for implementing a search engine.

A lot of applications use Elasticsearch as their search engine solution. However, such an implementation can be expensive both in terms of cost and time. Copying the data over to Elasticsearch can also cause lags because data is being migrated to another data store. Also, setting up the Elasticsearch cluster, configuring the nodes and defining and fine-tuning indexes can take more programmatic work, which may not be justified for all projects. 

Fortunately, we can create an alternative search engine solution using a data warehousing solution such as ClickHouse (or Snowflake) that the company is already using for analytical purposes. Not only does ClickHouse support capabilities such JOINing, UNIONing data and performing statistical functions like STDDEV, but it also goes above and beyond by offering fuzzy text matching algorithms such as multiFuzzyMatchAnyIndex that does an advanced distance calculation across a haystack. Finally, ClickHouse has a more cost-effective storage model and is open-source.

In this tutorial, we will learn how to index, score, and match search queries to return results that make sense for the user.

Prerequisite

First, we need a database to work with. We will start with a movies database which contains 3 different kinds of entities: 1) movies, 2) celebrities, and 3) production houses. Below are the scripts to create a database with those 3 tables.

Movies Table

SQL
 
CREATE OR REPLACE TABLE movies AS
SELECT 1 as id, 'John Wick' as movie_name, 'Action movie centered around a hitman' as movie_description, 9 as imbdb_rating

UNION ALL

SELECT  2 as id, 'Midnight in Paris' as movie_name, 'Romantic movie with historical nostalgia' as movie_description, 8 as imdb_rating

UNION ALL

SELECT 3 as id, 'Foxcatcher' as movie_name, 'Sports movie inspired by true events' as movie_description, 7.0 as imdb_rating

UNION ALL

SELECT 4 as id, 'Bull' as movie_name, 'Mystery and revenge drama' as movie_description, 6.5 as imdb_rating


Movies Table

Celebrities Table

SQL
 
CREATE OR REPLACE TABLE celebrities AS

SELECT 1 as id, 'John Wick' as celebrity_name, 'Some actor from Nebraska' as bio, 1500 as instagram_followers
UNION ALL
SELECT  2 as id, 'Owen Wilson' as celebrity_name, 'Romantic movie with historical nostalgia' as bio, 40700 as instagram_followers
UNION ALL
SELECT 3 as id, 'Sandra Bullock' as celebrity_name, 'Sports movie inspired by true events' as bio, 2400000 as instagram_followers
UNION ALL
SELECT 4 as id, 'Robert Downey Jr.' as celebrity_name, 'Popular for his role as Iron Man' as bio, 5810000 as instagram_followers


Celebrities Table

Production Houses Table

SQL
 
CREATE OR REPLACE TABLE production_houses AS

SELECT 1 as id, '20th Century Fox' as production_house, 6095 as num_movies

UNION ALL

SELECT  2 as id, 'Paramount Pictures' as production_house, 12715 as num_movies

UNION ALL

SELECT 3 as id, 'DreamWorks Pictures' as production_house, 158 as num_movies


Production Houses Table

Architecture

We need to create a system that can search across all the movies, celebrities, and production houses when we query by a search keyword(s) and return to us the best fitting results order in what makes most sense.

Unified Entities Table

Tutorial

Indexing

As a first step, we will take all the disparate tables from the database and standardize them in a unified_entities table by UNIONing them together.

SQL
 
CREATE OR REPLACE TABLE unified_entities AS
SELECT 'movie' as entity_type, id as entity_id, movie_name as entity_name, movie_description as entity_description, imbdb_rating as entity_metric

FROM movies

UNION ALL 

SELECT 'celebrity' as entity_type, id as entity_id, celebrity_name as entity_name, bio as entity_description, instagram_followers as entity_metric

FROM celebrities

UNION ALL 

SELECT 'production house' as entity_type, id as entity_id, production_house as entity_name, '' as entity_description, num_movies as entity_metric

FROM production_houses


Standardized tables

Scoring

Next, we want to make sure we create an algorithm that compares apples to apples. If there's an actor named John Wick and a movie named John Wick, we want to know which one to rank first. By simply comparing them against each other, we may not know which is bigger because we are comparing apples to oranges. The metric available for movies in our database is imdb_rating, while the metric available for celebrities in our database is instagram_followers.

Using a z-score calculation, we will be able to calculate how John Wick as a movie ranks amongst other movies, and also how John Wick as a celebrity ranks amongst other available celebrities. This same example can be used for a word like "Fox" to compare if the movie "Foxcatcher" is more popular than "20th Century Fox" or not.

SQL
 
CREATE OR REPLACE TABLE unified_entities_scored

SELECT

    entity_type,

    entity_id,

    entity_name,

    entity_metric,

    (entity_metric - AVG(entity_metric) OVER (PARTITION BY entity_type))

    / STDDEV_POP(entity_metric) OVER (PARTITION BY entity_type) AS entity_z_score

FROM unified_entities

WHERE 1=1


unified_entities_scored

Fuzzy Text Matching

Finally, once we have unified the entities and scored them uniformly, the next step is to compare the search keyword(s) entered by a user to the name being compared to. 

For fuzzy text matching, we ended up using ClickHouse's function multiFuzzyMatchAnyIndex.

SQL
 
SELECT

    entity_name,

    entity_type,

    entity_metric,

    entity_z_score

FROM unified_entities_scored

WHERE multiFuzzyMatchAnyIndex(entity_name, 1, ['(?i)john', '(?i)wick']) > 0

ORDER BY entity_z_score DESC;


As you would have seen, we also ended up ranking the search results by the z-scores we calculated for each entity (within their entity type).

Below, we can see the search results returned are not only correct but are ranked in the right order with John Wick, the movie, getting a higher score than John Wick, the celebrity.

Correct John Wick search results

We can try a similar search for the keyword "Fox."

SQL
 
SELECT

    entity_name,

    entity_type,

    entity_metric,

    entity_z_score

FROM unified_table_scored

WHERE multiFuzzyMatchAnyIndex(entity_name, 1, ['(?i)fox']) > 0

ORDER BY entity_z_score DESC;


This tells us that 20th Century Fox is a better-ranked search result because it is more prominent as a production house than Foxcatcher's prominence as a movie.

 20th Century Fox is a better ranked search result

multiFuzzyMatchAnyIndex() is a ClickHouse-specific function. Hence, if we were doing this in Snowflake, everything so far stays the same. However, in Snowflake, we will have to change the query to as below:

SQL
 
SELECT

    entity_name,

    entity_type,

    entity_metric,

    entity_z_score

FROM unified_table_scored

WHERE LOWER(entity_name) ILIKE '%john %wick%'

ORDER BY entity_z_score DESC;


Further Sophistication

As demonstrated, this search algorithm gets us pretty solid search outcomes. However, if we wanted to further improve our search, we need a use-case of searching by synonyms such as "RDJ" instead of Robert Downey Jr. or NYC instead of New York.

For us to be able to do that, we can start by first creating a synonyms table:

Synonyms Table

SQL
 
CREATE OR REPLACE TABLE entity_synonyms AS

SELECT 'celebrity' as entity_type, 4 as entity_id, 'RDJ' as synonym

UNION ALL

SELECT 'production house' as entity_type, 1 as entity_id, '20th Century Studios' as synonym


Synonyms Table

Merge Synonyms to Unified Entities

Now, it's time to JOIN the entity_synonyms to the unified_entities we created and make the unified_entities table a longer table. When we UNION these tables, we shall just create a new  column called search_string that can take the value of entity_name for entity records and the value of synonym for the synonym records.

SQL
 
CREATE OR REPLACE TABLE unified_entities AS
WITH unified_entities_v1 as (
    SELECT 'movie' as entity_type, id as entity_id, movie_name as entity_name, movie_description as entity_description, imbdb_rating as entity_metric
    FROM movies
    UNION ALL 
    SELECT 'celebrity' as entity_type, id as entity_id, celebrity_name as entity_name, bio as entity_description, instagram_followers as entity_metric
    FROM celebrities
    UNION ALL 
    SELECT 'production house' as entity_type, id as entity_id, production_house as entity_name, '' as entity_description, num_movies as entity_metric
    FROM production_houses
)
SELECT u.entity_type, u.entity_id, u.entity_name, u.entity_name as search_string, u.entity_description, u.entity_metric
FROM unified_entities_v1 u

UNION ALL

SELECT u.entity_type, u.entity_id, u.entity_name,  s.synonym as search_string, u.entity_description, u.entity_metric
FROM unified_entities_v1 u
INNER JOIN entity_synonyms s ON u.entity_type = s.entity_type AND u.entity_id = s.entity_id


Merge Synonyms to Unified Entities

Search Query

We can try searching by "RDJ" and here's what we will get below:

SQL
 
SELECT
    entity_id,
    entity_name,
    entity_type,
    entity_metric,
    entity_z_score
FROM unified_entities_scored
WHERE multiFuzzyMatchAnyIndex(search_string, 1, ['(?i)RDJ']) > 0
ORDER BY entity_z_score DESC;


Search Query

In this example, we used the search_string column for fuzzy text matching. However, we used the entity_name and entity_id columns for displaying the records returned. This is done for the most optimal user experience.

As we can see, the search result returns the same result for Robert Downey, Jr. despite searching by the synonym "RDJ", which is our intended outcome.

Summary

This article showed a full tutorial on how to create a cross-entity search engine in ClickHouse from scratch. We took an example of a movie database and demonstrated the key steps involved such as indexing, scoring, and text matching. This implementation can be easily replicated for any other domain such as e-commerce or fintech.

AI ClickHouse Database Algorithm Search engine (computing)

Opinions expressed by DZone contributors are their own.

Related

  • The Role of Data Governance in Data Strategy: Part II
  • Raft in Tarantool: How It Works and How to Use It
  • It's 2025: How Do You Choose Between Doris and ClickHouse?
  • Exploring the New Boolean Data Type in Oracle 23c AI

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!