DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • A New Era Has Come, and So Must Your Database Observability
  • Conversational Applications With Large Language Models Understanding the Sequence of User Inputs, Prompts, and Responses
  • Automating Database Operations With Ansible and DbVisualizer
  • Database Monitoring: Key Metrics and Considerations

Trending

  • Designing Databases for Distributed Systems
  • Agile Estimation: Techniques and Tips for Success
  • REST vs. Message Brokers: Choosing the Right Communication
  • Software Verification and Validation With Simple Examples
  1. DZone
  2. Data Engineering
  3. Databases
  4. Searching Through Text: Part 1, Full-Text Search in Under 200 Lines of Code

Searching Through Text: Part 1, Full-Text Search in Under 200 Lines of Code

In this article, take a look at how to do full-text search in under 200 lines of code.

Oren Eini user avatar by
Oren Eini
·
Dec. 24, 19 · Tutorial
Like (5)
Save
Tweet
Share
14.05K Views

Join the DZone community and get the full member experience.

Join For Free

Reading glasses on open book

Full-Text Search

Full-text search is a really interesting topic that I have been dipping my toes into again and again over the years. It is a rich area of research, and there has been quite a few papers, books, and articles about the topic. I read a bunch of projects for doing full-text search, and I have been using Lucene for a while.

I thought that I would write some code to play with full text search and see where that takes me. This is a side project, and I hope it will be an interesting one. The first thing that I need to do is to define the scope of work:

  • Be able to (eventually) do full text search queries
  • Compare and contrast different persistence strategies for this
  • Be able to work with multiple fields
You might also be interested in:  Postgres Full-Text Search

What I don't care about: Analysis process, actually implementing complex queries (I do want to have the foundation for them), etc.

Given that I want to work with real data, I went and got the Enron dataset. That is over 517,000 emails from Enron totaling more than 2.2 GB. This is one of the more commonly used test datasets for full-text search, so that is helpful. The first thing that we need to do is to get the data into a shape that we can do something about it.

Enron is basically a set of MIME-encoded files, so I've used MimeKit to speed the parsing process. Here is the code of the algorithm I'm using for getting the relevant data for the system. Here are the relevant bits:

static void Analyze(string dir)
{
    var spearators = new char[] { ' ', '\t', ',', '!', '\r', '(', ')', '?', '-', '"', '\n', '/' };
    var trim = new char[] { '.', };

    var blockingCollection = new BlockingCollection<string>(2048);
    var tasks = new List<Task>();
    var dics = new ConcurrentQueue<Dictionary<string, HashSet<string>>>();
    for (int i = 0; i < 16; i++)
    {
        var task = Task.Run(() =>
        {
            while (blockingCollection.IsCompleted == false)
            {
                using var stream = File.OpenRead(blockingCollection.Take());

                var parser = new MimeParser(stream, MimeFormat.Entity);
                while (parser.IsEndOfStream == false)
                {
                    var entity = parser.ParseMessage();

                    var dic = new Dictionary<string, HashSet<string>>
                    {
                        ["Id"] = new HashSet<string> { entity.MessageId.ToLower() },
                        ["Date"] = new HashSet<string> { entity.Date.ToString("r") },
                        ["From"] = entity.From.Select(x => x.ToString().ToLower()).ToHashSet(),
                        ["To"] = entity.To.Select(x => x.ToString().ToLower()).ToHashSet(),
                        ["Body"] = entity.GetTextBody(TextFormat.Plain)
                            .Split(spearators, StringSplitOptions.RemoveEmptyEntries)
                            .Select(x => x.Trim(trim).ToLower())
                            .Where(x =>
                            {
                                if (x.Length > 3)
                                    return true;
                                if (x.Length == 0)
                                    return false;
                                return char.IsDigit(x[0]);
                            })
                            .ToHashSet()
                    };

                    dics.Enqueue(dic);
                }
            }
        });
        tasks.Add(task);
    }

    var so = Stopwatch.StartNew();

    tasks.Add(Task.Run(() =>
    {
        foreach (var file in Directory.EnumerateFiles(dir, "*", SearchOption.AllDirectories))
        {
            blockingCollection.Add(file);
        }
        blockingCollection.CompleteAdding();
    }));

    var final = Task.WhenAll(tasks.ToArray());

    // do stuff with it.

}


As you can see, this is hardly a sophisticated approach. We are spawning a bunch of threads, processing all half a million emails in parallel, selecting a few key fields, and doing some very basic text processing.

The idea is that we want to get to the point where we have enough information to do full-text search, but without going through the real pipeline that this would take.

Here is an example of the output of one of those dictionaries:

{
  "Id": [
    "18739556.1075840017471.javamail.evans@thyme"
  ],
  "Date": [
    "Sat, 03 Nov 2001 01:24:54 GMT"
  ],
  "From": [
    "tim.belden@enron.com"
  ],
  "To": [
    "mike.swerzbin@enron.com","matt.motley@enron.com","robert.badeer@enron.com",
    "sean.crandall@enron.com","diana.scholtes@enron.com","chris.mallory@enron.com",
    "phillip.platter@enron.com","tom.alonso@enron.com","mark.fischer@enron.com",
    "holden.salisbury@enron.com"
  ],
  "Body": [
    "alan","original","02","11:59","12:16","13th","14th;","20","2001","20th","7","7th","9","9th","adress",
    "advancing","agenda","alan","alvarez","anything","appears","asked","available","away","back-up","because",
    "before","belden","between","caps","comments","commission","commission's","comnes","decisions","definitive",
    "doesn't","educated","ferc","following","friday","from:","going","guess","hear","held","here's","highly",
    "however","into","issue","issues","know","last","like:","likely","look","looks","meantime","meeting","meetings",
    "meets","message","more","next","november","order","orders","ours","pending","posts","preparing",
    "pressure","price","ray;","right","rule","said","scheduled","sean","sent:","significant","since","soon","still",
    "subject:","submittal","surprise","that","there","therefore","they","this","tim;","time","timing",
    "unlikely","until","what","when","will","winter","wish","with","wood","would","written"
  ]
}


As you can see, this is bare bones (I forgot to index the Subject, for example), but on my laptop (8 cores Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz) with 16 GB of RAM, we can index this amount of data in under a minute and a half.

So far, so good, but this doesn't actually get us anywhere, we need to construct an inverted index, so we can ask questions about the data and be able to find stuff out. We are already about half way there, which is encouraging. Let's see how far we can stretch the "simplest thing that could possibly work"... shall we?

Here is the key data structures:

var docs = new ConcurrentDictionary<long, DocRef>();
long index = 0;

var fields = new ConcurrentDictionary<string, ConcurrentQueue<long>>[]
{
    new ConcurrentDictionary<string, ConcurrentQueue<long>>(), // Body
    new ConcurrentDictionary<string, ConcurrentQueue<long>>(), // Subject
    new ConcurrentDictionary<string, ConcurrentQueue<long>>(), // Date
    new ConcurrentDictionary<string, ConcurrentQueue<long>>(), // From
    new ConcurrentDictionary<string, ConcurrentQueue<long>>(), // To 
};


Basically, we have an array of fields, each of which holds a dictionary from each of the terms and a list of documents for the terms.

For the full code for this stage, look at the following link, it's less than 150 lines of code.

Indexing the full Enron dataset now takes 1 minute, 17 seconds, and takes 2.5 GB in managed memory.

The key is that with this in place, if I want to search for documents that contains the term: "XML", for example, I can do this quite cheaply. Here is how I can "search" over half a million documents to get all those that have the term HTML in them:

As you can imagine, this is actually quite fast.

That is enough for now, I want to start actually exploring persistence options now.

The final code bits are here. I ended up implementing stop words as well, so this is a really cool way to show off how you can do full-text search in under 200 lines of code.

Further Reading

Introducing Full-Text Search Capability via the Query Interface

Building a Full-Text Search Test Framework

Database

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • A New Era Has Come, and So Must Your Database Observability
  • Conversational Applications With Large Language Models Understanding the Sequence of User Inputs, Prompts, and Responses
  • Automating Database Operations With Ansible and DbVisualizer
  • Database Monitoring: Key Metrics and Considerations

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: