{{announcement.body}}
{{announcement.title}}

Searching Through Text: Part 1, Full-Text Search in Under 200 Lines of Code

DZone 's Guide to

Searching Through Text: Part 1, Full-Text Search in Under 200 Lines of Code

In this article, take a look at how to do full-text search in under 200 lines of code.

· Database Zone ·
Free Resource

Reading glasses on open book

Full-Text Search

Full-text search is a really interesting topic that I have been dipping my toes into again and again over the years. It is a rich area of research, and there has been quite a few papers, books, and articles about the topic. I read a bunch of projects for doing full-text search, and I have been using Lucene for a while.

I thought that I would write some code to play with full text search and see where that takes me. This is a side project, and I hope it will be an interesting one. The first thing that I need to do is to define the scope of work:

  • Be able to (eventually) do full text search queries
  • Compare and contrast different persistence strategies for this
  • Be able to work with multiple fields
You might also be interested in:  Postgres Full-Text Search

What I don't care about: Analysis process, actually implementing complex queries (I do want to have the foundation for them), etc.

Given that I want to work with real data, I went and got the Enron dataset. That is over 517,000 emails from Enron totaling more than 2.2 GB. This is one of the more commonly used test datasets for full-text search, so that is helpful. The first thing that we need to do is to get the data into a shape that we can do something about it.

Enron is basically a set of MIME-encoded files, so I've used MimeKit to speed the parsing process. Here is the code of the algorithm I'm using for getting the relevant data for the system. Here are the relevant bits:

static void Analyze(string dir)
{
    var spearators = new char[] { ' ', '\t', ',', '!', '\r', '(', ')', '?', '-', '"', '\n', '/' };
    var trim = new char[] { '.', };

    var blockingCollection = new BlockingCollection<string>(2048);
    var tasks = new List<Task>();
    var dics = new ConcurrentQueue<Dictionary<string, HashSet<string>>>();
    for (int i = 0; i < 16; i++)
    {
        var task = Task.Run(() =>
        {
            while (blockingCollection.IsCompleted == false)
            {
                using var stream = File.OpenRead(blockingCollection.Take());

                var parser = new MimeParser(stream, MimeFormat.Entity);
                while (parser.IsEndOfStream == false)
                {
                    var entity = parser.ParseMessage();

                    var dic = new Dictionary<string, HashSet<string>>
                    {
                        ["Id"] = new HashSet<string> { entity.MessageId.ToLower() },
                        ["Date"] = new HashSet<string> { entity.Date.ToString("r") },
                        ["From"] = entity.From.Select(x => x.ToString().ToLower()).ToHashSet(),
                        ["To"] = entity.To.Select(x => x.ToString().ToLower()).ToHashSet(),
                        ["Body"] = entity.GetTextBody(TextFormat.Plain)
                            .Split(spearators, StringSplitOptions.RemoveEmptyEntries)
                            .Select(x => x.Trim(trim).ToLower())
                            .Where(x =>
                            {
                                if (x.Length > 3)
                                    return true;
                                if (x.Length == 0)
                                    return false;
                                return char.IsDigit(x[0]);
                            })
                            .ToHashSet()
                    };

                    dics.Enqueue(dic);
                }
            }
        });
        tasks.Add(task);
    }

    var so = Stopwatch.StartNew();

    tasks.Add(Task.Run(() =>
    {
        foreach (var file in Directory.EnumerateFiles(dir, "*", SearchOption.AllDirectories))
        {
            blockingCollection.Add(file);
        }
        blockingCollection.CompleteAdding();
    }));

    var final = Task.WhenAll(tasks.ToArray());

    // do stuff with it.

}


As you can see, this is hardly a sophisticated approach. We are spawning a bunch of threads, processing all half a million emails in parallel, selecting a few key fields, and doing some very basic text processing.

The idea is that we want to get to the point where we have enough information to do full-text search, but without going through the real pipeline that this would take.

Here is an example of the output of one of those dictionaries:

{
  "Id": [
    "18739556.1075840017471.javamail.evans@thyme"
  ],
  "Date": [
    "Sat, 03 Nov 2001 01:24:54 GMT"
  ],
  "From": [
    "tim.belden@enron.com"
  ],
  "To": [
    "mike.swerzbin@enron.com","matt.motley@enron.com","robert.badeer@enron.com",
    "sean.crandall@enron.com","diana.scholtes@enron.com","chris.mallory@enron.com",
    "phillip.platter@enron.com","tom.alonso@enron.com","mark.fischer@enron.com",
    "holden.salisbury@enron.com"
  ],
  "Body": [
    "alan","original","02","11:59","12:16","13th","14th;","20","2001","20th","7","7th","9","9th","adress",
    "advancing","agenda","alan","alvarez","anything","appears","asked","available","away","back-up","because",
    "before","belden","between","caps","comments","commission","commission's","comnes","decisions","definitive",
    "doesn't","educated","ferc","following","friday","from:","going","guess","hear","held","here's","highly",
    "however","into","issue","issues","know","last","like:","likely","look","looks","meantime","meeting","meetings",
    "meets","message","more","next","november","order","orders","ours","pending","posts","preparing",
    "pressure","price","ray;","right","rule","said","scheduled","sean","sent:","significant","since","soon","still",
    "subject:","submittal","surprise","that","there","therefore","they","this","tim;","time","timing",
    "unlikely","until","what","when","will","winter","wish","with","wood","would","written"
  ]
}


As you can see, this is bare bones (I forgot to index the Subject, for example), but on my laptop (8 cores Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz) with 16 GB of RAM, we can index this amount of data in under a minute and a half.

So far, so good, but this doesn't actually get us anywhere, we need to construct an inverted index, so we can ask questions about the data and be able to find stuff out. We are already about half way there, which is encouraging. Let's see how far we can stretch the "simplest thing that could possibly work"... shall we?

Here is the key data structures:

var docs = new ConcurrentDictionary<long, DocRef>();
long index = 0;

var fields = new ConcurrentDictionary<string, ConcurrentQueue<long>>[]
{
    new ConcurrentDictionary<string, ConcurrentQueue<long>>(), // Body
    new ConcurrentDictionary<string, ConcurrentQueue<long>>(), // Subject
    new ConcurrentDictionary<string, ConcurrentQueue<long>>(), // Date
    new ConcurrentDictionary<string, ConcurrentQueue<long>>(), // From
    new ConcurrentDictionary<string, ConcurrentQueue<long>>(), // To 
};


Basically, we have an array of fields, each of which holds a dictionary from each of the terms and a list of documents for the terms.

For the full code for this stage, look at the following link, it's less than 150 lines of code.

Indexing the full Enron dataset now takes 1 minute, 17 seconds, and takes 2.5 GB in managed memory.

The key is that with this in place, if I want to search for documents that contains the term: "XML", for example, I can do this quite cheaply. Here is how I can "search" over half a million documents to get all those that have the term HTML in them:

As you can imagine, this is actually quite fast.

That is enough for now, I want to start actually exploring persistence options now.

The final code bits are here. I ended up implementing stop words as well, so this is a really cool way to show off how you can do full-text search in under 200 lines of code.

Further Reading

Introducing Full-Text Search Capability via the Query Interface

Building a Full-Text Search Test Framework

Topics:
database ,full-text search ,lucene ,enron dataset ,mimekit ,tutorial

Published at DZone with permission of Oren Eini , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}