Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Why You Should Use FlashText Instead of RegEx for Data Analysis

DZone's Guide to

Why You Should Use FlashText Instead of RegEx for Data Analysis

RedEx is pretty useful for going through a lot of data quickly and performing search and replace. But is FlashText better? We take a look in this article.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

If you have done any text/data analysis, you might already be familiar with Regular Expressions (RegEx). RegEx evolved as a necessary tool for text editing. If you are still using RegEx to deal with text processing, then you may have some problems to deal with. Why? When it comes to large-sized texts, the low efficiency of RegEx can make data analysis unacceptably slow.

In this article, we will discuss how you can use FlashText, a Python library that is 100 times faster than RegEx to perform data analysis.

RegEx vs. FlashText

Before you proceed with your analysis, you need to clean your source data, even for the simplest text. This often includes searching and replacing keywords. For example, search the corpus for the keyword "Python," or replace all instances of "python" with "Python."

RegEx is an ideal tool if you have to search and replace a few hundred keywords. However, many of these tasks involve Natural Language Processing (NLP). You might come across tens of thousands of such operations. RegEx would require several days to complete such a requirement.

Of course, you may think that you can solve the problem by parallelizing your processes; however, in practice, this solution does not make too much of a difference.

Is there any other way to attack this problem?

The creator of FlashText also faced the same problem back then. After some research that did not return any results, he decided to write a new algorithm.

Before understanding the algorithm behind it, let us take a look at the comparison chart showing the speed of FlashText in a search vs. Regular Expressions in a search.

1

You can observe from the above chart that RegEx’s processing time increases on a nearly linear basis as the number of keywords increases. However, on the other hand, the increase in keywords does not affect FlashText.

Moving on, let us look at another chart on keyword replacement.

2

Likewise, when the number of keywords increases, the processing time of FlashText almost remains the same. So, what is FlashText? I have explained it in the following section.

The Smarter and Faster Way of Data Cleansing - FlashText

As the name suggests, FlashText is one of the fastest ways to execute search and replace keywords. It is an open source Python library on GitHub.

When using FlashText, begin by providing a list of keywords. FlashText uses this list to build an internal Trie dictionary. You then send it a string of text depending on whether you want to search or replace.

For performing replacements, it will create a new string with the replacement keyword. To carry out a search, it will return a list of keywords in the string. These tasks will only iterate through your string once.

Why Is FlashText So Fast?

To truly understand the reason behind FlashText’s speed, let us consider an example. Take a sentence that is comprised of three words: "I like Python." Assume that you have a corpus of four words {Python, Java, J2EE, and Ruby}.

If for every word in the corpus, you select it out and see if it appears in the sentence, you need to iterate the string four times.

3

For n words in the corpus, we need n iterations. And each step (is it in the sentence?) will take its own time. This is the logic behind RegEx matching.

There is also an alternative method that contradicts the first method. That is, for each word in the sentence, we can see if it exists in the corpus.

4

For m words in the sentence, you have m cycles. In the situation, the time spent only depends on the number of words in the sentence. You can quickly perform this step (is it in the corpus?) using a dictionary.

The FlashText algorithm uses the second method. Moreover, the Aho-Corasick algorithm and the Trie data structure inspired this algorithm.

How Does FlashText Work?

First, it creates a Trie data structure from the corpus. Like the following graph:

5

Start and EOT (End of Term) indicate the word boundaries. It can be space, comma, or line return. It can match keywords when it has boundaries on both sides. This way it can prevent cases such as 'apple' matching for 'pineapple.'

Use the string "I like Python," and search for it character by character.

6

7

As this algorithm searches character by character, when searching for "I", you can easily jump over some values because "l" (as in like) is not immediately after it. This mechanism can let you skip all the non-existent words.

The FlashText algorithm only inspects every character in the string "I like Python." Even if your dictionary contains one million keywords, it does not really affect its operation.

When Do You Need to Use FlashText?

I would suggest you use FlashText whenever the number of keywords is greater than 500.

8

In terms of searching, if the number of keywords is greater than 500, FlashText will perform better than RegEx.

Additionally, RegEx can search special characters such as "^, $, *, d" but FlashText does not support them.

You cannot match partial words (for example "worddvec"), but it can match full words too ("word2vec").

Take a look at the basic usage of FlashText. Give it a try. You will observe that it is much faster than RegEx.

Below is some Python code which will help you use FlashText.

Code: Search for keywords using FlashText.

9

Code: Search for keywords using FlashText.

10

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
python ,data analysis ,big data ,regex ,flashtext

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}