Speeding Up Pattern Searches With the Boyer-Moore Algorithm From C++17
Take a look at using C++17 for pattern searches using the Boyer-Moore algorithm, giving you more control and an even better performance boost.
Join the DZone community and get the full member experience.
Join For FreeWith C++17, you can now use more sophisticated algorithms for pattern searches! Now, you'll have more control and a promising performance boost for many use cases.
See what the options are.
The article comes from a blog - bfilipek.com - weekly articles about modern C++.
Intro
The naive approach of finding a pattern in a string is O(nm)
(where n
is the length of the whole string, m
is the length of the pattern). There are much better alternatives. For instance Boyer-Moore with the linear complexity.
The algorithm is, for example, used in grep — see this reference — why GNU grep is fast.
I'm not an expert in describing algorithms, so here's an excellent introduction of Boyer-Moore:
C++17 updated std::search
algorithm in two ways:
- You can now use execution policy to run the default version of the algorithm but in a parallel way.
- You can provide a Searcher object that handles the search.
For now, we have three searchers:
default_searcher
boyer_moore_searcher
boyer_moore_horspool_searcher
Preprocessing
Both of the algorithms Boyer Moore and Boyer Moore Horspool use some knowledge about the pattern string so that they can skip fruitless comparisons. In order to be "smarter," each algorithm does a preprocessing that analyses the input pattern. The complexity of the preprocessing usually depends on the size of the alphabet of the string.
Horspool is a simplified version of Boyer-Moore (with only bad character rule) and uses smaller internal tables. The average complexity is linear, but the worst case might be O(mn)
.
In Boost
You might be familiar with the searching algorithms if you use boost libraries. In the version 1.50 (2012, June) there was a new set of algorithms added: see boost Version 1.50.0.
In the library there are three searchers objects:
How to Use
C++17 provides a new overload for std::search
:
template<class ForwardIterator, class Searcher>
ForwardIterator search( ForwardIterator first, ForwardIterator last,
const Searcher& searcher );
Each searcher usually takes two input iterators — beginning and end of a pattern, and then a binary predicate—- usually it's an equality operator. They might also use other parameters, for example, a hashing function.
All in all, you can use it like this:
std::string testString = "Hello Super World";
std::string needle = "Super";
auto it = search(testString.begin(), testString.end(),
boyer_moore_searcher(needle.begin(), needle.end()));
if (it == testString.end())
cout << "The string " << needle << " not found\n";
Some Basic Tests
I wrote some basic tests that show some nice performance boost for the new algorithms in MSVC.
Source code: github.com/fenbf/articles/cpp17/searchers/searchers.cpp
How the test works:
- The app loads a file, like a book sample - 500kb of text.
- The whole file content is stored in one input string.
- A pattern is selected — N last letters of the input string.
- The app uses several algorithms and runs each search
ITER
times.
For example:
The std::string::find
version:
RunAndMeasure("string::find", [&]() {
for (size_t i = 0; i < ITERS; ++i)
{
std::size_t found = testString.find(needle);
if (found == std::string::npos)
std::cout << "The string " << needle << " not found\n";
}
});
The boyer_moore_horspool
version:
RunAndMeasure("boyer_moore_horspool_searcher", [&]() {
for (size_t i = 0; i < ITERS; ++i)
{
auto it = std::search(testString.begin(), testString.end(),
std::boyer_moore_horspool_searcher(
needle.begin(), needle.end()));
if (it == testString.end())
std::cout << "The string " << needle << " not found\n";
}
});
Here are the results (i7 4720HQ, Win 10, MSVC 2017 15.8, Release 64 bit)
The pattern is composed of 10000 letters from the end of the input text.
.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 10000
string length: 547412
test iterations: 1000
pattern length: 10000
string::find: 693.449 ms
default searcher: 1102.25 ms
boyer_moore_searcher: 133.558 ms
boyer_moore_horspool_searcher: 37.0234 ms
The pattern is now the last 100 letters of the input text:
.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 200
string length: 547412
test iterations: 1000
pattern length: 200
string::find: 158.612 ms
default searcher: 467.518 ms
boyer_moore_searcher: 58.8752 ms
boyer_moore_horspool_searcher: 56.7017 ms
The sample results need some more investigation. For example for short patterns, the string::find
version is usually faster. Also, horspool
algorithm was faster than the boyer_moore
option in most cases.
More performance tests are worth a separate blog post, with much better result analysis. So I'll leave this for your experiments now.
Using Other Containers
The important fact about std::search
is that it's a generic algorithm! So you can use it not only for strings!
Here's a sample code for searching a pattern of numbers in a vector of integers.
std::vector<int> testVector(1000000);
std::iota(testVector.begin(), testVector.end(), 0);
std::vector vecNeedle(testVector.end() - 1000, testVector.end());
auto it = std::search(testVector.begin(), testVector.end(),
std::boyer_moore_horspool_searcher(
vecNeedle.begin(), vecNeedle.end()));
if (it == testVector.end())
std::cout << "The pattern " << needle << " not found\n";
Summary
The article just briefly shows new capabilities that you get in C++17. It's important to know that the new algorithms will not always be faster than std::string::find
(for strings).
Have you used new string searchers? What are your use cases?
I showed some results from MSVC, can you run the code on GCC or Clang?
More from the Author:
Bartek recently published a book - "C++17 In Detail"- rather than reading the papers and C++ specification drafts, you can use this book to learn the new Standard in an efficient and practical way.
Published at DZone with permission of Bartłomiej Filipek, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments