DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Speeding Up Pattern Searches With the Boyer-Moore Algorithm From C++17

Speeding Up Pattern Searches With the Boyer-Moore Algorithm From C++17

Take a look at using C++17 for pattern searches using the Boyer-Moore algorithm, giving you more control and an even better performance boost.

Bartłomiej Filipek user avatar by
Bartłomiej Filipek
·
Aug. 23, 18 · Tutorial
Like (2)
Save
Tweet
Share
5.21K Views

Join the DZone community and get the full member experience.

Join For Free

With C++17, you can now use more sophisticated algorithms for pattern searches! Now, you'll have more control and a promising performance boost for many use cases.

See what the options are.

The article comes from a blog - bfilipek.com - weekly articles about modern C++.

Intro

The naive approach of finding a pattern in a string is O(nm) (where n is the length of the whole string, m is the length of the pattern). There are much better alternatives. For instance Boyer-Moore with the linear complexity.

The algorithm is, for example, used in grep — see this reference — why GNU grep is fast.

I'm not an expert in describing algorithms, so here's an excellent introduction of Boyer-Moore:


C++17 updated std::search algorithm in two ways:

  • You can now use execution policy to run the default version of the algorithm but in a parallel way.
  • You can provide a Searcher object that handles the search.

For now, we have three searchers:

  • default_searcher
  • boyer_moore_searcher
  • boyer_moore_horspool_searcher

Preprocessing

Both of the algorithms Boyer Moore and Boyer Moore Horspool use some knowledge about the pattern string so that they can skip fruitless comparisons. In order to be "smarter," each algorithm does a preprocessing that analyses the input pattern. The complexity of the preprocessing usually depends on the size of the alphabet of the string.

Horspool is a simplified version of Boyer-Moore (with only bad character rule) and uses smaller internal tables. The average complexity is linear, but the worst case might be O(mn).

In Boost

You might be familiar with the searching algorithms if you use boost libraries. In the version 1.50 (2012, June) there was a new set of algorithms added: see boost Version 1.50.0.

In the library there are three searchers objects:

  • Boyer-Moore Search
  • Boyer-Moore-Horspool Search
  • Knuth-Morris-Pratt Search

How to Use

C++17 provides a new overload for std::search:

template<class ForwardIterator, class Searcher>
ForwardIterator search( ForwardIterator first, ForwardIterator last,
                        const Searcher& searcher );

Each searcher usually takes two input iterators — beginning and end of a pattern, and then a binary predicate—- usually it's an equality operator. They might also use other parameters, for example, a hashing function.

All in all, you can use it like this:

std::string testString = "Hello Super World";
std::string needle = "Super";
auto it = search(testString.begin(), testString.end(),
                    boyer_moore_searcher(needle.begin(), needle.end()));

if (it == testString.end())
    cout << "The string " << needle << " not found\n";


Some Basic Tests

I wrote some basic tests that show some nice performance boost for the new algorithms in MSVC.

Source code: github.com/fenbf/articles/cpp17/searchers/searchers.cpp

How the test works:

  • The app loads a file, like a book sample - 500kb of text.
  • The whole file content is stored in one input string.
  • A pattern is selected — N last letters of the input string.
  • The app uses several algorithms and runs each search ITER times.

For example:

The std::string::find version:

RunAndMeasure("string::find", [&]() {
    for (size_t i = 0; i < ITERS; ++i)
    {
        std::size_t found = testString.find(needle);
        if (found == std::string::npos)
            std::cout << "The string " << needle << " not found\n";
    }
});

The boyer_moore_horspool version:

RunAndMeasure("boyer_moore_horspool_searcher", [&]() {
    for (size_t i = 0; i < ITERS; ++i)
    {
        auto it = std::search(testString.begin(), testString.end(),
            std::boyer_moore_horspool_searcher(
                needle.begin(), needle.end()));
        if (it == testString.end())
            std::cout << "The string " << needle << " not found\n";
    }
});

Here are the results (i7 4720HQ, Win 10, MSVC 2017 15.8, Release 64 bit)

The pattern is composed of 10000 letters from the end of the input text.

.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 10000
string length: 547412
test iterations: 1000
pattern length: 10000
string::find: 693.449 ms
default searcher: 1102.25 ms
boyer_moore_searcher: 133.558 ms
boyer_moore_horspool_searcher: 37.0234 ms

The pattern is now the last 100 letters of the input text:

.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 200
string length: 547412
test iterations: 1000
pattern length: 200
string::find: 158.612 ms
default searcher: 467.518 ms
boyer_moore_searcher: 58.8752 ms
boyer_moore_horspool_searcher: 56.7017 ms

The sample results need some more investigation. For example for short patterns, the string::find version is usually faster. Also, horspool algorithm was faster than the boyer_moore option in most cases.

More performance tests are worth a separate blog post, with much better result analysis. So I'll leave this for your experiments now.

Using Other Containers

The important fact about std::search is that it's a generic algorithm! So you can use it not only for strings!

Here's a sample code for searching a pattern of numbers in a vector of integers.

std::vector<int> testVector(1000000);
std::iota(testVector.begin(), testVector.end(), 0);
std::vector vecNeedle(testVector.end() - 1000, testVector.end());

auto it = std::search(testVector.begin(), testVector.end(),
        std::boyer_moore_horspool_searcher(
                vecNeedle.begin(), vecNeedle.end()));

if (it == testVector.end())
        std::cout << "The pattern " << needle << " not found\n";


Summary

The article just briefly shows new capabilities that you get in C++17. It's important to know that the new algorithms will not always be faster than std::string::find (for strings).

Have you used new string searchers? What are your use cases?

I showed some results from MSVC, can you run the code on GCC or Clang?

More from the Author:

Bartek recently published a book - "C++17 In Detail"- rather than reading the papers and C++ specification drafts, you can use this book to learn the new Standard in an efficient and practical way.

Algorithm c++

Published at DZone with permission of Bartłomiej Filipek, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Cloud Native London Meetup: 3 Pitfalls Everyone Should Avoid With Cloud Data
  • Handling Automatic ID Generation in PostgreSQL With Node.js and Sequelize
  • A Beginner's Guide to Back-End Development
  • Top 10 Secure Coding Practices Every Developer Should Know

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: