Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Parallel STL and Filesystem: Files Word Count Example

DZone's Guide to

Parallel STL and Filesystem: Files Word Count Example

Take a look at another example of parallel algorithms and the standard filesystem to find the word count of all text files in a directory.

· Performance Zone ·
Free Resource

xMatters delivers integration-driven collaboration that relays data between systems, while engaging the right people to proactively resolve issues. Read the Monitoring in a Connected Enterprise whitepaper and learn about 3 tools for resolving incidents quickly.

Last week you might have read about a few examples of parallel algorithms. Today, I have one more application that combines the ideas from the previous post.

We'll use parallel algorithms and the standard filesystem to count words in all text files in a given directory.

The Case

In my previous post, there were two examples: one with iterating over a directory and counting the files sizes and the next one about counting words in a string. What would happen if we joined those two samples?

We can also play with execution policies and test if std::execution::par gives a performance advantage over the sequential version.

The General Idea

The application does the following:

  • Gets the input parameters from the command line: directory parallel:1:0 (printsizes)
  • It will find all TXT files in a directory (recursively)
  • Then it will work on the selected files and count words in each file.
  • The sum of all words will be presented at the end and optionally (if the third command line argument is passed) the list of paths and their corresponding word count will be shown.
  • The parallel argument is used to determine if the app will use sequential execution policy or parallel.
  • The app will also print some timings for the steps.

The pseudo code:

Please notice that while each step might use parallelism to execute their internal tasks, there are "sync points" between the major steps. In my initial implementation, FindFiles must finish before CountWords can start. Such approach might not be the best but was easier to start with.

Gathering All Text Files

The sequential version is relatively simple:

The above code iterates over the directory and then appends a path when it checks that it is a text file.

For the parallel version I had one obstacle:

In MSVC (VS 2017 15.7.4), std::copy_if has no parallel implementation for such directory iterator ( copy_if only supports random access iterators), so I had to write my custom version.

I'm using a two-step approach: firstly I collect all the paths and then I'm filtering out the entries that are not TXT files.

The code uses a mutex in the case when it pushes one more element to the output vector. This is probably not the best approach from the performance perspective.

Counting Words

When we have all the paths, then we can iterate over them and count words in each file.

To hold the results I'm using a separate vector std::vector<FileAndWordCount> filesWithWordCount

The core code:

Each task might be run in parallel and the code reads all the text from a file into one string and then runs CountWords on the given string. It uses the same algorithm as from the last post.

Warning: it might be another point for the refactoring. Why not use std::vector<FileAndWordCount> from the beginning and not waste time for transforming vector<path> into std::vector<FileAndWordCount>.

Performance Results

While I know that the code is not written in the optimal way I still see some perf boost compared to the sequential version.

One invocation over small files (10...15kb each).

For 68 files (60 that are TXT) I got 1.5ms for PAR and 6,8ms for SEQ version!

And another test - reading 40 books from Gutenberg Project:

This time the whole directory contains around 10MB of text files.

And I got 17ms for the PAR version and 29ms for SEQ.

Your results might be different! I'm using a Quad Core i7 Laptop with SSD.

Summary

With the ease of use of Parallel STL and Filesystem I could quite easily assemble an application that does the word counting task efficiently. As you see, I didn't spend a lot of time to polish the code and the design, yet for small utilities that might be good enough. And what's more: all code comes from only STL without any third party code!

You can find all the code in my repo:

github/fenbf/ParSTLTests

And the file with this example is:

I'm curious what's your ideas for the same use case? How would you improve the code?

There are several points where we could improve the code:

  • Find an optimal way to count words in a file: load its contents at once as string (not suitable for larger files), or read chunks at a time.
  • For example Instead of collecting paths and filtering them and then starting the whole process I could work on those files in parallel (without any sync point).
  • Compare it with OS version like WinApi for reading files and distributing tasks.
  • Error handling

I'm happy to see your ideas and modifications!

3 Steps to Monitoring in a Connected Enterprise. Check out xMatters.

Topics:
c++ ,software development ,visual c++ ,native ,programming ,performance ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}