Moving from a sequential to a parallel implementation of an algorithm usually means that something changes, and it may mean a completely different approach.
We talk about functions in terms of the “order” (typically called Big O). This is how it behaves as the input size changes, without worrying too much about specifics of how long it takes to actually do the work.
For example, if we have data stored in an unsorted list structure, and we need to find out if a particular value is in the list, we must check each item in the list until we find the item or reach the end. In Big O notation we call this
O(n), indicating that as the length
n of the list grows, we should expect the time it takes to search it to increase in a linear way.
Note that we don’t care how long it takes to step through and look at each element, and we don’t care that an early match is very fast. We only care about the general relationship between the size of the list and the time it takes to run. In this case, if the list gets twice as long, the average run time will get about twice as long.
Similarly, if we had an unsorted list, and we were searching for duplicated elements, we would call this
O(n^2), because we are going to have to do n searches through the list, each of which we already said is O(n). Regular math works here, and
O(n^2). Again, the details don’t matter; we just care that if the list gets three times as long, average run time will be about nine times as long.
Work and Depth
When we move from sequential to parallel, we still think about Big O, but also about doing multiple things at the same time. For example, in searching an unordered list, while we have to step through the whole list, every single comparison is independent of every other, so if we had that many processors we could do them all at once.
As a result, instead of having a single Big O value, we use the terms "work" and "depth." Work we saw earlier; it is how the run time grows as the input size grows. Depth also uses Big O notation, but it uses it to express how easy it is to run in parallel.
We use the term "depth" because we are thinking in terms of "divide and conquer." We expect to have a recursive function that hands off smaller and smaller pieces of the problem to new versions of itself. The flatter (shallower) the recursion, the better, because it means we can spread out across multiple processes more easily. In our search of an unordered list, the depth is
O(1), or "constant time." No matter how many extra items there are in the list, we can, in theory, break it up into that number of pieces.
In our unsorted list duplicate search, we compare each item in the list with every other item. This is no problem, for we just create
n^2 separate tasks, each with a different "left" and "right" index for the comparison, and we can do all the comparisons in one step. So the depth is still
At this point, alarm bells will be ringing about how feasible this is, but I'm not quite ready to let the real world intrude yet.
Putting work and depth together, we can define "available parallelism" (where bigger is better):
Available Parallelism = Work / Depth
With our search through an unsorted list, the work was
O(n) and the depth was
O(1), giving an available parallelism of
O(n). This means that as the size of the input increases, the amount of work increases linearly, but our ability to do it in parallel also increases linearly. So as long as we have more processors the problem will take about the same amount of time (ignoring for a moment the overhead of splitting the work).
In a marginally more realistic example, let's say that instead of just identifying duplicates, we wanted to count the number of duplicates for each duplicate we find. Now, instead of just comparing each item in the list to every other item, we also need to keep track of how many matches we've found. So we can't split up the comparisons completely. Let's take a simple approach. We will split up the "left" side of the comparison, then just iterate over the list. This way we count the number of matches in parallel for each item in the list. Of course, this is a very poor approach, because we are finding the same duplicates many times, which is a lot of wasted work.
For this example, while the work is still
O(n^2), the depth is now
O(n). This means our available parallelism is
O(n). This is still quite good, because we still see linear speedup from adding more processors.
Of course, it would be nice to avoid that wasted work. Those experienced with map and reduce may have noticed that a map can emit a value for each item, then a reducer can add them up. In fact, this is Hadoop’s WordCount example. The work in this case is
O(n), and if the reducer is written correctly the depth is
O(log n). Our available parallelism is
O(n / log n), which is slightly less than linear.
Note that while the work is much worse in the first example, because of all the wasted comparisons, it has slightly better available parallelism than the map/reduce example, because it fully preserves the independence of all the comparisons. That is not enough reason to choose it, but it does illustrate an important rule in parallel programming, which is that sometimes it is necessary to waste work in order to improve parallelism.
The Real World Will Not Stop Hassling Me
So far, making a good parallel algorithm has meant trying to increase our available parallelism, because then we can just throw more hardware at the problem to get it to run faster. Unfortunately, while that can be true, it isn't the full story.
First, servers and electricity cost money. There is some limit on buying more hardware or spawning more cloud instances. At that point, no matter what the theoretical speedup of our algorithm is, we won't see any actual advantages, because we'll just be queuing up more tasks than we have cores to run them on.
Second, Big O notation hides a lot of important differences between algorithms. There's a cost in creating a thread or even a Goroutine. In most real-world implementations, tuning means we spawn many fewer parallel tasks than the theoretical maximum. For example, Hadoop lets you carefully configure split size (amount of data given to each worker) and block size (amount of data stored separately on disk). Our duplicate search with
n^2 tasks was absurd; the overhead is going to be many times greater than the time it takes to do a single comparison of two items.
Third, as we saw above, to get higher available parallelism we sometimes have to do extra work, not just incur extra overhead. Sometimes that extra work is justified by the speedup we get; sometimes it is not.
This is a pretty basic discussion of how parallel algorithms are analyzed and compared to each other. If you'd like to see how parallel code might work in practice, I have a GitHub repository that runs a Net Present Value simulator using Java fork/join’s
RecursiveTask that might be of interest.