A Simulation Using Sequential and Parallel Collections in Scala 2.13
A Simulation Using Sequential and Parallel Collections in Scala 2.13
Satellite systems track fighter jets. In comparing data, would a single- or multi-threaded algorithm perform better?
Join the DZone community and get the full member experience.Join For Free
Software to be build is to compare data from independent satellite systems. As fighter jets traverse airspace, each system determines its identity, recording the name of the jet type or an equivalent letter or number abbreviation. For data collected over time, the simulation software must ascertain whether systems agree on identification and if so, whether contrasting the data by single or multiple processors would be more efficient.
This article begins with a proof of concept in Scala 2.12. A small program with function taking generic collection type parameters verifies the basic strategy. The actual software is written in 2.13, in which generic collection types – those inheriting the GenTraversable hierarchy – have been deprecated . The parallel package is also gone but can be brought back .
You will see in the actual software simulation system the following: a rewrite of the ‘corresponds’ function defined in the collection hierarchy; custom methods for timing both sequential and parallel algorithms; and basic statistical readouts of results.
For this article you need background in the Scala collections – at least the sequential collections.
The 2.12 program and full system code are available here: https://github.com/zironsys/sequential_parallel
The Central Tactic: Scala 2.12
As mentioned, as of version 2.13 generic collection types have been deprecated and support for multi-threaded processing over collections removed. We can, though, use these resources in version 2.12 to devise the core strategy.
This proof of concept program is to achieve three goals. The first is to fashion a data structure for holding known correct associations of jet identifiers between satellite systems. Set ‘correctSet’ does that.
The set is hardwired for associating full name and character abbreviations; actual software will be generic to allow random pairings between string, character, and numeric types.
Via transformations on correctSet, our second goal is realized: creating structures of mock-up data for comparison. Converting and unzipping the set of Tuple2 returns two arrays simulating distinct systems’ output. The ‘.par’ method yields ParArray objects whose elements may be horizontally partitioned and operated upon in parallel by threads on multiple processors.
Finally we need a collection-based function to do item-by-item comparison and that can work with either type of array. Local function ‘corresponds’ takes parameters of type GenSeq, a GenTraversable-derived trait that can take sequential or parallel sequences (or mixed). The ‘corresponds’ function it invokes is from class Array and identical in meaning to that in ParArray; this is its signature:
def corresponds[B](that: GenSeq[B])(p: (T, B) ⇒ Boolean): Boolean
Corresponds returns true when the sequences are of the same length and all elements in corresponding position relate. The results of testing with sequences built from correctSet are trivially true. Of more interest is usage of predicate parameter 'p.' Each paring (a, b) is tried in correctSet’s ‘apply’ method, which tests for set membership. This predicate will be used in the actual software in version 2.13. The software includes its own implementation of corresponds adapted to purpose.
Set correctSet thus is critically involved in all goals: maintaining correct pairings, building simulation data, and testing for correspondence. In actual software it will be known as the ‘verification set.’
Populating Simulation Rows: The Verification Set
The small satellite software and testing/simulation software are in separate packages. As the focus of this article is the former, you’ll see none of the latter save for this method contained in object TestTools from the download. I’ve included it for its importance in understanding how data is created for all simulation testing – using the verification set as was established.
The first parameter to function ‘buildTestData’ controls whether thousands or millions of elements of data will be created. The ‘TestData’ return, type defined as shown, is the array of Tuple2 associations between simulation systems being compared – the actual test data. It is parameterized to allow any system to represent jet identifications as strings, characters, or numbers.
Important parameter ‘systemPairs’ is the set of known correct associations between two given systems from which all initially correct test data is fashioned: the verification set. Its type is imported from the 'CorrespondsCheck' companion object in the satellite library – see below:
type SystemPairs[A,B] = Set[(A,B)]
Testing is done over sets of (string, character) and (integer, string) pairs; variable correctSet in Figure 1 is an example of the former and will be used in testing.
Though data made here is always correct, simulation runs may change elements to test failure cases.
Solution Space: CorrespondsCheck
The software is written in version 2.13.2. As was stated, information on adding back and using revised package ‘parallel’ is in .
Unlike other resources used in concurrent programming that emphasize correctness such as atomic variables and futures, those in the parallel package are strictly about performance. As for accurate results, in 2.13 (I assume) for a collection’s operations that can be parallelized, given it runs correctly on one processor, it must run correctly on multiple processors.
Class CorrespondsCheck in the satellite library defines a single method that executes the corresponds function over arrays and then parallel arrays and returns both run times.
Resources needed by the class are in its companion object.
Here is the companion object in its entirety:
You saw type SystemPairs used by test code in the previous section.
The time functions at bottom are from Prokopec ; I modified them for use here. All sequential and parallel executions are done through them.
By-name parameter ‘body,’ used in both, is the corresponds function to run and the ‘resultDummy’ a zero-value meant only to initialize the result of execution. As you will see type T is Long for sequential runs and Boolean for parallel.
Function ‘timed’ is invoked through ‘warmTimed,’ which executes the corresponds function ‘iterations’ number of times before returning the Tuple2 result from timed. The 200-default value is recommended for larger projects to get the JVM into steady state – i.e. without intervening JIT compilation and garbage collection that can skew the timed results. If you run the code I suggest you use a value such as 20 or less or it may take some time to complete.
The Long values in upper case at top of object are explained anon.
Here is the class in its entirety:
The class is parameterized to allow any jet identification system to use the various types – String and so on. Its ‘verificationSet’ constructor parameter is also as you expect: for verifying input.
The one public function returns an Either disjoint union, whose purpose I’ll make clear here. Let’s start with its nested function.
In Scala standard library 2.13 Array has two versions of the corresponds function, one implemented in trait SeqOps and another in trait IterableOnceOps. Aside from parameter type they are identical in meaning and available to Array via implicit conversion.
Both corresponds functions return Boolean, but we’ll need a numeric return for more complete error information. Nested function ‘sequentialCheck’, adapted from the ‘_Ops’ traits, does this.
Its code and comments should be self-explanatory; I do, though, draw your attention to this snippet:
The loop containing this comparison test is essentially an iterative form of the single line of declarative code from the local corresponds function from Figure 1; it accomplishes the same set membership test:
arrayA.corresponds(arrayB)((a, b) => correctSet((a, b)))
Outer function correspondsCheck invokes the function wrapped in the warmTimed method. If failure is found in this single-threaded invocation – the ‘result’ of type T in the Tuple2 – the multi-threaded ParArray version is not called and correspondsCheck fills out the Left class branch of Either with ARRAY_LENGTH_FAILURE or the index where the predicate in sequentialCheck first failed. By assumption, the ParArray call if reached must succeed and Left must never be populated with UNKNOWN_PARALLEL_ERROR.
On success the Right class is filled with (String, String) for the respective times of the sequential and parallel executions in milliseconds. The decimal format object in the companion class is used by the timed function for nice String readout.
Results can vary by machine. I’m testing on a computer with eight core and hyperthreading on Ubuntu 18.04. The warmTimed for the readout below is default: 200 iterations before invoking the timed method.
The first four tests use the verification set as depicted in Figure 1, and the fifth, tuples of (Int, String). Test one uses a sample size of 1,000 elements while all others, 50,000,000.
At the small size, test one shows that using multiple cores is clearly inefficient. In testing over the large size, however, gains with parallel algorithms are considerable as seen in tests two and five.
Tests three and four are the error cases for which I’ve altered the correct arrays. The third test in particular shows how the satellite system accurately identified the index of the erroneous element.
You saw how a small proof of concept program could aid in defining and realizing actual software goals. The program as written in Scala version 2.12, the last to support the GenTraversable hierarchy, demonstrated storing of correct jet identification associations, creating simulation data, and testing data for correctness in single- and multi-threaded types.
Although the testing system was not presented, I did include discussion of its function that generically creates correct simulation data of any size.
In the actual software, based on the goals and POC, all resources were contained in a single class and companion object. The latter included functions for determining execution times and warmed times. The class utilized a custom nested function that extends the corresponds function in the sequential collection Scala library to give it more capability.
Throughout you saw the central role played by the verification set.
In contrasting single and multiprocessor algorithms, we safely conclude that performance depends partly on the size of data. Still, more to be considered. There always is.
 Scala Standard Library 2.13.3, “package collection.” [Online]. Available: https://www.scala-lang.org/api/current/scala/collection/index.html. [Section ‘Deprecated Value Members’].
 Scaladex, “scala / scala-parallel-collections.” [Online]. Available: https://www.scala-lang.org/api/current/scala/collection/index.html.
 A. Prokopec, Learning Concurrent Programming in Scala. Birmingham, UK: Packt Publishing, 2014. [Chapter 5: Data-Parallel Collections]
Opinions expressed by DZone contributors are their own.