Bridge the Gap of Zip Operation
Bridge the Gap of Zip Operation
In this article, this author takes a look at the alternatives the zip operation that Java does not provide, and argues against StackOverflow's top pick.
Join the DZone community and get the full member experience.Join For Free
Despite Java not providing a zip operation, you don't need either 30 additional lines to implement it, nor a third party library. Simply compose a zipline through the existing Stream API.
Java, from its 8th version onward, provides an easy way to query sequences of elements through its Stream Interface, which provides several operations out of the box. This set of operations is quite versatile but, as can be expected, it does not cover all the operations a programmer may require. One such operation is zip, as we can observe in one of the most visited posts about Java Streams in Stackoverfow: Zipping streams using JDK8 with lambda (java.util.stream.Streams.zip). Even 7 years later, we are now on Java 14 release and there is no zip operation for Streams yet.
Nevertheless, from all workarounds provided to the absence of zip, all of them incur at least in one of the following drawbacks:
4. Dependency of a third-party library.
In fact, the most voted answer based on Java’s Stream extensibility, not only incurs in 35 lines of code to implement an auxiliary
zip() method (1. Verbosity), but it also breaks the fluent chaining of a stream pipeline (2. Non-fluent) and moreover, it is still far from being the most performant alternative (3. Underperforming).
Here we analyze other alternative approaches to the Java’s Stream extensibility (section 2.1), namely the use of a third-party library to suppress the absence of zip (such as Guava or Protonpack in section 2.2) or to entirely replace Stream by another type of Sequence (such as jOOλ, Jayield, StreamEx and Vavr in section 2.3). And, we also include in our comparison a fourth alternative denoted as zipline (in section 2.4) that combines a
map() operation with the use of an auxiliary iterator to achieve the desired zip processing.
Next, we will give further details about all these solutions and a performance analysis built with JMH support to benchmark the queries
find() (mentioned in the Stackoverflow question) that traverse two sequences through a
zip() operation provided by each alternative. Pursuing unbiased results we have also developed another benchmark, which tests more complex queries with a realistic data source (namely the Last.fm Web API).
No solution is best in all possible scenarios, so we have summarized in this article the behavior of each approach for each of the four mentioned drawbacks. Put it simply, from our results, we can observe that the most voted answer on Stackoverflow gathers less advantages than any other alternative and there are other viable choices better suited to the constraints ruling each development environment.
Zip, also known as Convolution, is a function that maps a Tuple of sequences into a sequence of tuples. As an exercise, let's imagine that zip is supported by Stream and that we have one stream with song names and another with the same song's artist's names, we could use zip to pair them up like so:
Our console output would be:
Zip is one of the operations that isn't supported out of the box by Java, but that doesn't mean we cannot implement it ourselves. Java also provides a way for us to introduce new custom operations in a stream pipeline through the implementation of a stream source.
In this article, we're going to present some alternatives on how to implement the zip operation in Java. We are going to compare their features, measuring them against the four overheads listed above, as well as compatibility with Java's Stream. After that, we'll benchmark different queries using the different alternatives proposed and see how they fare against each other.
2. Java Zip Alternatives
There are multiple ways to set up a zip operation in Java, namely:
- Java's Stream extensibility
- Interleave a third-party utility such as
- Replace Stream by other kind of Sequence such as
- Combine existing Java Stream operations -- zipline
2.1 Java's Stream Extensibility
In order to program and use custom Stream operations, Java provides the StreamSupport class, which provides methods to create streams out of a Spliterator. So, with the zip operation in mind, we implemented the solution proposed by siki over at stackoverflow.
Simply put, this solution obtains the iterators from the stream sources and combine their items into a new
Iterator that is further converted to a
Spliterator and then into a new
Stream, as denoted in the following sketch:
Thus the Iterator resulting from the combination of
other is according to the following listing:
zippedIterator, siki's proposal obtains the final
Stream from the following composition:
Going back to the song's example, here's how we would use our newly defined zip operation:
This is the approach that Java currently provides. It has the advantage of not requiring any further libraries, but on the other hand it is quite verbose as we can see by the definition of the zip method proposed by siki over at StackOverflow that incurs in almost 35 lines of code. Moreover, when we are using our new custom operation we lost the fluent pipeline that stream operations provide, nesting our streams inside the method instead. Although the resulting pipeline is formed by
forEach, the method calls are written in different order where
zip() call appears at the head.
2.2. Interleaving with Third Party Utility Classes
Rather than developing our own zip function, we may use a third party utility library, making use of their operations interleaved with the Stream's method calls. This approach is provided by Protonpack with the StreamUtils class and by Guava with the Streams. Regarding the example of using our newly zip operation of previous section, we may replace the call to our own implementation of
zip with the use of
StreamUtils.zip(...) depending on whether we are using Guava's or Protonpack's utility class. This approach is much less verbose as we do not have to define the zip operation ourselves, but we do have to add a new dependency to our project if we want to use it.
2.3 Replacing Stream by Other Sequences
A Stream has the same properties as any other Sequence's pipeline. This pattern has been described before by Martin Fowler in his well-known article from 2015, "Collection Pipeline." A Sequence's pipeline is composed of a source, zero or more intermediate operations, and a terminal operation. Sources can be anything that provides access to a sequence of elements, like arrays or generator functions. Intermediate operations transform sequences into other sequences, such as filtering elements (
e.g. filter(Predicate<T>)), transforming elements through a mapping function (
map(Function<T,R>)), truncating the number of accessible elements (e.g.
limit(int)), or using one of many other stream producing operations. Terminal operations are also known as value or side-effect-producing operations that ask for elements to operate, consuming the sequence. These operations include aggregations such as
collect(), searching with operations like
anyMatch(), and iteration with
With that said there are some alternatives to Java's Stream's implementation of a Sequence, such as StreamEx, Query by Jayield, Seq by jOOλ or Stream by Vavr. Now let's take a look at how our example from "Java's Stream extensibility", behaves using Seq by jOOλ:
Stream by a third party implementation results in a similar usage to that one presented in our example, regardless the change of the sequence type, such as
Seq in the case of the jOOλ library. This approach is similar in many ways to the one presented in "2.2 Interleaving with third party utility classes", with the added benefit of keeping the fluent pipeline of operations. Here the methods calls such as
foreach() are chained through a sequence of method calls where each call acts on the result of the previous calls.
2.4 Combine existing Java Stream operations -- zipline
One approach that does not require the usage of third party utility libraries nor breaks the fluent pipeline of operations when it comes to zipping two Sequences, is simply the combination of the Stream operation map with an old fashioned iterator.
Considering that we want to merge two sequences:
other (instances of
Stream), then we can achieve our goal (denoted as the resulting zipline) with the following combination of native
Regarding our use case of zipping artists with songs, then using the zipline approach we may achieve our goal in the following listing:
Zipline shares the fluency advantage of using a third-party sequences alternative, but with the additional benefit of not requiring a further dependency. Moreover, zipline is concise and avoids all the verbosity of implementing a custom
3. Feature Comparison
In this section, we present a table that gives an overview of the different approaches regarding the drawbacks described at the beginning of this article as well as their compatibility with Java Stream's. Concerning the performance, we include four benchmark results that present the speedup relative to the performance observed with the Java's Stream extensibility (the accepted answer proposal of the original Stackoverflow question about zip).
These four results deal with foure different query pipelines, namely the
find() operations mentioned in the Stackoverflow question about zip, and also 2 more complex pipelines processing data from Last.fm Web API, one including a
distinct() and other a
filter() over a zip. For these two latter results we present the speedup in percentage.
We will explain these operations in more detail in section "4. Domain model and queries for Benchmarking". This table groups different approaches by the same color that is used in performance results charts of section "Performance Evaluation" and according to the 4 choices presented in Section 2 - Java Zip Alternatives.
|Verboseless||3r Party Free||Fluent||Stream
(* Although Seq being compatible with
Stream it does not support parallel processing.)
Comparing the approaches, we can see that the only one incurring in Verbosity is that one provided by the "2.1 Java's Stream extensibility", whereas all other approaches are Verboseless. Yet, all those alternatives depend of a Third Party library, with the exception of the Zipline idiom that is built over the existing Java Stream API.
On the other hand, replacing the
Stream type by other Sequence implementation or even using the Zipline idiom we can maintain fluent pipelines.
The most performant approaches are Query by Jayield and Zipline as we'll confirm further ahead in our performance benchmark results. Although jOOλ and StreamEx present worse performance than competition they provide a full set of query operations and they are fully compatible with the Java
Stream type. Finally, although Vavr is the most popular among these Github repositories it hasn't any advantage over the alternatives for the analyzed features and benchmarks.
4. Domain Model and Queries for Benchmarking
To test how each of these approaches would perform we decided to devise a couple of queries and benchmark them.
4.1 every and find
Every is an operation that, based on a user-defined predicate, tests if all the elements of a sequence match between corresponding positions. So for instance, if we have:
every to return true, every element of each sequence must match in the same index. The
every feature can be achieved through a pipeline of
allMatch operations, such as:
find between two sequences is an operation that, based on a user defined predicate, finds two elements that match, returning one of them in the process. So if we have:
find to return an element, two elements of each sequence must match in the same index. Here's what an implementation of
find would look like with the support of a
As we can see,
find is similar to
every in the sense that both
zip sequences using a predicate, and if we have a match on the last element of a sequence it runs through the entire sequence as
every would. With this in mind, we decided that benchmarking
find with sequences with many elements and matching it only in the last element would not add much value to this analysis. So we devised a benchmark in which the match index would vary from the first index up until the last and analysed sequences with 1000 elements only. On the other hand, we have benchmarked the
every with sequences of 100 thousand elements.
4.2. Artists and Tracks
Looking for more complex pipelines with realistic data, we resort to publicly available Web APIs, namely REST Countries and Last.fm. We retrieved from the REST Countries a list of 250 countries which we then used to query the Last.fm API in order to retrieve both the top Artists as well as the top Tracks by country, leaving us with a total of 7500 entries for each.
Our domain model can be summarized into the following definition of these classes: Country, Language, Artist, and Track.
Both our queries come from the same basis. We first query all the countries, filter those that don't speak English and from there we retrieve two Sequences: one with the pairing of the Country and it's top Tracks other with the Country and it's top Artists.
From here on out our queries diverge so we'll explain separately.
Distinct Top Artist and Top Track by Country
This query consists of zipping both Sequences described above into a Trio with the Country, it's top Artist and it's Top Track, then selecting the distinct entries by Artist.
Artists Who Are in A Country's Top Ten Who Also Have Tracks in The Same Country's Top Ten.
As the previous query, this one starts by zipping both sequences, but this time, for each Country, we take the top ten Artists and top ten Track artist's names and zip them into a Trio. After, the top ten artists are filtered by their presence in the top ten Track artist's name, returning a Pair with the Country and the resulting Sequence.
For the actual implementations of these queries you can always check our Github repository. You'll find that some of these methods and properties don't exist (first(), second(), country, etc...), but to make the concept clearer and easier to understand we used these properties in this article.
5. Performance Evaluation
To avoid IO operations during benchmark execution we have previously collected all data into resource files and we load all that data into in-memory data structures on benchmark bootstrap. Thus the streams sources are performed from memory and avoid any IO.
To compare the performance of the multiple approaches described above we benchmarked the two queries explained above with jmh. We ran these benchmarks both in a local machine as well as in GitHub, with GitHub actions, which presented consistent behavior and results with the performance tests collected locally.
Our local machine presents the following specs:
-i 10 -wi 10 -f 3 -r 15 -w 20 --jvmArgs "-Xms2G -Xmx2G"
These options correspond to the following configuration:
-i 10- run 10 measurement iterations.
-wi 10- run 10 warmup iterations.
-f 3- fork each benchmark for 3 times.
-r 15- spend at least 15 seconds ate each measurement iteration.
-w 20- spend at least 20 seconds ate each warmup iteration.
--jvmArgs "-Xms2G -Xmx2G"- set the initial and the maximum Java heap size to 2 Gb.
Local Machine results
As we can observe from the results presented in these charts, the solution based on the accepted answer of Stackoverflow (labeled as "stream") is not the most performant option. Moreover, it has other drawbacks beyond performance already mentioned in previous sections.
The use of the third-party library Jayield outperforms all the other choices for both operations
find() mentioned by the original poster in StackOverflow's question about zipping Java streams. Although Jayield is also the most performant in Artists and Tracks benchmark, it presents a more subtle speedup over the competition.
Yet, if adding a dependency to an auxiliary library such as Jayield is not an option, then zipline will be the best alternative. It is simpler and more effective to chain a
map() combined with the use of an auxiliary iterator, as denoted by zipline idiom, than implementing a
Spliterator based zip, which incurs in a lot of verbosity. Furthermore, that
zip() cannot be chained fluently in a stream pipeline, whereas the zipline keeps the fluent nature of the query pipeline. Finally, zipline is is the second most performant solution in most part of the analysed benchmarks.
StreamEx is also a good option that provides a full set of operations out of the box to build a wide variety of query pipelines. It has the big advantage of being fully compatible with Java stream classes. Yet, it presents a worse performance in comparison with Jayield.
Finally, the Guava and Protonpack options present similar performance to the
Spliterator based zip. This is expected because these solutions are also based in the implementation of the
Spliterator interface and thus they incur in the same overheads. So, the only advantage of these libraries is avoiding the verbosity of implementing the
zip() from the scratch. Yet, both share the same drawbacks regarding the loss of the pipeline fluency and performance.
Java Streams are a masterpiece of software engineering, allowing querying and traversal of sequences, sequentially or in parallel. Nevertheless, it does not contain, nor could it, every single operation that will be needed for every use case. That is the case of the
In this article, we highlight other concerns regarding the choice of a workaround to suppress the absence of the
zip() operation in Java streams. We argue that the most voted and accepted answer on StackOverflow is not the best choice and incurs in 3 drawbacks: verbosity, loss of pipeline fluency and underperformance.
We also proved that the 2nd an 3rd most voted answers of Stackoverflow, based on the use of third-party libraries, such as Guava and Protonpack also incur in similar overheads and are not the best choices.
We brought to our analysis other options, namely zipline idiom and Jayield library and we also have built a zip benchmark based on a realistic data source to fairly compare all alternatives. This benchmark is available on Github and the results are publicly visible in Github Actions.
We hope that all these results and evaluated characteristics can serve as a basis to help developers make their choice to suppress the absence of the zip operation.
Opinions expressed by DZone contributors are their own.