Java 8 vs. Scala, Part II: the Streams API

A continuation of a series on using Java vs. Scala, and using the Streams API to collect and manipulate data.

Hussachai Puripunpinyo

Nov. 17, 15 · Analysis

Likes (27)

Comment

Save

38.3K Views

This is part 2 of the article.

Stream vs. Collection?

Let’s start with a short and incomplete description by me :), a collection is a finite set of data while a stream is a sequenced set of values that can be either finite or infinite. Yep, that’s it.

The Streams API is a new API that comes with Java 8 for manipulating a collection and streaming data. The Streams API doesn’t mutate state while the Collections API does. For example, when you call Collections.sort(list), the method will sort the collection instance that you pass through an argument while calling list.stream().sorted() gives you a new copy of the collection and leave the original one unchanged. You can read more about the Streams API here.

The following are interesting comparisons between collections and streams that I copied from Java 8 documentation. I highly recommend you to read a full version here.

Streams differ from collections in several ways:

1. No storage. A stream is not a data structure that stores elements; instead, it conveys elements from a source through a pipeline of computational operations.
2. Functional in nature. An operation on a stream produces a result, but does not modify its source.
3. Laziness-seeking. Many stream operations, such as filtering, mapping, or duplicate removal, can be implemented lazily, exposing opportunities for optimization.
4. Possibly unbounded. While collections have a finite size, streams need not.
5. Consumable. The elements of a stream are only visited once during the life of a stream.

Java and Scala have a pretty easy way to compute values in a collection concurrently. In Java, you just have to call either parallelStream()* orstream().parallel() instead of just stream() from the Collection. In Scala, you have to call par() before calling other functions. It’s quite tempting to add parallelism and expect the program to run faster. Sadly, it will run slower most of the time. Actually, parallelism is a feature that is very easy to implement in the wrong way. Check this interesting article out: Java Parallel Streams Are Bad for Your Health!

* From the Javadoc, parallelStream() returns a possibly parallel Stream with the collection as its source. So, it may return a sequential stream. What??? (someone did some research on why this API exists)

Java’s Streams API computes elements lazily. It means that intermediate operations do not perform any processing until a terminal operation is executed. Processing streams lazily can be optimized to improve the performance. For example, we have filtering, mapping, and summing in the pipeline. These operations can be fused into a single pass on the data to reduce the number of intermediate states. Laziness also allows processing only the data that are necessary. In contrast, Scala collections are strict meaning that the element will be processed eagerly. …Hmmm…would that mean that in our benchmark, Java’s Streams API will have an advantage over Scala then? If we compare Java’s Streams API with Scala’s Collections API, the answer is yes. But, you have so many options in Scala. You can convert a collection to Stream easily by just calling toStream(). Scala also has another concept called View which is also a non-strict collection like Stream. What is a non-strict collection anyway? It is a collection that will be computed lazily, aka. lazy collection.

Let’s take a quick look at Scala’s Stream and View features.

Scala’s Stream

Scala’s Stream is slightly different from Java’s Stream. In Scala’s Stream, you don’t have to call terminal operation to get a result as Streams are the result. Stream is an abstract class that implements AbstractSeq, LinearSeq, and GenericTraversableTemplate trait. So, you can treat Stream like a Seq.

If you are familiar with Java but not Scala, you can think of Seq as a List interface in Java. (Scala’s List is not an interface but that’s a topic for a different article :) ).

What we have to know here is that elements in Streams are computed lazily and because of this, Stream can be implemented for infinite data. Stream is expected to have the same performance as List if all elements in a collection are computed. Once computed, the values are cached. Stream has a function called force. It forces evaluation of the whole stream then returns the result. Be careful not to call this function on an infinite stream, as well as other operations that force the API to process the whole stream such as size(),toList(), foreach(), etc. Those operations are implicit terminal operations in Scala’s Stream.

Let’s implement Fibonacci sequence in Scala’s Stream.

def fibFrom(a: Int, b: Int): Stream[Int] = a #:: fibFrom(b, a + b)
val fib1 = fibFrom(0, 1) //0 1 1 2 3 5 8 …
val fib5 = fibFrom(0, 5) //0 5 5 10 15 …
//fib1.force //Don’t do this cause it will call the function infinitely and soon you will get the OutOfMemoryError
//fib1.size //Don’t do this too with the same reason as above.
fib1.take(10) //Do this. It will take the first 10 from the inifite Stream.
fib1.take(20).foreach(println(_)) //Prints 20 first numbers

:: notion is commonly used as a function name for concatenating data in a collection. So, #:: also means concatenation but it will concatenate the right-hand value lazily (you have more freedom to name a function in Scala than in Java).

Scala’ View

Once again, Scala’s collection is a strict collection while view is a non-strict collection. A view is a collection representing base collection, but all transformers are implemented lazily. You can convert the strict collection to view by calling the view function and you can convert it back by calling the force function. View does not cache the result and it applies the transformation every time you fetch. It is similar to database view where the view is a virtual collection.

Let’s create a data set that we are going to work on.

public class Pet {
    public static enum Type {
        CAT, DOG
    }
    public static enum Color {
        BLACK, WHITE, BROWN, GREEN
    }
    private String name;
    private Type type;
    private LocalDate birthdate;
    private Color color;
    private int weight;
    ...
}

Assume that we have a collection of pets and we are going to play with this collection throughout this article.

Filter

Requirement: We want to filter only chubby Pets from the collection. Any pet that weighs more than 50 lbs is considered chubby. We also want to get a list of pets that were born before January 1, 2013. The following code snippet shows you how to achieve this filter job in different ways.

Java Approach 1: Traditional style

//Before Java 8
List<Pet> tmpList = new ArrayList<>();
for(Pet pet: pets){
    if(pet.getBirthdate().isBefore(LocalDate.of(2013, Month.JANUARY, 1))
            && pet.getWeight() > 50){
        tmpList.add(pet);
    }
}

This is the way we usually do in imperative languages. You have to create a temporary collection then iterate through every element and store each one that satisfies the predicate(s) in the temporary collection. It’s quite verbose but always gets the job done and its performance is amazing too. I can spoil you here that the traditional approach is faster than the Streams API approach. Don’t worry about the performance because making the code more elegant outweighs a slight gain in performance.

Java Approach 2: Streams API

//Java 8 - Stream
pets.stream()
    .filter(pet -> pet.getBirthdate().isBefore(LocalDate.of(2013, Month.JANUARY, 1)))
    .filter(pet -> pet.getWeight() > 50)
    .collect(toList())

From the code above, we used the Streams API to filter elements in the collection. I intentionally called filter method twice to show that the Streams API is designed like a Builder pattern. In Builder pattern, you can chain various methods together before you invoke the build method which constructs the result object. In the Streams API, the build method is called a terminal operation and the one that is not a terminal operation is an intermediate operation. Terminal operations may be different from the build function in Builder pattern because you cannot call the terminal operation more than once in the Streams API. There are a bunch of terminal operations that you can use — collect, count, min, max, iterator, toArray. Those operations produce values whereas some terminal operations, like forEach, consume values. Which approach do you think is more readable? Traditional or Streams API approach.

Java Approach 3: Collections API

//Java 8 - Collection
pets.removeIf(pet -> !(pet.getBirthdate().isBefore(LocalDate.of(2013, Month.JANUARY, 1))
                    && pet.getWeight() > 50));
//Applying De-Morgan's law.
pets.removeIf(pet -> pets.get(0).getBirthdate().toEpochDay() >= LocalDate.of(2013, Month.JANUARY, 1).toEpochDay()
                || pet.getWeight() <= 50);

This approach is the shortest. However, it modifies the original collection while the previous ones do not. The removeIf function takes Predicate<T> (a functional interface) as an argument. Predicate is a behavioral parameter and it has only one abstract method named test that takes an object and returns boolean. Notice that we have to flip the logic by putting ! in front of the expression or you can apply de morgan’s law and the code will look like the second statement.

Scala Approach: Collection, View, and Stream

//Scala - strict collection
pets.filter { pet => pet.getBirthdate.isBefore(LocalDate.of(2013, Month.JANUARY, 1))}
.filter { pet => pet.getWeight > 50 } //List[Pet]
//Scala - non-strict collection
pets.views.filter { pet => pet.getBirthdate.isBefore(LocalDate.of(2013, Month.JANUARY, 1))}
.filter { pet => pet.getWeight > 50 } //SeqView[Pet]
//Scala - stream
pets.toStream.filter { pet => pet.getBirthdate.isBefore(LocalDate.of(2013, Month.JANUARY, 1))}
.filter { pet => pet.getWeight > 50 } //Stream[Pet]

Solutions in Scala are pretty similar to Java’s Streams API. Look at each one, you just have to call view function to turn the strict collection to the non-strict one and call toStream function to turn the strict collection to a stream.

I think you already got the idea. So, I will show you just the code and keep my mouth shut :D.

Grouping

Groups elements in a collection by one of attributes in the element. The result will be Map<T, List<T>> where T is a generic type.

Requirement: Group pets by its type, ie. Dog, Cat, etc.

//Java approach
Map<Pet.Type, List<Pet>> result = pets.stream().collect(groupingBy(Pet::getType));

//Scala approach
val result = pets.groupBy(_.getType)

Note: groupingBy is a static helper method in java.util.stream.Collectors.

Sorting

Sorts elements in a collection by any attributes in the element. The result will be any kind of collections, depending on configuration, that preserve order of elements.

Requirement: We want to sort pets by type, name, and color in order.

//Java approach
pets.stream().sorted(comparing(Pet::getType)
    .thenComparing(Pet::getName)  
    .thenComparing(Pet::getColor))
    .collect(toList());

//Scala approach
pets.sortBy{ p => (p.getType, p.getName, p.getColor) }

Mapping

Applies the given function to each element in a collection. The result can be any types depending on the given function.

Requirement: We want to have convert Pet to String in this format “%s — name: %s, color: %s”

//Java approach
pets.stream().map( p-> 
        String.format(“%s — name: %s, color: %s”, 
            p.getType(), p.getName(), p.getColor())
    ).collect(toList());

//Scala approach
pets.map{ p => s"${p.getType} - name: ${p.getName}, color: ${p.getColor}"}

Finding First

Finds the first element that matches the given predicate.

Requirement: We want to find the pet that has name “Handsome”. We don’t care how many handsome pets in the collection. We just want the first one.

//Java approach
pets.stream()
    .filter( p-> p.getName().equals(“Handsome”))
    .findFirst();

//Scala approach
pets.find{ p=> p.getName == “Handsome” }

This is a tricky one. Did you notice that in Scala approach, I used the findfunction instead of filter? If you use filter instead of find, it will compute all elements in the collection because Scala collection is strict. However, you don’t have to worry about using filter in Java’s Streams API, the API will figure out that you just want to get the first one and it will not compute the whole collection when it finds one. This is when lazy collection shines!

Let’s see more examples of lazy collection in Scala code below. We put the predicate that always returns true to the filter function and get the second result from the operation. What are we going to see as the output?

pets.filter { x => println(x.getName); true }.get(1) --- (1)

pets.toStream.filter { x => println(x.getName); true }.get(1) -- (2)

From the code above, the first statement will print all pets’ names in the collection while the second statement will print only the first two pets’ names. That’s the advantage of lazy collection. It computes lazily.

pets.view.filter { x => println(x.getName); true }.get(1) --- (3)

From the code above, are we going to see the same result as (2)? The answer is No. The result will be the same as (1). Could you tell me why?

After comparing Java and Scala approach on a few common operations — filter, group, map, and find; it’s obvious that Scala approach is shorter than Java. But, which approach do you like? Which one do you think it is more readable?

In the next part of this article, we are going to see which one is faster. Hopefully, it will be less arguable next time. Stay tuned!

API Stream (computing) Java (programming language) Scala (programming language) Element

Opinions expressed by DZone contributors are their own.

Related

Trending