Java Stream API: 3 Things Every Developer Should Know About
Java Stream API simplifies collection processing with lazy evaluation, parallelism, and functional programming. Use it to write cleaner, efficient, and scalable code.
Join the DZone community and get the full member experience.
Join For FreeTime flies! I remember the time when Java 8 was such a reference, and everybody was talking about it like something new and revolutionary. Frankly speaking, it was new and revolutionary. But now, projects using Java 8 might be labeled "legacy." If Java 8 itself became a legacy, the features introduced in that version would still be actual. And let’s talk today about one of them — Stream API.
In case you don’t know, Java Stream API is a powerful tool that allows programmers to write Java code in a functional programming style. Introduced long ago, it makes working with collections easier by enabling filtering, transformation, and aggregation.
Even though Stream API is widely used, I still see a lot of gaps in deep knowledge of this technology. In this article, I will explore three aspects of the Stream API that I find essential to understand:
- Lazy evaluation. This helps us optimize the execution of operation chains.
- Parallel streams. We will take a deep look at adding parallelism in data processing by leveraging multi-core processors.
- Lambda variable scope. We will find out how to correctly pass variables into Lambdas when using Streams.
I hope, by the end of this article, you will be able to understand these concepts a little better,
1. Lazy Evaluation in Java Stream API
Lazy evaluation is the core concept to understand if you want to effectively work with the streams. But before we go deeper into lazy evaluation, let’s explore the two main types of operations in a stream pipeline: intermediate and terminal operations.
- Intermediate Operations. Intermediate operations are those operations that transform the input stream into another stream without producing an immutable result. For example,
filter()
,map()
, andflatMap()
are all intermediate operations because they take an input stream and return another stream as a result. These operations do not consume all the elements from the input stream but instead create a new stream with the desired elements. - Terminal Operations. Terminal operations are those operations that consume the elements from the stream and either return a result or side-effectingly modify some state. For instance,
forEach()
,findFirst()
, andcollect()
are all terminal operations because they ultimately consume all the elements in the stream to produce a result.
What Is Lazy Evaluation and How Does It Work?
In terms of Stream API Lazy evaluation means that the intermediate operations in a stream will not be executed until we call a terminal operation. It means we can define a stream and all operations in any place of code and execute it only when a terminal operation is called.
When we invoke a terminal operation, the stream starts processing the data one element at a time, applying all the intermediate operations in sequence. This approach helps us optimize performance by avoiding unnecessary computations.
Let’s look at a practical example to see how laziness impacts execution:
import java.util.stream.Stream;
public class LazyEvaluationExample {
public static void main(String[] args) {
Stream<Integer> stream = Stream.of(1, 2, 3, 4, 5)
.filter(num -> {
System.out.println("Filtering: " + num);
return num % 2 == 0;
})
.map(num -> {
System.out.println("Mapping: " + num);
return num * 2;
});
System.out.println("Stream pipeline defined, no execution yet.");
// Terminal operation triggers execution
stream.forEach(System.out::println);
}
}
Output:
Stream pipeline defined, no execution yet.
Filtering: 1
Filtering: 2
Mapping: 2
4
Filtering: 3
Filtering: 4
Mapping: 4
8
Filtering: 5
Let’s try to understand why we have the following output. We may see that filter()
and map()
operations are lazy. It means even though we wrote the code, it was not invoked before the terminal forEach()
was invoked. It explains why we see the output, Stream pipeline defined, no execution yet, first.
Only when the forEach()
terminal operation is called does the stream start processing elements one by one.
2. Parallel Streams
One of the most useful and powerful features of Java Stream API is the support of parallel streams. Parallelism is nothing but a capability to process two or more actions concurrently by leveraging multiple CPU cores. When it applies to Stream API, it means that we are allowed to process intermediate or terminal operations for more than one element in the stream at the same time.
This functionality can significantly improve performance for computationally intensive tasks, but it is better to understand it solidly to reach the best results.
What Are Parallel Streams?
A parallel stream is a stream that divides its elements into several chunks and then processes them in parallel by different threads. In contrast to a normal stream, which processes elements one by one, parallel streams use ForkJoinPool under the hood to achieve parallelism.
Creating a parallel stream is straightforward. You may use one of two approaches:
- For existing collections, one may use the
parallelStream()
method, or - You can make an existing stream parallel just by calling
parallel()
method on it.
When to Use Parallel Streams
Parallel streams can boost performance in specific scenarios, but they’re not always the right choice. Here are some key considerations:
1. Suitable Scenarios
- Large data sets. Parallelism works best when processing a large amount of data.
- CPU-intensive tasks. It is ideal for computations that heavily utilize the CPU, such as mathematical operations or data transformations.
2. Avoid Parallel Streams When
- IO-bound tasks. If your task requires reading or writing, using disk/network parallel streams might not be the right choice for you.
- Small data sets. You should keep in mind that under the hood, Java Virtual Machine is still managing thread-switching contexts between them, etc. So, in the case of working with small data sets, the overhead of managing threads may outweigh the performance benefits, and we will see it a little later in the next example.
Explanation of the Performance Comparison Code
The code provided below will help us compare the performance of sequential and parallel streams in Java by summing numbers in a range.
We are going to run two tests:
- First: With a range from 1 to 1,000,000
- Second: With a range from 1 to 100,000,000
Our main goal is to compare the processing time for both sequential and parallel streams, which helps us understand the pros and cons of using parallel streams from the performance side.
Int rangeLimit = 1_000_000
start = System.currentTimeMillis();
LongStream.rangeClosed(1, rangeLimit)
.reduce(0L, Long::sum);
end = System.currentTimeMillis();
System.out.println("Sequential Stream Time: " + (end - start) + " ms");
start = System.currentTimeMillis();
LongStream.rangeClosed(1, rangeLimit)
.parallel()
.reduce(0L, Long::sum);
end = System.currentTimeMillis();
System.out.println("Parallel Stream Time: " + (end - start) + " ms");
First, we created two streams: sequential and parallel. The parallel one was created using the .parallel()
method on an existing stream. Both are including numbers from 1 to 1,000,000. We did it using the LongStream.rangeClosed()
method.
Second, for both streams, we executed the .reduce(0L, Long::sum)
method that sums all the elements in the input stream. As reduce
is a terminal operation, the stream will be processed as soon as we call this method.
We are able to measure the time this operation took. This info is recorded and stored in variables start
and end
using the System.currentTimeMillis()
command. The result is printed in milliseconds using the System.out.println
method.
So, let’s execute our code twice, updating the rangeLimit
variable. For the first execution, set the value to 1,000,000
as in the code. For the second execution, set this value to 100,000,000
.
For the range from 1 to 1,000,000:
Sequential Stream Time: 9 ms
Parallel Stream Time: 12 ms
We can see that, in this case, the parallel stream takes slightly longer than the sequential stream. This is a great example showing that managing multiple threads may cause a performance penalty for smaller data sets like the one we used in our example.
As a next step in our testing, let’s increase the range to 100_000_000
, and we will be able to see the following results:
Sequential Stream Time: 57 ms
Parallel Stream Time: 12 ms
Finally, we can see the benefit of parallel streams. Here, the parallel stream outperforms the sequential stream by a significant margin. The larger data set was able to benefit from parallelism by leveraging multiple CPU cores for a faster computation process.
Important Consideration: Handling Large Data Sets
We should remember one point: Long values in Java have a maximum range of 2^63-1. So in our example, when we test our stream with a larger second range, the sum may exceed this limit, giving an incorrect result.
As the main purpose of this example is to show the behavior of parallel streams and just compare efficiency, we may ignore the fact that the result might be incorrect. For accurate summing, you may need a type with a larger range, such as BigInteger.
3. Variable Scope in Lambdas
Let’s talk a little bit about lambdas. Lambda expressions are widely involved when using Stream API. Frankly speaking, I think there is quite an impressive number of developers who are using lambdas only with streams. So, I think it is reasonable to discuss some Lambda-related points in this article as well.
We should be aware that lambdas have a specific way of interacting with variables, and for us, as Java developers, it is crucial to understand how lambdas capture and use variables.
Let’s explore how variable scope works in lambda expressions and how it differs from traditional methods
Capturing Variables in Lambdas
Let’s imagine you have a variable initialized outside the scope of lambda, somewhere in the surrounding scope and you are planning to use this variable inside your lambda function. Can you do it?
It depends. We can use variables captured from the surrounding scope only if they are final or effectively final. But what does "effectively final" mean?
To keep it short, a variable is considered effectively final if its value is never changed after it is initialized. So, to use your variable in Lambda, you have two ways:
- Initialize your variable as usual and make sure its value is changed after initialization or
- Just make this variable final by adding a corresponding keyword at the stage of initialization.
int factor = 2;
List<Integer> numbers = Arrays.asList(1, 2, 3, 4, 5);
numbers.stream()
.map(n -> n * factor)
.forEach(System.out::println);
We can see in the above example that the factor variable is effectively final because we are not updating it after initialization. It means that this variable might be used in our Lambda. You may try to reassign the factor after its initial value and check what happens then.
Conclusion
Stream API is a powerful yet simple-to-understand set of tools for processing sequences of elements. When used correctly, it helps reduce a lot of unnecessary code, makes programs more readable, and improves application performance. But as I mentioned, it is crucial to use it correctly to achieve the best result from performance and code cleanness perspectives.
Opinions expressed by DZone contributors are their own.
Comments