DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Introducing Graph Concepts in Java With Eclipse JNoSQL, Part 3: Understanding Janus
  • Introducing Graph Concepts in Java With Eclipse JNoSQL, Part 2: Understanding Neo4j
  • How to Introduce a New API Quickly Using Micronaut
  • Introducing Graph Concepts in Java With Eclipse JNoSQL

Trending

  • Using Python Libraries in Java
  • Bridging UI, DevOps, and AI: A Full-Stack Engineer’s Approach to Resilient Systems
  • What’s Got Me Interested in OpenTelemetry—And Pursuing Certification
  • Developers Beware: Slopsquatting and Vibe Coding Can Increase Risk of AI-Powered Attacks
  1. DZone
  2. Coding
  3. Java
  4. Grouping, Sampling and Batching - Custom Collectors in Java 8

Grouping, Sampling and Batching - Custom Collectors in Java 8

By 
Tomasz Nurkiewicz user avatar
Tomasz Nurkiewicz
DZone Core CORE ·
Jul. 18, 14 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
21.5K Views

Join the DZone community and get the full member experience.

Join For Free

Continuing from the first article, this time we will write some more useful custom collectors: for grouping by given criteria, sampling input, batching and sliding over with fixed size window.

Grouping (counting occurrences, histogram)

Imagine you have a collection of some items and you want to calculate how many times each item (with respect to equals()) appears in this collection. This can be achieved using CollectionUtils.getCardinalityMap() from Apache Commons Collections. This method takes an Iterable<T> and returns Map<T, Integer>, counting how many times each item appeared in the collection. However sometimes instead of usingequals() we would like to group by an arbitrary attribute of input T. For example say we have a list of Person objects and we would like to compute the number of males vs. females (i.e. Map<Sex, Integer>) or maybe an age distribution. There is a built-in collector Collectors.groupingBy(Function<T, K> classifier) - however it returns a map from key to all items mapped to that key. See:

import static java.util.stream.Collectors.groupingBy;

//...

final List<Person> people = //...
final Map<Sex, List<Person>> bySex = people
        .stream()
        .collect(groupingBy(Person::getSex));

It's valuable, but in our case unnecessarily builds two List<Person>. I only want to know the number of people. There is no such collector built-in, but we can compose it in a fairly simple manner:

import static java.util.stream.Collectors.counting;
import static java.util.stream.Collectors.groupingBy;

//...

final Map<Sex, Long> bySex = people
        .stream()
        .collect(
                groupingBy(Person::getSex, HashMap::new, counting()));

This overloaded version of groupingBy() takes three parameters. First one is the key (classifier) function, as previously. Second argument creates a new map, we'll see shortly why it's useful. counting() is a nested collector that takes all people with same sex and combines them together - in our case simply counting them as they arrive. Being able to choose map implementation is useful e.g. when building age histogram. We would like to know how many people we have at given age - but age values should be sorted:

final TreeMap<Integer, Long> byAge = people
    .stream()
    .collect(
            groupingBy(Person::getAge, TreeMap::new, counting()));

byAge
        .forEach((age, count) ->
                System.out.println(age + ":\t" + count));

We ended up with a TreeMap from age (sorted) to count of people having that age.

Sampling, batching and sliding window

IterableLike.sliding() method in Scala allows to view a collection through a sliding fixed-size window. This window starts at the beginning and in each iteration moves by given number of items. Such functionality, missing in Java 8, allows several useful operators like computing moving average, splitting big collection into batches (compare with Lists.partition() in Guava) or sampling every n-th element. We will implement collector for Java 8 providing similar behaviour. Let's start from unit tests, which should describe briefly what we want to achieve:

import static com.nurkiewicz.CustomCollectors.sliding

@Unroll
class CustomCollectorsSpec extends Specification {

    def "Sliding window of #input with size #size and step of 1 is #output"() {
        expect:
        input.stream().collect(sliding(size)) == output

        where:
        input  | size | output
        []     | 5    | []
        [1]    | 1    | [[1]]
        [1, 2] | 1    | [[1], [2]]
        [1, 2] | 2    | [[1, 2]]
        [1, 2] | 3    | [[1, 2]]
        1..3   | 3    | [[1, 2, 3]]
        1..4   | 2    | [[1, 2], [2, 3], [3, 4]]
        1..4   | 3    | [[1, 2, 3], [2, 3, 4]]
        1..7   | 3    | [[1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6], [5, 6, 7]]
        1..7   | 6    | [1..6, 2..7]
    }

    def "Sliding window of #input with size #size and no overlapping is #output"() {
        expect:
        input.stream().collect(sliding(size, size)) == output

        where:
        input | size | output
        []    | 5    | []
        1..3  | 2    | [[1, 2], [3]]
        1..4  | 4    | [1..4]
        1..4  | 5    | [1..4]
        1..7  | 3    | [1..3, 4..6, [7]]
        1..6  | 2    | [[1, 2], [3, 4], [5, 6]]
    }

    def "Sliding window of #input with size #size and some overlapping is #output"() {
        expect:
        input.stream().collect(sliding(size, 2)) == output

        where:
        input | size | output
        []    | 5    | []
        1..4  | 5    | [[1, 2, 3, 4]]
        1..7  | 3    | [1..3, 3..5, 5..7]
        1..6  | 4    | [1..4, 3..6]
        1..9  | 4    | [1..4, 3..6, 5..8, 7..9]
        1..10 | 4    | [1..4, 3..6, 5..8, 7..10]
        1..11 | 4    | [1..4, 3..6, 5..8, 7..10, 9..11]
    }

    def "Sliding window of #input with size #size and gap of #gap is #output"() {
        expect:
        input.stream().collect(sliding(size, size + gap)) == output

        where:
        input | size | gap | output
        []    | 5    | 1   | []
        1..9  | 4    | 2   | [1..4, 7..9]
        1..10 | 4    | 2   | [1..4, 7..10]
        1..11 | 4    | 2   | [1..4, 7..10]
        1..12 | 4    | 2   | [1..4, 7..10]
        1..13 | 4    | 2   | [1..4, 7..10, [13]]
        1..13 | 5    | 1   | [1..5, 7..11, [13]]
        1..12 | 5    | 3   | [1..5, 9..12]
        1..13 | 5    | 3   | [1..5, 9..13]
    }

    def "Sampling #input taking every #nth th element is #output"() {
        expect:
        input.stream().collect(sliding(1, nth)) == output

        where:
        input  | nth | output
        []     | 1   | []
        []     | 5   | []
        1..3   | 5   | [[1]]
        1..6   | 2   | [[1], [3], [5]]
        1..10  | 5   | [[1], [6]]
        1..100 | 30  | [[1], [31], [61], [91]]
    }
}

Using data driven tests in Spock I managed to write almost 40 test cases in no-time, succinctly describing all requirements. I hope these are clear for you, even if you haven't seen this syntax before. I already assumed existence of handy factory methods:

public class CustomCollectors {

    public static <T> Collector<T, ?, List<List<T>>> sliding(int size) {
        return new SlidingCollector<>(size, 1);
    }

    public static <T> Collector<T, ?, List<List<T>>> sliding(int size, int step) {
        return new SlidingCollector<>(size, step);
    }

}

The fact that collectors receive items one after another makes are job harder. Of course first collecting the whole list and sliding over it would have been easier, but sort of wasteful. Let's build result iteratively. I am not even pretending this task can be parallelized in general, so I'll leave combiner() unimplemented:

public class SlidingCollector<T> implements Collector<T, List<List<T>>, List<List<T>>> {

    private final int size;
    private final int step;
    private final int window;
    private final Queue<T> buffer = new ArrayDeque<>();
    private int totalIn = 0;

    public SlidingCollector(int size, int step) {
        this.size = size;
        this.step = step;
        this.window = max(size, step);
    }

    @Override
    public Supplier<List<List<T>>> supplier() {
        return ArrayList::new;
    }

    @Override
    public BiConsumer<List<List<T>>, T> accumulator() {
        return (lists, t) -> {
            buffer.offer(t);
            ++totalIn;
            if (buffer.size() == window) {
                dumpCurrent(lists);
                shiftBy(step);
            }
        };
    }

    @Override
    public Function<List<List<T>>, List<List<T>>> finisher() {
        return lists -> {
            if (!buffer.isEmpty()) {
                final int totalOut = estimateTotalOut();
                if (totalOut > lists.size()) {
                    dumpCurrent(lists);
                }
            }
            return lists;
        };
    }

    private int estimateTotalOut() {
        return max(0, (totalIn + step - size - 1) / step) + 1;
    }

    private void dumpCurrent(List<List<T>> lists) {
        final List<T> batch = buffer.stream().limit(size).collect(toList());
        lists.add(batch);
    }

    private void shiftBy(int by) {
        for (int i = 0; i < by; i++) {
            buffer.remove();
        }
    }

    @Override
    public BinaryOperator<List<List<T>>> combiner() {
        return (l1, l2) -> {
            throw new UnsupportedOperationException("Combining not possible");
        };
    }

    @Override
    public Set<Characteristics> characteristics() {
        return EnumSet.noneOf(Characteristics.class);
    }

}

I spent quite some time writing this implementation, especially correct finisher() so don't be frightened. The crucial part is a buffer that collects items until it can form one sliding window. Then "oldest" items are discarded and window slides forward by step. I am not particularly happy with this implementation, but tests are passing. sliding(N)(synonym to sliding(N, 1)) will allow calculating moving average of N items.sliding(N, N) splits input into batches of size N. sliding(1, N) takes every N-th element (samples). I hope you'll find this collector useful, enjoy!

Java (programming language)

Published at DZone with permission of Tomasz Nurkiewicz, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Introducing Graph Concepts in Java With Eclipse JNoSQL, Part 3: Understanding Janus
  • Introducing Graph Concepts in Java With Eclipse JNoSQL, Part 2: Understanding Neo4j
  • How to Introduce a New API Quickly Using Micronaut
  • Introducing Graph Concepts in Java With Eclipse JNoSQL

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!