DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • From Compliance Pipes to Data Streams: Modernizing Healthcare EDI for Strategic Value
  • Building a 300 Channel Video Encoding Server
  • Event-Driven Architecture's Dark Secret: Why 80% of Event Streams Are Wasted Resources
  • MongoDB Change Streams and Go

Trending

  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • Fact-Checking LLM Outputs Programmatically: Building a Verification Layer That Catches Hallucinations
  • What Is Plagiarism? How to Avoid It and Cite Sources
  • Designing a Secure API From Day One
  1. DZone
  2. Coding
  3. JavaScript
  4. Split a File as a Stream

Split a File as a Stream

Curious about when you would actually use the splitAsStream method? Here's a use case of splitting a file into chunks that can be processed as streams.

By 
Peter Verhas user avatar
Peter Verhas
DZone Core CORE ·
Nov. 30, 17 · Tutorial
Likes (15)
Comment
Save
Tweet
Share
30.9K Views

Join the DZone community and get the full member experience.

Join For Free

I recently discussed how the new (@since 1.8) method splitAsStream in the class Pattern works on character sequences, reading only as much as needed by the stream and not running ahead with pattern matching, creating all the possible elements and returning them as a stream. This behavior is the true nature of streams, and it is the way it has to be to support high-performance applications.

In this article, I will show a practical application of splitAsStream, where it really makes sense to process the stream and not just split the whole string into an array and work on that.

The application, as you may have guessed from the title of the article, is splitting up a file along some tokens. A file can be represented as a CharSequence as long it is not longer than 2GB. The limit comes from the fact that the length of a CharSequence is an int value, and that is 32 bits in Java. The file length is long, which is 64-bit. Since reading from a file is much slower than reading from a string that is already in memory, it makes sense to use the laziness of stream handling. All we need is a character sequence implementation that is backed up by a file. If we can have that, we can write a program like the following:

public static void main(String[] args) throws FileNotFoundException {
    Pattern p = Pattern.compile("[,\\.\\-;]");
    final CharSequence splitIt = 
        new FileAsCharSequence(
               new File("path_to_source\\SplitFileAsStream.java"));
    p.splitAsStream(splitIt).forEach(System.out::println);
}


This code does not read any part of the file — that is not needed yet — and assumes that the implementation FileAsCharSequence is not reading the file greedily. The class FileAsCharSequence implementation can be:

package com.epam.training.regex;

import java.io.*;

public class FileAsCharSequence implements CharSequence {
    private final int length;
    private final StringBuilder buffer = new StringBuilder();
    private final InputStream input;

    public FileAsCharSequence(File file) throws FileNotFoundException {
        if (file.length() > (long) Integer.MAX_VALUE) {
            throw new IllegalArgumentException("File is too long to handle as character sequence");
        }
        this.length = (int) file.length();
        this.input = new FileInputStream(file);
    }

    @Override
    public int length() {
        return length;
    }

    @Override
    public char charAt(int index) {
        ensureFilled(index + 1);
        return buffer.charAt(index);
    }


    @Override
    public CharSequence subSequence(int start, int end) {
        ensureFilled(end + 1);
        return buffer.subSequence(start, end);
    }

    private void ensureFilled(int index) {
        if (buffer.length() < index) {
            buffer.ensureCapacity(index);
            final byte[] bytes = new byte[index - buffer.length()];
            try {
                int length = input.read(bytes);
                if (length < bytes.length) {
                    throw new IllegalArgumentException("File ended unexpected");
                }
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
            try {
                buffer.append(new String(bytes, "utf-8"));
            } catch (UnsupportedEncodingException ignored) {
            }
        }
    }
}


This implementation reads only that many bytes from the file as needed for the last, actual method call to charAt or subSequence.

If you are interested, you can improve this code to keep only the bytes in memory that are really needed and delete bytes that were already returned to the stream. To know which bytes are not needed, a good hint is that the splitAsStream never touches any character that has a smaller index than the first (start) argument of the last call to subSequence.

However, if you implement the code in a way that it throws the characters away and fails if anyone wants to access a character that was already thrown, then it will not truly implement the CharSequence interface, though it still may work well with splitAsStream so long as long the implementation does not change and it starts needing some already passed characters. (Well, I am not sure, but it may also happen in a case where we use some complex regular expression as a splitting pattern.)

Happy coding!



If you enjoyed this article and want to learn more about Java Streams, check out this collection of tutorials and articles on all things Java Streams.

Stream (computing)

Published at DZone with permission of Peter Verhas. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • From Compliance Pipes to Data Streams: Modernizing Healthcare EDI for Strategic Value
  • Building a 300 Channel Video Encoding Server
  • Event-Driven Architecture's Dark Secret: Why 80% of Event Streams Are Wasted Resources
  • MongoDB Change Streams and Go

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook