DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Amazon OpenSearch Vector Search Explained for RAG Systems
  • Production-Grade RAG: Why Vector Search Isn't Enough (and How Hybrid Search Fills the Gaps)
  • Beyond Manual Annotation: Engineering Self-Correcting Pseudo-Labeling Pipelines
  • How to Save Money Using Custom LLMs for Specific Tasks

Trending

  • AWS Kiro: The Agentic IDE That Makes Specs the Unit of Work
  • Monitoring Spring Boot Applications with Prometheus and Grafana
  • Code Quality Had 5 Pillars. AI Broke 3 and Created 2 We Can’t Measure
  • Real-Time AI Inference at Scale Using Cloud Run, GPUs, and Vertex AI
  1. DZone
  2. Data Engineering
  3. Data
  4. A Disk-Backed ArrayList

A Disk-Backed ArrayList

Learn more about disk-backed ArrayLists and Java performance.

By 
Bozhidar Bozhanov user avatar
Bozhidar Bozhanov
·
Updated Jan. 02, 20 · Analysis
Likes (1)
Comment
Save
Tweet
Share
13.6K Views

Join the DZone community and get the full member experience.

Join For Free

Learn more about disk-backed ArrayLists and Java performance.


A Java disk-based ArrayList can sometimes occur when your list becomes too big to fit in memory and you have to do something to avoid running out of memory.

The proper way to do that is streaming — instead of fitting everything in memory, you should stream data from the source and discard the entries that are already processed.


You may also like: Performance Evaluation of Java ArrayLists


However, there are cases when code that's outside of your control requires a List and you can't use streaming. These cases are rather rare, but in case you hit them, you have to find a workaround. One is to re-implement the code to work with streaming, but depending on the way the library is written, it may not be possible. So the other option is to use a disk-backed list — one that works as a list but underneath stores and loads elements from disk.

Searching for existing solutions results in several 3+ years old repos like this one and this one and this one.

And then there's MapDB, which is great and supported. It's mostly about maps, but it does support a List as well, as shown here.

And finally, you have the option to implement something simpler yourself, in case you need just iteration and almost nothing else. I've done it here - DiskBackedArrayList.java. It doesn't support many things (not all methods are overridden to throw an exception, but they should). But most importantly, it doesn't support random adding and random getting, and also toArray(). It's purely "fill the collection" and then "iterate the collection". It relies on ObjectOutputStream, which is not terribly efficient, but is simple to use.

Note that I've allowed a short in-memory prependList in case small amounts of data need to be prepended to the list.

The list gets filled in memory until a specified threshold and then gets flushed to disk, clearing the memory which starts getting filled again. This too can be more efficient — with background flushing in another thread that doesn't interfere with adding elements to the list, but optimizations complicate things, and in this case, the total running time was not an issue.

Most importantly, the iterator() method is overridden to return a custom iterator that first streams the prepended list, then reads everything from disk and finally iterates over the latest batch, which is still in memory. And finally, the clear() method should be called in the end in order to close the underlying stream.

An output stream could be opened and closed on each flush, but ObjectOutputStream can't be used in append mode due to some implementation-specific about writing headers first.

So basically, we hide the streaming approach underneath a List interface — it's still streaming elements and discarding them when not needed. Ideally, this should be done at the source of the data (e.g. a database, message queue, etc.) rather than using the disk as overflow space, but there are cases where using the disk is fine. This implementation is a starting point, as it's not tested in production, but illustrates that you can adapt existing classes to use different data access patterns if needed.

Further Reading

How to Create an ArrayList in Java

Performance Evaluation of Java ArrayLists

Data structure

Published at DZone with permission of Bozhidar Bozhanov. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Amazon OpenSearch Vector Search Explained for RAG Systems
  • Production-Grade RAG: Why Vector Search Isn't Enough (and How Hybrid Search Fills the Gaps)
  • Beyond Manual Annotation: Engineering Self-Correcting Pseudo-Labeling Pipelines
  • How to Save Money Using Custom LLMs for Specific Tasks

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook