DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Building Scalable Real-Time Apps with AstraDB and Vaadin
Register Now

Trending

  • Using Render Log Streams to Log to Papertrail
  • Demystifying SPF Record Limitations
  • Security Challenges for Microservice Applications in Multi-Cloud Environments
  • Tactics and Strategies on Software Development: How To Reach Successful Software [Video]

Trending

  • Using Render Log Streams to Log to Papertrail
  • Demystifying SPF Record Limitations
  • Security Challenges for Microservice Applications in Multi-Cloud Environments
  • Tactics and Strategies on Software Development: How To Reach Successful Software [Video]
  1. DZone
  2. Data Engineering
  3. Data
  4. Put your fat Collections on a diet!

Put your fat Collections on a diet!

Nikita Salnikov-Tarnovski user avatar by
Nikita Salnikov-Tarnovski
·
Nov. 02, 11 · Interview
Like (0)
Save
Tweet
Share
6.62K Views

Join the DZone community and get the full member experience.

Join For Free

Background

Every Java program uses some sort of data structures, be it a trivial array or a Fibonacci Heap or even something more exotic that only Google search knows about. In most cases developers do not write their own implementations of these structures but use either the one provided by Java core APIs or some third-party library, such as Apache Collections or Google Guava. In my 10+ years of Java development not a day passed by without me using some data structure from Java Collection API. These Lists, Sets and Maps are so natural to me that I don't hesitate a second before writing 

Map<Integer, String> = new HashMap<Integer, String>(); 

And everything was fine until recently…

One of the classes inside our Plumbr java agent needs to store a bunch of integers as one of its fields. The semi-formal requirements are as follows:

  • We need a data structure for storing integers.
  • No duplicates
  • Order is unimportant
  • We need to add to this structure new elements
  • We need to look up if some element exists in this structure
  • Number of different elements is limited to a couple of hundreds at most
  • Memory consumption is more important than speed
  • Nevertheless the performance  must be decent, so MemoryMapped files, database etc are out of question.

The natural choice for this requirement is, at least considering my experience so far, java.util.HashSet<Integer>. So, without thinking twice I gave it a try. That was a disaster!

Experiment

Well, in order to illustrate my point, we need some way to measure memory usage of different data structures. For this blog post I used the following procedure:

  1. Write a java class with main method, which holds needed data structure as a local variable
  2. Add infinite cycle to the end of this main method in order for this thread not to die too quickly.
  3. Using Eclipse Memory Analyzer take a memory dump and find out the size of retained heap for the local variable of interest.

As a baseline I used the fact that in Java integer takes 4 bytes. So, for a COUNT number of integers we need 4*COUNT bytes. Then we can calculate the overhead of the given data structure as follows:

Overhead = Structure's retained heap/(4*COUNT)

Please note, that Java distinguishes between primitives and objects and Collection API operates only on objects. It means that total overhead consists of the overhead of given collection data structure and overhead of using Integer objects, not primitives.

Results


 So, having settled that, let us measure how big are java.util.HashSets. In order to do that I used the following code:

Set<Integer> set = new HashSet<Integer>();
int COUNT = 10;
for (int i = 0; i < COUNT; i++) {
  set.add(i);
}


Let us look at the results:
COUNTRetained heapOverhead
1072018
1006 96017.4
100085 42421.36
100000088 774 25622.19

Wow! Just wow! Storing integers in java.util.HashSet takes about 20(!) times as much memory as the information we are storing. This is a HUGE overhead in my opinion. Taking into account our need to work in really constrained memory conditions that was unacceptable. We had to find some other way. We started with reviewing our requirements: what do we really need? It turns out, that plain old java array suits our requirements all right. Changing my code to this:

int COUNT = 10;
Integer[] array = new Integer[COUNT];
for (int i = 0; i < COUNT; i++) {
    array[i] = i;
} 
yielded the following results:
COUNTRetained heapOverhead
103448.6
1003 2248.06
100032 0248.006
100000032 000 0248.000006

So far so good. This reduced the overhead by 2-3 times in comparison to the HashSet. But it is still too large to my taste. So I just replaced Integer[] with int[]:
int COUNT = 10; 
int[] array = new int[COUNT];
for (int i = 0; i < COUNT; i++) {
    array[i] = i;
}

and got:
COUNTRetained heapOverhead
10641.6
1004241.06
10004 0241.006
10000004 000 0241.000006

Now, that's much better. We have a constant overhead of 24 bytes per array. And implementing the needed operations, such as element addition and duplicate elimination is as easy as pie.

Conclusion


The first conclusion is quite obvious: array of primitives is the most compact data structure. That is not surprising :) What was really a big surprise for me, was the magnitude of overhead that Java Collection API has. I hope to keep that in mind next time I choose the structure for my data.

Of course, Java Collection API will not become deprecated as a result of this post :) And I certainly will use it again and again, as it provides a very easy-to-use API. But in those rare cases when every byte really matters it is much better to be aware of discovered overhead and design your piece of software accordingly.

Another case where this overhead may be important, is storing large amount of data in e.g. a HashSet. I mean, in case of a couple of millions of elements this overhead, in absolute numbers, grows to a couple of hundreds of megabytes. So, the overhead alone requires a significant increase in your heap size.
Data structure Overhead (computing) Java (programming language) Data (computing)

Opinions expressed by DZone contributors are their own.

Trending

  • Using Render Log Streams to Log to Papertrail
  • Demystifying SPF Record Limitations
  • Security Challenges for Microservice Applications in Multi-Cloud Environments
  • Tactics and Strategies on Software Development: How To Reach Successful Software [Video]

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: