DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Data
  4. Large, Interconnected, In-Memory Models

Large, Interconnected, In-Memory Models

Can you easily get machines with enough RAM to store stupendous amounts of data in memory?

Oren Eini user avatar by
Oren Eini
·
Feb. 07, 19 · Opinion
Like (1)
Save
Tweet
Share
7.68K Views

Join the DZone community and get the full member experience.

Join For Free

I got into an interesting discussion about Event Sourcing in the comments for a post, and it was interesting enough to make a post about that in itself.

Basically (I’m paraphrasing, and maybe not too accurately), Harry is suggesting a potential solution to the problem of having the computed model from all the events stored directly in memory. The idea is that you can pretty easily get machines with enough RAM to store stupendous amounts of data in memory. That will give you all the benefits of being able to hold a rich domain model without any persistence constraints. It is also likely to be faster than any other solution.

And to a point, I agree. It is likely to be faster, but that isn’t enough to make this into a good solution for most problems. Let me point out a few cases where this fails to be a good answer.

If the only way you have to build your model is to replay your events, then that is going to be a problem when the server restarts. Assuming a reasonably sized data model of 128GB or so, and assuming that we have enough events to build something like that, let’s say about 0.5 TB of raw events, we are going to be in a world of hurt. Even assuming no I/O bottlenecks, I believe that it would be fair to state that you can process the events at a rate of 50 MB/sec. That gives us just under 3 hours to replay all the events from scratch. You can try to play games here, try to read in parallel, replay events on different streams independently, etc. But it is still going to take time.

And enough time that this isn’t a good technology to have without a good backup strategy, which means that you need to have at least a few of these machines and ensure that you have some failover between them. But even ignoring that, and assuming that you can indeed replay all your states from the events store, you are going to run into other problems with this kind of model.

Put simply, if you have a model that is tens or hundreds of GB in size, there are two options for its internal structure. On the one hand, you may have a model where each item stands on its own with no relations to other items. Or if there are any relations to other items, they are well scoped to a particular root. Call it the Root Aggregate model, with no references between aggregates. You can make something like that work because you have good isolation between the different items in memory, so you can access one of them without impacting another. If you need to modify it, you can lock it for the duration, etc.

However, if your model is interconnected — so you may traverse between one Root Aggregate to another — you are going to be faced with a much harder problem.

In particular, because there are no hard breaks between the items in memory, you cannot safely/easily mutate a single item without worrying about access from another item to it. You could make everything single-threaded, but that is a waste of a lot of horsepower, obviously.

Another problem with in-memory models is that they don’t do such a good job of allowing you to rollback operations. If you run your code mutating objects and hit an exception, what is the current state of your data?

You can resolve that. For example, you can decide that you have only immutable data in memory and replace that atomically. That…works, but it requires a lot of discipline and makes it complex to program against.

Off the top of my head, you are going to be facing problems around atomicity, consistency, and isolation of operations. We aren’t worried about durability because this is purely an in-memory solution, but if we were to add that, we would have ACID, and that does ring a bell.

The in-memory solution sounds good, and it is usually very easy to start with, but it suffers from major issues when used in practice. To start with, how do you look at the data in production? That is something that you do surprisingly often to figure out what is going on “behind the scenes.” So you need some way to peek into what is going on. If your data is in-memory only and you haven’t thought about how to explore it to the outside, your only option is to attach a debugger, which is…unfortunate. Given the reluctance to restart the server (startup time is high), you’ll usually find that you have to provide some scripting that you can run in the process to make changes, inspect things, etc.

Versioning is also a major player here. Sooner or later, you’ll probably put the data inside a memory mapped to allow for (much) faster restarts, but then you have to worry about the structure of the data and how it is modified over time.

None of the issues I have raised are super hard to figure out or fix, but in conjunction? They turn out to be a pretty big set of additional tasks that you have to do just to be in the same place you were before you started to put everything in memory to make things easier.

In some cases, this is perfectly acceptable. For high-frequency trading, for example, you would have an in-memory model to make decisions as fast as possible as well as a persistent model to query on the side. But for most cases, that is usually out of scope. It is interesting to write such a system though.

Event Data (computing)

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • The 12 Biggest Android App Development Trends in 2023
  • Asynchronous HTTP Requests With RxJava
  • Hidden Classes in Java 15
  • Remote Debugging Dangers and Pitfalls

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: