Introducing Infinispan: An Interview With Manik Surtani
James Sugrue: Could you introduce yourself to our readers please?
Manik Surtani: Sure, my name is Manik Surtani, software hacker in concurrency and distributed computing enthusiast and general purpose open source proponent. I should also say I'm the founder and project lead on Infinispan. I'm probably better known as the project lead on JBoss Cache. I've been in JBoss Cache since back in 2006. Infinispan is very new, but JBoss Cache has been around for a while now.
Sugrue: So you've got a lot experience in JBoss Cache. What exactly is Infinispan?
Surtani: Well, Infinispan is an open source and highly scalable data grid platform. In quick summary, it's an easy to use data structure where you can store Java objects and free them later on. Its state is primarily in memory, which means it's very fast to access. And it's an example of using cutting edge techniques, to ensure thread safety with minimal synchronization, or the use of approaches such as locks, which translates to very high performance on multicore or multi-CPU systems
And Infinispan is distributed as well - that's actually one of the more interesting features. It's optionally distributed, it still works great with the local and VM cache. But with distribution, data is now accessible from anywhere in the data grid. You can tranfer the memory of the entire grid, from any instance. For example, if you configure your distribution behavior into maintain one replica of each object for redundancy, and the grid of say 100 blades each with a two gig heap. You can now collect and restore 100 gigabytes of state in memory in just one data structure.
So even if your local JVM is only configured with a two gigabyte heap, you can now address a 100 gigabytes in state across the entire data grid. When Infinispan is used to retrieve data from across the grid for you, it's still significantly faster, particularly under high loads with a lot of concurrent access that goes into a disk, as everything happened in memory.
Sugrue: So you were saying that you can use Infinispan on its own, and that you can use it across distributed grids. So as someone who isn't too familiar with grid programming, does that mean that the same program that I write for a standalone can easily work across a distributed environment?
Surtani: It means that it can, yes, but in general I would not advise people to do that, because they're always are extra concerns, and you always want to tier an application to work on a grid. So you try and minimize stuff that will potentially be going across the wire and things like that. Whereas, you'll have no such concerns at all in a standalone case.
Sugrue: What are the motivations behind the project?
Surtani: In terms of motivation basically, I always feel the community has been crying out for a completely open source player data grid environment. So far the only really good ones have been commercial ones, usually very expensive. And the few single open source offerings really haven't been up to scratch, JBoss Cache included. JBoss Cache was never intended to be a data grid but people have been trying to use it as one. And that's when I realized people actually do need one.
Sugrue: So JBoss Cache as it was being used wasn't quite what it was meant to be to be used for. So what is JBoss Cache's primarily reason for existence?
Surtani: JBoss Cache is actually a clustering tool, because it still can help you cluster an application. So there's a fine line between the cluster and a grid. When people talk about clusters, they're usually talking about five nodes to 10 nodes, maybe up to 20 or 30 nodes, and that's really what JBoss focuses on. Whereas a grid, you are talking 100 maybe even thousands of nodes. And when you talk about that difference in scale, the fundamental way you architect that data grid is going to be different.
Sugrue: So while we're talking about the data grid, you say, it's across hundreds or thousands of nodes? How many nodes would a typical data grid have in your experience?
Surtani: In theory, yes. It should begin to scale up to that. For a typical data grid, I would say hundreds. It would definitely be in the hundreds.
Sugrue: When we speak of a node in the data grid, does it meana PC on it's own or a server or does it mean something else?
Surtani: Well specifically, I talk about a JVM. So you're talking about one JVM per virtual server and these are typically virtual appliances, especially when we talk about clouds and cloud computing. It's essentially virtual images, and you can have one JVM per virtual image.
Sugrue: How is grid computing different from cloud computing, or is there a difference?
Surtani: Yes, actually. I mean, a lot of these terms are pretty ambiguous, and people tend to use them interchangeably and very often wrong. So for what I'm actually going to say is my interpretation of these terms, because I don't believe there's a real right or wrong. When I say a data grid, I essentially refer to a distributed data structure. Something that's fast, has got low latency, that's accessible from anywhere, and very highly scalable. And it seems to grow as you add more nodes to the grid.
Now the cloud on the other hand, I feel tends to refer to on-demand provisioning of computing resources. It's more lower down, more lower level. And a cloud really has to do with computing resources. They tend to bring an operating system, disk, networking, keeping stack into account. And like I said earlier, the current trend is to use virtual appliances for these. So you can bring them in as and when you need on demand. You can shut them down on demand, things like that.
There are some common characteristics that they are both typically homogeneous, they tend to be preconfigured. Both data grids and clouds tend to be homogeneous and preconfigured. They both tend to need to be able to do self-discovery, because that's important because as you bring new nodes up, they need to be able to find the grid or the rest of the cloud. And these tend to be used on top of cloud appliances and often in conjunction with computer grids as well.
Sugrue: It all sounds very low level. I'm just wondering how did you get interested in this stuff? Perhaps it's not as glamorous as other areas of computing?
Surtani: Well, I've always been interested in distributed computing in general. So whether it was taking a job and breaking it up, and have it process concurrently by multiple systems at the same time and things like that. And that's a classic computer grid. Then in a sense, the way I specifically got into data grids was really due to JBoss Cache. And then watching how to make it work as a data grid and by thinking, you know what, I can actually make this work if I just change a few things.
Sugrue: It's interesting to find how you got to where you are. So when is the right time to use a grid?
Surtani: So if you are using a computer grid, for example, you almost certainly want the data grid. So if you are actually, if you've got a scheduling system that takes jobs in and farms them out across a large number of CPUs, depending on who's available to process things, you don't want all of these sea of computer grids, in the same database or something like that. Because that will almost certainly defeat any purpose or any gain that you would have gained by using a computer grid. So you definitely want to be using a data grid there.
But there are many other reasons to use a data grid as well. For example, if you wish to cloud enable your application. If you want your application to be able to have new instances formed on demand, on a cloud, and things like that. Then you tend to need to be able to share state between them. Because there's no point in an application instances coming up, noting what it's doing, or where the others are, or what things are up to. So for which case if you need to share state, a data grid is definitely the easy way to do it.
People also to use those as replacements for databases, which complements databases sometimes, because they're offer much lower latency and they're much faster to access.
Sugrue: When would you say absolutely not to use a grid?
Surtani: I've seen people who tend to try and use a data grid, or any shared data structure throughout their messaging system. And then that's definitely wrong. A messaging system is a completely different access pattern and solves a whole different problem. And that's what JBoss is there for, JBoss messaging is an excellent distributed JMAP server, for example. But trying to use a data grid to pass messages around definitely is the wrong thing to do in the data grid. And that's a very common thing I've seen.
Sugrue: So I believe you're involved in JSR 1.07. How does Infinispan differ from that specification?
Surtani: Yes, I'm involved with JSR-107. Infinispan does actually implement JSR-107, or rather it would implement it when the JSR finally completes. Until then, we are tracking the latest developments on the JSR. We've got the latest snapshots of the interfaces. Yeah, we also implement all the optional bits of the JSR, that includes JTA compatibility and clustering, and so on and so forth.
Sugrue: Is there your first time being involved in the JSR? How are you finding that whole process?
Surtani: Yes, it is. It's interesting, let's put it that way.
Sugrue: How is multi core processing, parallel programming and cloud computing changing the way that we think about scalability and availability in our Java applications?
Surtani: Well, I have written about this a fair deal. Concurrent hardware threads are very popular now with Sun Niagara and things like that. This means that any form of mutual exclusion you might use in your software, this might include the Java synchronized locks or locks, this stuff severely limits the performance gains that it's all this really cool modern hardware has to offer. You'll always be hampered by the flow of sequential processes in your system. And as the system scales up, this problem becomes exponentially worse. So you might think you solve the problem by adding more nodes into a grid. You're actually causing more of a problem if you have sequential stuff going somewhere in that grid.
The next example is a single shared database in a cluster. And imagine that cluster growing, you need to be putting in more nodes in there, more web servers, more application servers, and that database has a bottleneck that is going get worse and worse. Mostly the cluster is going to be idle waiting for the database to catch up.
So we need to actually minimize any code or any process that may cause any sort of serial processing in the entire system. More often than not that is the database or a file system. That's just the way they are. That's just the way they were designed. They were designed for a different era- they weren't designed for massively multi-core or massively parallel world. And in-memory data grids, using them instead of a disk or a database is one way to do this because most of them memory networks are inherently parallel, they support a lot of concurrent threads and good data grid software can make all the difference.
Sugrue: what are the next steps for Infinispan? What's coming up in the future releases?
Surtani: We've got some very cool stuff coming up in the future releases.
But first I am trying to nail down and finalize the first public release, the first stable release of Infinispan, which will include the full distribution which is based on a consistent hash algorithm which is quite interesting.
The next step we are planning on memcache compatible server module which means that we can make use of anybody using an memcache client. So memcache is very popular, it's got a lot of open source clients written in many different languages, not just Java, there are client in C and C# and Python, and Ruby. So this means that anybody writing software on any of these platforms can actually use Infinispan, it's not just a Java platform anymore. And that's going to be interesting and I'm very keen to see how that goes.
Apart from that we're also planning a searchable interface, to be able to search for stuff across the grid. We hope to also have a JTA interface at some point.
Sugrue: It sounds like there's a lot on the horizon. And when is this first stable release planned for? Do you have a date in mind?
Surtani: Well we don't have a hard date. We're currently very, very active in the development. We're cutting alphas literally every three, every two weeks. I am expecting the first beta in about two weeks. Based on feedback for that we will determine when the actual release goes out. Should be by mid-June or the summer.
Sugrue: Are you seeing a lot of commercial interest in your project or a lot of people using it at the moment?
Surtani: At the moment, no. It still is in alpha right now and we'd be surprised if anybody tried to use it in any production capacity, but we have seen a lot of interest in people downloading it and trying it out, providing feedback, playing around with it which bodes well.
Sugrue: And you need to be using JBoss to use Infinispan?
Surtani: Not at all. Not at all. Like I was saying in the future you won't even need to use JAVA to use Infinispan.