Data-as-a-Service: The data fabric for clouds
Never before has data storage and retrieval been more at the forefront of what's new and exciting in technology, and it's long overdue for a major update. Consider, how much have filesystems or relational databases really changed over the last two decades?
Web-2.0: that much overused, overloaded and overhyped term is also over here! And everywhere. Traditionally considered the darling of teenage PHP hackers sitting in their bedrooms, Web-2.0 has spawned a whole class of massively distributed, socially spread applications which impose unprecedented volumes on data storage systems. With AJAX and other such rich client technologies, expectations for very snappy UIs have imposed speed and low latency constraints on data access. Moreover, with the proliferation of rich media - video, high res photographs - sheer data volumes have exploded, and will continue to to do so. And finally, with rich mobile, connected platforms - think smart phones, tablets, netbooks - expectations on data availability previously only seen among highly specialist systems like telcos and certain military functions are now prevalent across a much wider class of applications.
What's interesting is that such styles of interaction are no longer restricted to kids stalking their best-friends' girlfriends on Facebook. Motivated by how sticky and viral such interaction can be, this is leaking into the mainstream as it gives users a sense of ownership, of being "involved". Take SalesForce's Chatter features for example. Or the fact that most corporate Intranets have such social features as being able to share bookmarks or "moods". Even a financial services company I use to invest in mutual funds offers rich, visual mechanisms to compare funds, and then "share" my opinions with "friends". And news and media sites, where possibly more interesting information can be gleaned from user comments and exchanges rather than the news items themselves.
Back to technology. Where has this left us? Massive volume, high spikes in usage, demands on low latency access, and 24x7 high availability on our data storage systems, all of these characteristics that trouble traditional RDBMSs. So the world turns to a technology called the in-memory data grid.
In-memory Data GridsIn-memory data grids aren't new and have been around for a while. They have been employed in large-scale or mission-critical applications to take load off RDBMSs as applications grow to cope with an explosion of data and an even greater explosion of expectations and requirements. You may call such use of data grids - as a distributed cache - a pacemaker attached to the ageing RDBMS.
But even this isn't enough as the RDBMS still becomes the ultimate bottleneck, particularly in truly elastic cloud-style deployments where virtual nodes are brought up - and down - on demand, to most efficiently cope with load at any given moment. And this is where NoSQL comes in.
The world of NoSQLNoSQL is currently as undefined and unstructured a term as the data organisation style it promotes. The whole ethos behind NoSQL is unstructured, elastic data. Highly available, massively scalable, and above all, distributed. As dictated by Eric Brewer's CAP theorem, consistency is weakened to gain such high availability in most NoSQL systems, resulting in what is termed Eventual Consistency. You may have heard of better known implementations such as Google's BigTable and Apache Hadoop over the last few years, and these have definitely raised the bar in practical scalability and set the standard for NoSQL. However, increasingly, developers are getting comfortable experimenting with bleeding edge solutions emerging from the current Cambrian explosion of NoSQL projects.
But is that enough? NoSQL is useful in many cases, but most disk-based NoSQL systems focus on massive volume and throughput rather than low latency, fast access. There is a subspecies where NoSQL databases have been crossed with in-memory data grids to provide fast, low-latency access to in-memory data. Such systems include Amazon's Dynamo and open source projects Voldemort and Infinispan.
Data clouds: DaaSTaking things even further, lets look at that other occasionally overhyped and overloaded term, cloud computing. Clouds have been popularised by the massive economies of scale, ease of use and high degree of hardware utilisation realised by infrastructure (IaaS) and platform (PaaS) services, whether public or private. And this approach could be applied to data storage as well. Imagine a service that you connect to in a platform-independent manner, to store and retrieve data, not too dissimilar from a traditional RDBMS connection. Except that you don't need to know or care about what sort of system is used to store your data. And you have guarantees of low latency, high scalability and availability. This virtualisation of storage systems would be necessary in any cloud deployment to complement elastic, on-demand infrastructure and middleware, and would be backed by technology exhibiting the desirable characteristics of elasticity, low latency, distribution, etc. These would typically involve a mix of NoSQL solutions. Any such solution would need to provide multi-tenancy, metering and health monitoring capabilities, even if it were to be only used for an internal, private cloud. Multi-tenancy would be necessary to be able to isolate data from different applications, perhaps using some form of namespacing. Metering, would be crucial to appropriately apportion the shared cost of running the service, and would entail keeping track of CPU cycles consumed, disk and memory space consumed, and possibly even bandwidth.
Amazon offers such a service, called SimpleDB. Google's AppEngine provides its DataStore API around BigTable. But apart from these two proprietary offerings, the current landscape is bleak. Budding cloud service providers - or vendors providing software to cloud service providers - will need to think hard about such data services and ensure it is a part of their offering.
What does this mean for application developers?Thankfully most people don't directly interact with databases as much as they used to, and ORM tools like Hibernate, an implementation of the JPA standard, have added a layer of abstraction above RDBMSs. I expect plugins to work with such ORM tools to ease the transition from RDBMSs to elastic cloud storage services. The fledgling SimpleJPA project offering JPA capabilities atop Amazon SimpleDB is one, and Infinispan's upcoming JPA interface is another. And while I expect more to come as DaaS becomes more prevalent, I don't expect any of these to be a silver bullet allowing truly transparent migration. I believe developers will still want to take a long, hard look at their application data and understand why they store things the way they do. Certain RDBMS-specific assumptions are often implicit in the data models we design simply because RDBMS technology is so well ingrained in the way we think about data. And these often can hinder or add inefficiencies when moving on to non-RDBMS storage.