The Future of Data Protection
At Datos IO, we brainstorm over the future trends in data protection and how protection has to evolve from its cornerstone of guarding against user errors and logical corruptions to embrace the future of computing. We brainstorm with our instincts, our customers, and our technical advisors, and we keep a very close pulse on where the market is headed.
We identified enterprise infrastructure trends very early on in our startup more than 2 years ago, and we always kept these in mind as we moved forward. We designed our architecture not only for our initial beachhead products but also for the changes that are permeating through the enterprise: the rise of cloud computing as well as complex applications dealing with multi-faceted data. In this post, I will describe two interesting trends that are influencing our vision and how we translate those trends into our product. I will specifically tackle these items:
- The emergence of multi-cloud as an approach to combat vendor lock-in.
- The rise of polyglot persistence to capture multi-structured data and their varying manipulation methods.
Multi-cloud: The Danger of Data Lock-In
Enterprises rarely think about cloud portability issues before transferring large amounts of data onto a cloud service provider like Amazon AWS, Microsoft Azure and Google Cloud Platform. They were likely to ask questions on usage by other players in the industry (“the herd effect”) as well as the reliability of the cloud service, durability of data among other things. Enterprises are starting to realize that as they store increasing amounts of data onto the cloud, it is easy to import data but hard to get data out. While it is true that sophisticated end users may be able to solve the problem, most users run the risk of locking in their databases, the “crown jewels”, with the cloud provider.
A common solution to the cloud portability issue is to develop an abstraction layer. Having an abstraction layer in place between an application and the cloud data service helps by minimizing the amount of code restructuring that would be required should an enterprise wish to migrate to a new cloud service. One example is the Amazon S3 API – the Google Cloud Framework has developed S3-compliant API so as to make it easy for enterprises to migrate their applications. The open source community is also working to make it seamless – for example, the Python Boto library supports both the Amazon S3 service as well as the Google Cloud Storage service under the same abstraction layer.
However, while an abstraction layer solves the application migration issue, what about the “data” itself? How easy would it be to migrate swathes of data from one cloud service provider to another? Let us do some simple mathematics on this subject. For example, let us assume that you would like to migrate 10 TB of bucket data from Amazon S3 to Google Cloud Storage. How long do you think it will take? The answer would surprise you – almost a month by reasonable estimates! The sheer operational cost of managing the data transfer over a month would dwarf the storage and network bandwidth charges accrued in both cloud providers. It might be cheaper to transfer the contents of an Amazon S3 bucket onto a local hard drive, ship the drive to Google and ask them to upload the drive contents onto a Google Cloud Storage bucket.
How would someone solve the multi-cloud data lock-in problem? Some simple, albeit very inefficient solutions could be: Would you run copies of applications on multiple clouds and keep data in sync? Would you maintain one cloud as the primary and other clouds as passive secondaries? Would you trust one cloud for tier-0 applications and other cloud for tier-2 or tier-3 applications? One of the key issues is that of the data transfer time – at Datos IO, we use intelligent change capture techniques to minimize the amount of data sent over the WAN links that connect multiple cloud providers. For example, take a version of an Apache Cassandra database running in AWS S3, store the version in Google Cloud Storage and restore the version to an Apache Cassandra database running in Microsoft Azure. This is the future of data protection: not only does it allow you to protect against user errors and logical corruptions, but it also gives you protection across clouds to provide insurance against lock-in onto a single cloud service provider.
Application Recovery Management: The Rise of Polyglot Persistence
Enterprises are using a variety of different data stores to capture the needs of a varying set of applications and their access patterns. Relational data stores might be the right choice for multi-faceted normalized data, but data stores today have to cater to a varying degree of access patterns – for example, a relational database cannot keep up with the high volume of a social media stream neither can a NoSQL data store perform joins efficiently on a normalized data set. This term is referred to as “polyglot” persistence where composite applications are sprayed over multiple data stores for varying types of data and methods of data manipulation. In the earlier days, data architects use to choose a relational data store such as Oracle and then map the application to this data store. Today, the landscape has changed significantly – data architects first classify the data types and expected manipulation methods and then choose the appropriate data store to fit the needs.
Let us use the example where an enterprise needs to identify all the customers in the past year, their purchase characteristics, the customer acquisition method, the social media comments on the post-purchase experience as well as any support interactions. Remember, that in today’s information age, the processing has to be real-time: it would be very imprudent for a business to react to a negative social media reference months after it happened. As one can guess, multiple sources of data with varying degrees of structure need to be collected and analyzed to meet these basic requirements of the enterprise. This type of problem cannot be solved easily or cost-effectively with one type of database technology. Even though some of the basic information is transactional and probably in a relational data store, the other information is non-relational and will require a few different types of persistence engines: document stores for static images and text collateral, spatial for geo-locating mobile customer interactions and graph for deriving relations between the multi-faceted data described above.
But has data protection kept up? Remember that the data is highly correlated with complex interdependencies between the data stores. A logical error in one data store is likely to propagate to others, so recovery has to span the entire spectrum of polyglot persistence. Today’s data protection is highly siloed with recovery limited to a single type of data store. This is clearly not enough!! What one has to do is to synchronize versioning across multiple data stores according to the needs of the application – this is an application consistent polyglot version. The key advantage of this version is allowing us to recover an entire application rather than sub-parts of it – we at Datos IO refer to this as application recovery management and are working hard to realize the vision in the upcoming future.
It is time to rethink and reinvent data protection. The world of enterprise IT is experiencing a massive change as cloud is becoming the de-facto infrastructure of choice and multi-faceted big data is becoming the hallmark of the next generation of applications. Data protection must adapt to these changing environments and deliver the next generation of services to the enterprise. We at Datos IO are proud to be the vanguard of the future for delivering the next generation of data protection technologies.