Applications are one of the key drivers for innovation and growth in enterprises. Applications also influence the entire infrastructure stack (from database, to compute, to storage, to data management and data protection). The next-generation applications that address the world of social and mobile are high-volume, real-time and high-ingestion nature. As shared earlier, most if not all the next-generation applications are deployed on scale-out and cloud databases such as Apache Cassandra, AWS DynamoDB, MongoDB, etc. As enterprises look to shorten development cycles via continuous development, there are several challenges in developing and testing such web-scale applications on data-centric infrastructure. The challenges are exacerbated when dealing with multiple release cycles and business drivers for continuous integration and deployment. In this blog we will look at some of those challenges and opportunities henceforth.
Application Development on Cassandra
Application teams rely on non-production or staged environments for development and testing of new features. However, application developers need these environments to contain the latest data from the production environments. Lack of most up to date data hinders application development and testing of new features. Most striking challenge in “refreshing” the data comes from two fundamental limitations of distributed databases:
- Lack of native support for cluster consistent versions
- Distributed scale-out databases are designed to be eventually consistent. There is no native support for capturing a cluster consistent snapshot of the database.
- Limited capability to version the database and restore into a destination database cluster that may have an unlike topology
- Native tools cannot naturally handle cluster topology changes or node failures.
So, what do the application developers do today? They simply end up maintaining standalone replicas (leading to increased infrastructure costs), work with outdated data sets (leading to suboptimal test quality), and rely on an army of operations teams to manage the flow of data across their environments and release cycles (leading to longer production cycles and delayed releases).
Test Refresh Cycle
A typical data refresh of a Cassandra cluster involves the following steps:
- Take a ‘snapshot’ on all nodes of the cluster
- Move all the file to the target cluster
- Because of replication, around ~3x the logical data is moved per environment
- Insert data into the target environment
- Due to replication, almost 3x the data is loaded into the cluster
- Run repair on every target environment, typically takes multiple hours to days.
- The target environment is typically not available for the applications till repair is finished.
As shared above, 3 times the actual data is moved, ingested and repaired for every environment that is refreshed. Meanwhile, developers and testers are waiting for their environment to be refreshed! Not only is this putting unnecessary strain on the infrastructure, it lengthens the refresh cycle considerably. Operations teams have to go through this tedious, error-prone, and time-consuming process for every non-production environment, for every refresh cycle, for every release!
Case Study: Online Retail Customer
A retail customer we spoke to has a 12-node Cassandra cluster in production with 10 downstream application development environments. Their current cluster is about 1TB in size. They execute 2 release per year and would like to refresh their test environments once a quarter. However, given that a refresh cycle takes multiple days to complete for every environment, they are limited to only two refreshes per year.
Even though their production database is only 1TB in size, they end up moving almost 10 TB of data through their infrastructure. After data is loaded into the test clusters, it has to go through a time-consuming repair process. Each repair takes hours and is the long pole in the refresh cycle. Note that they run repair on the same data, 10 times for every refresh cycle. This is wasted time and resources, that the developers and testers can use.
Handling Data Growth
One of the key characteristics of next generation, Big Data applications is the ability to sustain a rapid growth of data. Figure 1 shows the amount of time taken to refresh a single environment, as the cluster size grows at a modest 0.5% daily (1.5x in 1 Year). A typical refresh that consumes around 25 hours per environment today will grow to around 80 hrs. How would you handle this for 12 environments twice a year! This clearly shows the traditional approach of node level snapshot, followed by a repair does not scale. One of the biggest roadblocks in expediting release cycles is an inability to scale non-production environments. This limitation makes it prohibitive to scale to more test/dev environments, despite strong business demand.
Application teams that want to embrace DevOps for their Big Data applications are struggling with long release cycles. In large part, they are slowed down by the long time it takes to refresh their non-production databases. A typical environment refresh takes multiple days, taxing the infrastructure and wasting precious developer time. The problem is exacerbated by the rapid growth of data consumed by these applications – making the test/dev data refresh cycle even more arduous.
How do we leverage the performance and scalability advantages of scale-out databases, while facilitating enterprise-grade data protection and data flow through various silos of application development? This brings forth new application recovery requirements: any point-in-time versions of produ ction databases, reliable recovery that is both repair-free in nature and supports clusters that have unlike topology (developers), and helps avoid bloated storage costs – all towards providing a runnable database “as of” minutes or hours ago, a concept we call “reify.” A fundamentally new approach is needed to address this problem. At Datos IO we are working on industry-first data protection products and solutions that allow enterprises to mitigate these issues and onboard newer applications at any scale and at a rapid pace. Contact us to know more about our early access program or leave comments and questions here. We would be more than thankful and glad to share our thoughts.