Our expert panel included: Andi Mann, CTO at Splunk; John-Daniel Trask, CEO and co-founder of Raygun.com; Pete Grant, solutions architect at CloudBI; and, our very own Sam Fell.
During the episode, the panelists discussed the interaction of DevOps and Big Data, some unique challenges Big Data poses on delivery pipelines, and best practices and patterns for DevOps processes for Big Data applications. Continue reading for their insights!
Why DevOps for Big Data?
Mann notes how the DevOps culture of continuous learning fits with Big Data: “You need collaboration between developers who are creating algorithms and trying to understand what’s happening in the code. Then, business leaders need to understand what is going on from a business perspective, and operations need to run efficient, fast queries and receive responses. With all these teams working together collaboratively, iterating and asking questions — that’s DevOps.”
Trask, coming from a developer background, highlights the scale of databases and the subsequent need to automate: “Everything’s getting bigger. We’re talking terabytes or petabytes of data. Immediately you’re already having to look at some of the automation play to make it work.”
In addition to discussing how DevOps enables teams instead of individuals, Grant explores the regulatory and compliance aspect of DevOps and big data: “You need tight treatment of the sensitive data that you might have in Big Data solutions. Being able to audit everything that everybody did to the data, that’s a big benefit to Big Data for DevOps.”
Fell noted pipelines as another parallel between DevOps and big data: “Think about the way that you deploy applications through the pipeline, the value stream. Data has to go through a value stream also. I guess you could just dump it into the big data swamp, but often times there’s work that’s done as that data is ingested to make sure that it’s fit for use by whatever data scientist or purpose you’d like to apply to it.”
The self-service aspects of DevOps also create an interesting comparison, says Wallgren: “The whole notion of self-service works really well in an environment where you have sensitivity issues that you need to deal with. You can prevent mistakes by funneling people down the right road, depending on the data that they want to put out there.”
DevOps and Big Data: Challenges and Unique Considerations
Analysis of data can present itself as a challenge, according to Trask: “If you have a smaller team and you don’t necessarily have a dedicated data scientist, how do you sit down and start building some of these models? Just in terms of talent on the team, that can sometimes be an issue.”
Mann calls out the volume of data and the inability to move it around: “Testing on a small data set can give you radically different results. Many of the tropes of Agile development become a lot more difficult simply because of that data gravity around a huge data set. This is where DevOps really helps, because you can work in incrementally increasing sizes of data sets to see if the results are correct, then keep iterating on those results.”
The scale of the data requires massive distributed computing in order to answer questions in real time notes Wallgren: “Nowadays, with massive distributed computing and machine learning you can do these computations more in real time than in the past when you had to pre-compute all the answers. We’ve evolved to figure out ways that we can index data and ways that we can access the data where we have more flexibility in how we ask the questions.”
Grant highlights the importance of knowing what you’re optimizing for: “If you want to make sure any batch process finishes in a certain amount of time, you have to consider, ‘well, if I have more performance then I’ll get there better.’ Sometimes there can be challenges around that. Being able to control the cost of your performance testing can help you solve that problem.”
Patterns for DevOps for Big Data
One thing we always talk about is cycle time, says Wallgren: “The way that a lot of organizations today fall down on DevOps is manual processes, manual testing, or rework due to problems. In Big Data, you’ll never finish if you don’t get a little bit more efficient in your cycle times and lead times. The lesson is a hard lesson to learn with Big Data because it’s so expensive when you screw up and have to start over. If you’re going to be a successful company, you’d better learn that lesson pretty quickly or you’re not going to be around for very long.”
“If you’re trying to keep the costs as low as possible for your infrastructure then it’s not going to be sized so that there can be an infinite number of users running whatever queries that they want onto that infrastructure,” says Grant. “Use some DevOps techniques to figure out how much capacity you need to serve that set of users, or that expected amount of use of your data.”
Start off with the basics when it comes to DevOps patterns, says Trask: “From the purely DevOps part, get that automation pipeline in. Get code built reliably and deployed in a consistent manner. Frankly, it seems like most people are there now, but if you’re not doing that then it’s the lowest hanging fruit for improving everybody’s life.”
Mann highlights collaboration as a pattern and also expands on machine learning: “If the data set’s vast, it’s going to be pretty much impossible for a human being to be able to see patterns, especially when a pattern is a blip on your radar. You need to use machines that understand how to find those patterns, and that’s something that is not normal in a lot of our endeavors, but is absolutely critical when it comes to Big Data.”
Read the original post here.