Introduction to DataOps
For each new tech word, it's always good to get an introduction. Here's your entry point to DataOps.
Join the DZone community and get the full member experience.Join For Free
The term DataOps is currently gaining a lot of traction, with solutions emerging that have significantly matured. Let's dicscuss whatDataOps is all about.
I can start by citing the first sentence of the Wikipedia page, which reads: "DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics.” Beyond this condensed summary, this means that DataOps makes it possible to quickly meet the data analytics needs of the business, with reliable figures. Quickly because all the tools are industrialized and facilitate collaborative work, but also because we are confident of the quality of the figures reported.
So, what are the necessary profiles and their associated roles? Here is the list:
Data Engineer: They are a kind of extended data architect. In addition to knowing how to manage a SQL database, they must be able to manage big data technologies, but also to manage data ingestion flows. They are the ones who must make the data available to the other actors in the DataOps chain.
Data Analyst: They are the ones who are responsible for the subject of the report/visualization. They must therefore be able to make the visuals, clean the data, program, and make statistics and machine learning modules to be able to estimate for example future figures.
Data Scientist: This is still an evolving role, but, to put it simply, they are the person who will be an expert in the field of the business, and who has skills in statistics, machine learning, and mathematics, in order to extract intelligence from data.
DataOps Engineer: Their role is to provide a unified platform between all stakeholders, and to orchestrate the data pipeline and automated data quality control.
The idea in the end is to have two pipelines. A continuous data ingestion pipeline, and a pipeline for new developments, which meet during data production. Ideally, therefore, a unified platform is needed to handle all this and centralize people around the same tool. Tools exist, such as DataKitchen or Saagie, to monitor the data production chain. This chain, where the typical steps of data access, transformation, modeling, and visualization and reporting are performed, must be able to be followed from start to finish, but also allow for a unified view of the non-regression tests. The tests to be implemented are the typical tests that we are used to having, but to which we will add "Statistical process control" tests. These tests consist in detecting that the returned metrics remain in normal numbers. If you measure stock consumption in a factory, you do not normally expect to increase by 50% in one month. The subject of the SPC is a rather broad one, and which would greatly deserve a book-length treatment; but, I'll just redirect you to the Wikipedia link first.
In terms of capabilities, you also need a personal sandbox for everyone. Except that the sandbox must contain a fresh local dataset. And, of course, all this should be performed with version management! This allows you to properly manage the whole big data ecosystem that you will orchestrate, from the recovery of the data to its final restitution to business people.
All this in order to set up a Datops process, where the steps are as follows:
Sandbox management: Like you already know, in DevOps, the goal is to have an isolated dev environment. But here you have data to manage too....
Develop: As in DevOps, you develop your features.
Orchestrate: You orchestrate all the binaries, all the code, and all the data to be manipulated.
Test: Then you test the whole thing.
Deploy: Like in DevOps!
Monitor: Still like in DevOps!
I hope this article was useful as an introduction to DataOps, and will help you in your DataOps adoption! You'll find very interesting articles on two DataOps vendors' web sites:
Opinions expressed by DZone contributors are their own.