Effective Data Engineering in the Cloud World
Cloud has changed the game for data engineers, from the way data is stored to how the infrastructure is built.
Join the DZone community and get the full member experience.Join For Free
Cloud has changed the dynamics of data engineering as well as the behavior of data engineers in many ways. This is primarily because a data engineer on-premise only dealt with databases and some parts of the Hadoop stack.
In the cloud, things are a bit different. Data engineers suddenly need to think different and broader. Instead of being purely focused on data infrastructure, you are now almost a full-stack engineer (leaving out the final end application, perhaps). Compute, containers, storage, data movement, performance, network — skills are increasingly needed across the broader stack. Here are some design concept and data stack elements to keep in mind.
1. The Disaggregated Data Stack — Pick a Compute, a Catalog, a Buffer Pool, a Storage.
Historically, databases were tightly integrated with all core components built together. Hadoop changed that with co-located compute and storage in a distributed system instead of being in a single or a few boxes. Cloud changed that. Today, it is a fully disaggregated stack with each core element of the database management system being its own layer. Pick each component wisely.
2. Orchestrate, Orchestrate, Orchestrate
Efficiency dramatically increasing by abstracting and orchestration. Since now a data engineer for the cloud has full-stack concerns, orchestration can be a data engineers best-kept secret.
3. Copying Data Creates More Problems than It Solves
Fundamentally, once data lands in the enterprise, it should not be copied around unless, of course, for backup, recovery, and disaster recovery scenarios. Making this data accessible to as many business units, data scientists, and analysts with as few new copies created is THE data engineering puzzle to solve.
This is where in the legacy DBMS world a buffer pool helped, making sure the compute (query engine) always had access to data stored in a consistent, performant way in a format that was suitable for the query engine to process versus a format optimized for storage. Technologies like Alluxio can dramatically simplify life, bringing data closer to compute making it more performant and accessible.
4. S3-Compatible in The Cloud, S3-Compatible on Premise
Because of the popularity of AWS S3, object stores, in general, will be the next dominant storage system — at least for a few years (5-8-year cycle, typically). Think forward see pick a storage tier that will last for sometime and S3-compatible object stores should be your primary choice. While they are not great at all data-driven workloads many technologies help remove their deficiencies.
5. SQL and Structured Data Is Still In!
While SQL has existed since the 1970s, it still is the easiest way for analysts to understand and do something with data. AI models will continue to evolve, but SQL has lasted close to 50 years. Pick 2, at most 3, frameworks to bet on and invest in. But build a platform that will over time support as many as needed. Currently Presto sql is turning into a popular query engine pick for the disaggregated stack.
This blog is originally posted on Medium.
Opinions expressed by DZone contributors are their own.