5 Data Tasks That Keep Data Engineers Awake at Night
5 Data Tasks That Keep Data Engineers Awake at Night
When it comes to integrating and managing data, there are quite a few tasks that are downright tedious. Data engineering is a tough job, but somebody's gotta do it!
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
As a data engineer, I enjoy my work. But when it comes to integrating and managing data, there are quite a few tasks that are downright tedious. From data remodeling to juggling a hodgepodge of software to address disaster recovery, sometimes, we are simply drowning. As they say, it’s a tough job, but somebody's gotta do it!
1. Data Remodeling
When you start building a solution from scratch, you are dealing with a clean slate. The infrastructure is built, all the data has been modeled perfectly in an almost artistic manner, and it is now flowing in an elegant and clean fashion.
However, problems start real soon — basically, the minute those damn analysts and data scientists get their hands on your magnificent creation. They keep asking for data augmentation, integration of new sources of data, and different data models for asking different kinds of questions. It falls on your shoulders, and you feel like you are somehow personally at fault for preventing your company from data-driven value creation, while those damn analysts hold no appreciation for the remarkable work of art that you’ve created.
Data modeling progresses from conceptual model to logical model to physical schema, establishing a robust world of data objects and their relationships to other data objects. Mainly, it provides the analysts the ability to ask the questions they need in a fast cost-effective and simple manner. It is precisely its robust nature, combined with the frequency of changes needed, that creates the never ending cycle of integration, data preparation, and modeling.
2. Scaling Clusters
Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive, cumbersome, and time-consuming. They also stretch physical infrastructure capacity, creating risk for under-provisioned resources.
The problematic nature of scaling clusters is even greater with RDBMS, given their architecture is run on a single server. In practical terms, this means that when an RDBMS needs to scale, you must buy bigger, more complex, and expensive proprietary hardware with more processing power, memory, and storage capacity. All this involves lengthy downtime and configurations to make the change.
Scaling clusters in cloud infrastructure generally solves the scalability issue. BigQuery’s most significant advantage, for instance, is the seamless and fast resizing of a cluster, up to petabyte scale. Unlike Redshift, there is no need to constantly track and analyze the cluster size and growth in an effort to optimize its size to correspond to the current dataset requirements.
Redshift scalability allows users to enjoy an increase in performance when resources including memory and I/O capacity increase. To get the best for their buck, Panoply seamlessly scales Redshift users’ cloud footprints according to the amount of data and the number and complexity of queries. It scales the cluster on demand — ensuring the data warehouse performance is well balanced with costs.
3. Backup and Disaster Recovery
Traditionally, businesses have cobbled together multiple software solutions to address disaster recovery (DR), backup, and archival as a part of the larger data protection practices. This approach is mind-bogglingly inefficient for the data engineer tasked with managing each of these solutions. It is also a relatively expensive approach. A different way of dealing with DR is to leverage cloud architecture for secondary workflows such as backup, archival and disaster recovery (DR).
BigQuery automatically replicates data to ensure its availability and durability. However, complete loss of data due to disaster is less common than the need for fast, instant restoration of, for example, a specific table or even a specific record. For both purposes, Redshift automatically stores backups to S3 and enables you to revisit data at any point in time over the last ninety days. In all cases, retrieval includes a series of actions that can make the instant recovery a cumbersome, lengthy operation.
Since Panoply is powered by Redshift, backup to S3 is obvious, but it takes it one step further. Leveraging Panoply’s Revision History tables, users keep track of every change to any of their database rows, right within their data warehouse, making it immediately available to analysts with simple SQL queries. This makes file uploading to S3 and database extraction redundant when the need to travel back to any point in time and quickly see how the data has changed arises.
4. ETL and Data Prep
Loading your data in a clean and logical way to the analytics architecture is not only the first step in data management but is also definitely one of its more important stages. The difference in loading methodologies can affect the scalability of the process and the richness of the data loaded, and in most cases, parts of raw data are actually lost.
When loading data, we usually think to load only the important data — meaning the actual data that we want to analyze. The problem with this is that in most cases (most cases being all), you never know on day one which data is going to be important for analysis in the future, and thus long-thought experiments and tests of how and which data to load become worthless the minute after analysis begins.
Essentially, the ETL and data warehousing must fulfill the function of being a screening protocol that ultimately produces clean data sets that are easy for analytics programs to scrutinize.
Very frequently, you end up running supporting software on large numbers of servers so that they can warehouse data from multiple sources, including different OLTPs.
Due to the wide range of possible data inconsistencies and the sheer data volume, data cleaning is considered to be one of the biggest challenges in data warehousing. A number of tools of varying functionality are available to support ETL tasks. Even with these tools, a significant portion of the cleaning and transformation work has to be done manually or by low-level programs.
Snowflake supports simultaneous users’ queries through different virtual warehouses. You can run your overnight ETL processes run on slower and less expensive warehouse resources, and then enable real-time ad-hoc queries through a more powerful warehouse during business hours. However, with Redshift scale and operational efficiency, ETL can be referred to as a rigid and outdated paradigm.
Leveraging Redshift scalability with Panoply, you can forget about overnight ETL processes. Panoply follows the ELT process whereby all of the raw data is instantly available in real-time and all transformations happen asynchronously on query time. This makes Panoply both a data lake and a data warehouse, allowing users to have constant and real-time access to their raw data. This means they can iterate their transformations in real-time, with updates instantly applied on newly inserted data as well. Finally, customized, advanced transformations are also possible via the Panoply UI console and take just minutes to set up and run.
5. Performance Troubleshooting
A bottomless pit, an abyss, the void…should I keep going? If the system is slow, just about anything could be the problem. It could be poor data modeling, improper indexing, network issues, storage hardware — basically anything.
Then, once you begin (especially in the case of an issue with a private cloud or unmanaged solutions), the data engineer clears one bottleneck only to bump into another one. That troubleshooting the data warehouse can be a bit intimidating. Sometimes, even figuring out where to begin to look can be difficult.
More often than not, and more frustrating than anything (at least from my personal experience), is when the trouble lays in the type of queries being run by irresponsible analysts, like some select
* over an enormous table. I try raising knowledge and awareness of these as part of my tasks. Other frequent issues I see are around the way data is structured in the warehouse — for instance, in a case where data is modeled with many subtables, analysis on even a small set of data can be lengthy.
Published at DZone with permission of Yaniv Leven . See the original article here.
Opinions expressed by DZone contributors are their own.