Data Ingestion Best Practices
Data Ingestion Best Practices
Data ingestion is something you likely have to deal with pretty regularly, so let's examine some best practices to help ensure that your next run is as good as it can be.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Data is the fuel that powers many of the enterprise’s mission-critical engines, from business intelligence to predictive analytics; data science to machine learning. To be fully useful, data, like any fuel, must be abundant, readily available, and clean. The process of data ingestion — preparing data for analysis — usually includes steps called extract (taking the data from its current location), transform (cleansing and normalizing the data), and load (placing the data in a database where it can be analyzed). Enterprises typically have an easy time with extract and load, but many run into problems with transform. The result can be an analytic engine sitting idle because it doesn’t have ingested data to process.
In light of this reality, here are some best practices to consider regarding data ingestion.
Expect Difficulties and Plan Accordingly
The dirty secret of data ingestion is that collecting and cleansing the data reportedly takes 60 percent to 80 percent of the scheduled time in any analytics project. We imagine data scientists spending most of their time running algorithms, examining results, and then refining their algorithms for the next run. This is the exhilarating part of the job, but the reality is that data scientists spend most of their time trying to wrangle the data into shape so they can begin their analytic work. As the size of big data continues to grow, this part of the job gets bigger all the time. Many enterprises begin data analytics projects without understanding this, and then they become surprised or disappointed when the data ingestion process does not meet their initial schedules. Meanwhile, other teams have developed analytic engines that assume the presence of clean ingested data and are left waiting idly while the data ingestion effort flounders. There is no magic bullet that can help you avoid these difficulties. Expect them, and plan for them. For example, you may want to schedule more time for data ingestion, assign more people to it, bring in external expertise or defer the start of developing the analytic engines until the data ingestion part of the project is well underway.
Automate Data Ingestion
In the good old days, when data was small and resided in a few-dozen tables at most, data ingestion could be performed manually. A human being defined a global schema and then assigned a programmer to each local data source to understand how it should be mapped into the global schema. Individual programmers wrote mapping and cleansing routines in their favorite scripting languages and then ran them accordingly. Today, data has gotten too large, both in size and variety, to be curated manually. You need to develop tools that automate the ingestion process wherever possible. For example, rather than manually defining a table’s metadata, e.g., its schema or rules about minimum and maximum valid values, a user should be able to define this information in a spreadsheet, which is then read by a tool that enforces the specified metadata. This type of automation, by itself, can reduce the burden of data ingestion. But, in many cases, it does not eliminate the ingestion bottleneck, given the sheer number of tables involved. When thousands of tables must be ingested, filling out thousands of spreadsheets is better than writing thousands of ingestion scripts. However, it is still not a scalable or manageable task.
Use Artificial Intelligence
A variety of products have been developed that employ machine learning and statistical algorithms to automatically infer information about data being ingested and largely eliminate the need for manual labor. These include open-source systems like Data Tamer and commercial products like Tamr, Trifacta, and Paxata. Some examples of processes that these systems can automate include the following:
- Infer the global schema from the local tables mapped to it.
- Given a local table, infer which global table it should be ingested into.
- Infer synonyms for data normalization. For example, the abbreviations “in.” and ”in,” a straight double-quotation mark (") and the word “inches” are all synonyms.
- Detect duplicate records based on fuzzy matching. For example, “Moshe Kranc” and “M. Kranc” are the same person.
These systems rely on humans to provide training data and to resolve gray areas where the algorithm cannot make a clear determination. Newer systems, such as Informatica’s CLAIRE or the open-source ActiveClean project, are touted as tools that can eliminate humans entirely.
The bottom line is that these products are real, they work and they should be part of any enterprise’s data ingestion roadmap.
Make it Self-Service
In a midsize enterprise, dozens of new data sources will need to be ingested every week. A centralized IT organization that has to implement every request will inevitably become a bottleneck. The solution is to make data ingestion self-service by providing easy-to-use tools for preparing data for ingestion to users who want to ingest new data sources. For example, give your users self-service tools to detect and cleanse missing values, outlier values, and duplicate records before they try to ingest the data into the global database.
Govern the Data to Keep it Clean
Once you have gone to the trouble of cleansing your data, you will want to keep it clean. This means introducing data governance with a data steward responsible for the quality of each data source.
This responsibility includes the following: defining the schema and cleansing rules, deciding which data should be ingested into each data source, and managing the treatment of dirty data. Of course, data governance includes other aspects besides data quality, such as data security and compliance with regulatory standards such as GDPR and master data management. Achieving all these goals requires a cultural shift in the way the organization relates to data, and it requires a data steward who can champion the required efforts and be accountable for the results.
Advertise Your Cleansed Data
Once you have cleansed a specific data source, will other users be able to find it easily? If your data integration is always done point-to-point, as requested by customers, there is no way for any customer to find data already cleansed for a different customer that could be useful. Your organization should implement a pub-sub (publish-subscribe) model with a registry of previously cleansed data available for lookup by all your users.
Ultimately, these best practices, when taken together, can be the difference between the success and failure of your specific data ingestion projects.
Published at DZone with permission of Moshe Kranc , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.