We are now seeing a transformation in the world of data, where the tension between the old world (single source of truth data warehouses with top-down data governance) and the new world (distributed, self-service analytics with grassroots management) is occurring. In organizations of all sizes, self-service reporting and analysis is becoming the norm. Where previously, people were given data in the form of a packaged report; today, people are free to discover and freely explore their own data.
Data curation is emerging as a technique to support data governance, especially in data-driven organizations. As self-service data visualization tools have taken off, more business users across organizations are discovering and transforming their own data to help make critical business decisions. Sharing the nuances and best practices of how to use data becomes ever more critical in this environment.
Finding the Path to Successful Data Curation
Getting started with data curation, however, can be a challenging endeavor due to the broad distribution of data knowledge across an organization. Pieces of data knowledge are often spread across Wiki pages, data dictionaries, email, chat, social, and raw web content, which the data curator needs to identify, understand, and propagate. Some challenges for the data curator include:
- Documentation. The priorities of what data to document aren’t initially obvious. Data knowledge is distributed in too many places, and the data is changing too rapidly. The velocity of data growth outpaces the rate at which people can be assigned to document knowledge of the data.
- Propagation. It is hard to make data knowledge easily discoverable at the right point of time. The periodicity of use of data is not always predictable and data assets are often redundantly replicated in different formats and in multiple storage locations.
- Data quality. It is hard to distinguish the high-quality data from resources which are inaccurate or stale. Doing so often requires subject matter expertise in the business function associated with that data. It can be nearly impossible for a non-expert to know which data source is the accurate one.
- Data definitions. Even an accurate, up-to-date data asset can be used in different ways by different teams. For instance, a product team might analyze clickstream data in two-minute intervals whereas the marketing team might consider two-day sessions. Both methods are valid but results in different numbers for the same metric.
The consumer world of the Internet faced similar challenges. Initially, we expected machines to do the lion’s share of the work to automate all Internet content, ensure that it was accurate, and propagate it to the masses. What we now know is that curation needs human input, especially when it comes to evaluating and labeling the quality of content. We can’t completely automate curation. Organizations on a self-service analytics path need to identify where humans must offer input and where computers can automatically document.
As important as the role of data curator is to self-service analytics, data curation best practices are still in their infancy. Organizations are experimenting with how to integrate machines into the data curation process yet still give data curators the appropriate amount of control. Here are some steps to finding your organization’s optimal balance between humans and machines in data curation:
Browse the Data
This is a big task to do; it is impossible for one person to manually search through all sources of data to find those of the highest interest or importance. Just starting with a list of all database sources would be overwhelming, let alone knowing all of the tables within these sources. But it might not be necessary to browse all of your data first. Machines can be effectively trained to pattern match and find the most important data. Utilizing Machine Learning can save a data curator, and an organization, a tremendous amount of time.
Create Context for Data Knowledge
The key to creating context is to document the data effectively and provide the most useful information possible to enable appropriate use. This is not just about documenting technical information (i.e. column, labels, tables), but actually creating context with an understanding of how people should use the information. In this endeavor, the information surrounding the number can be more important than the actual numeric value. There may be hundreds of different uses of one data source. For example, when defining what constitutes a “U.S. state” – the Shipping Department might not include the island of Hawaii, for it is a shipping exception. But the Finance Department would include it in a list of states as a revenue source. This is why it is critical to document the nuances to provide the context for the data.
Share the Data Knowledge
You also need to make the data discoverable and to promote it to as broad an audience as possible. In order to obtain this breadth of reach, you need to actively share it via push methods such as emails and alert notifications, as well as just-in-time methods such as a suggestion-oriented query tool. Sharing data knowledge also means making pull methods of access available such as data catalogs. Through actively sharing the data knowledge, the data curator helps distribute it to the right people at exactly the right moment.
Update the Data Knowledge
Finally, you need to propagate changes to the data knowledge; that is, you need to stay on top of technical changes to the data. For example, as a data curator updates a column label, it should be automatically updated within the other tables and sources that use that same data source. This is difficult to do without technology – how do you know where all of the other sources that leverage that data are? And, if you do know all sources, you don’t want to have to update it every different place it occurs.
Technology is a critical driver to the successful implementation of data curation in the self-service analytics world. An organization should pick a technology that provides flexibility and helps people manage data curation across the wide gamut of control.
Key components of the data curation technology should include the following:
The technology should allow your organization to build a comprehensive description of data knowledge in all forms of the data objects that exist. There should be a link between all of the data knowledge sources. A vendor like Alation will provide a complete repository for all the data assets and data knowledge in your organization, as well as automatically generate lineage of where the data came from, where it’s going, and how it’s being moved.
2. Automated Documentation
The technology should automatically discover for you who the top users are, the top queries that are written, the lineage of the data knowledge, and the most popular data knowledge used.
3. Collaborative Editing
The technology should make documentation a living document, so people can comment on it, change it, revise it, and make it better on a continuous basis.
4. Knowledge Propagation
You should be able to syndicate data knowledge across the organization. Instead of getting 1x knowledge propagated, technology should help 7x times the number of people who are using information for analytical and discovery purposes.
5. Query Tool
Technology should give users freedom in a query writing tool. As they are typing SQL, the technology should provide auto-completion to allow users to work faster and write more accurate queries.
It is important that an organization employ an iterative and Agile approach to becoming a data-driven organization. As any organization goes through cultural and corporate changes, it needs to revisit its data curation framework over time. And, the actual users of data knowledge need to embrace data curation as a means to their goal of making data-driven decisions for the business. Finally, technology should be employed to help balance where humans conduct and manage the data curation process and where machines do so.
By following the proposed frameworks in this blog, organizations can achieve success in implementing a data curation process that supports their culture and business goals and that is Agile to adapt as organizational goals and market forces require. For a more detailed explanation on how to implement a successful self-service analytics approach, view the whitepaper "Enabling Self-Service Analytics for Data Driven Organizations."