Deriving Value from the Data Lake
Deriving Value from the Data Lake
What questions about your data do you need to answer, and how can the data lake provide those answers?
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Excerpt from ebook, Architecting Data Lakes: Data Management Architectures for Advanced Business Use Cases, by Ben Sharma and Alice LaPlante.
The purpose of a data lake is to provide value to the business by serving users. From a user perspective, these are the most important questions to ask about the data:
- What is in the data lake (the catalog)?
- What is the quality of the data?
- What is the profile of the data?
- What is the metadata of the data?
- How can users do enrichments, clean ups, enhancements, and aggregations without going to IT (how to use the data lake in a self-service way)?
- How can users annotate and tag the data?
Answering these questions requires that proper architecture, governance, and security rules are put in place and adhered to, so that the right people get access to the right data in a timely manner. There also needs to be strict governance in the onboarding of data sets, naming conventions have to be established and enforced, and security policies have to be in place to ensure role-based access control.
For our purposes, self-service means that non-technical business users can access and analyze data without involving IT.
In a self-service model, users should be able to see the metadata and profiles and understand what the attributes of each data set mean. The metadata must provide enough information for users to create new data formats out of existing data formats, using enrichments and analytics.
Also, in a self-service model, the catalog will be the foundation for users to register all of the different data sets in the data lake. This means that users can go to the data lake and search to find the data sets they need. They should also be able to search on any kind of attribute—for example, on a time window such as January 1st to February 1st—or based on a subject area, such as marketing versus finance. Users should also be able to find data sets based on attributes—for example, they could enter, “Show me all of the data sets that have a field called discount or percentage.”
It is in the self-service capability that best practices for the various types of metadata come into play. Business users are interested in the business metadata, such as the source systems, the frequency with which the data comes in, and the descriptions of the datasets or attributes. Users are also interested in knowing the technical metadata: the structure and format and schema of the data.
When it comes to operational data, users want to see information about lineage, including when the data was ingested into the data lake, and whether it was raw at the time of ingestion. If the data was not raw when ingested, users should be able to see how was it created, and what other data sets were used to create it. Also important to operational data is the quality of the data. Users should be able to define certain rules about data quality, and use them to perform checks on the data sets.
Users may also want to see the ingestion history. If a user is looking at streaming data, for example, they might search for days where no data came in, as a way of ensuring that those days are not included in the representative data sets for campaign analytics. Overall, access to lineage information, the ability to perform quality checks, and ingestion history give business users a good sense of the data, so they can quickly begin analytics.
Controlling and Allowing Access
When providing various users—whether C-level executives, business analysts, or data scientists—with the tools they need, security is critical. Setting and enforcing the security policies, consistently, is essential for successful use of a data lake. In-memory technologies should support different access patterns for each user group, depending on their needs. For example, a report generated for a C- level executive may be very sensitive, and should not be available to others who don’t have the same access privileges. In addition, you may have business users who want to use data in a low-latency manner because they are interacting with data in real time, with a BI tool; in this case, they need a speedy response. Data scientists may need more flexibility, with lesser amounts of governance; for this group, you might create a sandbox for exploratory work. By the same token, users in a company’s marketing department should not have access to the same data as users in the finance department. With security policies in place, users only have access to the data sets assigned to their privilege levels.
You may also use security features to enable users to interact with the data, and contribute to data preparation and enrichment. For example, as users find data in the data lake through the catalog, they can be allowed to clean up the data, and enrich the fields in a data set, in a self-service manner.
Access controls can also enable a collaborative approach for accessing and consuming the data. For example, if one user finds a data set that she feels is important to a project, and there are three other team members on that same project, she can create a workspace with that data, so that it’s shared, and the team can collaborate on enrichments.
Using a Bottom-Up Approach to Data Governance to Rank Data Sets
The bottom-up approach to data governance, discussed in Chapter 2, enables you to rank the usefulness of data sets by crowdsourcing. By asking users to rate which data sets are the most valuable, the word can spread to other users so they can make productive use of that data. This way, you are creating a single source of truth from the bottom up, rather than the top down.
To do this, you need a rating and ranking mechanism as part of your integrated data lake management platform. The obvious place for this bottom-up, watermark-based governance model would be the catalog. Thus the catalog has to have rating functions.
But it’s not enough to show what others think of a dataset. An integrated data lake management and governance solution should show users the rankings of the data sets from all users—but it should also offer a personalized data rating, so that each individual can see what they have personally found useful whenever they go to the catalog.
Users also need tools to create new data models out of existing data sets. For example, users should be able to take a customer data set and a transaction data set and create a “most valuable customer” data set by grouping customers by transactions, and figuring out when customers are generating the most revenue. Being able to do these types of enrichments and transformations is important from an end-to-end perspective.
Published at DZone with permission of Ben Sharma , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.