Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Share Data, Don't Extract Data

DZone's Guide to

Share Data, Don't Extract Data

A big data expert discusses the basic concepts behind data sharing and why it has the potential to shake up more traditional data warehouses.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

The advent of cloud-native data warehouses, such as Snowflake, are enabling fundamental changes in the way we construct and think about data warehouses and BI systems. For example, a core capability of Snowflake is Data Sharing. This lets any user of Snowflake grant access to secure, personalized views of their data to any other Snowflake user, even across different companies, without copying, preparing, extracting, downloading, or transmitting files of data.

This may sound simple, but it is a transformational advantage. It’s possible because you can think of Snowflake as a single global database. Everyone using Snowflake is actually using the same database, it’s only the access rules that keep each user’s data private to them. But with a few SQL commands or GUI clicks, these rules can be morphed to enable what Snowflake calls Data Sharing.

When one user shares data with another in Snowflake, they are granting SELECT or other access rights on their existing data table to the other user. This has always been possible for users of the same database instance in traditional databases, but now everyone in the world is basically using the same database instance, so anyone can share with anyone else.

Many data warehouses consist of data sourced from inside a company itself, as well as data sourced from partners, suppliers, and paid data providers. For example, a manufacturer of consumer goods often gets a lot of its data from the retailers that sell its products. Without that data from retailers, they don’t know who’s buying, or where, when, or what mix of their products is actually selling.

Before data sharing, this data needed to be extracted from the retailer, transmitted to the CPG company, and loaded into another database. This introduces considerable expense, delay, and inefficiency, and also precludes the ability to drill into the lowest levels of data that are too voluminous to extract and transmit. This was so complex that the most sophisticated retailers just gave suppliers access to their own data warehouse and BI systems — Walmart’s Retail Link being the biggest example.

With Data Sharing, nothing needs to be extracted, transmitted, loaded, or maintained, nor does a company need to incur the costs of their suppliers’ or partners’ BI. The owner of the data, known as the share’s Provider, just needs to share data. This is entirely free for the sharing company, as the other party, called the share’s Consumer, pays for any queries they run themselves. Taking out the cost and complexity of extracting and transmitting data, or the cost of a Retail Link-style supplier-facing BI system, can be a huge cost savings for the provider, and also gives share consumers better access to fresher and more detailed data than was previously possible.

Skipping the copying, extraction, and transmission of data to partners and customers sure makes things simpler. The recipient of the share, the consumer, can simply use the tables in their BI tool the same way they would their own. The data is just magically there — fresh, detailed, and ready for them to access on demand.

This does, however, push some additional responsibilities onto the BI tool layer, such as Zoomdata. Without an ETL process loading inbound data, there isn’t an obvious place to do any needed transformations — the things that allow data from one organization to mesh seamlessly with data from another. Using the shared data as a source for further transformation would be an option, but would require the data to be copied, destroying the freshness and efficiency benefits of using data sharing to begin with.

So, ideally, a BI tool can do some data alignment on the fly, while actually querying data, without ever needing to make or store any copies. In Zoomdata, we call these features multisource analysis, and we originally developed them to help align data from multiple data sources, for example, to join data from Snowflake to data from Hadoop. They are also what’s needed to help align a share consumer’s existing data with a share provider’s shared data. We’ve built a raft of features into Zoomdata to enable this, such as cross-source filtering, data fusion, and keysets for ad-hoc cohorting and set analysis. The trick is to align data as needed on the fly, at the time of each query, as opposed to through copy-creating transformation processes.

Taking this one step further, BI tools of the future can also help users discover data that may be useful to them, that may align with their existing data, that they may not even know exists. This can be as simple as showing users a list of available Data Shares that may be relevant to their at-hand analysis, to sophisticated approaches such as auto-suggesting and robotic back-testing across thousands of shared sources to automatically determine which shares could help provide lift to a machine learning or marketing algorithm, or provide additional alpha to a stock trading system.

BI has always been the place where business users and data come together. Data sharing, through databases like Snowflake, enables BI users to tap into not only their own data, but also data from other providers and other companies. We at Zoomdata envision a future where your BI tool is your portal into all of the world’s data, nicely indexed, searchable, and immediately aligned and compatible with your own data through a few simple clicks. Snowflake Data Sharing provides the underpinnings to make that possible, and we’re tightly integrating data sharing into Zoomdata to help make it a reality.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
data sources ,big data ,data warehouses ,bi ,data sharing

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}