Connected Data Ponds: The Evolution of Data Lakes
This article introduces the concept of data pools that are free flowing data stores in various data centers and clouds throughout an organization. Sometimes segments of your data must live in separate locations for security, performance, usage, privacy or price.
Join the DZone community and get the full member experience.Join For Free
A lot has been said about Data Lakes over the past five years. The call to action from our industry to customers was to take all your data-at-rest in databases and warehouses, and add to this to the data-in-motion from everything in your ecosystem. Then store all of the resulting terabytes and petabytes in a Data Lake in order to get to a complete 360-degree view of the world and the best value from your data. Collect it all.
The Data Lake metaphor arose because ‘lakes’ are a great concept to explain one of the basic tenets of Big Data. That is, the need to collect all of the data in the ecosystem ready to analyze it for pertinent patterns using all kinds of analysis, including autonomous machine learning. This is because one of the basic tenets of data science is: the more data you can get the better your analysis will ultimately be.
Certainly, the Data Lake concept was a good evolution over its precursors — warehouses, marts and so on. However, given the inherently distributed nature of data across data centers and clouds and today’s always-connected IOT world where data comes from billions of things that can connect to anything from GPS to wireless to wifi networks, we need to advance our thinking.
Today, we need to think of interconnected Data Pools or a Connected Data Architectures. Pools in clouds that can connect to pools in our data centers. That can flow to the pools at the edge and everywhere in between. All data connected and flowing all the time.
So what is the definition of a Data Pool and Connected Data Architecture? Just like in nature, water naturally moves between pools, data needs to freely move from one pool to another based on need, and to do this they all need to be always connected to each other.
Beyond the basic metaphor, Data Pools need to ensure that connected data can flow freely to the place where it is optimal for the business to get value from it by analyzing it. Basically, data needs to be able to connect to any other data to get the most value from analysis.
Which means your Connected Data Architecture needs to transcend vendors and platforms.
Which means your Connected Data Architecture must provide connectivity between clouds (such as Azure, AWS, Google,…), between data center and other on-premise pools (running Linux, Docker or virtual services), to and from micro-storage pools (on the car/sensor), and of course data lakes.
Beyond this, the most common criteria also include:
Privacy: Keep certain data inside the firewall and only transmit non-personally identifiable information (PII) data elements to the other connected data sets.
Legislative or political. Data might have to be kept in the locale it was created.
Governance: Data of a certain type might be allowed on the public cloud, private cloud, accessible by only certain people and so on.
Performance: Keep data close to the decision maker or place where it needs to be analyzed, especially in low bandwidth situations. For instance data for a particular oil well which has low connectivity to shore, or on a car driving, might be retained and analyzed locally so it can be acted on instantly.
Published at DZone with permission of , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.