Data Virtualization — The Evolution of the Data Lake
Go in-depth on data virtualization.
Join the DZone community and get the full member experience.Join For Free
The Birth of the Data Lake
The idea of the traditional data center being centered on relational database technology is quickly evolving. The adoption of big data is causing a paradigm shift in the IT industry that is rivaling the release of relational databases and SQL in the early 80s. We are seeing an unprecedented explosion of growth of data volume.
This growth is the result of the myriad of new data sources that have been created within the last 10 years. Things like machine sensors that collect data from everything from your car to your blender, medical devices, RFID readers, web logs, and especially social media are generating terabytes of data every day. This new “smart data” can provide tremendous business value if it can be mined and analyzed.
Data Is Everywhere!
The problem with all this new data is that the majority of it is unstructured. Storing and analyzing it has far exceeded the capacity of the traditional RDMS. For businesses, the challenge was to find a way to incorporate these unstructured sources of data with their traditional business data, such as customer and sales information.
This would provide a 360-degree view of their customers' buying habits. Additionally, it would help a company make more targeted strategic decisions on how to increase business. This dilemma produced the concept of the data lake. A data lake is essentially a large holding area for raw data. They're low cost, highly scalable, able to support extremely large data volumes and accept data in its native raw format from a wide variety of data sources.
You may also like: Data Lake: The Central Data Store.
The repository of choice has been primarily Hadoop. Hadoop allows you to store combinations of both structured and unstructured data. Hadoop is essentially a massively parallel file system that allows you to process large amounts of data in a timely fashion. The data can be analyzed via different methods, such as MapReduce, Hive (SQL), and, more recently, Apache Spark.
Typical Data Lake Architecture
Data Lake vs Data Warehouse
It is important to understand the difference between data lakes and data warehouses. A data warehouse is highly structured. Much effort is done upfront in developing schemas and hierarchies prior to the data being loaded into a warehouse. There is no hierarchy or structure to the way data is stored in a data lake. The structure is applied afterward. There can be multiple schemas applied to the same data in a data lake.
Shortcomings of the Traditional Data Lake
This concept of merging all your data sources into a common repository has caused challenges for many organizations — number one, constantly having to do data replication. The main repository has to be kept in sync with the local data sources. This typically requires having to run numerous ETL processes.
There is a high potential for data inconsistencies. The data is only as current as the last syn point. Another issue is as your data lake grows, you may have new groups of analysts looking for different views of data. This results in having to do unnecessary duplication of data.
The third and possibly biggest challenge is that of data security and governance plus new GDPR regulations restrict data location. Sensitive data cannot be moved into the cloud or into a centralized repository. It has to remain in its native location. This limits your ability to utilize this data for analytics.
Data virtualization can be the solution to overcoming the shortcomings of a centralized repository. Let's start with an understanding of what exactly data virtualization is. Data virtualization is the ability to view, access, and analyze data without the need to know its location. Data virtualization can integrate data sources across multiple data types and locations. Turning it into a single logical view without having the need to do any sort of data replication or movement.
View a short video on data virtualization.
The Benefits of data virtualization:
- Reduction of errors via increased data accuracy.
- Less resource consumption due to ETL processed.
- Data can be categorized.
- Governance rules can be enforced.
- Less disk requirements.
- Easier to share data across organizations.
Data Virtualization vs Federation
It is important to understand the difference between data virtualization and data federation.
Data federation is the technology that allows you to logically map remote data sources and execute distributed queries against those multiple sources from a single location. Data virtualization is a platform that provides the end-user to retrieve and manipulate data without requiring them to know any technical details about the data, such as how it is formatted or where it is physically located.
It provides the end-user a self-service data mart of multiple data sources that can be joined into a single customer view.
Opinions expressed by DZone contributors are their own.