An Introduction to Data Virtualization and Its Use Cases

DZone 's Guide to

An Introduction to Data Virtualization and Its Use Cases

We explore why data virtualization could prove to be the essential solution for your data and integration needs.

· Big Data Zone ·
Free Resource

Data Virtualization

Data virtualization is a solution to address several issues. This type of solution is booming, with strong year-over-year growth. But let's start with a definition first.


Data virtualization is the process of inserting a layer of data access between data sources and data consumers to facilitate access. In practice, we have a kind of SQL requestor as a tool, which is able to query very heterogeneous data sources, ranging from the traditional SQL databases to a text or PDF files, or a streaming source like Kafka. In short, you have data, you can query it, and generate joins between this data. In practice, you can thus offer a unified and complete view of the data, even if it is "exploded" between several systems. On top of that, you have cache and a query optimizer that allows you to minimize the impact on source systems in terms of performance. And, of course, you have a data catalog, which helps you to find your way through all the data in your IT infrastructure. From this we can deduce two main use cases.

Integration Use Cases

This is the first use case that inevitably comes to mind. A large bank that had somewhat missed its "digital" shift was struggling to offer its customers a portal containing all its relevant data. Old heterogeneous or even exotic systems, unable to support the connection of thousands of customers simultaneously, was the bread and butter of this bank. With a data virtualization solution, data exposure is done at a rate as fast as if you had to do a simple DAO class in Java with Hibernate. You add connections, search for your data with the data catalog, write your requests, expose them in an API, and you have all the necessary assets to expose your data. That's all, and it's that simple. Making an API, if you know where to find the data, literally takes 5 minutes. Of course, you will not implement inserts and updates, but remember that readings make up 80 percent of requests on a client portal and that the CQRS pattern does not exist for nothing! You will implement your inserts and updates via API, which will certainly call up existing transactions!

This saves you a lot of time, as there's no need to output data via an ETL or Change Data Capture, which would require you to know about all the data.

The Data Use Cases

Two typical use cases are the virtual data warehouse and the virtual data lake.

Concerning the virtual data warehouse, it can be set up much faster than a traditional data warehouse. In the case of a traditional data warehouse, you need to set up many ETL flows, and if a new business need requires a new ETL flow, you have gone to write a spec, send it, wait for the devs to be done, then test. Anyway, you lost a month. In the case of a virtual data warehouse, you make your request, and, and, and, and that's it!

Then the case of use of the virtual data lake allows you to consolidate the data whatever its origin. It avoids you having to ask many questions about how the data is integrated: it makes it easy to make the data available and urbanized. Many data lake projects have been lost in a v-cycle approach, where the creation of urbanized data will take you one to two years.

Finally, via the data catalog, you are able to offer data access to businesses, business analysts, data scientists, and BI experts. In short, you are popularizing data access.

big data, data integration, data lakes, data warehouses

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}