Over a million developers have joined DZone.

What Is Data Extraction?

DZone's Guide to

What Is Data Extraction?

You can't do data science without first extracting and ingesting your data. But what does this process look like? Read on to find out!

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Data Extraction Defined

Data extraction is a process that involves the retrieval of data from various sources. Frequently, companies extract data in order to process it further, migrate the data to a data repository (such as a data warehouse or a data lake) or to further analyze it. It's common to transform the data as a part of this process. For example, you might want to perform calculations on the data — such as aggregating sales data — and store those results in the data warehouse. If you are extracting the data to store it in a data warehouse, you might want to add additional metadata or enrich the data with timestamps or geolocation data. Finally, you likely want to combine the data with other data in the target data store. These processes, collectively, are called ETL, or Extraction, Transformation, and Loading. Extraction is the first key step in this process.

How Is Data Extracted?

Structured Data

If the data is structured, the data extraction process is generally performed within the source system. It's common to perform data extraction using one of the following methods:

  • Full extraction. Data is completely extracted from the source, and there is no need to track changes. The logic is simpler, but the system load is greater.
  • Incremental extraction. Changes in the source data are tracked since the last successful extraction so that you do not go through the process of extracting all the data each time there is a change. To do this, you might create a change table to track changes, or check timestamps. Some data warehouses have change data capture (CDC) functionality built in. The logic for incremental extraction is more complex, but the system load is reduced.

Unstructured Data

When you work with unstructured data, a large part of your task is to prepare the data in such a way that it can be extracted. Most likely, you will store it in a data lake until you plan to extract it for analysis or migration. You'll probably want to clean up "noise" from your data by doing things like removing whitespace and symbols, removing duplicate results, and determining how to handle missing values.

Data Extraction Challenges

Usually, you extract data in order to move it to another system or for data analysis (or both). If you intend to analyze it, you are likely performing ETL so that you can pull data from multiple sources and run analysis on it together. The challenge is ensuring that you can join the data from one source with the data from other sources so that they play well together. This can require a lot of planning, especially if you are bringing together data from structured and unstructured sources.

Another challenge with extracting data is security. Often some of your data contains sensitive information. It may, for example, contain PII (personally identifiable information), or other information that is highly regulated. You may need to remove this sensitive information as a part of the extraction, and you will also need to move all of your data securely. For example, you may want to encrypt the data in transit as a security measure.

Types of Data Extraction Tools

Batch processing tools: Legacy data extraction tools consolidate your data in batches, typically during off-hours to minimize the impact of using large amounts of compute power. For closed, on-premise environments with a fairly homogeneous set of data sources, a batch extraction solution may be a good approach.

Open source tools: Open source tools can be a good fit for budget-limited applications, assuming the supporting infrastructure and knowledge is in place. Some vendors offer limited or "light" versions of their products as open source as well.

Cloud-based tools: Cloud-based tools are the latest generation of extraction products. Generally the focus is on the real time extraction of data as part of an ETL/ELT process and cloud-based tools excel in this area, helping take advantage of all the cloud has to offer for data storage and analysis. These tools also take the worry out of security and compliance as today's cloud vendors continue to focus on these areas, removing the need for developing this expertise in-house.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

big data ,data extraction ,unstructured data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}