What Is Data Transformation?
A look at data transformation and how it helps developers convert data from one format or structure into another.
Join the DZone community and get the full member experience.Join For Free
Data Transformation Defined
Data transformation is the process of converting data from one format or structure into another format or structure. Data transformation is critical to activities such as data integration and data management. Data transformation can include a range of activities: you might convert data types, cleanse data by removing nulls or duplicate data, enrich the data, or perform aggregations, depending on the needs of your project.
Typically, the process involves two stages.
In the first stage, you:
- Perform data discovery where you identify the sources and data types.
- Determine the structure and data transformations that need to occur.
- Perform data mapping to define how individual fields are mapped, modified, joined, filtered, and aggregated.
In the second stage, you:
- Extract data from the original source. The range of sources can vary, including structured sources, like databases, or streaming sources, such as telemetry from connected devices, or log files from customers using your web applications.
- Perform transformations. You transform the data, such as aggregating sales data or converting date formats, editing text strings, or joining rows and columns.
- Send the data to the target store. The target might be a database or a data warehouse that handles structured and unstructured data.
Why Transform Data?
You might want to transform your data for a number of reasons. Generally, businesses want to transform data to make it compatible with other data, move it to another system, join it with other data, or aggregate information in the data.
For example, consider the following scenario: your company has purchased a smaller company, and you need to combine information for the Human Resources departments. The purchased company uses a different database than the parent company, so you'll need to do some work to ensure that these records match. Each of the new employees has been issued an employee ID, so this can serve as a key. But, you'll need to change the formatting for the dates, you'll need to remove any duplicate rows, and you'll have to ensure that there are no null values for the Employee ID field so that all employees are accounted for. All these critical functions are performed in a staging area before you load the data to the final target.
Other common reasons to transform data include:
- You are moving your data to a new data store; for example, you are moving to a cloud data warehouse and you need to change the data types.
- You want to join unstructured data or streaming data with structured data so you can analyze the data together.
- You want to add information to your data to enrich it, such as performing lookups, adding geolocation data, or adding timestamps.
- You want to perform aggregations, such as comparing sales data from different regions or totaling sales from different regions.
How Is Data Transformed?
There are a few different ways to transform data:
- Scripting. Some companies perform data transformation via scripts using SQL or Python to write the code to extract and transform the data.
- On-premise ETL tools. ETL (Extract, Transform, Load) tools can take much of the pain out of scripting the transformations by automating the process. These tools are typically hosted on your company's site, and may require extensive expertise and infrastructure costs.
- Cloud-based ETL tools. These ETL tools are hosted in the cloud, where you can leverage the expertise and infrastructure of the vendor.
Data Transformation Challenges
Data transformation can be difficult for a number of reasons:
Time-consuming. You may need to extensively cleanse the data so you can transform or migrate it. This can be extremely time-consuming, and is a common complaint amongst data scientists working with unstructured data.
Costly. Depending on your infrastructure, transforming your data may require a team of experts and substantial infrastructure costs.
Slow. Because the process of extracting and transforming data can be a burden on your system, it is often done in batches, which means you may have to wait up to 24 hours for the next batch to be processed. This can cost you time in making business decisions.
Published at DZone with permission of Garrett Alley, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.