DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Data Pipeline Techniques in Action
  • Tobiko Data: Revolutionizing Data Transformation With SQLMesh
  • Machine Learning With Python: Data Preprocessing Techniques
  • Financial Data Engineering in SAS

Trending

  • Apache Doris vs Elasticsearch: An In-Depth Comparative Analysis
  • How to Practice TDD With Kotlin
  • While Performing Dependency Selection, I Avoid the Loss Of Sleep From Node.js Libraries' Dangers
  • A Guide to Container Runtimes
  1. DZone
  2. Data Engineering
  3. Data
  4. Data Transformation Tips for the Trenches

Data Transformation Tips for the Trenches

Here are some quick and dirty data transformation techniques you might find useful. I'm not much into exposition, so let's get started.

By 
Brad Hanks user avatar
Brad Hanks
·
Updated Nov. 23, 22 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
3.2K Views

Join the DZone community and get the full member experience.

Join For Free

Here are some quick and dirty data transformation techniques you might find useful. I'm not much into exposition, so that we can get started. 

Depending on the application, data transformation has different purposes. It can prevent over-fitting for regression problems. In classification, it improves model accuracy significantly.

In contrast, scaling did not improve the performance of non-supervised learning techniques, such as k-means clustering. Scaling would still be beneficial in this scenario since it reduces data sparsity. 

Techniques for Transformation

It depends on the data and application in hand, which transformation technique is appropriate:

Importance-Based Transformation

One of the most important decisions when transforming variables is choosing which variable to transform. It can do this automatically using feature selection algorithms such as the VIF test or decision tree random forest model.

Variable Distribution

The type of output variable dictates how the model should be, the type of loss function, and the performance metric (i.e., binary filter result or probability values).

Importance of interpretation: I focus my analysis on understanding industries, groups, or cases rather than prediction, then it makes more sense to keep variables original.

Logarithmic Transformation

Logarithmic transformation involves the natural log of all values in a variable, thus rewarding small positive values disproportionately.

 Common cases are: 

  • The target variable is nonlinear with the predictor Log target variable would affirm the linear relationship. 
  • High levels of data skew or SKEW is 0. You can us arithmetic operations on transformed data, although the monitor metrics will be non-entity (transformed variables). Important note: to apply a log on a zero or negative value, then addition/subtraction of arbitrary constant value is necessary to ensure the transformed variable > 0.
  • For variable X with arbitrary constant C will be Y=log(X+C)—different C value gives different distributions, so I recommend testing results by checking metrics of interest.

Log Normalization

If a variable contains a natural log (base e), then its distribution is log-normal. 

Here’s a log trick: In the case where values of a variable distribute uniformly, its logarithm would also follow a uniform distribution. Therefore, evidence of uniformity in the transformed data means that the original data is exponentially distributed. 

This property allows one to detect the underlying exponential structure of raw data using a simple EDA technique. I recommend trying both transforming and non-transformation cases before deciding on any conclusion. 

Standardization vs. Normalization

Both are ways of scaling variables so that their mean value is 0 and variance is 1. For standardization, scaling is based on variability, whereas you normalize the process scale (or range) for normalization. (It might go without saying, but I'll say it anyway.)

To apply these techniques, one must decide if being on the same scale or with the same range is more important. There are many variations of normalization techniques, such as Min-Max Scaling, standardization, and many more, which data scientists widely use.

Standardization vs. Min-Max Scaling Clipping

It involves discarding raw data that exceeds the extreme value threshold in order to reduce the variability of transformed data. 

Transforming Into Normal Distribution 

You can just simply transform your variable into the desired distribution. Format conversion For text variables may include lemmatization, stemming, and stop word removal. Numeric variables may replace values with similar numeric variables, Others Transformation pipelines, and re-indexing, among others.

There are tens of transformation techniques, but only some will play off well with your data set based on its format and type. The choice depends heavily on experience. Here are some examples using Python that I cherry-picked: 

 
jobNumsFormed["Income"] = jobNumsFormed["Income"].str[1:]

removeComma = "{:,}"

jobNumsFormed["Income"] = jobNumsFormed["Income"].str.replace(removeComma,"")

jobNumsFormed["Income"] = jobNumsFormed["Income"].astype("int")/1000000  #replace string with int and scale down into million dollar 


Data transformation Data (computing)

Opinions expressed by DZone contributors are their own.

Related

  • Data Pipeline Techniques in Action
  • Tobiko Data: Revolutionizing Data Transformation With SQLMesh
  • Machine Learning With Python: Data Preprocessing Techniques
  • Financial Data Engineering in SAS

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!