MachineX: Improve Accuracy of Your ML Models Before Even Writing Them
One of the most important parts of machine learning is preparing data for machine learning. Today, we will be looking at a part of this process: feature scaling.
Join the DZone community and get the full member experience.Join For Free
Every machine learning practitioner will agree with me when I say that one of the most important parts of machine learning is preparing data for machine learning. It certainly requires some experience to properly and effectively prepare data for machine learning. Although data preparation is in itself a really big topic, today, we will only be looking at a part of its process: feature scaling.
Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
So, in simple terms, feature scaling brings the features of a dataset to approximately the same scale.
That's Fine, but How Does It Help Improve Accuracy?
Let's understand this using an example. Suppose we want to find out the dependence of an athlete's speed on their height and weight. Our dataset consists of heights, say, in the range 4 feet to seven feet, and weights in the range 70 to 120 kilograms. Now, as we can clearly see, the range of weights is more than the range of heights. If we feed in these parameters to our ML model as it is, then our model will give higher weight-age to the weight of an athlete than their height. So, a change in the weight of an athlete will be reflected much more than the change in the height, which is totally undesirable, and at this point, the accuracy of our algorithms will be very low. If we scale down weight to approximately to the scale of height, then we will be able to get a much more desirable result, as changes in both parameters will be equally reflected, giving us much more accuracy than before.
With a few exceptions, machine learning algorithms don't perform well when the input numerical attributes have very different scales. Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.
Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it.
Two of the most common ways to apply feature scaling, i.e. to get all attributes to have the same scale, are min-max scalingand standardization.
Min-max scaling (AKA normalization) is quite simple: values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min.
x = value to be normalized
min(x) = minimum value of the attribute in the dataset
max(x) = maximum value of the attribute in the dataset
z = value after normalization
So, if a dataset has some attribute with the values 29, 59, 53, 76, the min-max scaling would produce 0, 0.63, 0.51, 1.
Standardization is quite different. First, it subtracts the mean value (so, standardized values always have a zero mean), and then it divides by the variance so that the resulting distribution has unit variance. Unlike min-max scaling, standardization does not bound values to a specific range, which may be a problem for some algorithms (e.g. neural networks often expect an input value ranging from 0 to 1). However, standardization is much less affected by outliers.
x = value to be normalized
µ = mean of the attribute in the dataset
σ² = variance of the attribute in the dataset
z = value after standardization
So, if a dataset has some attribute with the values 29, 59, 53, 76, standardization would produce -0.066, 0.012, -0.003, 0.057.
Feature scaling can improve the accuracy of your ML models significantly, but implementing it can be quite tricky and can only be mastered through experience. It is certainly one of those things you cannot ignore.
Published at DZone with permission of Akshansh Jain, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.