10 Feature Scaling Techniques in Machine Learning

Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

In this article at OpenGenus, we will explore feature scaling techniques in Machine Learning and understand when to use a particular feature scaling technique.

Table of content

  1. Introduction
  2. Why Feature Scaling Matters
  3. Different Feature Scaling Techniques
  4. Choosing the Right Technique
  5. Illustration
  6. Conclusion

Alright, Let's get started.

Introduction

In machine learning, feature scaling plays a vital role in achieving optimal performance of models. It is the process of transforming numerical features into a common scale, enabling fair comparisons and avoiding the dominance of certain features due to their larger magnitudes. In this article, we will dive into the world of feature scaling techniques and understand their importance in machine learning workflows.

Why Feature Scaling Matters

Machine learning algorithms often rely on distance calculations, such as in clustering or k-nearest neighbors. These algorithms can be sensitive to the scale of features. If features have different scales, those with larger magnitudes may dominate the learning process, leading to biased results. Moreover, features with smaller scales may be overlooked. Feature scaling helps mitigate these issues and ensures that all features contribute equally to model training and evaluation.

Different Feature Scaling Techniques :

This section dives into various data transformation and scaling techniques, from simple ones like standardization, min-max scaling and logarithmic scaling to slightly more advanced techniques like power transformation, binning, quantile transformation, unit vector scaling, binary scaling, and max absolute scaling.

10 Feature Scaling Techniques in Machine Learning are:

  1. Standardization (Z-score normalization)
  2. Min-Max Scaling
  3. Robust Scaling
  4. Logarithmic Scaling
  5. Power Transformation
  6. Binning
  7. Quantile Transformation
  8. Unit Vector Scaling
  9. Binary Scaling
  10. Max Absolute Scaling

We will discuss each technique in detail, highlighting their purpose, benefits, and applications.

  1. Standardization (Z-score normalization):
    Standardization transforms data to have zero mean and unit variance. Each feature is scaled independently, making the distribution of each feature centered around zero with a standard deviation of 1. This technique is robust and works well for most machine learning algorithms. It is defined as:
x_standardized = (x - mean(x)) / std(x)

where:

  • x is the original data point
  • mean(x) is the mean of the data
  • std(x) is the standard deviation of the data

Pseudocode:

def standardization(data):
   mean = np.mean(data)
   std_dev = np.std(data)
   standardized_data = (data - mean) / std_dev
   return standardized_data
  1. Min-Max Scaling:
    Min-Max scaling, also known as normalization, scales features to a specific range (e.g., [0, 1]). It maintains the relative relationship between values and preserves the shape of the distribution. This technique is suitable when features have a known range and there are no significant outliers. The formula for Min-Max scaling is:
x_scaled = (x - min(x)) / (max(x) - min(x))

where:

  • x is the original data point
  • min(x) is the minimum value in the data
  • max(x) is the maximum value in the data

Pseudocode:

def min_max_scaling(data):
   min_val = min(data)
   max_val = max(data)
   scaled_data = (data - min_val) / (max_val - min_val)
   return scaled_data
  1. Robust Scaling:
    Robust scaling is useful when data contains outliers. It scales features using statistics that are robust to outliers, such as the median and interquartile range (IQR). By using percentiles instead of mean and standard deviation, this technique reduces the impact of extreme values. The formula for robust scaling is:
x_scaled = (x - median(x)) / IQR(x)

where:

  • x is the original data point
  • median(x) is the median of the data
  • IQR(x) is the interquartile range of the data (75th percentile - 25th percentile)

Pseudocode:

def robust_scaling(data):
   median = np.median(data)
   q1 = np.percentile(data, 25)
   q3 = np.percentile(data, 75)
   iqr = q3 - q1
   scaled_data = (data - median) / iqr
   return scaled_data
  1. Logarithmic Scaling:
    Logarithmic scaling transforms data using the logarithm function. It is effective when dealing with highly skewed data or data with exponential growth patterns. Applying the logarithm can help normalize the distribution and reduce the impact of extreme values. The formula for logarithmic scaling is:
x_scaled = log(1+x)

where:

  • x is the original data point

Pseudocode:

def logarithmic_scaling(data):
   scaled_data = np.log1p(data)
   return scaled_data
  1. Power Transformation:
    Power transformation is a versatile technique used to modify the distribution of data. By applying a power function to each data point, we can address issues such as skewness and unequal variance. The choice of power function depends on the desired transformation. Common power transformations include square root, logarithmic, and reciprocal transformations. These transformations can make the data conform more closely to assumptions required by statistical tests or models. The formula for power transformation is:
x_transformed = x^p

where:

  • x is the original data point
  • p is the power parameter for the transformation

Pseudocode:

def power_transformation(data, power):
   transformed_data = np.power(data, power)
   return transformed_data
  1. Binning:
    Binning is a process of dividing continuous variables into smaller, discrete intervals or bins. It is particularly useful when dealing with large ranges of data. Binning simplifies data analysis by converting continuous data into categorical data, allowing for easier interpretation and analysis. It can be performed by specifying the number of bins or defining the width of each bin. Binning is often used in data preprocessing to create histograms, identify outliers, or create new features for machine learning algorithms. It do not have a specific mathematical formula associated with it.

Pseudocode:

def binning(data, num_bins):
   bin_edges = np.linspace(min(data), max(data), num_bins + 1)
   binned_data = np.digitize(data, bin_edges)
   return binned_data
  1. Quantile Transformation:
    Quantile transformation is a technique that transforms the distribution of a variable to a specified distribution, commonly the standard normal distribution. It involves mapping the original variable values to the corresponding quantiles of the desired distribution. This transformation helps achieve a more symmetric and uniform distribution, making it suitable for statistical techniques and machine learning algorithms that assume normality. It is particularly valuable when dealing with skewed data. It does not have a specific mathematical formula associated with it.

Pseudocode:

def quantile_transform(data):
   sorted_data = np.sort(data)
   quantiles = np.linspace(0, 1, len(data))
   transformed_data = np.interp(data, sorted_data, quantiles)
   return transformed_data
  1. Unit Vector Scaling:

Unit vector scaling, also known as normalization, is a widely used technique to scale the values of a variable to a fixed range, typically between 0 and 1. It involves subtracting the minimum value from each data point and then dividing by the range (maximum value minus minimum value). Normalization ensures that all values are proportionally adjusted to fit within the desired range while preserving the relative relationships between the data points. This technique is particularly helpful when the scale of variables varies widely. The formula for unit vector scaling is:

x_scaled = x / ∥x∥

where:

  • x is the original data point
  • ∥x∥ represents the Euclidean norm of x

Pseudocode:

def unit_vector_scaling(data):
   norm = np.linalg.norm(data)
   scaled_data = data / norm
   return scaled_data
  1. Binary Scaling:
    Binary scaling, or binarization, converts numerical data into binary values. It involves setting a threshold value and assigning 0 to all values below the threshold and 1 to values equal to or above it. Binarization is useful when specific algorithms require binary inputs, such as association rules mining or certain feature selection methods. It simplifies the data by categorizing it into two classes, which can be advantageous in some scenarios. Binary scaling is more process-oriented and do not have specific mathematical formulas associated.

Pseudocode:

def binary_scaling(data, threshold):
   scaled_data = np.where(data >= threshold, 1, 0)
   return scaled_data
  1. Max Absolute Scaling:
    Max Absolute scaling is a data normalization technique that scales the values of a variable to the range of [-1, 1]. It involves dividing each data point by the maximum absolute value among all the data points. Max Absolute scaling preserves the sign of the original values while ensuring they fall within the specified range. This technique is especially useful when dealing with data that contains outliers that could significantly impact other scaling techniques. The formula for max absolute scaling is:
x_scaled = x / max(∣x∣)

where:

  • x is the original data point
  • max(∣x∣) represents the maximum absolute value in the

Pseudocode:

def max_absolute_scaling(data):
   max_val = np.max(np.abs(data))
   scaled_data = data / max_val
   return scaled_data

Choosing the Right Technique:

The choice of feature scaling technique depends on the characteristics of the data and the specific machine learning algorithm being used. It is essential to consider factors such as the presence of outliers, the distribution of features, and the requirements of the algorithm.

When performing feature scaling, it is crucial to apply the scaling method consistently to both the training and test datasets. This ensures that the same scaling is applied to unseen data during model evaluation. Additionally, it is important to avoid data leakage by fitting the scaler only on the training data and then transforming both the training and test data.

Illustration

Suppose we have a dataset that includes two features: "Age" and "Income" to predict whether a person is likely to default on a loan. The "Age" feature ranges from 18 to 80, while the "Income" feature ranges from 20,000 to 200,000. Without proper feature scaling, the model may give undue importance to the "Income" feature due to its larger magnitude, potentially overlooking the predictive power of "Age."

Possible Solution:

To address this issue, we can apply feature scaling techniques. Let's explore two common techniques: standardization and min-max scaling.

  1. Standardization:

By applying standardization, we can transform both features to have zero mean and unit variance. This ensures that both "Age" and "Income" are on the same scale, making them comparable. The formula for standardization is x_scaled = (x - mean(x)) / std(x)

  1. Min-Max Scaling:

Min-max scaling can be used to transform the features into a specific range, such as [0, 1]. This technique ensures that both "Age" and "Income" are scaled proportionally within the given range. The formula for min-max scaling is x_scaled = (x - min(x)) / (max(x) - min(x))

By applying either of these scaling techniques, we can ensure that both "Age" and "Income" contribute equally to the model's training process and avoid the dominance of one feature over the other. This allows the model to make informed decisions based on the relative importance of each feature.

Results:

In this example, if we did not apply feature scaling, the model might heavily rely on the "Income" feature due to its larger magnitude, potentially overlooking the valuable information provided by the "Age" feature. However, by employing feature scaling techniques like standardization or min-max scaling, we bring both features to a common scale, enabling fair comparisons and ensuring that both "Age" and "Income" have an equal influence on the model's predictions.

Remember, feature scaling is not only crucial in scenarios with disparate magnitudes but also plays a role in algorithms that rely on distance-based calculations or optimization techniques. By incorporating appropriate feature scaling techniques, we can enhance the performance and reliability of machine learning models.

Conclusion:

Feature scaling is a critical step in the preprocessing of data for machine learning. It allows algorithms to effectively process features with different scales, avoiding biased results and ensuring fair comparisons. Understanding and applying appropriate feature scaling techniques contribute to the overall performance and reliability of machine learning models.

By utilizing techniques such as standardization, min-max scaling, robust scaling, and logarithmic scaling, data scientists can enhance the accuracy and effectiveness of their models.

Selecting the right feature scaling technique requires thoughtful consideration of the data characteristics and the algorithm's requirements. Experimentation and evaluation are key to determining the most suitable scaling approach for a given machine learning task.

Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.