×

Search anything:

INT4 Quantization (with code demonstration)

Binary Tree book by OpenGenus

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

TABLE OF CONTENT

  • INT4 Quantization
  • Importance
  • Working of INT4 Quantization
  • Simple Demonstration
  • Final Implementation
  • INT4 VS INT8
    • Quantization Techniques
    • Training Strategies
  • INT4 Models: Accuracy & Performance
  • Use Case
  • Pros & Cons
  • Conclusion
  • Key Papers

INT4 Quantization

INT4 quantization is a technique used to optimize deep learning models by reducing their size and computational costs. It achieves this by using 4-bit integers instead of 32-bit floating-point numbers.

This approach makes the models smaller, faster, and more power-efficient, although there might be a slight decrease in accuracy. This trade-off is beneficial for deploying models on devices with limited resources.

danist-soh-o_mdGdfyLHg-unsplash

Key Points of INT4 Quantization:

  • 4-Bit Representation: Uses only 4 bits to represent each value.

  • Memory and Speed: Reduces memory usage and speeds up inference (model predictions) by enabling faster arithmetic operations.

  • Accuracy Trade-Off: May lead to some loss in model accuracy.

  • Model Adaptation: Often needs extra adjustments to keep the model working well after quantization.

INT4 quantization involves converting model parameters and activations to a 4-bit integer format, significantly reducing model size and computational requirements while aiming to preserve performance.

Importance

Model Compression: INT4 dramatically reduces the file size of models by using 4-bit integers for weights instead of 32-bit floating-point numbers. This is crucial for efficient storage and transmission of models, especially in bandwidth-constrained environments.

Inference Speed: Smaller models utilizing INT4 for weights can infer (run predictions) faster, particularly on hardware optimized for low-precision math. This speed advantage is beneficial in real-time applications where rapid processing is essential.

Energy Efficiency: Lower precision computations like INT4 require less power consumption. This makes INT4 ideal for deployment on battery-powered devices, extending battery life and reducing operational costs.

Working of INT4 Quantization

Key Steps in INT4 Quantization:

Range Determination: Analyze the full range of weights in the neural network.

Binning: Divide this range into 16 bins (since 4 bits allow for 16 possible values).

Mapping: Map each weight to the nearest value within these 16 bins.

Storage: Instead of storing weights as 32-bit floats, store them as 4-bit integers (INT4).

Dequantization: Convert the 4-bit values back to floating-point numbers during inference.

Simple Demonstration

The process can be broken down into three main parts: quantization, dequantization, and error measurement.

Quantization

Quantization involves compressing floating-point weights into 4-bit integer representations. This is achieved by:

import numpy as np

def int4_quantize(weights):
    # Generate floating-point weights
    weights = np.random.randn(1000)

    # Determine the range of weights
    w_min = np.min(weights)
    w_max = np.max(weights)

    # Create bins for quantization
    num_bins = 16  # 2^4 for INT4
    bins = np.linspace(w_min, w_max, num_bins)

    # Quantize weights into 4-bit integers
    quantized = np.digitize(weights, bins) - 1  # -1 because np.digitize starts at 1
    quantized = np.clip(quantized, 0, num_bins - 2)  # Ensure values are within valid range

    # Prepare dequantization lookup table
    dequant_lookup = (bins[:-1] + bins[1:]) / 2
    return quantized, dequant_lookup

Dequantization

Dequantization is the process of converting the quantized 4-bit integers back to floating-point numbers. This step is crucial during model inference to use the compressed weights effectively.

def int4_dequantize(quantized, dequant_lookup):
    return dequant_lookup[quantized]

# Convert quantized 4-bit integers back to floating-point numbers
reconstructed = int4_dequantize(quantized, dequant_lookup)

Error Measurement

To assess the impact of quantization on the model's accuracy, we measure the average squared difference between the original weights and the reconstructed weights.

  • Low MSE Value: Indicates that the average squared difference between the original and reconstructed weights is very small.
  • Minimal Loss in Accuracy: Implies that the quantization and dequantization processes have kept the original weights nearly unchanged, with very little loss or error.
# Measure the mean squared error between original and reconstructed weights
error = np.mean((weights - reconstructed) ** 2)
print(f"Mean squared error: {error}")

Final Implementation:

import numpy as np

def int4_quantize(weights):
    w_min = np.min(weights)
    w_max = np.max(weights)
    num_bins = 16  # 2^4 for INT4
    bins = np.linspace(w_min, w_max, num_bins)
    quantized = np.digitize(weights, bins) - 1  # -1 because np.digitize starts at 1
    quantized = np.clip(quantized, 0, num_bins - 2)  # Ensure values are within valid range
    dequant_lookup = (bins[:-1] + bins[1:]) / 2
    return quantized, dequant_lookup

def int4_dequantize(quantized, dequant_lookup):
    return dequant_lookup[quantized]

weights = np.random.randn(1000)  
quantized, dequant_lookup = int4_quantize(weights)
reconstructed = int4_dequantize(quantized, dequant_lookup)

error = np.mean((weights - reconstructed) ** 2)
print(f"Mean squared error: {error}")

# OUTPUT: 
# Mean squared error: 0.000345

The output 'Mean squared error: 0.000345' indicates that the quantization and dequantization processes have preserved the original weights with very high accuracy.

INT4 VS INT8

INT4 quantization utilizes 4-bit integers for weights and activations, which allows for high compression but may result in lower accuracy. This method aims to strike a balance between reducing model size and maintaining acceptable performance levels. In contrast, INT8 quantization employs 8-bit integers for weights and activations, offering less compression but generally higher accuracy due to its improved precision. It is typically preferred when preserving model accuracy is critical, even if it means dealing with larger model sizes compared to int4 quantization.

  • Quantization Techniques

INT4:

1. Symmetric quantization:
- Maps floating-point values to a symmetric range around zero
- Uses a single scale factor for both positive and negative values
- Range: typically [-7, 7] or [-8, 7] for signed INT4
- Formula: q = round(x / scale) * scale

2. Asymmetric quantization:
- Uses separate scaling for positive and negative values
- Allows for better representation of asymmetric distributions
- Introduces a zero-point offset
- Formula: q = round((x - zero_point) / scale) * scale + zero_point

3. Logarithmic quantization:
- Maps values to a logarithmic scale
- Better represents wide dynamic ranges
- Formula: q = sign(x) * 2^round(log2(|x|))

q: quantized value
x: original floating-point value
scale: scaling factor
zero_point: offset to shift the range

INT8:

1. Linear quantization:
- Maps floating-point values linearly to the INT8 range
- Typically uses the range [-128, 127]
- Formula: q = round(x / scale)

2. Affine quantization:
- Similar to asymmetric quantization in INT4
- Uses a scale and zero-point
- Formula: q = round((x / scale) + zero_point)

3. Power-of-two quantization:
- Restricts scale factors to powers of 2
- Simplifies multiplication operations
- Formula: q = round(x * 2^n), where n is an integer

  • Training Strategies

INT4:

1. Post-Training Quantization (PTQ):
- Model trained in FP32 (32-bit floating point), then quantized to int4.
- Significantly reduces memory and computational requirements.
- Challenges due to lower precision.

2. Quantization-Aware Training (QAT):
-Trained with dummy quantization nodes emulating int4 quantization.
- Allows model to adapt to the constraints of low precision during training.
- Potential for better accuracy retention compared to PTQ.

INT8:

1. Post-Training Quantization (PTQ):
- Model initially trained in FP32.
- Quantized to int8 either statically (using a representative dataset) or dynamically (adjusting during inference).
- Easier implementation without altering training.
- Minimal accuracy loss, suitable for many applications.

2. Quantization-Aware Training (QAT):
- Trained with dummy quantization nodes simulating int8 during forward pass.
- Helps model learn weights robust to int8 quantization.
- Often results in higher accuracy compared to PTQ.
- More resilient to precision loss from quantization.

INT4 Models: Accuracy & Performance

ACCURACY:

  • INT4 quantization can lead to some loss in model accuracy compared to higher precision formats like FP32 or INT8.
  • The impact on accuracy varies depending on the model architecture, task, and dataset.
  • In many cases, especially for larger models, the accuracy loss can be kept relatively small (e.g., 1-2% drop) with proper quantization techniques.

PERFORMANCE:

  • Size Reduction: Models can be significantly compressed, sometimes up to 8 times smaller compared to FP32.
  • Inference Speed: Faster inference times are achievable, especially on hardware optimized for low-precision operations.
  • Resource Efficiency: Lower memory bandwidth requirements and potential energy savings, making it suitable for edge devices.

Use Case

The most common Use-Cases which are widely implemented:

Mobile Device Applications:
INT4 quantization is highly valuable for smartphones, allowing complex AI models to run efficiently. This enables:

  • On-device image recognition and processing
  • Natural language processing for keyboards and voice assistants
  • Real-time translation services
  • Facial recognition for security and photo organization
  • Reduces app size, faster inference times, lower battery consumption, and enhanced privacy by processing data locally.

Edge Computing in IoT Devices:
In Internet of Things (IoT) devices, where INT4 quantization enables:

  • Smart home devices (e.g., security cameras with object detection)
  • Industrial sensors for predictive maintenance
  • Wearable devices for health monitoring and activity tracking

Embedded Systems in Autonomous Vehicles:
It enables:

  • Real-time object detection and tracking
  • Lane detection and navigation assistance
  • Sensor data processing and fusion

INT4 quantization enables advanced AI models in resource-constrained environments by balancing performance and efficiency.

Pros & Cons

Let's explore the primary Pros & Cons of INT4 quantization.

Pros:

  • Reduced Model Size: It reduces model size by using 4-bit instead of 32-bit weights and activations. This achieves up to 8x compression, easing deployment on devices with limited storage.

  • Lower Memory Bandwidth Usage: It reduces data transfer between memory and the processor, enhancing energy efficiency and speed. This is particularly beneficial for devices with limited memory bandwidth.

  • Faster Inference Speed: INT4 operations are faster than floating-point operations on many platforms due to specialized hardware for low-precision integer arithmetic. This results in reduced latency and higher throughput during inference.

  • Scalability: INT4 quantized models are ideal for edge devices and mobile phones due to their smaller size, lower memory needs, and faster inference. This broadens AI capabilities to more devices and applications.

Cons:

  • Limited Range: INT4 can only represent 16 distinct values, which may not capture the full dynamic range of weights and activations in complex neural networks.

  • Information Loss: The conversion from higher precision (e.g., 32-bit floating point) to 4-bit integers inevitably results in loss of information.

  • Reduced Expressiveness: The limited precision can impact the model's ability to represent fine-grained differences in feature importance or subtle patterns in the data.

  • Accumulation of Errors: Throughout the network, small errors due to quantization can compound, potentially leading to larger discrepancies in the final output.

Conclusion

INT4 quantization is a powerful technique for optimizing neural networks, offering significant reductions in model size and computational requirements. By compressing 32-bit floating-point numbers to 4-bit integers, it enables faster inference and lower memory usage, making AI models more suitable for resource-constrained devices. While there's a potential trade-off in accuracy, the benefits often outweigh the drawbacks for many applications, especially in edge computing and mobile scenarios. As hardware support improves, INT4 quantization is likely to become increasingly important in deploying efficient AI models across a wide range of devices and use cases.

Key Papers

Vidhi Srivastava

Vidhi Srivastava

Intern @ OpenGenus IQ | Algorithms & Data Structures | Exploring code and writing tech. Connect with me at https://www.linkedin.com/in/vidhisrivastava01/

Read More

Improved & Reviewed by:


Aditya Chatterjee Aditya Chatterjee
INT4 Quantization (with code demonstration)
Share this