OpenSource Internship opportunity by OpenGenus for programmers. Apply now.
TABLE OF CONTENT
 INT4 Quantization
 Importance
 Working of INT4 Quantization
 Simple Demonstration
 Final Implementation
 INT4 VS INT8
 Quantization Techniques
 Training Strategies
 INT4 Models: Accuracy & Performance
 Use Case
 Pros & Cons
 Conclusion
 Key Papers
INT4 Quantization
INT4 quantization is a technique used to optimize deep learning models by reducing their size and computational costs. It achieves this by using 4bit integers instead of 32bit floatingpoint numbers.
This approach makes the models smaller, faster, and more powerefficient, although there might be a slight decrease in accuracy. This tradeoff is beneficial for deploying models on devices with limited resources.
Key Points of INT4 Quantization:

4Bit Representation: Uses only 4 bits to represent each value.

Memory and Speed: Reduces memory usage and speeds up inference (model predictions) by enabling faster arithmetic operations.

Accuracy TradeOff: May lead to some loss in model accuracy.

Model Adaptation: Often needs extra adjustments to keep the model working well after quantization.
INT4 quantization involves converting model parameters and activations to a 4bit integer format, significantly reducing model size and computational requirements while aiming to preserve performance.
Importance
Model Compression: INT4 dramatically reduces the file size of models by using 4bit integers for weights instead of 32bit floatingpoint numbers. This is crucial for efficient storage and transmission of models, especially in bandwidthconstrained environments.
Inference Speed: Smaller models utilizing INT4 for weights can infer (run predictions) faster, particularly on hardware optimized for lowprecision math. This speed advantage is beneficial in realtime applications where rapid processing is essential.
Energy Efficiency: Lower precision computations like INT4 require less power consumption. This makes INT4 ideal for deployment on batterypowered devices, extending battery life and reducing operational costs.
Working of INT4 Quantization
Key Steps in INT4 Quantization:
Range Determination: Analyze the full range of weights in the neural network.
Binning: Divide this range into 16 bins (since 4 bits allow for 16 possible values).
Mapping: Map each weight to the nearest value within these 16 bins.
Storage: Instead of storing weights as 32bit floats, store them as 4bit integers (INT4).
Dequantization: Convert the 4bit values back to floatingpoint numbers during inference.
Simple Demonstration
The process can be broken down into three main parts: quantization, dequantization, and error measurement.
Quantization
Quantization involves compressing floatingpoint weights into 4bit integer representations. This is achieved by:
import numpy as np
def int4_quantize(weights):
# Generate floatingpoint weights
weights = np.random.randn(1000)
# Determine the range of weights
w_min = np.min(weights)
w_max = np.max(weights)
# Create bins for quantization
num_bins = 16 # 2^4 for INT4
bins = np.linspace(w_min, w_max, num_bins)
# Quantize weights into 4bit integers
quantized = np.digitize(weights, bins)  1 # 1 because np.digitize starts at 1
quantized = np.clip(quantized, 0, num_bins  2) # Ensure values are within valid range
# Prepare dequantization lookup table
dequant_lookup = (bins[:1] + bins[1:]) / 2
return quantized, dequant_lookup
Dequantization
Dequantization is the process of converting the quantized 4bit integers back to floatingpoint numbers. This step is crucial during model inference to use the compressed weights effectively.
def int4_dequantize(quantized, dequant_lookup):
return dequant_lookup[quantized]
# Convert quantized 4bit integers back to floatingpoint numbers
reconstructed = int4_dequantize(quantized, dequant_lookup)
Error Measurement
To assess the impact of quantization on the model's accuracy, we measure the average squared difference between the original weights and the reconstructed weights.
 Low MSE Value: Indicates that the average squared difference between the original and reconstructed weights is very small.
 Minimal Loss in Accuracy: Implies that the quantization and dequantization processes have kept the original weights nearly unchanged, with very little loss or error.
# Measure the mean squared error between original and reconstructed weights
error = np.mean((weights  reconstructed) ** 2)
print(f"Mean squared error: {error}")
Final Implementation:
import numpy as np
def int4_quantize(weights):
w_min = np.min(weights)
w_max = np.max(weights)
num_bins = 16 # 2^4 for INT4
bins = np.linspace(w_min, w_max, num_bins)
quantized = np.digitize(weights, bins)  1 # 1 because np.digitize starts at 1
quantized = np.clip(quantized, 0, num_bins  2) # Ensure values are within valid range
dequant_lookup = (bins[:1] + bins[1:]) / 2
return quantized, dequant_lookup
def int4_dequantize(quantized, dequant_lookup):
return dequant_lookup[quantized]
weights = np.random.randn(1000)
quantized, dequant_lookup = int4_quantize(weights)
reconstructed = int4_dequantize(quantized, dequant_lookup)
error = np.mean((weights  reconstructed) ** 2)
print(f"Mean squared error: {error}")
# OUTPUT:
# Mean squared error: 0.000345
The output 'Mean squared error: 0.000345' indicates that the quantization and dequantization processes have preserved the original weights with very high accuracy.
INT4 VS INT8
INT4 quantization utilizes 4bit integers for weights and activations, which allows for high compression but may result in lower accuracy. This method aims to strike a balance between reducing model size and maintaining acceptable performance levels. In contrast, INT8 quantization employs 8bit integers for weights and activations, offering less compression but generally higher accuracy due to its improved precision. It is typically preferred when preserving model accuracy is critical, even if it means dealing with larger model sizes compared to int4 quantization.
Quantization Techniques
INT4:
1. Symmetric quantization:
 Maps floatingpoint values to a symmetric range around zero
 Uses a single scale factor for both positive and negative values
 Range: typically [7, 7] or [8, 7] for signed INT4
 Formula: q = round(x / scale) * scale
2. Asymmetric quantization:
 Uses separate scaling for positive and negative values
 Allows for better representation of asymmetric distributions
 Introduces a zeropoint offset
 Formula: q = round((x  zero_point) / scale) * scale + zero_point
3. Logarithmic quantization:
 Maps values to a logarithmic scale
 Better represents wide dynamic ranges
 Formula: q = sign(x) * 2^round(log2(x))
q: quantized value
x: original floatingpoint value
scale: scaling factor
zero_point: offset to shift the range
INT8:
1. Linear quantization:
 Maps floatingpoint values linearly to the INT8 range
 Typically uses the range [128, 127]
 Formula: q = round(x / scale)
2. Affine quantization:
 Similar to asymmetric quantization in INT4
 Uses a scale and zeropoint
 Formula: q = round((x / scale) + zero_point)
3. Poweroftwo quantization:
 Restricts scale factors to powers of 2
 Simplifies multiplication operations
 Formula: q = round(x * 2^n), where n is an integer
Training Strategies
INT4:
1. PostTraining Quantization (PTQ):
 Model trained in FP32 (32bit floating point), then quantized to int4.
 Significantly reduces memory and computational requirements.
 Challenges due to lower precision.
2. QuantizationAware Training (QAT):
Trained with dummy quantization nodes emulating int4 quantization.
 Allows model to adapt to the constraints of low precision during training.
 Potential for better accuracy retention compared to PTQ.
INT8:
1. PostTraining Quantization (PTQ):
 Model initially trained in FP32.
 Quantized to int8 either statically (using a representative dataset) or dynamically (adjusting during inference).
 Easier implementation without altering training.
 Minimal accuracy loss, suitable for many applications.
2. QuantizationAware Training (QAT):
 Trained with dummy quantization nodes simulating int8 during forward pass.
 Helps model learn weights robust to int8 quantization.
 Often results in higher accuracy compared to PTQ.
 More resilient to precision loss from quantization.
INT4 Models: Accuracy & Performance
ACCURACY:
 INT4 quantization can lead to some loss in model accuracy compared to higher precision formats like FP32 or INT8.
 The impact on accuracy varies depending on the model architecture, task, and dataset.
 In many cases, especially for larger models, the accuracy loss can be kept relatively small (e.g., 12% drop) with proper quantization techniques.
PERFORMANCE:
 Size Reduction: Models can be significantly compressed, sometimes up to 8 times smaller compared to FP32.
 Inference Speed: Faster inference times are achievable, especially on hardware optimized for lowprecision operations.
 Resource Efficiency: Lower memory bandwidth requirements and potential energy savings, making it suitable for edge devices.
Use Case
The most common UseCases which are widely implemented:
Mobile Device Applications:
INT4 quantization is highly valuable for smartphones, allowing complex AI models to run efficiently. This enables:
 Ondevice image recognition and processing
 Natural language processing for keyboards and voice assistants
 Realtime translation services
 Facial recognition for security and photo organization
 Reduces app size, faster inference times, lower battery consumption, and enhanced privacy by processing data locally.
Edge Computing in IoT Devices:
In Internet of Things (IoT) devices, where INT4 quantization enables:
 Smart home devices (e.g., security cameras with object detection)
 Industrial sensors for predictive maintenance
 Wearable devices for health monitoring and activity tracking
Embedded Systems in Autonomous Vehicles:
It enables:
 Realtime object detection and tracking
 Lane detection and navigation assistance
 Sensor data processing and fusion
INT4 quantization enables advanced AI models in resourceconstrained environments by balancing performance and efficiency.
Pros & Cons
Let's explore the primary Pros & Cons of INT4 quantization.
Pros:

Reduced Model Size: It reduces model size by using 4bit instead of 32bit weights and activations. This achieves up to 8x compression, easing deployment on devices with limited storage.

Lower Memory Bandwidth Usage: It reduces data transfer between memory and the processor, enhancing energy efficiency and speed. This is particularly beneficial for devices with limited memory bandwidth.

Faster Inference Speed: INT4 operations are faster than floatingpoint operations on many platforms due to specialized hardware for lowprecision integer arithmetic. This results in reduced latency and higher throughput during inference.

Scalability: INT4 quantized models are ideal for edge devices and mobile phones due to their smaller size, lower memory needs, and faster inference. This broadens AI capabilities to more devices and applications.
Cons:

Limited Range: INT4 can only represent 16 distinct values, which may not capture the full dynamic range of weights and activations in complex neural networks.

Information Loss: The conversion from higher precision (e.g., 32bit floating point) to 4bit integers inevitably results in loss of information.

Reduced Expressiveness: The limited precision can impact the model's ability to represent finegrained differences in feature importance or subtle patterns in the data.

Accumulation of Errors: Throughout the network, small errors due to quantization can compound, potentially leading to larger discrepancies in the final output.
Conclusion
INT4 quantization is a powerful technique for optimizing neural networks, offering significant reductions in model size and computational requirements. By compressing 32bit floatingpoint numbers to 4bit integers, it enables faster inference and lower memory usage, making AI models more suitable for resourceconstrained devices. While there's a potential tradeoff in accuracy, the benefits often outweigh the drawbacks for many applications, especially in edge computing and mobile scenarios. As hardware support improves, INT4 quantization is likely to become increasingly important in deploying efficient AI models across a wide range of devices and use cases.