FP32 (Floating point format for Deep Learning)

FP32 is a FP32 Floating point data format for Deep Learning where data is represented as a 32-bit floating point number. FP32 is the most widely used data format across all Machine Learning/ Deep Learning applications.

Table of contents:

Introduction to FP32 (Floating point 32 bits)
Components in FP32
Use of FP32
FP32 conversion to FP16 and FP64
FP32 vs FP16 vs FP64 vs INT8

Introduction to FP32 (Floating point 32 bits)

FP32 is, also, known as Single precision floating point format. The size of the floating point format impacts the following:

Less bits means less memory consumption (size of data)
More bits means more accuracy (results need to be reasonably accurate)
Less bits means reducing training/ inference time (impacts arithmetic and network bandwidth)

Components in FP32

There are 32 bits in FP32 which are divided as follows from left to right:

1 bit: Sign bit
8 bits: Exponent
23 bits: Fraction

A floating point number is represented as having two components:

Integer component (say X) (0 to 9)
Decimal components (say Y)
Exponent (say E)

The floating point number becomes X.YeE which is say as X.Y * 10^E.

So, a floating point number say 1.92e-4 is same as 0.000192

This number is stored internally using 32 bits. The number of bits determines:

The range of decimal component that can be included (determines the accuracy)
The range of exponent
In short, it determines the range and accuracy of floating point numbers

In FP32, 9 bits are used for range and 23 bits are used for accuracy/ decimal part.

Use of FP32

The use of FP32 datatype is as follows:

FP32 is supported in all major Deep Learning Inference software
FP32 is supported in all x86 CPUs and NVIDIA GPUs
FP32 is the default floating datatype in Programming Languages
FP32 in Deep Learning models

FP32 is supported in all major Deep Learning Inference software

TensorFlow supports FP32 as a standard Datatype:

tf.float32

PyTorch supports FP32 as the default float datatype:

torch.float

torch.float32

* FP32 is supported in all x86 CPUs and NVIDIA GPUs

FP32 is the default size of float for all calculations and was in use in Deep Learning models since the beginning. Even standard Programming Languages supported FP32 as the default float datatype.

* FP32 is the default floating datatype in Programming Languages

In most high level programming language, the default numberic type is FP32. FP64 is used for high precision calculations while lower precision like INT8 is not available as all programming languages.

INT8 and other types are supported in languages like C and C++.

* FP32 in Deep Learning models

FP32 is the most common datatype in Deep Learning and Machine Learning model. The activations, weights and input are in FP32.

Converting activations and weights to lower precision like INT8 is an optimization technique.

FP32 to FP16 and FP64

Converting FP32 to lower precision like INT32, INT8, FP16 and others involves a loss of accuracy. In general, the approach for the conversion to map the range of FP32 to the range of the destination type.

Similarly, we can convert FP32 to higher precision like FP64.

OpenVINO supports standard functions for the conversion of FP32 to FP16:

ie_fp16 InferenceEngine::PrecisionUtils::f32tof16

FP32 vs FP16 vs FP64 vs INT8

FP64 has more precision and range compared to FP32 and hence, FP64 is used for scientific purposes such as astronomical calculations.
FP16 has less memory than FP32 but also, has less precision. It is mainly, used in Deep Learning applications where the loss in precision does not impact the accuracy of the system much.
INT8 has significantly less memory than FP32 and hence, is used in Deep Learning applications for significant performance gains. The loss in accuracy is handled by quantization techniques.
In terms of memory:

FP64 > FP32 > FP16 > INT8

In terms of accuracy:

FP64 > FP32 > FP16 > INT8

In terms of widespread-use in Deep Learning applications:

FP32 > INT8 > FP16 > FP64

In terms of preferred use in Deep Learning applications (for performance):

INT8 > FP16 > FP32 > FP64

In terms of preferred use in Scientific calculations:

FP64 > FP32 > FP16 > INT8
INT8 and FP16 is never used in Scientific calculations.

With this article at OpenGenus, you must have a strong idea of FP32 (floating point 32 bits) in Machine Learning.