FP32 (Floating point format for Deep Learning)
FP32 is a FP32 Floating point data format for Deep Learning where data is represented as a 32bit floating point number. FP32 is the most widely used data format across all Machine Learning/ Deep Learning applications.
Table of contents:
 Introduction to FP32 (Floating point 32 bits)
 Components in FP32
 Use of FP32
 FP32 conversion to FP16 and FP64
 FP32 vs FP16 vs FP64 vs INT8
Introduction to FP32 (Floating point 32 bits)
FP32 is, also, known as Single precision floating point format. The size of the floating point format impacts the following:
 Less bits means less memory consumption (size of data)
 More bits means more accuracy (results need to be reasonably accurate)
 Less bits means reducing training/ inference time (impacts arithmetic and network bandwidth)
Components in FP32
There are 32 bits in FP32 which are divided as follows from left to right:
 1 bit: Sign bit
 8 bits: Exponent
 23 bits: Fraction
A floating point number is represented as having two components:
 Integer component (say X) (0 to 9)
 Decimal components (say Y)
 Exponent (say E)
The floating point number becomes X.YeE which is say as X.Y * 10^E.
So, a floating point number say 1.92e4 is same as 0.000192
This number is stored internally using 32 bits. The number of bits determines:
 The range of decimal component that can be included (determines the accuracy)
 The range of exponent
 In short, it determines the range and accuracy of floating point numbers
In FP32, 9 bits are used for range and 23 bits are used for accuracy/ decimal part.
Use of FP32
The use of FP32 datatype is as follows:
 FP32 is supported in all major Deep Learning Inference software
 FP32 is supported in all x86 CPUs and NVIDIA GPUs
 FP32 is the default floating datatype in Programming Languages
 FP32 in Deep Learning models
FP32 is supported in all major Deep Learning Inference software
TensorFlow supports FP32 as a standard Datatype:
tf.float32
PyTorch supports FP32 as the default float datatype:
torch.float
torch.float32
* FP32 is supported in all x86 CPUs and NVIDIA GPUs
FP32 is the default size of float for all calculations and was in use in Deep Learning models since the beginning. Even standard Programming Languages supported FP32 as the default float datatype.
* FP32 is the default floating datatype in Programming Languages
In most high level programming language, the default numberic type is FP32. FP64 is used for high precision calculations while lower precision like INT8 is not available as all programming languages.
INT8 and other types are supported in languages like C and C++.
* FP32 in Deep Learning models
FP32 is the most common datatype in Deep Learning and Machine Learning model. The activations, weights and input are in FP32.
Converting activations and weights to lower precision like INT8 is an optimization technique.
FP32 to FP16 and FP64
Converting FP32 to lower precision like INT32, INT8, FP16 and others involves a loss of accuracy. In general, the approach for the conversion to map the range of FP32 to the range of the destination type.
Similarly, we can convert FP32 to higher precision like FP64.
OpenVINO supports standard functions for the conversion of FP32 to FP16:
ie_fp16 InferenceEngine::PrecisionUtils::f32tof16
FP32 vs FP16 vs FP64 vs INT8

FP64 has more precision and range compared to FP32 and hence, FP64 is used for scientific purposes such as astronomical calculations.

FP16 has less memory than FP32 but also, has less precision. It is mainly, used in Deep Learning applications where the loss in precision does not impact the accuracy of the system much.

INT8 has significantly less memory than FP32 and hence, is used in Deep Learning applications for significant performance gains. The loss in accuracy is handled by quantization techniques.

In terms of memory:
FP64 > FP32 > FP16 > INT8
 In terms of accuracy:
FP64 > FP32 > FP16 > INT8
 In terms of widespreaduse in Deep Learning applications:
FP32 > INT8 > FP16 > FP64
 In terms of preferred use in Deep Learning applications (for performance):
INT8 > FP16 > FP32 > FP64
 In terms of preferred use in Scientific calculations:
FP64 > FP32 > FP16 > INT8
INT8 and FP16 is never used in Scientific calculations.
With this article at OpenGenus, you must have a strong idea of FP32 (floating point 32 bits) in Machine Learning.