Get this book -> Problems on Array: For Interviews and Competitive Programming

In this article, we have presented the formula to calculate theoretical max FLOPs for a given CPU system and understand the logic behind it.

Table of contents:

- CPU FLOPs (theoretical max)
- Formula for CPU FLOPs (theoretical max)
- Understanding the FLOPs formula

## CPU FLOPs (theoretical max)

CPU FLOPs theoretical max is the maximum floating point operations that can be computed in one second by a given CPU system.

This helps to calculate how efficient a given program is. For a given program:

```
Actual FLOPs = Total number of operations / Time taken
```

Efficiency is calculated as:

```
Efficiency = Actual FLOPs / Theoretical max FLOPs
```

## Formula for CPU FLOPs (theoretical max)

Formula for CPU FLOPs (theoretical max) is:

```
Theoretical Maximum FLOPS = Clock Speed x Number of Cores x
SIMD factor x FMA factor x Super-scalarity factor
```

where:

- SIMD factor = SIMD width / size of data type
- SIMD width is usually 256 or 512 bit

- FMA factor = FMA width / size of data type
- FMA width is usually 128 or 256 bit (if supported)

- Super-scalarity factor = usually 1 or 2

For example: For Intel's CascadeLake CPU, we have:

- Clock Speed as 2.5 GHz
- Number of cores = 56
- SIMD width = 512 bits
- FMA width = 256 bits

Theoretical Maximum FLOPS for FP32 data = 2.5 x 10^{9} x 56 x 256/32 x 512/32 x 2

Theoretical Maximum FLOPS for FP32 data = 1.792 x 10^{13} operations per second = **17,920 GFLOPs**

If we want to use 1 core or run on single thread, the Theoretical Maximum FLOPS for FP32 data will be 2.5 x 10^{9} x 256/32 x 512/32 x 2 = **320 GFLOPs**

## Understanding the FLOPs formula

- Clock Speed

Clock Speed is the number of cycles the CPU can process each second. Most instructions in AVX2, AVX512 instruction sets that one clock cycle. Note a few instructions take 2 or more cycle but 1 cycle per instruction is good enough for calculcation.

So, if you have an assembly code, you can count the number of instructions and get the exact number of CPU cycles to be used to run the code.

For example, Intel CascadeLake CPU has a clock speed of 2700 GHz that is 2700 x 10^{9} clock cycles per second.

You can get the clock speed / CPU frequency using the following command:

```
lscpu | grep GHz
```

- Number of cores

A perfect process will use all cores to the fullest for maximum performance. The clock speed reported is for one cores and hence, we should have a multiplicative factor (number_of_cores) for calculating the theoretical maximum GFLOPs.

For example, Intel CascadeLake CPU will have 56 cores.

- SIMD Unit

SIMD stand for Single Instruction Multiple Data. SIMD is an instruction set that allows multiple operations to be processed in a single clock cycle.

It is an improvement to SISD (Single Instruction Single Data) where only one operation is processed in one clock cycle. Today, all CPU architectures come with SIMD instructions.

SIMD instructions include SSE and AVX instructions.

If a particular CPU has SIMD width of 512 bits and you are working with FP32 data (floating point 32 bits), then the number of operations that can be processed in one clock cycle is 16 (= 512/32).

This, also, explains why a smaller data type can work better. Integer 8 bits can accomodate 64 operations per clock cycle.

```
SIMD width = Number of SIMD Units Ã— [(Number of Multiply_add units x 2)
+ Number of Multiple units]
```

For simiplicity, if CPU supports AVX512, then SIMD width is 512 bits.

If CPU supports AVX2 as the highest instruction set, then SIMD width is 256 bits.

With this article at OpenGenus, you must have the complete idea of the formula for CPU FLOPs (theoretical max).

- Super-scalarity factor

Super-scalarity factor is the number of instructions that can be processed in one CPU cycle. By default, it is 1.

Note pipelining is a different technique than super-scalarity.

With this, you must have the complete idea behind the formula presented to calculate theoretical max FLOPs for a given CPU system.