Branch prediction explained with a code example

Reading time: 25 minutes | Coding time: 10 minutes

Branch prediction is an optimization technique which predicts the path a code will take before it is known for sure. This matters because while executing a code, the machine loads the few code statements in advance and keeps it in the pipeline. The problem occurs when there is a conditional branch in which case, there are two possible paths or code section that can be executed.

Consider there are 3 lines of code L1, L2 and L3:


L1
L2
L3

You can think of the computer to have two operational units:

  • One executes commands or performs arithmetic/ logical operations
  • One fetches data from the main memory

Both of the above two process can run in parallel to make your program execute faster. In this sense, when the computer is performing the operations in L1, one part of the computer will read the instructions of L2.

Now, reading from main memory takes a lot of time, so, by reading in parallel to execution, your computing device is saving a lot of time.

The problem arises in case of conditional statement. Consider that L1 is a conditional statement. If L1 evaluates to be true, the next instruction will be L2 or else it will be L3.

The computer will know which statement to execute after the evaluation of L1 but it will not sit ideal during the evaluation time. Your computing device will make a smart guess and choose either of L2 or L3 and load it into memory. If the guess turns out to be correct, everything goes well but it is turns out to be wrong, then the loaded instruction should be dropped and the actual instruction has to be read which consumes additional time and hence, the program execution goes slow.

In general, the guess made by your computing device is based upon:

  • The path taken in previous execution of the program

This is the general idea behind branch prediction. There are a lot of other details behind the exact execution of this optimization which we will cover in a separate article.

We will observe branch prediction in practice by following an example.

Let us analyse branch prediction by going through an example

Testing code

The following code in C++ demonstrates branch prediction by using a conditional branch on a sorted and unsorted data and later, replaces the branch with a bitwise operation.

Take a look into this code:


#include <algorithm>
#include <ctime>
#include <iostream>
#define LOOP 1000000
void sorted_data_with_branch() 
{
  std::cout << "Sorted Data with branch: ";
  // Generate data
  const unsigned arraySize = 32768;
  int data[arraySize];
  for (unsigned c = 0; c < arraySize; ++c)
    data[c] = std::rand() % 256;
  std::sort(data, data + arraySize);
  clock_t start = clock();
  long long sum = 0;
  for (unsigned i = 0; i < LOOP; ++i)
  {
    for (unsigned c = 0; c < arraySize; ++c)
    {
      if (data[c] >= 128)
        sum += data[c];
    }
  }
  double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
  std::cout << elapsedTime << " seconds " << std::endl;
}
void unsorted_data_with_branch() 
{
  std::cout << "Unsorted data with branch: ";
  const unsigned arraySize = 32768;
  int data[arraySize];
  for (unsigned c = 0; c < arraySize; ++c)
    data[c] = std::rand() % 256;
  clock_t start = clock();
  long long sum = 0;
  for (unsigned i = 0; i < LOOP; ++i)
  {
    for (unsigned c = 0; c < arraySize; ++c)
    {
      if (data[c] >= 128)
        sum += data[c];
    }
  }
  double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
  std::cout << elapsedTime << " seconds" << std::endl;
}
void sorted_data_with_bitwise()
{
  std::cout << "Sorted Data without branch and with a bitwise operation: ";
  // Generate data
  const unsigned arraySize = 32768;
  int data[arraySize];
  for (unsigned c = 0; c < arraySize; ++c)
    data[c] = std::rand() % 256;
  std::sort(data, data + arraySize);
  clock_t start = clock();
  long long sum = 0;
  for (unsigned i = 0; i < LOOP; ++i)
  {
    for (unsigned c = 0; c < arraySize; ++c)
    {
        int t = (data[c] - 128) >> 31;
        sum += ~t & data[c];
    }
  }
  double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
  std::cout << elapsedTime << " seconds" << std::endl;
}
void unsorted_data_with_bitwise()
{
  std::cout << "Unsorted Data without branch and with a bitwise operation: ";
  // Generate data
  const unsigned arraySize = 32768;
  int data[arraySize];
  for (unsigned c = 0; c < arraySize; ++c)
    data[c] = std::rand() % 256;
  clock_t start = clock();
  long long sum = 0;
  for (unsigned i = 0; i < LOOP; ++i)
  {
    for (unsigned c = 0; c < arraySize; ++c)
    {
        int t = (data[c] - 128) >> 31;
        sum += ~t & data[c];
    }
  }
  double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
  std::cout << elapsedTime << " seconds" << std::endl;
}
int main()
{
  sorted_data_with_branch();
  unsorted_data_with_branch();
  sorted_data_with_bitwise();
  unsorted_data_with_bitwise();
  return 0;
}

Save the above code in a file named "branch.cpp" and then, compile and execute the code using the following command:


g++ branch.cpp
./a.out

Results Analysed

Results of the testing code with no compiler optimization is as follows:


Sorted Data with branch: 65.3733 seconds
Unsorted data with branch: 138.658 seconds
Sorted Data without branch and with a bitwise operation: 72.1379 seconds
Unsorted Data without branch and with a bitwise operation: 72.1375 seconds

The exact time will vary depending on the hardware infrastructure but the relative values will be the same.

We can make the following key observations:

  • Time taken for code with the conditional branch with sorted data is nearly half that of with unsorted or random data. This is because of branch prediction.

  • Time taken for code with bitwise operation with sorted data and unsorted / random data is nearly same. Branch prediction does not play any role in this case.

  • Time taken for code with bitwise operation is slightly more than the code with conditional branch and sorted data.

Visualization of code

Following illustration demonstrates the branch prediction in the above code:


T = branch taken
N = branch not taken

data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N  N  N  N  N  ...   N    N    T    T    T  ...   T    T    T  ...
    
       = NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT  (easy to predict)

When the data is completely random, the branch predictor is rendered useless because it cannot predict random data. Thus, there will be around 50% misprediction. (no better than random guessing)


data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118,  14, 150, ...
branch =   T,   T,   N,   T,   T,   T,   T,  N,   T,   N,   N,   T, ...

       = TTNTTTTNTNNTTTN ...   (random - hard to predict)