__builtin_popcount and POPCNT

Get FREE domain for 1st year and build your brand new site

Internship at OpenGenus

In this article, we have explored about __builtin_popcount - a built-in function of GCC, which helps us to count the number of 1's(set bits) in an integer in C and C++. POPCNT is the assemby instruction used in __builtin_popcount.

The population count (or popcount) of a specific value is the number of set bits in that value. Calculating the population count efficiently has been widely studied with implementations existing for both software and hardware.
__builtin_popcount is a compiler-specific extension built into GCC, so no need of adding any header or library to use this function. Prototype of this function is as follows.

int __builtin_popcount (unsigned int x)

This function call returns an integer which is the count of non zero bits of a given number/integer.Example for the same is as given below.

#include <stdio.h>

int main(){
    int num = 22; // 22 in binary = 00000000 00000000 00000000 00010110
    printf("Number of 1's is = %d", __builtin_popcount(num));
    return 0;
}

Output:

Number of 1's is = 3

Time complexity of this function is O(NUM). i.e it directly depends on number of bits in that input number.We can mimic same functionality by our own function as shown below,

int setbit_Count(int NUM){
    int count=0;
    while(NUM>0){
        count+=(NUM&1);
        NUM=NUM>>1;
    }
return count;
}

Similarly we can use __builtin_popcountl for long data type and __builtin_popcountll long long data types. Both returns an integer type for count of 1's. Prototypes for these functions are as follows.

int __builtin_popcountl (unsigned long)

int __builtin_popcountll (unsigned long long)

Similar to this built-in library function of GCC, C++ also provides similar function as std::popcount with similar features.

POPCNT

Internally, __builtin_popcount uses a specific hardware instruction. In a x86 architecture POPCNT is used by compiler. The first CPU to support the POPCNT instruction was Intel's Nehalem.To make use of POPCNT compiler should support SSE4.

SSE (Streaming SIMD Extensions) is a processor technology that enables single instruction multiple data. Older processors only process a single data element per instruction. SSE enables the instruction to handle multiple data elements. It's used in intensive applications, such as 3D graphics, for faster processing. By default compiler uses __popcountdi2 for calculationg set bits. Without SSE4 performance of this operation(calculating set bits) will be slower.

We can provide target to GCC to use SSE4 as:

#pragma GCC target ("sse4")

The above instruction specifies that SSE4 standards should be followed while compiling. Below are the example assembly codes with and without the use of SSE4.

Without SSE4 - x86-64 GCC 9.2

int popcount(int x) {
    return __builtin_popcount(x);
}

compiler output(assembly code) - compilation time - 809ms

popcount(int):
    sub     rsp, 8
    mov     edi, edi
    call    __popcountdi2
    add     rsp, 8
    ret

With SSE4 - x86-64 GCC 9.2

#pragma GCC target("sse4")

int popcount(int x) {
    return __builtin_popcount(x);
}

compiler output(assembly code) - compilation time - 777ms

popcount(int):
    xor     eax, eax
    popcnt  eax, edi
    ret

popcnt calculates the number of bits set to 1 in the second operand (source) and returns the count in the first operand (a destination register).

Example

There are N ≤ 5000 workers. Each worker is available during some days of this month (which has 30 days). For each worker, you are given a set of numbers, each from interval (1,30), representing his/her availability for that month. You need to assign an important project to two workers but they will be able to work on the project only when they are both available. Find two workers that are best for the job — maximize the number of days when both these workers are available.

  • Firstly, We can think about the availability of a worker as a binary string of length 30, which can be stored in a single int. With this representation, we can count the intersection size in O(1) by using _builtinpopcount(x[i] & x[j]). We can compute the intersection of two workers (two sets) in O(30) by using e.g. two pointers for two sorted sequences. Doing that for every pair of workers yields O(N^2 * 30). The complexity becomes O(N^2).

For example, If a worker is available on these days of month then the same can be expresses in binary as-

  • {2,5,6,9,15,17,18,22,26,27} (Number of days available w.r.t perticular worker)
  • 01001100 10000010 11000100 01100000 (in Binary)
  • 1283638368 (in Decimal)
    This data is stored in an integer array according to number of workers.
#pragma GCC target("sse4")
#include <stdio.h>

const int K = 30;    //Number of days
unsigned int x[N];   //Number of workers
//Each entry of x[N] is populated.

int intersection(int i, int j) {
	int total = 0;
	total = __builtin_popcount(x[i] & x[j]);
	return total;
}

int main(){
    int length = sizeof(x)/sizeof(x[0]);
    int temp=0, max=0, a=0, b=1;
    for(int i=0; i<length-1; i++){
        for(int j=1; j<length; j++){
            temp = intersection(i, j);
            if(temp > max){
                a = i;
                b = j;
                max = temp;
             }
         }
     }
/*
after this iteration a and b will contain id's of workers 
where maximum the number of days when both these workers are available.
Edge cases are not included here.
*/
return 0;
}

This program shows the use of __builtin_popcount as a part of larger problem statement.

With this article at OpenGenus, you must have a complete idea of __builtin_popcount and POPCNT. Enjoy.