
In this article at OpenGenus, you must have the complete idea of VPOPCNT assembly instruction that is used to get the number of set bits.
Table of contents:
- VPOPCNT
- Assembly code with VPOPCNT
- C++ Implementation using VPOPCNT with intrinsic
VPOPCNT
VPOPCNT is a vectorized assembly instruction to count the number of set bits in a given register/ data. A data consist of 0 and 1. A set bit refer to 1.
It is supported in AVX512 and AVX256 as well. If you have a system that supports AVX512, it means VPOPCNT can be used to get the number of set bits in 512 bits using one instruction.
VPOPCNT is executed in 3 clock cycles.
Assembly code with VPOPCNT
The code in this section has a data in eax register. It uses vpopcnt to find the number of set bits and stores the result in eax (same register).
Following is the assembly code with VPOPCNT instruction:
section .text
global _start
_start:
; input integer in eax
; use vpopcnt instruction to count number of set bits
vpopcnt eax, eax
; result in eax
; exit program
mov eax, 1
xor ebx, ebx
int 0x80
C++ Implementation using VPOPCNT with intrinsic
It use 2 main intrinsics:
_mm256_popcnt_epi32
intrinsic to count number of set bits in a data of 256 bits_mm256_add_epi32
to add the number of set bits from_mm256_popcnt_epi32
across all data
Following is the complete C++ code using VPOPCNT as an intrinsic:
#include <immintrin.h>
// Compute the number of set bits in the given array of integers
int popcnt(const uint32_t* data, size_t n) {
size_t i = 0;
__m256i sum = _mm256_setzero_si256();
// Process the data in chunks of 8 integers
for (; i + 8 <= n; i += 8) {
__m256i chunk = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(data + i));
sum = _mm256_add_epi32(sum, _mm256_popcnt_epi32(chunk));
}
// Process any remaining integers
uint32_t remaining[8] = {0};
for (; i < n; ++i) {
remaining[i % 8] = data[i];
}
__m256i chunk = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(remaining));
sum = _mm256_add_epi32(sum, _mm256_popcnt_epi32(chunk));
// Compute the total popcount by summing the individual counts
uint32_t count[8];
_mm256_storeu_si256(reinterpret_cast<__m256i*>(count), sum);
return count[0] + count[1] + count[2] + count[3] + count[4] + count[5] + count[6] + count[7];
}
With this article at OpenGenus, you must have the complete idea of VPOPCNT vectorized instruction.