Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
In this article at OpenGenus, you must have the complete idea of VPOPCNT assembly instruction that is used to get the number of set bits.
Table of contents:
- VPOPCNT
- Assembly code with VPOPCNT
- C++ Implementation using VPOPCNT with intrinsic
VPOPCNT
VPOPCNT is a vectorized assembly instruction to count the number of set bits in a given register/ data. A data consist of 0 and 1. A set bit refer to 1.
It is supported in AVX512 and AVX256 as well. If you have a system that supports AVX512, it means VPOPCNT can be used to get the number of set bits in 512 bits using one instruction.
VPOPCNT is executed in 3 clock cycles.
Assembly code with VPOPCNT
The code in this section has a data in eax register. It uses vpopcnt to find the number of set bits and stores the result in eax (same register).
Following is the assembly code with VPOPCNT instruction:
section .text
global _start
_start:
; input integer in eax
; use vpopcnt instruction to count number of set bits
vpopcnt eax, eax
; result in eax
; exit program
mov eax, 1
xor ebx, ebx
int 0x80
C++ Implementation using VPOPCNT with intrinsic
It use 2 main intrinsics:
_mm256_popcnt_epi32
intrinsic to count number of set bits in a data of 256 bits_mm256_add_epi32
to add the number of set bits from_mm256_popcnt_epi32
across all data
Following is the complete C++ code using VPOPCNT as an intrinsic:
#include <immintrin.h>
// Compute the number of set bits in the given array of integers
int popcnt(const uint32_t* data, size_t n) {
size_t i = 0;
__m256i sum = _mm256_setzero_si256();
// Process the data in chunks of 8 integers
for (; i + 8 <= n; i += 8) {
__m256i chunk = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(data + i));
sum = _mm256_add_epi32(sum, _mm256_popcnt_epi32(chunk));
}
// Process any remaining integers
uint32_t remaining[8] = {0};
for (; i < n; ++i) {
remaining[i % 8] = data[i];
}
__m256i chunk = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(remaining));
sum = _mm256_add_epi32(sum, _mm256_popcnt_epi32(chunk));
// Compute the total popcount by summing the individual counts
uint32_t count[8];
_mm256_storeu_si256(reinterpret_cast<__m256i*>(count), sum);
return count[0] + count[1] + count[2] + count[3] + count[4] + count[5] + count[6] + count[7];
}
With this article at OpenGenus, you must have the complete idea of VPOPCNT vectorized instruction.