GCC Compiler Intrinsics

Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

In this article, we will discuss the GNU Compiler Collection (GCC), the fundamentals of intrinsics, some of the ways in which these intrinsics can speed up vector code, and we will also take a look at a list of some of the x86 intrinsics that GCC offers.

GCC Compiler

The GNU Compiler Collection, or GCC, is a compiler that was produced by the GNU Project, and while it could initially only handle the C programming language, it has now grown to support numerous programming languages, hardware architectures as well as operating systems, and has been ported to more platforms than any other compiler. It's also available for many embedded systems. It is the official compiler of the GNU operating system, as is also the standard compiler of many Unix-like operating systems. GCC can also compile code for some of the most widely used operating systems, namely- Windows, Android and iOS.

We will further discuss GCC's intrinsics, and how they can help speed up vector code.

What Are Intrinsics?

Intrinsics, also known as built-ins, are essentially functions that are recognized and implemented by the compiler without any need for the program to declare them. The compiler does not have to link to a run-time library to in order to perform an operation involving intrinsics.

Intrinsics, unlike in-line functions (which also omit the overhead of a function call), are provided entirely by the compiler, and there isn't any place in the source code where they are explicitly defined. Another difference between in-line functions and intrinsics is that in-line functions will not work if we do not include the header file that contains the definition of the function that we want to use, which isn't the case for intrinsics.

Intrinsics are not essential, since there are library-provided implementations for most functions. However, intrinsics offer us a gateway between assembly language and standard C language, thus allowing us to get more out of our specific processor, while still letting the compiler handle most of the grunt-tasks, like type checking and register allocation. This optimizes our code and can speed it up significantly, depending on the type of problem that we're trying to tackle.

GCC's x86 Intrinsics

Here, we will take a look at some some of GCC's x86 intrinsics that happen to be useful for vector processing. In vector processing, instead of having to perform the same instruction on multiple data elements one at a time, we can perform the instruction on a group of data elements all at once. This can help us save a lot of time if we encounter situations wherein we need to perform the same operation on multiple data elements, which is often the case in convolutional neural network (CNN) based systems. This is also known as single instruction, multiple data or SIMD processing. Read more about SIMD & SSE (Streaming SIMD Extensions, a practical implementation of SIMD architecture) here.

In order for us to be able to get the most out of our vector hardware, we tell the compiler to use intrinsic functions to generate SIMD code, we include the necessary include files (for MMX, SSE1 or SSE2), and finally, we store our data as a vector using one of many vector types.

Listed below are a few GCC command-line statements that can be used to generate SIMD code:

  • -mmmx (MMX)
  • -mfpmath=sse (Instructs GCC to use SSE)
  • -msse (SSE)
  • -msse2 (SSE2)

To enable further iterations of SIMD architecture, we can also use the following GCC command-line statements:

  • -mavx (AVX)
  • -mavx2 (AVX2)
  • -mavx512f (AVX512 Foundation)

For more information about AVX, AVX2 and AVX512, click here.

The include files that we will need for MMX, SSE1 and SSE2 are:

  • mmintrin.h (x86 MMX)
  • xmmintrin.h (x86 SSE1)
  • emmintrin.h (x86 SSE2)

And finally, some of the x86 compatibles that we will use are as follows:

  • __m64 (MMX): 64 bits, eight 8 bit integers, four 16 bit shorts or two 32 bit integers.
  • __m128 (SSE1): 128 bits, four single-precision floating-point values.
  • __m128i (SSE2): 128 bits, packed integers of any size.
  • __m128d (SSE2): 128 bits, two 64 bit doubles.

By using '-mmmx', the following x86 intrinsics become available:

v8qi __builtin_ia32_paddb (v8qi, v8qi)
v4hi __builtin_ia32_paddw (v4hi, v4hi)
v2si __builtin_ia32_paddd (v2si, v2si)
v8qi __builtin_ia32_psubb (v8qi, v8qi)
v4hi __builtin_ia32_psubw (v4hi, v4hi)
v2si __builtin_ia32_psubd (v2si, v2si)
v8qi __builtin_ia32_paddsb (v8qi, v8qi)
v4hi __builtin_ia32_paddsw (v4hi, v4hi)
v8qi __builtin_ia32_psubsb (v8qi, v8qi)
v4hi __builtin_ia32_psubsw (v4hi, v4hi)
v8qi __builtin_ia32_paddusb (v8qi, v8qi)
v4hi __builtin_ia32_paddusw (v4hi, v4hi)
v8qi __builtin_ia32_psubusb (v8qi, v8qi)
v4hi __builtin_ia32_psubusw (v4hi, v4hi)
v4hi __builtin_ia32_pmullw (v4hi, v4hi)
v4hi __builtin_ia32_pmulhw (v4hi, v4hi)
di __builtin_ia32_pand (di, di)
di __builtin_ia32_pandn (di,di)
di __builtin_ia32_por (di, di)
di __builtin_ia32_pxor (di, di)
v8qi __builtin_ia32_pcmpeqb (v8qi, v8qi)
v4hi __builtin_ia32_pcmpeqw (v4hi, v4hi)
v2si __builtin_ia32_pcmpeqd (v2si, v2si)
v8qi __builtin_ia32_pcmpgtb (v8qi, v8qi)
v4hi __builtin_ia32_pcmpgtw (v4hi, v4hi)
v2si __builtin_ia32_pcmpgtd (v2si, v2si)
v8qi __builtin_ia32_punpckhbw (v8qi, v8qi)
v4hi __builtin_ia32_punpckhwd (v4hi, v4hi)
v2si __builtin_ia32_punpckhdq (v2si, v2si)
v8qi __builtin_ia32_punpcklbw (v8qi, v8qi)
v4hi __builtin_ia32_punpcklwd (v4hi, v4hi)
v2si __builtin_ia32_punpckldq (v2si, v2si)
v8qi __builtin_ia32_packsswb (v4hi, v4hi)
v4hi __builtin_ia32_packssdw (v2si, v2si)
v8qi __builtin_ia32_packuswb (v4hi, v4hi)

By using either '-msse' or a combination of '-m3dnow' and '-march=athlon', the following x86 intrinsics become available:

v4hi __builtin_ia32_pmulhuw (v4hi, v4hi)
v8qi __builtin_ia32_pavgb (v8qi, v8qi)
v4hi __builtin_ia32_pavgw (v4hi, v4hi)
v4hi __builtin_ia32_psadbw (v8qi, v8qi)
v8qi __builtin_ia32_pmaxub (v8qi, v8qi)
v4hi __builtin_ia32_pmaxsw (v4hi, v4hi)
v8qi __builtin_ia32_pminub (v8qi, v8qi)
v4hi __builtin_ia32_pminsw (v4hi, v4hi)
int __builtin_ia32_pextrw (v4hi, int)
v4hi __builtin_ia32_pinsrw (v4hi, int, int)
int __builtin_ia32_pmovmskb (v8qi)
void __builtin_ia32_maskmovq (v8qi, v8qi, char *)
void __builtin_ia32_movntq (di *, di)
void __builtin_ia32_sfence (void)

By using '-msse', the following x86 intrinsics become available:

int __builtin_ia32_comieq (v4sf, v4sf)
int __builtin_ia32_comineq (v4sf, v4sf)
int __builtin_ia32_comilt (v4sf, v4sf)
int __builtin_ia32_comile (v4sf, v4sf)
int __builtin_ia32_comigt (v4sf, v4sf)
int __builtin_ia32_comige (v4sf, v4sf)
int __builtin_ia32_ucomieq (v4sf, v4sf)
int __builtin_ia32_ucomineq (v4sf, v4sf)
int __builtin_ia32_ucomilt (v4sf, v4sf)
int __builtin_ia32_ucomile (v4sf, v4sf)
int __builtin_ia32_ucomigt (v4sf, v4sf)
int __builtin_ia32_ucomige (v4sf, v4sf)
v4sf __builtin_ia32_addps (v4sf, v4sf)
v4sf __builtin_ia32_subps (v4sf, v4sf)
v4sf __builtin_ia32_mulps (v4sf, v4sf)
v4sf __builtin_ia32_divps (v4sf, v4sf)
v4sf __builtin_ia32_addss (v4sf, v4sf)
v4sf __builtin_ia32_subss (v4sf, v4sf)
v4sf __builtin_ia32_mulss (v4sf, v4sf)
v4sf __builtin_ia32_divss (v4sf, v4sf)
v4si __builtin_ia32_cmpeqps (v4sf, v4sf)
v4si __builtin_ia32_cmpltps (v4sf, v4sf)
v4si __builtin_ia32_cmpleps (v4sf, v4sf)
v4si __builtin_ia32_cmpgtps (v4sf, v4sf)
v4si __builtin_ia32_cmpgeps (v4sf, v4sf)
v4si __builtin_ia32_cmpunordps (v4sf, v4sf)
v4si __builtin_ia32_cmpneqps (v4sf, v4sf)
v4si __builtin_ia32_cmpnltps (v4sf, v4sf)
v4si __builtin_ia32_cmpnleps (v4sf, v4sf)
v4si __builtin_ia32_cmpngtps (v4sf, v4sf)
v4si __builtin_ia32_cmpngeps (v4sf, v4sf)
v4si __builtin_ia32_cmpordps (v4sf, v4sf)
v4si __builtin_ia32_cmpeqss (v4sf, v4sf)
v4si __builtin_ia32_cmpltss (v4sf, v4sf)
v4si __builtin_ia32_cmpless (v4sf, v4sf)
v4si __builtin_ia32_cmpunordss (v4sf, v4sf)
v4si __builtin_ia32_cmpneqss (v4sf, v4sf)
v4si __builtin_ia32_cmpnlts (v4sf, v4sf)
v4si __builtin_ia32_cmpnless (v4sf, v4sf)
v4si __builtin_ia32_cmpordss (v4sf, v4sf)
v4sf __builtin_ia32_maxps (v4sf, v4sf)
v4sf __builtin_ia32_maxss (v4sf, v4sf)
v4sf __builtin_ia32_minps (v4sf, v4sf)
v4sf __builtin_ia32_minss (v4sf, v4sf)
v4sf __builtin_ia32_andps (v4sf, v4sf)
v4sf __builtin_ia32_andnps (v4sf, v4sf)
v4sf __builtin_ia32_orps (v4sf, v4sf)
v4sf __builtin_ia32_xorps (v4sf, v4sf)
v4sf __builtin_ia32_movss (v4sf, v4sf)
v4sf __builtin_ia32_movhlps (v4sf, v4sf)
v4sf __builtin_ia32_movlhps (v4sf, v4sf)
v4sf __builtin_ia32_unpckhps (v4sf, v4sf)
v4sf __builtin_ia32_unpcklps (v4sf, v4sf)
v4sf __builtin_ia32_cvtpi2ps (v4sf, v2si)
v4sf __builtin_ia32_cvtsi2ss (v4sf, int)
v2si __builtin_ia32_cvtps2pi (v4sf)
int __builtin_ia32_cvtss2si (v4sf)
v2si __builtin_ia32_cvttps2pi (v4sf)
int __builtin_ia32_cvttss2si (v4sf)
v4sf __builtin_ia32_rcpps (v4sf)
v4sf __builtin_ia32_rsqrtps (v4sf)
v4sf __builtin_ia32_sqrtps (v4sf)
v4sf __builtin_ia32_rcpss (v4sf)
v4sf __builtin_ia32_rsqrtss (v4sf)
v4sf __builtin_ia32_sqrtss (v4sf)
v4sf __builtin_ia32_shufps (v4sf, v4sf, int)
void __builtin_ia32_movntps (float *, v4sf)
int __builtin_ia32_movmskps (v4sf)

Similarly, the following x86 intrinsics become available when '-msse3' is used:

v2df __builtin_ia32_addsubpd (v2df, v2df)
v2df __builtin_ia32_addsubps (v2df, v2df)
v2df __builtin_ia32_haddpd (v2df, v2df)
v2df __builtin_ia32_haddps (v2df, v2df)
v2df __builtin_ia32_hsubpd (v2df, v2df)
v2df __builtin_ia32_hsubps (v2df, v2df)
v16qi __builtin_ia32_lddqu (char const *)
void __builtin_ia32_monitor (void *, unsigned int, unsigned int)
v2df __builtin_ia32_movddup (v2df)
v4sf __builtin_ia32_movshdup (v4sf)
v4sf __builtin_ia32_movsldup (v4sf)
void __builtin_ia32_mwait (unsigned int, unsigned int)

All of these intrinsics becoming available to the compiler with the help of GCC command-line statements can help speed up our code. However, it is still possible for our vectorized code to be slower than a scalar counterpart of our code, which is why it is essential for us to test thoroughly and compare speed, accuracy and reliability, while also keeping our eyes open for compiler bugs and optimization issues.

Some Simple Examples

  1. Let's say we wish to multiply four pairs of 32-bit integers. If we used scalar code to do so, we would have to perform the multiplication operation four times. This leaves a lot of time on the table, since we can accomplish the same task much quicker with the help of intrinsics.

We first make the SSE instructions available to the compiler:
-mfpmath=sse -msse

Then, the following intrinsic becomes available:
v4sf __builtin_ia32_mulps (v4sf, v4sf)

This particular intrinsic generates the machine instruction 'MULPS', which allows for the SIMD multiplication of four pairs of single-precision floating-point (32-bit) values at once.

  1. Similarly, let's say that we come across a situation wherein we need to make comparisons between eight pairs of 16-bit integers, and find the minimum value in all of these comparisons. Again, doing so using scalar code would imply the repeated use of the same instruction. However, doing so using intrinsics can make this particular operation much quicker.

We first make the SSE instructions available to the compiler:
-mfpmath=sse -msse
OR, we can use a combination of -m3dnow and -march=athlon.

Then, the following intrinsic becomes available:
v4hi __builtin_ia32_pminsw (v4hi, v4hi)

This particular intrinsic generates the machine instruction 'PMINSW', which performs SIMD comparisons between eight pairs of 16-bit integers, and finds the minimum value of each comparison at once.

  1. Finally, let's say that we want to perform the bitwise logical AND of four pairs of 32-bit values. The use of intrinsics can substantially speed up this process, since performing the above operation using scalar code would require multiple iterations.

We first make the SSE instructions available to the compiler:
-mfpmath=sse -msse

Then, the following intrinsic becomes available:
v4sf __builtin_ia32_andps (v4sf, v4sf)

This particular intrinsic generates the machine instruction 'ANDPS', which performs the bitwise logical AND of four pairs of 32-bit values at once.

Conclusion

In this article at OpenGenus, we learned about the GNU Compiler Collection (GCC), the fundamentals of intrinsics, how intrinsics can be used to speed up vector code, and we also took a look at the intrinsics that become available on using different GCC command-line statements, such as '-mmmx', '-msse', '-msse3'. Finally, we looked at a few simple applications of intrinsics which can help speed up specific instructions.

Thanks for reading!

Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.