SIMD & SSE Instruction Set

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In this article, we will discuss scalar computing (and some of its drawbacks), the need for vector/parallel computing, the fundamental concepts behind single instruction, multiple data (or SIMD) architecture, as well as one of its implementations in modern CPUs, namely: Intel's Streaming SIMD Extensions, or SSE for short.

Scalar Computing (SISD Architecture)

Before parallelization was widely implemented in computers systems around the world, scalar processors were used to solve computational problems. Scalar processors could only process one data element at a time. Only after one instruction was completed could the next one be executed, and this meant that it took a substantial amount of time to process large amounts of data. Thus, even though scalar systems are simpler and some of its functions are often quicker, the fact that data elements are processed sequentially bogs down the overall speed and efficiency of the processor when we have to work on a large amount of data. This type of processor architecture is known as single instruction, single data (SISD) architecture.

sisd-1

The Intel 486 (or i486) is an example of a scalar processor. It was introduced in 1989.

Vector Computing (SIMD Architecture)

In order to combat these drawbacks, Intel understood the need for parallelization and began integrating single instruction, multiple data (or SIMD) vector capabilities into their processors in the late 1990s, and most modern processors now work on this type of architecture. In this day and age, we are often required to perform identical operations across multiple data elements, and this is where SIMD instructions can substantially accelerate performance. This is because, instead of performing the same instruction on each data element one at a time, we can perform the instruction on a group of data elements all at once.

simd-1

The data pertaining to the operands are stored in special wide registers.

An Example That Illustrates The Difference Between SISD & SIMD Instructions

Let us assume that we have three integer arrays- X, Y and Z. We wish to add the elements of X and Y continuously (in a loop) and store the results in the array Z. In each iteration of the loop, we will find that we have to perform an addition operation, two load operations (for the elements of X and Y), as well as a store operation (in Z). These operations are exactly the same in each and every iteration of the loop. The only thing that is changing is the elements that the operations are being performed on. This means that we can solve this particular problem much more quickly with the help of SIMD instructions, since we can process multiple elements simultaneously.

With SIMD instructions and 128-bit registers, we can process not one, but four integer elements at the same time. A single load instruction can get four elements of the integer array X into a 128-bit SIMD register, and the same goes for the integer array Y. Then, a single add instruction can add the values in both registers. Finally, a single store instruction can store all of these newly obtained values in the integer array Z.

Programming in this way is not as straightforward as programming without accounting for parallelization, but doing so can save us a lot of time if the situation calls for the use of SIMD instructions.

SSE Instruction Set

Intel's Streaming SIMD Extensions (SSE), was a SIMD instruction set extension to the existing 32-bit (x86) CPU architecture. Intel first introduced this technology in 1999, with their Pentium III line of processors, starting with 'Katmai'. It provided SIMD instructions as well as eight 128-bit registers (XMM0 - XMM7), as opposed to the 32-bit registers that were used by traditional scalar processors at the time, which allowed for up to four 32-bit data elements to be processed simultaneously, which was found to greatly accelerate performance (of supported operations). A new 32-bit control/status register was also introduced (MXCSR). It included 65 new instructions (70 total encodings) over MMX, which was Intel's previous implementation of SIMD architecture. Advanced imaging, speech recognition and 3-D video are some technologies that benefited from the introduction of SSE.

simd-pic

Since SSE was effectively an extension to MMX; SSE and MMX instructions could be mixed with no penalties to the system's performance.

Let us now take a look at all of the new instructions that were introduced in Intel's Streaming SIMD Extensions technology, as well as one example of each type of instruction. In order to understand these instructions better, we will simplify them to only operate on one 32-bit single-precision floating-point value, and not four, wherever applicable.

Data Transfer Instructions

Intel/AMD Mnemonic	Description
MOVAPS	Move four aligned packed single-precision floating-point values between XMM registers or memory.
MOVHLPS	Move two packed single-precision floating-point values from the high quadword of an XMM register to the low quadword of another XMM register.
MOVHPS	Move two packed single-precision floating-point values to or from the high quadword of an XMM register or memory.
MOVLHPS	Move two packed single-precision floating-point values from the low quadword of an XMM register to the high quadword of another XMM register.
MOVLPS	Move two packed single-precision floating-point values to or from the low quadword of an XMM register or memory.
MOVMSKPS	Extract sign mask from four packed single-precision floating-point values.
MOVSS	Move scalar single-precision floating-point value between XMM registers or memory.
MOVUPS	Move four unaligned packed single-precision floating-point values between XMM registers or memory.

For example:
If we perform the operation- 'MOVAPS xmm1, xmm2/m128', four packed single-precision floating-point values are moved from the source operand (xmm2/m128) to the destination operand (xmm1). The source is left unchanged, so both the source and the destination register will contain the same values.

If our source operand is as follows:

S	1111 1111 1111 1111......1111 (128 bits)

Then, on performing the above operation, our destination operand will now look like this:

D	1111 1111 1111 1111......1111 (128 bits)

Packed Arithmetic Instructions

Intel/AMD Mnemonic	Description
ADDPS	Add packed single-precision floating-point values.
ADDSS	Add scalar single-precision floating-point values.
DIVPS	Divide packed single-precision floating-point values.
DIVSS	Divide scalar single-precision floating-point values.
MAXPS	Return maximum packed single-precision floating-point values.
MAXSS	Return maximum scalar single-precision floating-point values.
MINPS	Return minimum packed single-precision floating-point values.
MINSS	Return minimum scalar single-precision floating-point values.
MULPS	Multiply packed single-precision floating-point values.
MULSS	Multiply scalar single-precision floating-point values.
RCPPS	Compute reciprocals of packed single-precision floating-point values.
RCPSS	Compute reciprocal of scalar single-precision floating-point values.
RSQRTPS	Compute reciprocals of square roots of packed single-precision floating-point values.
RSQRTSS	Compute reciprocal of square root of scalar single-precision floating-point values.
SQRTPS	Compute square roots of packed single-precision floating-point values.
SQRTSS	Compute square root of scalar single-precision floating-point values.
SUBPS	Subtract packed single-precision floating-point values.
SUBSS	Subtract scalar single-precision floating-point values.

For example:
If we perform the operation 'ADDPS xmm1, xmm2/m128', the four packed single-precision floating-point values of the source operand (xmm2/m128) are added with the values of the destination operand (xmm1), and the obtained packed result is stored in the destination operand.

If our source operand is as follows:

S	0010 1011 0111 0011 1000 0000 1100 0011

And if our destination operand is as follows:

D	0010 1001 1100 1110 0101 1111 0001 1111

Then, on performing the above operation, our destination operand will now look like this:

D	0101 0101 0100 0001 1101 1111 1110 0010

Compare Instructions

Intel/AMD Mnemonic	Description
CMPPS	Compare packed single-precision floating-point values.
CMPSS	Compare scalar single-precision floating-point values.
COMISS	Perform ordered comparison of scalar single-precision floating-point values and set flags in EFLAGS register.
UCOMISS	Perform unordered comparison of scalar single-precision floating-point values and set flags in EFLAGS register.

For example:
If we perform the operation- 'CMPPS xmm1, xmm2/m128, imm8', the packed single-precision floating-point values in the destination operand (xmm1) and the source operand (xmm2/m128) are compared using imm8 as the comparison predicate. This comparison predicate specifies what kind of comparison will take place between the two operands. If the comparison is true, then the result is a double word mask of all 1s, and if the comparison is false then the result is a double word mask of all 0s. This result is returned to the destination operand.

If our comparison predicate is 0, and our source register is as follows:

S	0110 1011 1111 0011 1010 0000 1100 0011

And if our destination operand is as follows:

D	0110 1011 1111 0011 1010 0000 1100 0011

Then, on performing the above operation, our destination operand will now look like this, since the comparison between the two values will produce the result 'TRUE':

D	1111 1111 1111 1111 1111 1111 1111 1111

Logical Instructions

Intel/AMD Mnemonic	Description
ANDNPS	Perform bitwise logical AND NOT of packed single-precision floating-point values.
ANDPS	Perform bitwise logical AND of packed single-precision floating-point values.
ORPS	Perform bitwise logical OR of packed single-precision floating-point values.
XORPS	Perform bitwise logical XOR of packed single-precision floating-point values.

For example:
If we perform the operation 'ANDPS xmm1, xmm2/m128', the four packed single-precision floating-point values of the source operand (xmm2/m128) are AND-ed with the values of the destination operand (xmm1), and the obtained result is stored in the destination operand.

If our source operand is as follows:

S	0010 1011 0111 0011 1000 0000 1100 0011

And if our destination operand is as follows:

D	0010 1001 1100 1110 0101 1111 0001 1111

Then, on performing the above operation, our destination operand will now look like:

D	0010 1001 0100 0010 0000 0000 0000 0011

Shuffle & Unpack Instructions

Intel/AMD Mnemonic	Description
SHUFPS	Shuffles values in packed single-precision floating-point operands.
UNPCKHPS	Unpacks and interleaves the two high-order values from two single-precision floating-point operands.
UNPCKLPS	Unpacks and interleaves the two low-order values from two single-precision floating-point operands.

For example:
If we perform the operation- 'SHUFPS xmm1, xmm2/m128, imm8', two packed single-precision floating-point values are moved from the destination operand (xmm1) into the low quadword of the destination operand, and two packed single-precision floating-point values are moved from the source operand (xmm2/m128) into the high quadword of the destination operand. The select operand (imm8) is what decides which values are moved to the destination operand.

Thus, if our select operand is '0000 0000', and our source register is as follows:

S	0000 0000 0000.............0000 (128 bits)

And if our destination operand is as follows:

D	1111 1111 1111.............1111 (128 bits)

Then, on performing the above operation, our destination operand will now look like:

D	0000 0000......0000 (64 bits) 1111 1111......1111 (64 bits)

Conversion Instructions

Intel/AMD Mnemonic	Description
CVTPI2PS	Convert packed doubleword integers to packed single-precision floating-point values.
CVTPS2PI	Convert packed single-precision floating-point values to packed doubleword integers.
CVTSI2SS	Convert doubleword integer to scalar single-precision floating-point value.
CVTSS2SI	Convert scalar single-precision floating-point value to a doubleword integer.
CVTTPS2PI	Convert with truncation packed single-precision floating-point values to packed doubleword integers.
CVTTSS2SI	Convert with truncation scalar single-precision floating-point value to scalar doubleword integer.

For example:
If we perform the operation- 'CVTPI2PS xmm, mm/m64', two packed signed doubleword integers in the source operand (mm/m64) are converted to two packed single-precision floating-point values in the destination operand (xmm). The obtained results are stored in the low quadword of the destination operand while the high quadword remains unchanged.

If our source operand is as follows:

S	1111 1111 1111.............1111 (64 bits)

And if our destination operand is as follows:

D	0000 0000 0000.............0000 (128 bits)

Then, on performing the above operation, our destination operand will now look like this:

D	0000 0000......0000 (64 bits) 1111 1111......1111 (64 bits, high quadword remains unchanged)

MXCSR Status/Control Instructions

Intel/AMD Mnemonic	Description
LDMXCSR	Load %mxcsr register.
STMXCSR	Save %mxcsr register state.

These instructions are simply used to load the MXCSR register from m32, and to store the contents of the MXCSR register to m32.

64-bit SIMD Integer Instructions

Intel/AMD Mnemonic	Description
PAVGB	Compute average of packed unsigned byte integers.
PAVGW	Compute average of packed unsigned byte integers.
PEXTRW	Extract word.
PINSRW	Insert word.
PMAXSW	Maximum of packed signed word integers.
PMAXUB	Maximum of packed unsigned byte integers.
PMINSW	Minimum of packed signed word integers.
PMINUB	Minimum of packed unsigned byte integers.
PMOVMSKB	Move byte mask.
PMULHUW	Multiply packed unsigned integers and store high result.
PSADBW	Compute sum of absolute differences.
PSHUFW	Shuffle packed integer word in MMX register.

For example:
If we perform the operation- 'PMINSW xmm1, xmm2/m128', the packed signed word integers in the destination operand (xmm1) are compared with those in the source operand (xmm2/m128), and the minimum value for each pair of word integers is returned to the destination operand.

If our source operand is as follows:

S	1010 1010 1111 1111 1010 1010 1111 1111

And if our destination operand is as follows:

D	1010 0101 0101 1010 1010 0101 0101 1010

Then, on performing the above instruction, our destination operand will now look like this:

D	1010 0000 0101 1010 1010 0000 0101 1010

Miscellaneous Instructions

Intel/AMD Mnemonic	Description
MASKMOVQ	Non-temporal store of selected bytes from an MMX register into memory.
MOVNTPS	Non-temporal store of four packed single-precision floating-point values from an XMM register into memory.
MOVNTQ	Non-temporal store of quadword from an MMX register into memory.
PREFETCHNTA	Prefetch data into non-temporal cache structure and into a location close to the processor.
PREFETCHT0	Prefetch data into all levels of the cache hierarchy.
PREFETCHT1	Prefetch data into level 2 cache and higher.
PREFETCHT2	Prefetch data into level 2 cache and higher.
SFENCE	Serialize store operations.

For example:
If we perform the operation- 'MASKMOVQ mm1, mm2', selected bytes from the source operand (mm1) are stored into a 64-bit memory location. The mask operand (mm2) selects which bytes from the source operand are written into the memory.

If our source operand is as follows:

S	1111 1111 1111.............1111 (64 bits)

And if our mask operand is as follows (where the bit with index 7 is 1):

M	0000 0000 0000.............0000 (56 bits) 1000 0000

Then, the bits from indices 0-7 from the source operand are written into the memory. Let us assume that prior to us performing the above operation, the memory location in question contains all 0s. After we perform said operation, the memory location changes to this:

Mem	0000 0000...... 0000 (120 bits) 1111 1111

The rest of the bits in the memory remain unchanged, as only the bit with index 7 of the mask operand is 1, and the rest of them are 0.

Future Iterations

SSE was incrementally improved over the years.

SSE2 added the double-precision floating-point format (64-bit) for all SSE operations, and also allowed for MMX integer operations to take place in 128-bit XMM registers, as opposed to the 64-bit MMX registers that were used in the first iteration of SSE.
SSE3 brought about less noticeable changes. It introduced a few new thread management instructions and allowed for 'horizontal' computing. This meant that two numbers that were stored in the same register could be added or multiplied.
SSSE3 was an upgrade to SSE3, and introduced 16 new permutation, addition and accumulation instructions.
SSE4 was a major upgrade to SSSE3, and introduced a dot product instruction, new integer instructions, a population count instruction, and more.
AVX, AVX-2 and AVX-512 hugely expanded on the number of as well as the size of registers. This allowed for significantly accelerated performance when dealing with computationally intensive workloads. Many Machine Learning based tasks benefited from the introduction of these processor architectures. Read more about AVX, AVX-2 and AVX-512 here.

Conclusion

In this article at OpenGenus, we learnt about scalar processing (SISD), vector processing (SIMD) and Intel's Streaming SIMD Extensions (SSE) instruction set, which is an implementation of SIMD architecture in modern CPUs. We also took a look at all of the new instructions that were introduced with SSE, and discussed the improvements that were made to SSE technology in future iterations.

Thanks for reading!

SIMD & SSE Instruction Set

Computer Architecture Machine Learning (ML)

Scalar Computing (SISD Architecture)

Vector Computing (SIMD Architecture)

An Example That Illustrates The Difference Between SISD & SIMD Instructions

SSE Instruction Set

Data Transfer Instructions

Packed Arithmetic Instructions

Compare Instructions

Logical Instructions

Shuffle & Unpack Instructions

Conversion Instructions

MXCSR Status/Control Instructions

64-bit SIMD Integer Instructions

Miscellaneous Instructions

Future Iterations

Conclusion

Switching Modes in Computer Networks

Introduction to Spring Boot