Assembly language: ARM Architecture

Internship at OpenGenus

Get this book -> Problems on Array: For Interviews and Competitive Programming

In this article we explore the assembly language for the ARM RISC computer architecture.

Table of contents.

  1. Introduction to assembly language.
  2. The ARM assembly language.
  3. Summary.
  4. References.

Introduction to assembly language.

Assembly languages are processor specific and are fundamental to compiler design.
In this article we shall use the gcc compiler and assembler for our examples.

Hello World

#include<stdio.h>

int main(int argsc, char *argv[]){
    printf("hello %s\n", "world);
    return 0;
}

Compilation

gcc -S test.c -o test.s

#view the compiled assembly code
cat test.s

Output

        .file   "test.c"
        .text
        .section        .rodata
.LC0:
        .string "world"
.LC1:
        .string "hello %s\n"
        .text
        .globl  main
        .type   main, @function
main:
.LFB0:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        subq    $16, %rsp
        movl    %edi, -4(%rbp)
        movq    %rsi, -16(%rbp)
        leaq    .LC0(%rip), %rsi
        leaq    .LC1(%rip), %rdi
        movl    $0, %eax
        call    printf@PLT
        movl    $0, %eax
        leave
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc
.LFE0:
        .size   main, .-main
        .ident  "GCC: (Debian 10.2.1-6) 10.2.1 20210110"
        .section        .note.GNU-stack,"",@progbits

The output of your compiler may be different.

Assembly code elements.

Regardless of the CPU architecture, assembly code will have the following elements;

  1. Directives.
    They begin with a . (dot) and used t indicate structural information that is useful for the assembler, linker or debugger.
    .data indicates the start of the data segment.
    .text indicates the start of the program segment.
    .string indicates a constant within the data section.
    .globl main indicates that the label main is a global symbol that can be accessed by other code modules.

  2. Labels
    These end with a colon and by their position are used to indicate the relationship between names and locations.
    An example;
    The label .LCO: indicates that the following string should be called .LCO.
    The label main: indicates that the instruction PUSHQ %rbp is the first instruction of the main function.
    Labels beginning with a . dot are temporary local labels generate by the compiler and as such don't become part of the machine code but are present in the resulting object code for the purposes of linking and in the executable file for the purpose of debugging.

  3. Instructions
    Are symbols like PUSHQ %rbp and are indented for visual distinction from directives and labels.
    Note that instructions in GNU assembly are not case sensitive but are uppercased for consistency.
    We can take the assembly code test.s and compile it to a runnable program.

Compiling assembly to an executable

gcc test.s -o test
#run executable
./test

Output

hello world

Compiling to object code.

gcc test.s -c -o test.o

we use the nm utility to display symbol(names) present in the code.

nm test.o

Output

                 U _GLOBAL_OFFSET_TABLE_
0000000000000000 T main
                 U printf

The above information from object code is available to the linker.
main is present in the text(T) section of the object at location.
printf is undefined(U) since is will be obtained from the standard library.
.LCO might appear if not declared as .global.

The ARM Assembly language.

ARM is one of a family of CPUs based on the RISC architecture.
RISC processors are designed to perform a smaller number of computer instructions therefore operate at a higher speed performing multiple instructions per second(MIPS) by removing unneeded instructions and optimizing pathways.
Compared to CISC architecture, they demonstrate an outstanding performance at a fraction of the power.

Registers and data types.

ARM-32 has 16 general purpose registers from r0-r15 with the following conventions for use.

r0 - r10 are general purpose.
r11 - Frame pointer(fp)
r12 - Intra-Procedure-Call Scratch Register(ip)
r13 - Stack pointer(sp)
r14 - Link Register(Return Address)
r15 - Program Counter(pc)

ARM also has 2 additional registers that cannot be accessed directly these are the Current Program Status Register(CPSR) and the Saved Program Status Register(SPRS) which holds the results of comparison operations and privileged data regarding the process state.
These can be set as side effects for some operations.

ARM suffixes for data sizes.

Data type Suffix Size
Byte B 8 bits
HalfWord H 16 bits
Word W 32 bits
Double Word - 64 bits
Signed Byte SB 8 bits
Signeg HalfWord SH 16 bits
Signed Word SW 32 bits
Double Word - 64 bits

There is no register naming structure for anything below a word.
The signed types are used to the provide appropriate sign-extension when loading a small data type int a larger register.
If no suffix is given the assembler will assume an unsigned word operand.

Moving data between registers and memory involves two classes of instructions namely;
MOV which copies data and constants
LDR(load) and STR(store) which moves data between registers and memory

MOV
Moves a known immediate value to a given register or a register to the first register.
Immediate values are denoted by # and must be 16 bits or less otherwise LDR is used.

In ARM instructions destinations registers are indicated on the left and source on the right.with the exception of STR.

Mode Example
Immediate MOV r0, #3
Register MOV r1, r0

A mnemonic letter for each data type is appended to the MOV instruction so that we know what is being transfered and how it is done.

LDR and STR are used to move values out of memory. The first argument is the source and destination is the second.

In the simplest case,

LDR Rd, [Ra]
STR Rs, [Ra]

Rd denotes the destination register.
Rs denotes the source register.
Ra denotes the register containing the address

ARM addressing modes.

Address Mode Example
Literal LDR Rd, =0xABCD1234
Absolute Address LDR Rd, =label
Register Indirect LDR Rd, [Ra]
Pre-indexing - Immediate LDR Rd, [Ra, #4]
Pre-indexing - Register LDR Rd, [Ra, Ro]
Pre-indexing - Immediate & Writeback LDR Rd, [Ra, #4]!
Pre-indexing - Register & Writeback LDR Rd, [Ra, Ro]!
Post-indexing - Immediate LDR Rd, [Ra], #4
Post-indexing - Register Post-indexing - Register

As can be seen LDR can be used to load a literal of a full 32-bits into a register however unlike the X86 architecture there is no a single instruction that loads a value from memory address.

To do this in ARM we first load the address into a register and perform a register-indirect load as shown below.

LDR r1, =x
LDR r2, [r1]

Pre-indexing Modes add a constant/register to a base register and loads them from the computed address.

An example
LDR r1, [r2, #4]; Load from address r2 + 4
LDR r1, [r2, r3]; Load from address r2 + r3

For writing back to the base register a ! (bang) character is used to indicate that the computed address should be saved to the base register after the address is loaded.

An example
LDR r1, [r2, #4]!; Load from r2 + 4 then r2 += 4
LDR r1, [r2, r3]!; Load from r2 + r3 then r2 += r3

Post-indexing performs the above operation in reverse order. It first loads from the base register then the base register is incremented.

LDR r1, [r2], #4; Load from r2 then r2 += 4
LDR r1, [r2], r3; Load from r2 then r2 += r3

Pre-indexing and post-indexing modes enable single-instruction implementation for operations such as b = a++

Large literals are stored in a literal pool, a small region of data inside the code section of the program, this is because every ARM instruction must fit in a 32-bit word.
The literal is loaded from a pool with PC-relative load instruction referenced with +-4096 bytes from the loading instruction.

An = marking a large literal or indicates to the assembler that the value should be placed into a literal pool and a corresponding PC-relative instruction emitted instead.

An example;

LDR r1, =x
LDR r1, [r1]

The above instructions load the address of x to r1 and value of x to r2.
They will be expanded into

LDR r1, .L1
LDR r2, [r1]
B   .end    
.L1:
    .word x 
.end

That is, load address of x from adjacent literal pool then load the value of x

Basic Arithmetic.

ARM provides three-address arithmetic instructions on registers.
ADD and SUB specify the result register as the first argument and compute the second and third arguments.
The third operand may be an 8-bit constant or a register with an optional shift applied.
Carry-in variants add the C bit of CPSR to the result.
All take an optional suffix which sets the condition flags on completion.

Instruction Example
Add ADD Rd, Rm, Rn
Add with carry-in ADC Rd, Rm, Rn
Subtract SUB Rd, Rm, Rn
Subtract with carry-in SBC Rd, Rm, Rn

For multiplication, the process is same except that a multiplication of 2 32-bit numbers could yield a 64-bit number.
MUL instruction will discard the high bits of the result and UMULL places the 64-bit result in two 32-bit registers.
The signed variant SMULL sign extends the high register.

Instruction Example
Multiplication MUL Rd, Rm, Rn
Unsigned Long Multiplication UMULL RdHi, RdLo, Rm, Rn
Signed Long Multiplication SMULL RdHi, RdLo, Rm, Rn

A division instruction does not exist since it can't be carried out in a single pipelined cycle therefore it is accomplished by repeated subtraction or more efficiently invoking an external function in the run time library which computes the quotient of a division.
Th function is referred to as __aeabi_idiv

An example for the division of 14 and 3

MOV r0, 14
MOV r1, 3
bl __aeabi_idiv

After the division register ro will contain the quotient or 4.

Logical instructions.

These are bitwise-and, bitwise-or, bitwise-exclusive-or and bitwise-bit-clear(bitwise-and first value and inversion of second value).
The move-not MVN instruction is used to perform a bitwise-not while moving from register to register.

Instruction Example
bitwise-and AND Rd, Rm, Rn
bitwise-or ORR Rd, Rm, Rn
bitwise-xor EOR Rd, Rm, Rn
bitwise-bit-clear BIC Rd, RM, Rn
move-not MVN Rd, Rn

Comparison and branches

The CMP instruction is used to compare two values and set the N(negative) and Z(zero) flag in the CPSR to be read by following instructions.
For comparing a register and immediate value, the immediate value is the second operand.

An example

CMP Rd, Rn
CMP Rd, #imm

Branch Instructions.

Instruction Meaning
B Branch always
BX Branch and exchange
BEQ Equal
BNE Not equal
BGT greater than
BGE greater than or equal
BLT less than
BLE Less than or equal
BMI Negative
BL Branch and Link
BLX Branch-Link-Exchange
BVS Overflow Set
BVC Overflow Clear
BHI Higher (unsigned >)
BHS Higher or same (unsigned >=)
BLO Lower (unsigned <)
BLS Lower or same (unsigned <=)
BBPL Positive or zero

Character S is appended to an arithmetic instruction so as to update the CPSR, e.g SUBS will subtract and store result then update the CPSR.
The reason for this is that some branch instructions consult the previously-set values of CPSR and jump to a label is the flags are set.
A branch without conditions is specified with B.

An example of couting from 0 - 5

        MOV r0, #0
loop:   ADD r0, r0, 1
        CMP r0, #5
        BLT loop

Assigning global variable y, y = 10 if x > 0 else y = 20

        LDR r0, =x
        LDR r0, [r0]
        CMP r0, #0
        BGT .L1
.L0:
        MOV r0, #20
        B   .L2
.L1:
        MOV r0, #10
.L2:
        LDR r1, =y
        STR r0, [r1]

BL (branch-and-link) instruction is used to implement function calls by setting the link register to be the address of the next instruction and jump to the given label.
This link register is used ad the return address when the function terminates.

BX instruction branches to the address given in a register and used to return from a function call by branching to the link register.

BLX performs a branch-and-link to the address given by the register and is used to invoke function pointers, virtual methods or other indirect jumps.

An example of conditional execution

if(a < b){
    a++;
}else{
    b++;
}

We make each of the two additions conditional upon a previous comparison and whichever condition holds true will be executed and the others skipped.
Assuming that a and b are held in r1 and r1 respectively then it translates to,

CMP   r0, r1
ADDLT r0, r0, #1
ADDGE r1, r1, #1

The stack.

This is an auxilliary data structure used to record function call history of a program along with local variables that don't fit in registers.
The stack grows downwards from high values to low values.
The sp register(stack pointer) keeps track of the bottom-most item on the stack.
To push r0 register onto the stack, we subtract the size of the register from sp and store r0 to the location pointed to by sp.

SUB sp, sp, #4S
TR r0, [sp]

Using a single instruction by use of pre-indexing and write-back.

STR r0, [sp, #-4]!

Pushing to stack
PUSH does the same in addition to moving any number of registers to the stack. {} are used here to indicate the list of registers.

An example

PUSH {r0, r1, r2}

Popping involves the opposite

LDR r0, [sp]
ADD sp, sp, #4

With a single instruction

LDR r1, [sp], #4

Popping a set of registers*

POP {r0, r1, r2}

Calling a function.

ARM register assignments.

Register Purpose Saver
r0 argument 0/ result not saved
r1 argument 1 caller saves
r2 argument 2 caller saves
r3 argument 3 caller saves
r4 argument 4 caller saves
r5 scratch callee saves
r6 base pointer callee saves
r7 stack pointer callee saves
r8 argument 5 callee saves
r9 argument 6 callee saves
r10 scratch caller saves
r11 frame pointer callee saves
r12 intraprocedure caller saves
r13 stack pointer callee saves
r14 link register caller saves
r15 program counter saved in linke register

ARM calling convention*
The first 4 arguments are placed in r0, r1, r2, r3 registers.
Additional arguments are pushed onto the stack in reverse.
The caller saves r0-r3 and r12 if needed.
The caller must always save the link register(r14).
The callee must save r4-r11 if needed.
The results are placed in r0.

To call a function, we place the desired arguments in registers r0-r3, save the current value of the link register and use the BL instruction to jump to the function.
When the function returns we restore the previous value of the link register and examine the result in stored in register r0.

An example

int x = 0;
int y = 10;
int main(){
    x = printf("value: %d\n", y);
}

Which translates to

.data
    x:  .word 0
    y:  .word 10
    S0: .ascii "value: %d\012\000"
.text
    main:
        LDR r0, =S0  @ Load address of S0
        LDR r1, =y   @ Load address of y
        LDR r1, [r1] @ Load value of y
        PUSH {ip,lr} @ Save registers
        BL  printf   @ Call printf
        POP  {ip,lr} @ Restore registers
        LDR r1, =x   @ Load address of x
        STR r0, [r1] @ Store return value in x.
end

Defining a leaf function.
n.
A leaf function is a function that computes a value without calling other functions.
They are easy to write since function arguments are passed in as registers.

Another example

square: function integer ( x: integer ) ={
    return x * x;
}

Is transalated to.

.global square
square:
    MUL  r0, r0, r0   @ multiply argument by itself
    BX   lr           @ return to caller

In the general case, a more complex approach is needed because the above function will not work for a function that wants to invoke other functions since the stack is not set up properly.

Defining a complex function.

A complex function is a function that is able to invoke other functions and compute expressions for an arbitrary complexity and return to the caller with the original state intact.

An example of a function that takes 3 arguments and uses 2 local variables.

func:
    PUSH {fp}        @ save the frame pointer
    MOV  fp, sp      @ set the new frame pointer
    PUSH {r0,r1,r2}  @ save the arguments on the stack
    SUB  sp, sp, #8  @ allocate two more local variables
    PUSH {r4-r10}    @ save callee-saved registers
    
    @@@ body of function goes here @@@
    
    POP  {r4-r10}     @ restore callee saved registers
    MOV  sp, fp       @ reset stack pointer
    POP  {fp}         @ recover previous frame pointer
    BX   lr           @ return to the caller

With this method we ensure that we save all values in registers into the stack so that data won't be lost.

This stack will be similar to X86 stack.

An example

compute: function integer( a: integer, b: integer, c: integer ) ={
    x:integer = a + b + c;
    y:integer = x * 5;
    return y;
}

Which translates to

.global compute
compute:
@@@@@@@@@@@@@@@@@@ preamble of function sets up stack
PUSH {fp}        @ save the frame pointer
MOV  fp, sp      @ set the new frame pointer
PUSH {r0,r1,r2}  @ save the arguments on the stack
SUB  sp, sp, #8  @ allocate two more local variables
PUSH {r4-r10}    @ save callee-saved registers

@@@@@@@@@@@@@@@@@@@@@@@@ body of function starts here

LDR  r0, [fp,#-12]     @ load argument 0 (a) into r0
LDR  r1, [fp,#-8]      @ load argument 1 (b) into r1
LDR  r2, [fp,#-4]      @ load argument 2 (c) into r2
ADD  r1, r1, r2        @ add the args together
ADD  r0, r0, r1
STR  r0, [fp,#-20]     @ store the result into local 0 (x)
LDR  r0, [fp,#-20]     @ load local 0 (x) into a register
MOV  r1, #5            @ move 5 into a register
MUL  r2, r0, r1        @ multiply both into r2
STR  r2, [fp,#-16]     @ store the result in local 1 (y)
LDR  r0, [fp,#-16]     @ move local 1 (y) into the result

@@@@@@@@@@@@@@@@@@@ epilogue of function restores the stack

POP  {r4-r10}     @ restore callee saved registers
MOV  sp, fp       @ reset stack pointer
POP  {fp}         @ recover previous frame pointer
BX   lr           @ return to the caller

The ARM stack frame.

Contents Address
Saved r12 [fp, #8]
Old lr [fp, #4]
Old frame pointer [fp] (fp points here)
Argument 2 [fp, #-4]
Argument 1 [fp, #-8]
Argument 0 [fp, #-12]
Local variable 1 [fp, #-16]
Local variable 0 [fp, #-20]
Saved r10 [fp, #-24]
Saved r9 [fp, #-28]
Saved r8 [fp, #-32]
Saved r7 [fp, #-36]
Saved r6 [fp, #-40]
Saved r5 [fp, #-44]
Saved r4 [fp, #-48] (sp points here)

Another approach could be that the callee first pushes all arguments and scratch registers onto the stack and then allocate space for local variables.

Another approach could be for the callee PUSH {fp, ip, lr, pc} onto the stack before pushing arguments and local variables.
This would provide a stack back-trace-debugging information so that the debugger can look backwards and easily reconstruct the correct execution state of the program.

These approaches are valid as long as the function uses the stack frame consistently.

To optimize the function, we can avoid using registers r4 and r5 so no need to save and restore them.
Also, we can keep arguments in registers without saving them to the stack and thus compute the result directly into r0 without using a local variable.

64-bit differences.

64-bit ARM architecture provides two execution modes, the A32 which supports the 32-bit instruction and A64 mode that supports a new 64-bit execution model hence the 64-bit CPU can perform a mix of 32-bit and 64-bit executions simultaneously.

Differences between the A32 and A64

Word size
A64 instructions are a fixed size of 32 bits, however registers and address computations are 64 bit.

Registers
A64 has 32 64-bit registers x0 to x31.
x0, a dedicated zero register
x1 to x15, general purpose registers
X16 and x17 for interprocess communincation
x29, the frame pointer
x30 the link register
x31 the stack pointer.

Instructions
A64 instructions are similar to A32 with the exception of conditional predicates which are no longer part of every instruction.

Calling convention
Invoking a function involves the first 8 arguments which are placed in registers x0 to x7 and the remainders pushed onto the stack.
The caller preserves x9 - x15 and x30.
The callee preserver x19 - x29.
The return value is placed in x0 while the extended return values are pointed to by x8.

Summary.

Assembly language enables programmers to write human-readable code that is close to machine language and help in providing full control of what tasks the computer should perform.
It is memory efficient, fast, hardware oriented and allows execution of complex jobs in a simplified manner.
With all that it comes with some drawbacks such as the time and effort to write assembly code is a lot not considering the complexity and syntax, it also lacks portability for different computer architectures and requires more memory for longer programs.

References.

  1. RISC and CISC computer architectures
  2. Writing ARM Assembly Documentation