Assembly language: ARM Architecture
Do not miss this exclusive book on Binary Tree Problems. Get it now for free.
In this article we explore the assembly language for the ARM RISC computer architecture.
Table of contents.
- Introduction to assembly language.
- The ARM assembly language.
- Summary.
- References.
Introduction to assembly language.
Assembly languages are processor specific and are fundamental to compiler design.
In this article we shall use the gcc compiler and assembler for our examples.
Hello World
#include<stdio.h>
int main(int argsc, char *argv[]){
printf("hello %s\n", "world);
return 0;
}
Compilation
gcc -S test.c -o test.s
#view the compiled assembly code
cat test.s
Output
.file "test.c"
.text
.section .rodata
.LC0:
.string "world"
.LC1:
.string "hello %s\n"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
leaq .LC0(%rip), %rsi
leaq .LC1(%rip), %rdi
movl $0, %eax
call printf@PLT
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Debian 10.2.1-6) 10.2.1 20210110"
.section .note.GNU-stack,"",@progbits
The output of your compiler may be different.
Assembly code elements.
Regardless of the CPU architecture, assembly code will have the following elements;
-
Directives.
They begin with a . (dot) and used t indicate structural information that is useful for the assembler, linker or debugger.
.data indicates the start of the data segment.
.text indicates the start of the program segment.
.string indicates a constant within the data section.
.globl main indicates that the label main is a global symbol that can be accessed by other code modules. -
Labels
These end with a colon and by their position are used to indicate the relationship between names and locations.
An example;
The label .LCO: indicates that the following string should be called .LCO.
The label main: indicates that the instruction PUSHQ %rbp is the first instruction of the main function.
Labels beginning with a . dot are temporary local labels generate by the compiler and as such don't become part of the machine code but are present in the resulting object code for the purposes of linking and in the executable file for the purpose of debugging. -
Instructions
Are symbols like PUSHQ %rbp and are indented for visual distinction from directives and labels.
Note that instructions in GNU assembly are not case sensitive but are uppercased for consistency.
We can take the assembly code test.s and compile it to a runnable program.
Compiling assembly to an executable
gcc test.s -o test
#run executable
./test
Output
hello world
Compiling to object code.
gcc test.s -c -o test.o
we use the nm utility to display symbol(names) present in the code.
nm test.o
Output
U _GLOBAL_OFFSET_TABLE_
0000000000000000 T main
U printf
The above information from object code is available to the linker.
main is present in the text(T) section of the object at location.
printf is undefined(U) since is will be obtained from the standard library.
.LCO might appear if not declared as .global.
The ARM Assembly language.
ARM is one of a family of CPUs based on the RISC architecture.
RISC processors are designed to perform a smaller number of computer instructions therefore operate at a higher speed performing multiple instructions per second(MIPS) by removing unneeded instructions and optimizing pathways.
Compared to CISC architecture, they demonstrate an outstanding performance at a fraction of the power.
Registers and data types.
ARM-32 has 16 general purpose registers from r0-r15 with the following conventions for use.
r0 - r10 are general purpose.
r11 - Frame pointer(fp)
r12 - Intra-Procedure-Call Scratch Register(ip)
r13 - Stack pointer(sp)
r14 - Link Register(Return Address)
r15 - Program Counter(pc)
ARM also has 2 additional registers that cannot be accessed directly these are the Current Program Status Register(CPSR) and the Saved Program Status Register(SPRS) which holds the results of comparison operations and privileged data regarding the process state.
These can be set as side effects for some operations.
ARM suffixes for data sizes.
Data type | Suffix | Size |
---|---|---|
Byte | B | 8 bits |
HalfWord | H | 16 bits |
Word | W | 32 bits |
Double Word | - | 64 bits |
Signed Byte | SB | 8 bits |
Signeg HalfWord | SH | 16 bits |
Signed Word | SW | 32 bits |
Double Word | - | 64 bits |
There is no register naming structure for anything below a word.
The signed types are used to the provide appropriate sign-extension when loading a small data type int a larger register.
If no suffix is given the assembler will assume an unsigned word operand.
Moving data between registers and memory involves two classes of instructions namely;
MOV which copies data and constants
LDR(load) and STR(store) which moves data between registers and memory
MOV
Moves a known immediate value to a given register or a register to the first register.
Immediate values are denoted by # and must be 16 bits or less otherwise LDR is used.
In ARM instructions destinations registers are indicated on the left and source on the right.with the exception of STR.
Mode | Example |
---|---|
Immediate | MOV r0, #3 |
Register | MOV r1, r0 |
A mnemonic letter for each data type is appended to the MOV instruction so that we know what is being transfered and how it is done.
LDR and STR are used to move values out of memory. The first argument is the source and destination is the second.
In the simplest case,
LDR Rd, [Ra]
STR Rs, [Ra]
Rd denotes the destination register.
Rs denotes the source register.
Ra denotes the register containing the address
ARM addressing modes.
Address Mode | Example |
---|---|
Literal | LDR Rd, =0xABCD1234 |
Absolute Address | LDR Rd, =label |
Register Indirect | LDR Rd, [Ra] |
Pre-indexing - Immediate | LDR Rd, [Ra, #4] |
Pre-indexing - Register | LDR Rd, [Ra, Ro] |
Pre-indexing - Immediate & Writeback | LDR Rd, [Ra, #4]! |
Pre-indexing - Register & Writeback | LDR Rd, [Ra, Ro]! |
Post-indexing - Immediate | LDR Rd, [Ra], #4 |
Post-indexing - Register | Post-indexing - Register |
As can be seen LDR can be used to load a literal of a full 32-bits into a register however unlike the X86 architecture there is no a single instruction that loads a value from memory address.
To do this in ARM we first load the address into a register and perform a register-indirect load as shown below.
LDR r1, =x
LDR r2, [r1]
Pre-indexing Modes add a constant/register to a base register and loads them from the computed address.
An example
LDR r1, [r2, #4]; Load from address r2 + 4
LDR r1, [r2, r3]; Load from address r2 + r3
For writing back to the base register a ! (bang) character is used to indicate that the computed address should be saved to the base register after the address is loaded.
An example
LDR r1, [r2, #4]!; Load from r2 + 4 then r2 += 4
LDR r1, [r2, r3]!; Load from r2 + r3 then r2 += r3
Post-indexing performs the above operation in reverse order. It first loads from the base register then the base register is incremented.
LDR r1, [r2], #4; Load from r2 then r2 += 4
LDR r1, [r2], r3; Load from r2 then r2 += r3
Pre-indexing and post-indexing modes enable single-instruction implementation for operations such as b = a++
Large literals are stored in a literal pool, a small region of data inside the code section of the program, this is because every ARM instruction must fit in a 32-bit word.
The literal is loaded from a pool with PC-relative load instruction referenced with +-4096 bytes from the loading instruction.
An = marking a large literal or indicates to the assembler that the value should be placed into a literal pool and a corresponding PC-relative instruction emitted instead.
An example;
LDR r1, =x
LDR r1, [r1]
The above instructions load the address of x to r1 and value of x to r2.
They will be expanded into
LDR r1, .L1
LDR r2, [r1]
B .end
.L1:
.word x
.end
That is, load address of x from adjacent literal pool then load the value of x
Basic Arithmetic.
ARM provides three-address arithmetic instructions on registers.
ADD and SUB specify the result register as the first argument and compute the second and third arguments.
The third operand may be an 8-bit constant or a register with an optional shift applied.
Carry-in variants add the C bit of CPSR to the result.
All take an optional suffix which sets the condition flags on completion.
Instruction | Example |
---|---|
Add | ADD Rd, Rm, Rn |
Add with carry-in | ADC Rd, Rm, Rn |
Subtract | SUB Rd, Rm, Rn |
Subtract with carry-in | SBC Rd, Rm, Rn |
For multiplication, the process is same except that a multiplication of 2 32-bit numbers could yield a 64-bit number.
MUL instruction will discard the high bits of the result and UMULL places the 64-bit result in two 32-bit registers.
The signed variant SMULL sign extends the high register.
Instruction | Example |
---|---|
Multiplication | MUL Rd, Rm, Rn |
Unsigned Long Multiplication | UMULL RdHi, RdLo, Rm, Rn |
Signed Long Multiplication | SMULL RdHi, RdLo, Rm, Rn |
A division instruction does not exist since it can't be carried out in a single pipelined cycle therefore it is accomplished by repeated subtraction or more efficiently invoking an external function in the run time library which computes the quotient of a division.
Th function is referred to as __aeabi_idiv
An example for the division of 14 and 3
MOV r0, 14
MOV r1, 3
bl __aeabi_idiv
After the division register ro will contain the quotient or 4.
Logical instructions.
These are bitwise-and, bitwise-or, bitwise-exclusive-or and bitwise-bit-clear(bitwise-and first value and inversion of second value).
The move-not MVN instruction is used to perform a bitwise-not while moving from register to register.
Instruction | Example |
---|---|
bitwise-and | AND Rd, Rm, Rn |
bitwise-or | ORR Rd, Rm, Rn |
bitwise-xor | EOR Rd, Rm, Rn |
bitwise-bit-clear | BIC Rd, RM, Rn |
move-not | MVN Rd, Rn |
Comparison and branches
The CMP instruction is used to compare two values and set the N(negative) and Z(zero) flag in the CPSR to be read by following instructions.
For comparing a register and immediate value, the immediate value is the second operand.
An example
CMP Rd, Rn
CMP Rd, #imm
Branch Instructions.
Instruction | Meaning |
---|---|
B | Branch always |
BX | Branch and exchange |
BEQ | Equal |
BNE | Not equal |
BGT | greater than |
BGE | greater than or equal |
BLT | less than |
BLE | Less than or equal |
BMI | Negative |
BL | Branch and Link |
BLX | Branch-Link-Exchange |
BVS | Overflow Set |
BVC | Overflow Clear |
BHI | Higher (unsigned >) |
BHS | Higher or same (unsigned >=) |
BLO | Lower (unsigned <) |
BLS | Lower or same (unsigned <=) |
BBPL | Positive or zero |
Character S is appended to an arithmetic instruction so as to update the CPSR, e.g SUBS will subtract and store result then update the CPSR.
The reason for this is that some branch instructions consult the previously-set values of CPSR and jump to a label is the flags are set.
A branch without conditions is specified with B.
An example of couting from 0 - 5
MOV r0, #0
loop: ADD r0, r0, 1
CMP r0, #5
BLT loop
Assigning global variable y, y = 10 if x > 0 else y = 20
LDR r0, =x
LDR r0, [r0]
CMP r0, #0
BGT .L1
.L0:
MOV r0, #20
B .L2
.L1:
MOV r0, #10
.L2:
LDR r1, =y
STR r0, [r1]
BL (branch-and-link) instruction is used to implement function calls by setting the link register to be the address of the next instruction and jump to the given label.
This link register is used ad the return address when the function terminates.
BX instruction branches to the address given in a register and used to return from a function call by branching to the link register.
BLX performs a branch-and-link to the address given by the register and is used to invoke function pointers, virtual methods or other indirect jumps.
An example of conditional execution
if(a < b){
a++;
}else{
b++;
}
We make each of the two additions conditional upon a previous comparison and whichever condition holds true will be executed and the others skipped.
Assuming that a and b are held in r1 and r1 respectively then it translates to,
CMP r0, r1
ADDLT r0, r0, #1
ADDGE r1, r1, #1
The stack.
This is an auxilliary data structure used to record function call history of a program along with local variables that don't fit in registers.
The stack grows downwards from high values to low values.
The sp register(stack pointer) keeps track of the bottom-most item on the stack.
To push r0 register onto the stack, we subtract the size of the register from sp and store r0 to the location pointed to by sp.
SUB sp, sp, #4S
TR r0, [sp]
Using a single instruction by use of pre-indexing and write-back.
STR r0, [sp, #-4]!
Pushing to stack
PUSH does the same in addition to moving any number of registers to the stack. {} are used here to indicate the list of registers.
An example
PUSH {r0, r1, r2}
Popping involves the opposite
LDR r0, [sp]
ADD sp, sp, #4
With a single instruction
LDR r1, [sp], #4
Popping a set of registers*
POP {r0, r1, r2}
Calling a function.
ARM register assignments.
Register | Purpose | Saver |
---|---|---|
r0 | argument 0/ result | not saved |
r1 | argument 1 | caller saves |
r2 | argument 2 | caller saves |
r3 | argument 3 | caller saves |
r4 | argument 4 | caller saves |
r5 | scratch | callee saves |
r6 | base pointer | callee saves |
r7 | stack pointer | callee saves |
r8 | argument 5 | callee saves |
r9 | argument 6 | callee saves |
r10 | scratch | caller saves |
r11 | frame pointer | callee saves |
r12 | intraprocedure | caller saves |
r13 | stack pointer | callee saves |
r14 | link register | caller saves |
r15 | program counter | saved in linke register |
ARM calling convention*
The first 4 arguments are placed in r0, r1, r2, r3 registers.
Additional arguments are pushed onto the stack in reverse.
The caller saves r0-r3 and r12 if needed.
The caller must always save the link register(r14).
The callee must save r4-r11 if needed.
The results are placed in r0.
To call a function, we place the desired arguments in registers r0-r3, save the current value of the link register and use the BL instruction to jump to the function.
When the function returns we restore the previous value of the link register and examine the result in stored in register r0.
An example
int x = 0;
int y = 10;
int main(){
x = printf("value: %d\n", y);
}
Which translates to
.data
x: .word 0
y: .word 10
S0: .ascii "value: %d\012\000"
.text
main:
LDR r0, =S0 @ Load address of S0
LDR r1, =y @ Load address of y
LDR r1, [r1] @ Load value of y
PUSH {ip,lr} @ Save registers
BL printf @ Call printf
POP {ip,lr} @ Restore registers
LDR r1, =x @ Load address of x
STR r0, [r1] @ Store return value in x.
end
Defining a leaf function.
n.
A leaf function is a function that computes a value without calling other functions.
They are easy to write since function arguments are passed in as registers.
Another example
square: function integer ( x: integer ) ={
return x * x;
}
Is transalated to.
.global square
square:
MUL r0, r0, r0 @ multiply argument by itself
BX lr @ return to caller
In the general case, a more complex approach is needed because the above function will not work for a function that wants to invoke other functions since the stack is not set up properly.
Defining a complex function.
A complex function is a function that is able to invoke other functions and compute expressions for an arbitrary complexity and return to the caller with the original state intact.
An example of a function that takes 3 arguments and uses 2 local variables.
func:
PUSH {fp} @ save the frame pointer
MOV fp, sp @ set the new frame pointer
PUSH {r0,r1,r2} @ save the arguments on the stack
SUB sp, sp, #8 @ allocate two more local variables
PUSH {r4-r10} @ save callee-saved registers
@@@ body of function goes here @@@
POP {r4-r10} @ restore callee saved registers
MOV sp, fp @ reset stack pointer
POP {fp} @ recover previous frame pointer
BX lr @ return to the caller
With this method we ensure that we save all values in registers into the stack so that data won't be lost.
This stack will be similar to X86 stack.
An example
compute: function integer( a: integer, b: integer, c: integer ) ={
x:integer = a + b + c;
y:integer = x * 5;
return y;
}
Which translates to
.global compute
compute:
@@@@@@@@@@@@@@@@@@ preamble of function sets up stack
PUSH {fp} @ save the frame pointer
MOV fp, sp @ set the new frame pointer
PUSH {r0,r1,r2} @ save the arguments on the stack
SUB sp, sp, #8 @ allocate two more local variables
PUSH {r4-r10} @ save callee-saved registers
@@@@@@@@@@@@@@@@@@@@@@@@ body of function starts here
LDR r0, [fp,#-12] @ load argument 0 (a) into r0
LDR r1, [fp,#-8] @ load argument 1 (b) into r1
LDR r2, [fp,#-4] @ load argument 2 (c) into r2
ADD r1, r1, r2 @ add the args together
ADD r0, r0, r1
STR r0, [fp,#-20] @ store the result into local 0 (x)
LDR r0, [fp,#-20] @ load local 0 (x) into a register
MOV r1, #5 @ move 5 into a register
MUL r2, r0, r1 @ multiply both into r2
STR r2, [fp,#-16] @ store the result in local 1 (y)
LDR r0, [fp,#-16] @ move local 1 (y) into the result
@@@@@@@@@@@@@@@@@@@ epilogue of function restores the stack
POP {r4-r10} @ restore callee saved registers
MOV sp, fp @ reset stack pointer
POP {fp} @ recover previous frame pointer
BX lr @ return to the caller
The ARM stack frame.
Contents | Address |
---|---|
Saved r12 | [fp, #8] |
Old lr | [fp, #4] |
Old frame pointer | [fp] (fp points here) |
Argument 2 | [fp, #-4] |
Argument 1 | [fp, #-8] |
Argument 0 | [fp, #-12] |
Local variable 1 | [fp, #-16] |
Local variable 0 | [fp, #-20] |
Saved r10 | [fp, #-24] |
Saved r9 | [fp, #-28] |
Saved r8 | [fp, #-32] |
Saved r7 | [fp, #-36] |
Saved r6 | [fp, #-40] |
Saved r5 | [fp, #-44] |
Saved r4 | [fp, #-48] (sp points here) |
Another approach could be that the callee first pushes all arguments and scratch registers onto the stack and then allocate space for local variables.
Another approach could be for the callee PUSH {fp, ip, lr, pc} onto the stack before pushing arguments and local variables.
This would provide a stack back-trace-debugging information so that the debugger can look backwards and easily reconstruct the correct execution state of the program.
These approaches are valid as long as the function uses the stack frame consistently.
To optimize the function, we can avoid using registers r4 and r5 so no need to save and restore them.
Also, we can keep arguments in registers without saving them to the stack and thus compute the result directly into r0 without using a local variable.
64-bit differences.
64-bit ARM architecture provides two execution modes, the A32 which supports the 32-bit instruction and A64 mode that supports a new 64-bit execution model hence the 64-bit CPU can perform a mix of 32-bit and 64-bit executions simultaneously.
Differences between the A32 and A64
Word size
A64 instructions are a fixed size of 32 bits, however registers and address computations are 64 bit.
Registers
A64 has 32 64-bit registers x0 to x31.
x0, a dedicated zero register
x1 to x15, general purpose registers
X16 and x17 for interprocess communincation
x29, the frame pointer
x30 the link register
x31 the stack pointer.
Instructions
A64 instructions are similar to A32 with the exception of conditional predicates which are no longer part of every instruction.
Calling convention
Invoking a function involves the first 8 arguments which are placed in registers x0 to x7 and the remainders pushed onto the stack.
The caller preserves x9 - x15 and x30.
The callee preserver x19 - x29.
The return value is placed in x0 while the extended return values are pointed to by x8.
Summary.
Assembly language enables programmers to write human-readable code that is close to machine language and help in providing full control of what tasks the computer should perform.
It is memory efficient, fast, hardware oriented and allows execution of complex jobs in a simplified manner.
With all that it comes with some drawbacks such as the time and effort to write assembly code is a lot not considering the complexity and syntax, it also lacks portability for different computer architectures and requires more memory for longer programs.
References.
Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.