Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
In this article, we have presented over 30 techniques which you can use to optimize your C++ code to make it perform optimally and pass time restrictions in Competitive Contests.
Table of contents:
- Compile C++ code optimally
- Use aligned memory allocation
- Optimize Memory allocation
- Always free dynamically allocated memory
- Use multi-threaded execution
- Use third-party optimized libraries
- Minimize use of Jumps and Branches
- Use Likely and Unlikely
- Consider the order of accessing array indices
- Avoid Local variables, Use Global variables
- Reduce number of function parameters
- Pass by reference (not value)
- Avoid casting
- Loop unrolling, Duff's device
- Use intrinsics where possible
- Use register keyword
- Use Inline keyword
- Use cache optimally
- Use std::vector over std::list
- Use memset, memcpy
- Use qsort and bsearch
- Minimize Read and Write I/O
- Optimize I/O
- Reuse objects and memory
- Use template meta-programming
- Use Initialization over Assignment
- Use prefix
- Use Operator=
- Use explicit constructor
- Use reference instead of pointer
We will go through all points in detail.
1. Compile C++ code optimally
Use GCC's -Os option to create an optimized C++ executable. -O0 option turns off all optimizations and is used in cases when optimization can result in specific bugs within the code.
If your code is safe from standard bugs, compile using -O3 option which aggressively applies all optimizations on C++ code to create an executable.
Use specific -march option depending on the system you want to run on. If your system supports advanced instructions like AVX512, enable it as a compiler flag.
-O3 -march=icelake-server
Try different compilers depending on your use case such as:
- AOCC
- GCC
- G++
2. Use aligned memory allocation
Instead of malloc(), use aligned_alloc().
aligned_alloc() allocates aligned memory which helps C++ code to take advantage of cache and perform optimally.
aligned_alloc(size_t alignment, size_t size)
alignment is usually a higher power of 2. Memory returned by malloc is an alignment of 8.
3. Optimize Memory allocation
Use libraries like TCMalloc to optimize the use of malloc. TCMalloc stands for thread caching malloc.
export LD_PRELOAD=<path to TCMalloc>
Compile your C++ code using the flag "-ltcmalloc". This libraries optimizes all memory related functions such as malloc, aligned_alloc and others.
4. Always free dynamically allocated memory
Free dynamically allocated memory as soon as possible within your code. This help avoid out of memory issues and perform optimally.
free(data)
This will also help avoid memory related issues such as segmentation error.
5. Use multi-threaded execution
Use OpenMP and add support for multiple threads to execute your C++ code.
If there are 64 cores in your system, set OMP_NUM_THREADS accordingly and use all cores in your system (to 100% usage):
export OMP_NUM_THREADS=64
Explore the optimal settings for other environment variables such as:
OMP_WAIT_POLICY
OMP_AFFINITY
KMP_HW_SUBSET
OMP_PROC_BIND
OMP_PLACES
If you are not using OMP pragmas within your code or a library you are using, these will not impact and your code will run in single thread. Make sure to use OpenMP pragmas to utilize the entire system.
6. Use third-party optimized libraries
Use third-party optimized libraries such as Eigen, OpenBLAS, FLAME BLIS, MKL if your code requires a function implemented by these libraries.
The most common library is GEMM for optimized matrix multiplication from the above libraries.
7. Minimize use of Jumps and Branches
Jump statements such as continue, goto, break should always be avoided. These calls are computationally expensive and there are always alternatives. Switch case should never be used if performance is a consideration.
Even branch statements such as if else and loops like for, do while should be avoided if possible. It is fine to use these at minimum.
8. Use Likely and Unlikely
If you are using branches in your C++ code and cannot avoid it easily, then use likely or unlikely attributes so that it is easier for the compiler to optimize your code.
Go through this article to understand this optimization.
Following is a sample code snippet using this optimization:
if( a > 3 ) [[likely]] {
std::cout<<"a is greater then 3"<<std::endl;
}else[[unlikely]]{
std::cout<<"a is smaller then 3"<<std::endl;
}
9. Consider the order of accessing array indices
Access patterns matter a lot when dealing with array.
In most cases, array is stored row-wise so if elements being accessed are contiguous in the same row, access is fast as the row is cached. If the elements are from different rows, it has a major performance impact.
These are important considerations to optimize even simple algorithms like brute force for matrix multiplication.
The approach to do this is to reorder the inner nested loops. For example, the 3rd nested loop may become 2nd nested loop.
10. Avoid Local variables, Use Global variables
Always allocate memory and initialize/ declare variables at the beginning of the code and keep all variables global.
Local variables incur overhead as variable declaration and memory allocation gets scattered throughout the code.
11. Reduce number of function parameters
Reduce the number of parameters passed to a function. If possible, group multiple values in an object of a custom class and then, pass 1 object.
In specific cases, these have moderate impact over the code performance. Always avoid passing unnecessary values. These values are inserted in a stack which is an overhead.
12. Pass by reference (not value)
When you pass by value, the object is copied and a new memory is allocated. Memory allocation is always an overhead and in this case, the data is copied as well.
13. Avoid casting
Always avoid casting as it is one of the most expensive operations. Define with the datatype that is required.
(float *) a
14. Loop unrolling, Duff's device
Loop unrolling is a technique in which reduce the number of loop controls by manually unrolling the loop data.
The "Duff's Device" is a combination of a switch statement and a while loop that replicates memory contents from one source to the other destination.
Learn more about this optimization with examples.
15. Use intrinsics where possible
Intrinsics are functions that the C++ compiler will directly convert to assembly code. These calls when used correctly produce optimized code.
One example of intrinsic is:
__m512 _mm512_add_ps (__m512 a, __m512 b);
This is used to add 2 vectors of 16 float values together or 2 512 bits data.
16. Use register keyword
Define all variables as register so that the compiler will place the variable in register if possible. This is the easiest optimization.
register int data;
You can even place an entire array in a register. See how to do this.
17. Use Inline keyword
Define all functions as inline. This will move all function code to the area where the function is called and the overhead from function call will be eliminated.
inline int sample-function(parameters)
{
// function code
}
18. Use cache optimally
Write cache friendly C++ code to minimize cache misses. The main points are:
- Temporal locality: when a data is accessed, it is likely the next data will be close enough.
- Spatial locality: Reaarange data accesses such that memory accesses are contiguous.
- Use std::vector over std::list as vector is cache friendly while list is cache unfriendly.
- Use cache blocking technique.
- Use virtual functions which reduces cache misses.
- Avoid unpredictable branches which are cache unfriendly.
19. Use std::vector over std::list
Use std::vector over std::list as vector is cache friendly while list is cache unfriendly.
20. Use memset, memcpy
If you need to initialize an array with the same value, then use memset instead of looping through all values and setting them. memset is an intrinsic so it is exponentially fast.
memset(array, value, size_of_array);
Similarly, if you need to copy an array to another array, then use memcpy.
memcpy(destination_array, src_array, size_of_src_array);
21. Use qsort and bsearch
To sort any array in C++ even with custom objects, use qsort(). It is an optimal implementation of QuickSort that is supported within C++ as a standard library.
qsort(array, number_of_elements, sizeof(int), comparison_function);
Similarly, for Binary Search, use bsearch().
bsearch (array, number_of_elements, search_value, sizeof(int)
, comparison_function);
These are usually, faster than manual implementation.
22. Minimize Read and Write I/O
Use buffer for read and write. For output, prepare the output as a string and print it once instead of multiple print statements.
Do not use std::endl in print statements. Instead use new line character "\n".
23. Optimize I/O
Use std::basic_istream::read to read a massive chunk of data at once instead of reading the data line by line.
Place this line before taking any input to avoid iostreams and stdio synchronization:
ios_base::sync_with_stdio(false);
Untie cout from cin as by default, using cout will first flush cin which is an overhead. This enables iostream to work efficiently.:
cin.tie(NULL);
Turn off stream locking on key functions getchar() and putchar() by using this code line before all include statements:
#define _CRT_DISABLE_PERFCRIT_LOCKS
24. Reuse objects and memory
Memory allocation is expensive so it is always wise to reuse existing objects and memory whenever possible.
If possible, use Singleton design pattern.
Design a custom memory pool. Many software maintain a pool of already allocated memory and reuse memory from this pool for new operations.
25. Use template meta-programming
This allows the code to shift from dynamic polymorphism to compile time polymorphism which brings in significant performance improvement.
26. Use Initialization over Assignment
When an object is initialized, then copy constructor is invoked.
datatype object = new datatype(value);
If an object is declared, then the default constructor is invoked and later, when the object is assigned a value, the copy constructor is invoked.
datatype object;
...
object = new datatype(value);
So, if an object is initialized directly, then the cost of involing default constructor can be saved.
27. Use prefix
Use prefix operator (++x) over postfix operators (x++). Postfix operators involve creating a temporary object as the original operator is returned but the value is incremented.
Using prefix saves on the cost of creating a temporary object. This is a significant gain as object creation is a costly process.
28. Use Operator=
It is always advised to use += instead of + as this saves on creating a temporary object. So, always use operator= instead of just operator.
29. Use explicit constructor
Add the keywork explicit before every constructor.
This prevents the compiler to automatically convert one object to another datatype like string. Such conversions are expensive if it goes unnoticed.
30. Use reference instead of pointer
One should use reference instead of pointer as compiler find it difficult to optimize code using pointers as it is not efficient to determine if two pointers refer to the same location.
void Ptr(const int* p) { x += *p; }
void Ref(const int& p) { x += p; }
Additionally, reference does not require dereference operator and does not need additional case of handling NULL. So, it is better in terms of code cleanup as well.
With this article at OpenGenus, you must have the complete idea of how to optimize C++ code.