C++11 & Performances
A few notes I took from a lecture I attended a few days ago. I found some parts very interesting, whereas basic. I share this in case some would be interested.
Memory Proximity
Modern architecture generally have several levels of cache between the
CPU and RAM. On the current computer I am working on, there are 3 levels
of cache as we can see with the lscpu
command:
stac@debian:~>lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 ... L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K NUMA node0 CPU(s): 0-3
The nearest and smallest cache L1 (64K) is divided into two parts: - Instructions cache - Data cache
How a CPU cache works is quite easy to understand, but behind the scene it is top notch algorithms to optimize what as to be put in so that the cores do not waste time grabbing the memory from L2, L3 or, worse, the RAM. If the CPU needs to access a part of the memory that is available in the L1 cache, fine : it is called a cache hit. If not, it is a cache miss and the CPU will try to find the memory in L2 cache. The process continues until the data is found which can end up in the RAM (the slowest location).
The number of CPU cycles to access a piece of data in memory give a clear overview of the costs:
Memory location | Number of CPU Cache |
---|---|
CPU Registry | ~ 1 |
L1 Cache | 3-4 |
L2 Cache | 10-12 |
L3 Cache | 30-70 |
L3 Cache - Other CPU socket | 100-300 |
RAM | 100-150 |
RAM - Other CPU socket | 300-500 |
Hence, it is very important to strive to avoid the cache misses. Let's have a look at an exemple of an array of the following structures:
struct Person { char first_name[30]; char last_name[30]; int age; };
The goal is to calculate the average of the ages:
constexpr int kPersonNumber = 100'000; double average(const std::array<Person, kPersonNumber>& persons) { int sum = 0; for (const Person& person : persons) { sum += person.age; } return sum / kPersonNumber; } int main(int argc, const char *argv[]) { std::array<Person, kPersonNumber> persons; for (int i=0; i < kPersonNumber; ++i) { // Fill the array of Persons } std::cout << "Average: " << average(persons) << std::endl; return 0; }
The average
function loops on all the structures contained in the
array. This leads to a cache miss on each iteration of the loop. We can
use the Cachegrind tool to visualize the number of cache misses:
> valgrind --tool=cachegrind --branch-sim=yes --cache-sim=yes --cachegrind-out-file=chg.out ./myprog > cg_annotate chg.out `pwd`/myprog.cpp
Here is the interesting part of the output:
==XXXX== D refs: 2,146,050 (1,397,466 rd + 748,584 wr) ==XXXX== D1 misses: 222,386 ( 119,985 rd + 102,401 wr) ==XXXX== LLd misses: 209,999 ( 108,396 rd + 101,603 wr) ==XXXX== D1 miss rate: 10.4% ( 8.6% + 13.7% ) ==XXXX== LLd miss rate: 9.8% ( 7.8% + 13.6% ) ... Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim file:function ... 1,000,027 3 3 300,006 100,002 100,001 200,006 99,999 99,999 200,000 10 0 0 :main ...
D1mr stands for Data L1 miss read. And we can see the order of magnitude of the the array size.
What would be the result if we changed the structure into a struct of
arrays and change the average
function
struct Person { char first_name[kPersonNumber][30]; char last_name[kPersonNumber][30]; int age[kPersonNumber]; }; double average(const Person& persons) { int sum = 0; for (int i=0; i < kPersonNumber; ++i) { sum += persons.age[i]; } return sum / kPersonNumber; }
In this new version, the ages are stored in contiguous block of
memories. So, when we access persons.age
array which is 400'000 bytes
a big part of it can be access through the L1 cache in each iteration of
the loop, reducing the number of cache misses as we can see in the
Cachegrind output:
==XXXX== D refs: 2,271,052 (1,322,467 rd + 948,585 wr) ==XXXX== D1 misses: 129,213 ( 26,807 rd + 102,406 wr) ==XXXX== LLd misses: 112,949 ( 11,368 rd + 101,581 wr) ==XXXX== D1 miss rate: 5.7% ( 2.0% + 10.8% ) ==XXXX== LLd miss rate: 5.0% ( 0.9% + 10.7% ) ... Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim file:function ... 900,026 3 3 225,006 6,252 3,098 400,006 93,749 93,749 125,000 10 0 0 :main ...
Here we can see a drastic decrease of cache misses (from 100'000 downto 6'252 for L1 cache).
Moreover, the fact the data (in our case the person's ages is stored in 400'000 contiguous bytes can allow some compiler optimization.
If we compile the second version of the program in gcc with the
following options -O2 -march=native
, the average
function will look
like this:
average(Person const&): leaq 6000000(%rdi), %rax leaq 6400000(%rdi), %rdx xorl %ecx, %ecx .L2: addl (%rax), %ecx addq $4, %rax cmpq %rdx, %rax jne .L2 ... (make the final division here) ret
Here, we can see (between the label L2 and the jump to L2), that 4 bytes by 4 bytes, we add the ages in the %ecx register.
Now, if we compile the program with the flags -O3 -march=native
, let
us have a look at the loop:
average(Person const&): ... shrl $3, %ecx .L4: addl $1, %eax vpaddd (%rdx), %ymm0, %ymm0 addq $32, %rdx cmpl %eax, %ecx ja .L4 ... (make the final division here) ret
We can see that we are able to pack 8 integers and to add them 8 by 8 thanks to the MMX registers. As a matter of fact the number of operations to sum all the ages is devided by 8.
Inlining and C++
Inlining is the decision taken by the compiler not to do a real call to a function, but directly write the assembler code of the function at the point it should be called.
Here is a simple example that calculate the square of the number of arguments passed to the program. We could have chosen to calculate the square of a constant but the very agressive compiler's optimization would have calculated the result without any needs of inlining.
#include <iostream> int square(int num) { return num * num; } int main(int argc, const char* argv[]) { auto a = square(argc); std::cout << a << std::endl; return 0; }
Compiling with gcc with no optimization (-O0 -march=native
), would
lead to a call to the square function (with the argument copied in
%rdi
register and the creation of a stack frame).
main: ... movl %eax, %edi call square(int) movl %eax, -4(%rbp) movl -4(%rbp), %eax movl %eax, %esi movl std::cout, %edi call std::basic_ostream<char, std::char_traits<char> >::operator<<(int) ...
The obvious optimization would be to directly multiply the value that is
passed to main by itself without calling the function. This is what is
done with the following gcc options -O1 -march=native
:
main: subq $8, %rsp imull %edi, %edi movl %edi, %esi movl std::cout, %edi call std::basic_ostream<char, std::char_traits<char> >::operator<<(int) ...
There is no call to the square(int)
function anymore: this is called
inlining.
The C++ compilers strive to inline the maximum of functions to improve the program performances. But there are some C++ language features that does not fit well to this philisophy, like inheritance and polymorphism.
Let's have a look at the following program:
#include <iostream> #include <array> class ValueProviderBase { public: virtual int value() =0; }; class ConstValueProvider : public ValueProviderBase { public: virtual int value() override {return 42;} }; constexpr int kMax = 100'000; int main(int argc, const char* argv[]) { std::array<ValueProviderBase*, kMax> v; ConstValueProvider p; for (int i = 0; i < kMax; ++i) v[i] = &p; int sum = 0; const std::size_t size = v.size(); for (int i = 0; i < size; ++i) sum += v[i]->value(); std::cout << sum << std::endl; return 0; }
We have an array of the child class ConstValueProvider
. The goal is to
compute the sum of the values. In this state, to call the method
value()
, the compiler has no other choice than finding the method
address in the good virtual table which is done like this with no
optimization.
call std::array<ValueProviderBase*, 100000ul>::operator[](unsigned long) movq (%rax), %rax movq (%rax), %rdx movq (%rdx), %rdx movq %rax, %rdi call *%rdx
The std::array<>::operator[](unsigned long)
returns a pointer to a
pointer to the real object in memory in %rax
. We dereference it and
then get the address of the appropriate vtable stored in %rdx
. Finally
we store the address of the method in the %rdx
register that is used
to make the call.
The reason of this is that generally the compiler has no chance to know what really is in the array.
In our case, the array is filled with a unique subtype of
ValueProviderBase
. Even, static cast would not help because there
could be some subclasses of ConstValueProvider
in the array. This
means that in C++ 98/03, the developer had no chance to help the
compiler.
With C++11, we have a way to tell the compiler that the class won't
subclassed (or a particular method): it is the keyword final
. This
makes a big difference because if we tell the compiler to trust us on
the real types contained in the array, he will be smart enought to avoid
using the vtable indirection.
class ConstValueProvider : public ValueProviderBase { public: virtual int value() override final {return 42;} }; // ... int main(int argc, const char* argv[]) { // ... for (int i = 0; i < kMax; ++i) sum += static_cast<ConstValueProvider*>(v[i])->value(); // ... }
Without optimization requested, gcc would do a direct call to
ConstValueProvider::value()
:
call std::array<ValueProviderBase*, 100000ul>::operator[](unsigned long) movq (%rax), %rax movq %rax, %rdi call ConstValueProvider::value()
Better ! And if we set an optimization level greater than 1, the call can be inlined. And still better, the result of the whole loop can be computed at compile time:
main: subq $8, %rsp movl $4200000, %esi movl std::cout, %edi call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
The value 4200000
is directly passed in %esi
to be displayed ! We
got rid of any function call.
As a side note, gcc is smart enought to suggest where you should add the
final keyword when the warning options -Wsuggest-final-types
and
-Wsuggest-final-methods
are provided.
R-value References & Move Semantic
From the start, C++ used to copy objects by default. But a copies take resources and time. So a smart way to cope such copies was to use reference for instance when passing arguments to a function:
void f1(X x) {} // Pass by copy void f2(X& x) {} // Pass by reference
But sometimes, the copy mechanism is not appropriate for your needs. For instance, imagine a vector whose size grows so much that it has to be reallocated, then all the objects were copied into the new vector before deletion of the old one. This is not a bad thing: the vector would not take the risk starting to move objects around if at some point in the mechanism, an error could occur.
Another use case is that you defined an object that handles a resource
and this management must not be done by several objects: what you want
here is to avoid sharing the resource ownership. What would be the
meaning of sharing a std::mutex
or a std::thread
?
That's why an optimized mechanism to move the object in memory was added to the C++ standard. This mechanism relies upon 'move constructor' and 'move equal operator'.
I won't deal too long with this topic but in terms of performance I was striked by two things:
- the fact the STL is 'move-ready'. The basic structures like
std::string
will be automatically moved when needed. Often, compiling a program with the last C++ standard version leads to higher performances… for free ! Morever, the containers provides some API to avoid useless copy likestd::vector<>::emplace_back
. - the importance of the new C++ keyword
noexcept
. As a result, this keyword has to be placed after each function that does not throw exception. This give hints to the compiler to enable the move of objects like it is the case when astd::vector
has to reallocate all the contained objects when growing.
template <int N> class X { public: X() : buffer_(std::make_unique<char[]>(N)) {} X(const X& other) : buffer_(std::make_unique<char[]>(N)) { std::memcpy(buffer_.get(), other.buffer_.get(), N); } X& operator=(const X& rhs) { buffer_.reset(new char[N]); std::memcpy(buffer_.get(), rhs.buffer_.get(), N); return *this; } X(X&& other) noexcept : buffer_(std::move(other.buffer_)) {} X& operator=(X&& rhs) noexcept { buffer_ = std::move(rhs.buffer_); return *this; } private: std::unique_ptr<char[]> buffer_; }; constexpr int kMax = 300'000; int main(int argc, const char *argv[]) { std::vector<X<1000>> v; v.reserve(kMax); for (int i = 0; i < kMax; ++i) { v.emplace_back(); } // One more to for vector realloacation v.emplace_back(); std::cout << v.size() << std::endl; return 0; }
Using noexcept
or not on the move constructor have an impact during
reallocation: - if noexcept
is not used, std::vector
will use the
copy constructor - if noexcept
is used, std::vector
will use the
move constructor
In the second case, this avoids the call to std::memcpy
. Hence, the
better performances.
Without noexcept
:
stac@debian:~/development/cpp-sandbox/vector>time ./a.out 300001 ./a.out 0.15s user 0.20s system 99% cpu 0.353 total
With noexcept
:
stac@debian:~/development/cpp-sandbox/vector>time ./a.out 300001 ./a.out 0.05s user 0.10s system 97% cpu 0.149 total
STL Containers
Choosing a container depends upon how you will use it, so it is worth knowing about their internal structure and the complexity of the methods/functions you will call most often.
For example, the best container for appending data is std::deque
because as its internal structure is an contiguous array of pointers
toward arrays of contiguous memory containing the objects, there is no
need for reallocation the contained objects as it is the case with
std::vector
. Nevertheless, if you know the number of elements that
will be stored, you can use the std::vector<>::reserve
method and in
this case the std::vector
becomes the best. Conversely, due to the
tree structure of the std::set
container and the fact the objects are
always ordered, this is the worse container.
A new type container appeared also in C++11: std::unordered_set
and
its counterpart key-value std::unordered_map
. Because of its
structure, based on hashes, insertion is faster than in a std::set
and
lookup become blazing fast compared to std::vector
and std::deque
.
Regarding lookup of values in containers, if std::find
algorithm is
correct for std::vector
and std::deque
, it is far better to use the
find
method of std::set
and std::unordered_set
.
Note that to have similar performance with std::vector
, if the
elements are sorted, std::lower_bound
is the best alternative to
std::find
algorithm.
Last but not least, it really seems that std::list
due to its poor
performances, must not be the first container choice.
Strings
New ABI
Strings is one of the most common objects used in programs. Its definition has changed between C++98 and C++11.
In C++98, the standard specification fostered the use of references
counting to allow the copy-on-write optimization. Now that C++11,
provides threading support, the standard forbids hidden states in
std::string
(reference counters need atomics or locks in multithreaded
environment).
The std::string
is no more binary compatible between gcc 4 and gcc 5.
Before, a std::string
only had a pointer to a structure containing the
size, the capacity, the reference counter and the buffer. The later is
now 32 bits structure (64-bits machine architecture).
A noteworthy fact is the Small String Optimization (SSO): if the string
contains less than 16 bytes, there is no heap memory allocation (the
small buffer is store in place of the pointer to the location of the
char[]
on the heap.
C++17 std::string_view
There is a new interesting structure coming with C++17:
std::string_view
. Often, you have a char[]
allocated on heap and
filled with some data. Generally, to avoid the overhead of the copy, you
don't dare using std::string
and its facilities. So, you end up using
the C primitives to read things in this char[]
. Moreover, as the
length of the buffer is lost (you only have a pointer to the first
element of the buffer), you have to convey the size of the buffer in all
the function signatures).
To cope with this limitation, std::string_view
was introduced to
provide a read only access to a buffer with the size and a standard API
to get iterators on the buffer elements.
#include <iostream> #include <algorithm> #include <string_view> int main(int argc, const char *argv[]) { std::string_view sv(argv[1]); std::cout << "Size: " << sv.size() << std::endl; std::cout << "Number of A: " << std::count(sv.begin(), sv.end(), 'A') << std::endl; return 0; }
The result:
stac@saturne:~>./a.out jhsfhsdAAAsdfjhsdkjAAsldkfjsdlkA Size: 32 Number of A: 6
In this example, I used the constructor based on null-terminated
char[]
, but you can use std::string_view
to work with raw contiguous
memory.
The advantage of this std::string_view
is it allows you to write C++
style code that looks like less old C style without the overhead of
costly copies.