Improving SmallVec's speed by 60% and why that shouldn't matter to you
smallvec
is a library by the Servo team for reducing the number of allocations for dynamic arrays in the case that most of those arrays are below a certain size. For example, if you’re creating a vector that normally has less than 100 items but could have more, you can reduce the load on the allocator by using a SmallVec
like so:
fn print_iterator<T>(i: T)
where
T: Iterator,
T::Item: Debug,
{
// Before...
let vec: Vec<T::Item> = i.collect();
// After
let vec: SmallVec<[T::Item; 100]> = i.collect();
println!("{:?}", vec);
}
The implementation is conceptually very simple, essentially looking like the following:
// Implemented for `[T; 1]`, `[T; 2]`, etc.
trait Array {
type Item;
fn ptr(&self) -> *const Item;
fn ptr_mut(&self) -> *mut Item;
fn size() -> usize;
}
enum SmallVec<T: Array> {
Inline {
len: usize,
buffer: T,
},
Heap(Vec<T::Item>),
}
The real implementation is more complex for optimisation purposes, but it’s essentially what you see above. If the length is less than T::size()
it will store the elements inline (on the stack), if it becomes larger than that then it will copy the elements onto the heap - incurring a cost to allocate and to copy the existing elements. The main benefits to this are not from avoiding allocation, but from better cache efficiency when you store SmallVec
values in an array.
Because malloc
is fast, for many cases it’s actually slower to use SmallVec
than just using Vec
because the one-time cost of the initial allocation is dwarfed by the lifetime cost of SmallVec
's increased complexity. You can see that switching to Vec
actually improves speed on many of SmallVec
's own benchmarks1:
name smallvec.bench ns/iter vec.bench ns/iter diff ns/iter diff % speedup
extend_from_slice 52 65 13 25.00% x 0.80
extend_from_slice_small 23 29 6 26.09% x 0.79
extend 200 69 -131 -65.50% x 2.90
extend_small 39 28 -11 -28.21% x 1.39
from_slice 199 36 -163 -81.91% x 5.53
from_slice_small 38 30 -8 -21.05% x 1.27
insert 1,222 1,231 9 0.74% x 0.99
insert_small 121 114 -7 -5.79% x 1.06
macro_from_elem 218 40 -178 -81.65% x 5.45
macro_from_elem_small 60 32 -28 -46.67% x 1.88
push 357 369 12 3.36% x 0.97
push_small 57 53 -4 -7.02% x 1.08
macro_from_list 47 33 -14 -29.79% x 1.42
pushpop 306 253 -53 -17.32% x 1.21
Vec
is only slower for one method - extend_from_slice
. Weirdly SmallVec
is even faster in the case that extend_from_slice
must allocate. From a cursory glance at the source code for std
it looks like SmallVec
might get an edge on inlining, since Vec::extend_from_slice
isn’t marked #[inline]
.
Even with the improvement on extend_from_slice
, it’s still a tough sell to use SmallVec
. Even in the case where SmallVec
is on the stack - the case where it’s supposed to have an edge - it’s still slower.
Wait, though, isn’t the only difference between the two the fact that SmallVec
has to check whether it’s inline or heap-allocated? Why is from_elem
(which tests the vec![val; N]
macro and the equivalent smallvec![val; N]
macro) more than 5 times slower for SmallVec
, and why is extend
almost 3 times as slow, even being slower when it’s on the stack? Well it turns out that apparently this library had some serious inefficiencies that are not inherent to the design. Let’s check out what SmallVec::from_elem
and SmallVec::extend
look like:
impl<A: Array> SmallVec<A> {
pub fn from_elem(elem: A::Item, n: usize) -> Self {
let mut v = SmallVec::with_capacity(n);
v.insert_many(0, (0..n).map(|_| elem.clone()));
v
}
// ...
}
impl<A: Array> Extend<A::Item> for SmallVec<A> {
fn extend<I>(&mut self, iterable: I)
where
I: IntoIterator<Item=A::Item>
{
let iter = iterable.into_iter();
let (lower_size_bound, _) = iter.size_hint();
let target_len = self.len + lower_size_bound;
if target_len > self.capacity() {
self.grow(target_len);
}
for elem in iter {
self.push(elem);
}
}
}
Both seem pretty clear, for extend
you reserve the expected number of elements and then do a for
loop over the iterator, pushing each element to the new SmallVec
. For from_elem
you create a SmallVec
preallocated with the number of elements you need and then insert n
clones of elem
. Hang on, though, what does the source code of insert_many
look like?
impl<A: Array> SmallVec<A> {
// ...
pub fn insert_many<I: IntoIterator<Item=A::Item>>(&mut self, index: usize, iterable: I) {
let iter = iterable.into_iter();
let (lower_size_bound, _) = iter.size_hint();
assert!(lower_size_bound <= std::isize::MAX as usize); // Ensure offset is indexable
assert!(index + lower_size_bound >= index); // Protect against overflow
self.reserve(lower_size_bound);
unsafe {
let old_len = self.len;
assert!(index <= old_len);
let ptr = self.as_mut_ptr().offset(index as isize);
ptr::copy(ptr, ptr.offset(lower_size_bound as isize), old_len - index);
for (off, element) in iter.enumerate() {
if off < lower_size_bound {
ptr::write(ptr.offset(off as isize), element);
self.len = self.len + 1;
} else {
// Iterator provided more elements than the hint.
assert!(index + off >= index); // Protect against overflow.
self.insert(index + off, element);
}
}
let num_added = self.len - old_len;
if num_added < lower_size_bound {
// Iterator provided fewer elements than the hint
ptr::copy(ptr.offset(lower_size_bound as isize), ptr.offset(num_added as isize), old_len - index);
}
}
}
}
Gah! That’s a lot of code. Let’s go through this bit-by-bit (I’ve removed assertions to make the logic clearer):
let (lower_size_bound, _) = iter.size_hint();
self.reserve(lower_size_bound);
This reserves enough space for the iterator’s self-reported lower bound on its number of produced elements. Well hang on, we already know that we have enough space since we just created it with with_capacity
and then used an iterator with a fixed size. So this is useless. We don’t need to reserve any more space.
let old_len = self.len;
let ptr = self.as_mut_ptr().offset(index as isize);
ptr::copy(ptr, ptr.offset(lower_size_bound as isize), old_len - index);
This optimistically copies any existing elements to the end of the array. This is because this function allows you to insert the new elements at any point - not just at the end - and so if there are existing elements it needs to move them first2. Again, these lines are useless. If we’re calling this from from_elem
, we know that both self.len
and index
are both 0, so this will always copy 0 bytes (a no-op). We still waste cycles figuring this out, though.
for (off, element) in iter.enumerate() {
if off < lower_size_bound {
ptr::write(ptr.offset(off as isize), element);
self.len = self.len + 1;
} else {
self.insert(index + off, element);
}
}
For every element we do a branch. This means that the compiler can’t optimise this into a memcpy
(the fastest possible method of copying the bytes) because for each element we need to check whether or not we’ve reached the lower size bound. Technically the compiler could optimise this by turning it into two loops (known as loop fission) and then noticing that the first loop is equivalent to a memcpy
, but relying on it to do so is inadvisable because it requires multiple optimisation passes to be applied in the correct order. Again, when we’re calling this function from from_elem
we know that this can only ever go through the first branch and never the second, so this prevents optimisations and wastes time checking off < lower_size_bound
for no benefit.
let num_added = self.len - old_len;
if num_added < lower_size_bound {
// Iterator provided fewer elements than the hint
ptr::copy(
ptr.offset(lower_size_bound as isize),
ptr.offset(num_added as isize),
old_len - index
);
}
This is another branch that’s always skipped when we’re doing from_elem
, since we know that the iterator produces the right amount of elements every time. More wasted cycles.
I’m not going to go line-by-line into precisely how Vec
does it, but you can check out the code here. The important thing is that for many types it’s about as fast as is reasonably possible. A lot of work has been put into optimising this and it shows. There is even a special method that uses the operating system’s ability to return zeroed memory “magically” if the element you pass to the function is equal to 0 (i.e. Vec
requests a pointer to zeroed memory and the OS returns a pointer to memory that it knows should be zeroed, but doesn’t actually zero it until you try to access it). It only increases the vector’s length once, after all elements have been added. This is much better than our method of increasing the vector’s length by 1 each iteration of the loop because you only have to do one addition. The bigger the Vec
, the bigger difference this will make. The biggest benefit is that it uses specialisation to use memcpy
for any element that implements Copy
. This means that when possible it will copy using SIMD.
The optimisations that require specialisation are actually impossible to implement in SmallVec
without requiring a nightly compiler, but it turns out that it doesn’t matter. You can just reuse Vec::from_elem
(which is what vec![elem; n]
desugars to) when you know it’s going to end up on the heap, and do a simple loop-and-write when you know it will be on the stack. Here’s what that looks like:
pub fn from_elem(elem: A::Item, n: usize) -> Self {
if n > A::size() {
vec![elem; n].into()
} else {
unsafe {
let mut arr: A = ::std::mem::uninitialized();
let ptr = arr.ptr_mut();
for i in 0..n as isize {
::std::ptr::write(ptr.offset(i), elem.clone());
}
SmallVec {
data: Inline { array: arr },
len: n,
}
}
}
}
We don’t need to use the OS’s magical ability to conjure up zeroed memory when we’re building things on the stack since that works at the granularity of a page - 4kb by default on Linux - and if you’re creating an array that large then the cost of allocation is absolutely dwarfed by the cost of initialisation. LLVM can trivially optimise the for
loop into a memcpy
and even a memset
when A::Item = u8
(the comment at the header of the file lists this optimisation as a TODO
but if you look at the code you can see that it’s already been implemented). We get all the benefits of Vec
's complex implementation with the only cost being a n > A::size()
check, which is especially cheap because A::size()
is a constant. You can see the generated assembly here, I haven’t commented it because of the sheer amount of code that was generated before, but just the difference in the amount of code should make it clear that this method is significantly more efficient.
After this, we see a huge improvement:
name smallvec.bench ns/iter vec.bench ns/iter diff ns/iter diff % speedup
macro_from_elem 66 50 -16 -24.24% x 1.32
Vec
is still faster, but now the difference is only 16ns instead of 130ns. I do wonder if there’s some way to speed up the conversion from Vec
to SmallVec
, because that’s the only place that the difference could be coming from and it should be essentially instant.
Now for the fun one. Let’s see SmallVec::extend
again:
impl<A: Array> Extend<A::Item> for SmallVec<A> {
fn extend<I>(&mut self, iterable: I)
where
I: IntoIterator<Item=A::Item>
{
let iter = iterable.into_iter();
let (lower_size_bound, _) = iter.size_hint();
let target_len = self.len + lower_size_bound;
if target_len > self.capacity() {
self.grow(target_len);
}
for elem in iter {
self.push(elem);
}
}
}
So what’s wrong with this? Well, let’s take a look at the source of push
:
pub fn push(&mut self, value: A::Item) {
let cap = self.capacity();
if self.len == cap {
self.grow(cmp::max(cap * 2, 1))
}
unsafe {
let end = self.as_mut_ptr()
.offset(self.len as isize);
ptr::write(end, value);
let len = self.len;
self.set_len(len + 1)
}
}
It’s the same problem again - we know that we don’t need to grow for the first lower_size_bound
iterations (because we already preallocated that many elements) but we do anyway. Plus we’re making LLVM’s job hard - it could figure out that because we set self.capacity
and then check it against self.len
in each iteration of the loop then it can split the loop in two and omit the check for the first self.capacity - self.len
iterations, and then from that figure out that the writes can be vectorised, but it’s again relying on a specific set of optimisations in a specific order. Also, we store to self.len
every iteration of the loop and although LLVM could load self.len
to a register, increment it there, and then store it again, that is not a simple enough optimisation for us to expect it to be reliably performed. Besides, that optimisation cannot be performed for an iterator that may panic, since in that case we need to ensure that self.len
is correct after each iteration of the loop. This is because other code could observe it after the panic (even without catch_panic
, code in implementations of Drop::drop
could observe it). We can do this optimisation ourself and make the code easier to understand by LLVM:
fn extend<I>(&mut self, iterable: I)
where
I: IntoIterator<Item=A::Item>
{
let mut iter = iterable.into_iter();
let (lower_size_bound, _) = iter.size_hint();
let target_len = self.len + lower_size_bound;
if target_len > self.capacity() {
self.grow(target_len);
}
// This section is new
unsafe {
let ptr = self.as_mut_ptr()
.offset(self.len() as isize);
let len = self.len();
let mut count = 0;
for p in 0..lower_size_bound as isize {
if let Some(out) = iter.next() {
::std::ptr::write(ptr.offset(p), out);
count += 1;
} else {
break;
}
}
self.set_len(len + count);
}
for elem in iter {
self.push(elem);
}
}
You see that in the fast path (the path where iter.size_hint()
tells the truth) we just continuously ptr::write
. We branch in the loop, but for most iterators the branch could either easily be removed by the compiler after monomorphisation - for example, for slices, vectors and wrappers around either of those that preserve size_hint
. For those where it isn’t, the else
branch is taken at most once per run of the loop. This makes it extremely predictable even for CPUs with only basic branch predictors. We then do a “real” store to the len
field only once, and count
can easily be put into a register. This optimisation isn’t defeated by an iterator that may panic, either. Since we only store to memory after the loop, if it panics during the loop then the calling code will just see the value of self.len
from before we entered the loop. We can do this transformation because we know that leaking data when panicking is fine, but LLVM cannot because it would change the meaning of the program3.
We can see that the difference after this optimisation is large:
name smallvec.bench ns/iter vec.bench ns/iter diff ns/iter diff % speedup
extend 102 70 -32 -31.37% x 1.46
extend_small 25 28 3 12.00% x 0.89
We’re much closer to Vec
's speed than before, and at last SmallVec
is faster for the case where it doesn’t have to allocate.
I think that the important takeaways from this are: you don’t need to perform every optimisation yourself, but you should make your code as simple and optimisable as possible. It is necessary, however, manually perform any optimisations that changes the meaning of your program. Even if an optimisation seems like it’d have a small effect, if you can’t expect the compiler to perform it then I think it’s worth trying. It’s very possible that it can unlock more compiler optimisations. Above all else, though, it’s important to actually benchmark your optimisations. You wouldn’t expect your code to be correct without testing it (even if that testing is by running the program and clicking around) so why would you expect your code to be fast if you don’t benchmark it. SmallVec
actually had no comparisons to Vec
in its test suite until I wrote some for the purposes of this article. There are a lot of crates out there that currently use smallvec
- crates.io lists 9 pages of crates using it at time of writing. If you’re using smallvec
for its performance benefits, it’s worth benchmarking your program using Vec
with Vec::with_capacity
instead. Usually the simpler implementation is the faster one.
-
To make the benchmarks fair I preallocated the vector with the same number of elements that the
SmallVec
preallocated on the stack (16). I believe this is fair because in a real program you can always easily replace calls toVec::new
withVec::with_capacity(N)
whereN
was the number that you would have put intoSmallVec::<[T; N]>::new
. To see the exact methodology you can see this commit. ↩︎ -
During the writing of this article I realised that this function is actually unsound. One interesting thing about writing unsafe Rust is that safe code is never to be trusted. You can use values from safe code as hints but it should be impossible to cause unsoundness by returning incorrect values from safe code. This means, for example, that
Iterator::size_hint
can return essentially whatever it wants. Also, it should be possible for safe code to panic at any point without causing unsoundness (although it’s allowed to cause arbitrarily bad behaviour otherwise, like infinite loops or leaking data). You can see a visual explanation of this bug in this gist. At the time of writing it’s still unfixed but I opened an issue where you can track its progress. ↩︎ -
It’s actually possible for LLVM to do the increment in a register and just add the code to actually store the length into the panic cleanup code.
Vec
actually does this explicitly with aSetLenOnDrop
struct. Again though, it’s a complex optimisation and you can’t rely on LLVM applying it. ↩︎