WebAssembly Troubles part 3: What About Alloca?

Preamble

This is part 3 of a 4-part miniseries on issues with WebAssembly and proposals to fix them. Part 1 here, part 2 here, part 4 here. This article assumes some familiarity with virtual machines, compilers and WebAssembly, but I’ll try to link to relevant information where necessary so even if you’re not you can follow along. Also, this series is going to come off as if I dislike WebAssembly. I love WebAssembly! I wrote a whole article about how great it is! In fact, I love it so much that I want it to be the best that it can be, and this series is me working through my complaints with the design in the hope that some or all of these issues can be addressed soon, while the ink is still somewhat wet on the specification.

If you read my first article in this series, you’ll know that WebAssembly has both locals and the stack. I made the argument that locals are not just unnecessary, but are detrimental to the performance of WebAssembly in general, and that if we used the stack alone our code would be in SSA form with lifetime information automatically. However, if you’re familiar with LLVM IR or other SSA IRs you might be familiar with an operator known as alloca, which circumvents SSA form. Why does LLVM need that escape hatch while I argue that WebAssembly does not?

Well, here’s a thought experiment for you: what WebAssembly does LLVM emit for the following Rust code?

extern { fn test(_: &mut u32); }

fn main(){
    let mut a = 0;
    unsafe { test(&mut a) };
}

In LLVM’s IR this is represented using alloca, which allocates a value on the physical stack and returns a pointer to the new value. In WebAssembly, however, you cannot create pointers to locals, so alloca has to be compiled another way. Let’s have a look at the emitted WebAssembly when compiled in release mode. I’ve simplified it by hand to make it clearer what’s going on.

(module
  (import "env" "test" (func $test (param i32)))
  (func $main
    (local $local_sp i32)
    (global.set $stack_ptr
      (local.tee $local_sp
        (i32.sub
          (global.get $stack_ptr)
          (i32.const 16))))
    (i32.store offset=12
      (local.get $local_sp)
      (i32.const 0))
    (call $test
      (i32.add
        (local.get $local_sp)
        (i32.const 12)))
    (global.set $stack_ptr
      (i32.add
        (local.get $local_sp)
        (i32.const 16))))
  (memory $memory (export "memory") 17)
  (global $stack_ptr (mut i32) (i32.const 1048576)))

Gross. LLVM generates a mutable global (similar to static mut in Rust) which it then does arithmetic on manually to manipulate the stack. Accessing the stack is suppposed to be extremely fast, and in fact if you compile the Rust code directly to native it looks something like this¹:

main:
    push rax
    mov  dword ptr [rsp + 4], 0
    lea  rdi,                 [rsp + 4]
    call qword ptr [rip + test@GOTPCREL]
    pop  rax
    ret

Contrast this assembly output with the assembly generated by Firefox for the above Wasm (simplified):

main:
  sub rsp,                         0x28
  cmp [r14 + 0x28],                rsp
  ; .Lstack-overflow is defined elsewhere
  jae .Lstack-overflow 

  mov edi,                         [r14 + 0x50]
  mov eax,                         edi
  sub eax,                         0x10
  mov [rsp + 0x1c],                eax
  mov [r14 + 0x50],                eax
  mov [r15 + rax + 0xc],           0
  add edi,                         -4
  mov [rsp],                       r14
  mov rax,                         [r14 + 0x30]
  mov r14,                         [r14 + 0x38]
  mov r15,                         [r14 + 0x18]
  call rax
  mov r14,                         [rsp]
  mov r15,                         [r14 + 0x18]
  mov eax,                         [rsp + 0x1c]
  add eax,                         0x10
  mov [r14 + 0x50],                eax
  nop
  add rsp,                         0x28
  ret

Some of this is important for WebAssembly’s sandboxing, but I hope you can see that the stack_ptr global is being loaded and stored to memory every time. This isn’t the baseline compiler, this is Firefox’s optimising compiler, and as far as I can tell Firefox’s output for WebAssembly is the fastest of any browser at time of writing.

So how do we solve this?

Separate stack and heap

The fix that I originally wanted to implement for this is to either have an alloca instruction or have each function/block declare a set of variables to be stored on the physical stack. Unfortunately, this is a total non-starter. Currently pointers in Wasm are just 32-bit integers, which means that arbitrary arithmetic can be performed on them. If you want the stack pointers to refer what I call the “physical stack” you’re out of luck (what I call the physical stack is the area of memory around rsp, like we saw LLVM generate code to access when generating native code). A malicious Wasm module could just manipulate a pointer to a stack value and overwrite the return address. PNaCl implements this method and prevents the return address from being manipulated by masking the return address when the function returns, which ensures that you can only jump to known-safe addresses. Unfortunately, manipulating the return address like this means that every single function call results in a pipeline stall since the CPU can’t start speculatively executing the code that we’re returning to.

That’s not the only problem though. Currently, the Wasm heap is untyped, which is very important for efficient compilation of languages like C and Rust that rely on viewing the heap as a blob of bytes. Heap pointers are just indices into a blob of memory. If the heap and stack were separate, there would be no good way to compile this code:

fn main() {
    let foo = Box::new(0);
    let bar = 1;

    let some_vec = vec![&*foo, &bar];
}

Here we’re storing one heap pointer and one stack pointer in an array which is itself stored on the heap. Separated stack and heap pointers means that this is impossible - you either need the heap to be typed or you need to allow manipulating the pointer to the stack value, the former is extremely slow and the latter could cause the sandboxed code to overwrite memory that it shouldn’t have access to.

A variant of this idea that would be more likely to work is to have our method of specifying variables to be stored on the physical stack also specify a linear memory section to store the stack values in. This would basically have the same semantics as the current implementation generated by LLVM but it would give the Wasm runtime more information, making it easier to optimise. However, I don’t believe that this would produce better code than option 2:

Explicit stack pointer

“So”, you ask, “why not just reserve a register for this global variable?". Well I’m glad you asked, straw man in my head. Although it wouldn’t solve every problem, having the stack pointer be stored in a global variable would improve performance significantly and allow for much better code. It still couldn’t match native, but it would get a lot closer.

The issue is that the global used for the stack pointer isn’t marked. The Wasm->native compiler doesn’t know that stack_ptr is the stack pointer and so needs to be fast to access.

So, the second possibility - which I believe has been suggested before although I can’t find the issue where it was mentioned - is to explicitly mark one or more globals as “hot”, which can act as a hint to the compiler that a register should be reserved for this global. The stack pointer can then be marked as “hot” and WebAssembly runtimes that don’t care about maximising performance (like the spec interpreter or wasmi) can ignore this hint. The biggest downside to this scheme is that it forces an ambient loss of a scratch register - simply having “hot” and “cold” globals is very course-grained and means that (for example) the Wasm can’t explicitly state that this global is cold in certain functions, which would allow those functions to use that register as scratch as long as they save and restore it². If your compiler has a decent instruction-choosing algorithm then codegen with this scheme should look relatively similar to the native code generated by LLVM. Unlike some complicated optimisations, however, I would consider implementing a good instruction choice algorithm to be the main job of a compiler from WebAssembly to native. It wouldn’t require extra, stack pointer-specific work in order to get this to compile to good code.

Personally, I believe that the second option is the best.

Why isn’t this implemented already?

Unlike the past two articles, this is pretty clearly not a fundamental design issue that should have been fixed for the MVP. This can easily be added in a backwards-compatible way - existing code can’t take advantage of this optimisation but any new code will be able to. Hopefully this article gave you some insight into why the stack isn’t quite as optimised as it should be in contemporary WebAssembly runtimes.

Join me next time where I finally inject some positivity into this series, and propose a medium-term solution to these issues and more.

The push and pop are to reserve space on the stack, it’s better if you need to reserve <8 bytes to use push and pop instead of doing arithmetic on the stack pointer yourself. ↩︎
Why not simply have every function mark which globals are hot and which are cold? Well, we have code size constraints, and having to say this on every function will bloat the format. More bloat means more bytes have to be transferred over the wire, which is something that the WebAssembly format is explicitly designed to prevent. Maybe a future extension to the format could have globals either hot or cold by default, allowing functions or even blocks to override that if they so wish. ↩︎