Stefan Fehrenbach

Fiftyfive bytes of WebAssembly

I implemented a persistent hash map in JavaScript some time ago. It uses the popCount function to count the number of 1-bits in the binary representation of a number to save space, it’s really cool.1 JavaScript’s standard library (unlike some other languages) does not have this function defined, so people use this implementation (which is from Hacker’s Delight, I think).

To be honest, I only have a vague idea of how it works and currently no time to think about it. However, I do know that there’s a POPCNT instruction in my CPU (apparently introduced with SSE4 around 20072).

I also recently read a nice article about the future of WebAssembly.3 Before, I did not really view WebAssembly very favourably. JavaScript is attractive as a compilation target because you get a high quality JIT compiler and runtime system for free. This is not yet true of WebAssembly and I don’t want to execute C++ applications in a browser.

It turns out that WebAssembly also has a POPCNT instruction, though. Wouldn’t it be cool to use that instead of three lines of bittwiddling?

Off I go to write the following WebAssembly program in the WebAssembly s-expression representation. This defines a module with one exported function called popcnt that takes a 32 bit integer and returns a 32 bit integer. The body of the function applies the 32 bit integer popcnt instruction to local 0 — the function’s first argument.

  (func (export "popcnt") (param i32) (result i32)
    (i32.popcnt (get_local 0))))

Now, the typical workflow for serious WebAssembly use is way overkill. You use a compiler to compile your code (C, C++, Rust) to WebAssembly bytecode. You probably need some special cross compilation stuff too. Then you put the bytecode file onto some webserver somewhere, so your browser can see it and load it over the network. Then you need to feed the bytestream to the WebAssembly compiler in the browser.

I don’t want to write a C/C++/Rust program for a single instruction and install their WebAssembly toolchains. There is a small tool to turn s-expression .wat files into a .wasm bytecode files called wat2wasm. I don’t want to install that either. Luckily there’s a demo website where you can also download the compiled .wasm file.4

I do that and use some Emacs macros to turn the binary file into a JavaScript source representation of an array of bytes. My WebAssembly popcount function comes out at 55 bytes, and they look like this:

Fun fact: the sequence 70 6f 70 63 6e 74 on the second line spells the name of the exported function: popcnt. (The other bytes probably mean something, too.)

Great! We have some bytes without having to set up a webserver somewhere. Now let’s feed them to WebAssembly. We instantiate something with our bytecode in bytes and an empty record of JavaScript bindings to expose on the WebAssembly side. We get back a Promise of something which has an instance which has the exports of our WebAssembly module. In our case there’s only popcnt.

I have no idea how to synchronize on a Promise to go from async land back into sync land, so imagine the rest of the code being inside the callback.

What can we do? Call popcnt! And it works!

We can also compare the JavaScript popCount against the WebAssembly popcnt.

Here are some numbers out of this highly scientific benchmark:

106 calls in WASM (ms) JS (ms)
Firefox 62 17 306
nodejs v10.11.0 18 7
Chromium 69 19 8
Firefox 63 7 300

Cool to see that Firefox 63 did indeed make WebAssembly calls faster.5 Shame though that the JavaScript version is so much worse and the WebAssembly version is still only as fast as what v8 manages to generate from the JavaScript.

Future work procrastination: check what all of this actually compiles to. Obviously these are not proper benchmarks. However, the numbers are all suspiciously close. It would be a shame to go through all of this trouble only to find out that they just use the same software implementation of POPCNT. On the other hand, I would be quite impressed if it turns out that v8 actually recognizes the software implementation and emits a POPCNT instruction. Also, check whether they work around the performance bug in Intel’s implementation.6

Find the code here.

  1. “Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections” (Steindorfer and Vinju, OOPSLA 2015,