Saturday, September 24, 2011

Road to Emscripten 2.0

Lots and lots of work has been taking place on emscripten (the LLVM-to-JS compiler). I haven't been breaking things out into smaller releases, instead, there will be a 2.0 release in the near future (last release was 1.5). The remaining issues before that are:
  • Fix any remaining regressions in the llvm-svn branch compared to the master branch, and merge llvm-svn to mastert. The llvm-svn branch uses LLVM's svn, which will soon become 3.0. Some of the code changes in LLVM have hurt our generated code, however most of the issues are now fixed. The latest update is a fix for exception handling which has led to a 5% smaller ammo.js build (compared to 2.9), with no speed decrease :)
  • Bundle headers with emscripten. As more people have begun to use emscripten, we have been seeing more issues with platform-specific problems, almost all due to using different system headers (for example, issues #82 and #85 on github). Bundling working headers will fix that, in a similar way as to how SDKs and NDKs typically bundle a complete build environment. Currently the plan is to use the newlib headers for libc and start from there.
As mentioned in a previous blog post, emscripten 2.0 will require LLVM 3.0, and will no longer officially support the deprecated llvm-gcc compiler (it might still work though). This is a significant change which may affect projects using emscripten, please let me know if you are aware of any issues there.

Note also that I will be merging llvm-svn to master before LLVM 3.0 goes stable. That means that you will need to build LLVM from source at that time since there won't be LLVM binaries. Of course, you will still be able to use a previous revision of emscripten, that works with LLVM 2.9, with no issues.

Friday, September 9, 2011

LLVM 3.0, llvm-svn Branch

LLVM 3.0 will probably be released in about a month. In preparation for that, I've gotten emscripten to work properly with LLVM svn in the llvm-svn branch.

As in the past, I intend to only support one version of LLVM at a time, since it takes too much effort to do any more - our automatic tests already require several hours, doubling that for another LLVM version is a huge burden.

With LLVM 3.0 it looks like llvm-gcc is pretty much obsoleted. It isn't being developed much, and remains on gcc 4.2 (I am guessing due to Apple's aversion to the GPL3?). There is Dragonegg, which is a plugin for recent GCC versions which uses LLVM as the backend, however as of 2.9 Dragonegg is not considered mature. I am not sure if 3.0 will be sufficiently stable or not.

So, the question is what compilers to use with LLVM 3.0. Clang goes without saying. The good news is that Clang can finally build all of the source code in the emscripten automatic tests which is very nice (although I did need to file a bug last week for libc++ - which is kind of ironic considering it's an LLVM project ;) but kudos to the LLVM people for the quick fix). As for other compilers, llvm-gcc seems of little importance if Clang can compile the same code, since llvm-gcc is deprecated. Dragonegg is interesting but not sure it makes sense to use it before it is fully ready - we might end up wasting a lot of time on bugs.

So my current plan is to move to a single compiler, Clang, in the emscripten test suite. That is what is currently done in the llvm-svn branch. The main risk here is of code that gcc can compile but Clang cannot. Is anyone aware of any significant cases of that? In particular I am curious about Python, I have not tried to compile it with Clang yet (the emscripten test suite has a prebuilt .ll file - we should fix that).

If no one raises any concerns about this plan, I'll merge the llvm-svn branch into master in the near future.

Thursday, September 8, 2011

The VTable Customization Hack

Recently my main focus in emscripten (the LLVM-to-JavaScript compiler) has been on the bindings generator: A tool to make it easy to use C++ code from within JavaScript. Why is this needed? Well, assume you have some C++ class,

class MyClass {
public:
  MyClass();
  virtual void doSomething();
};

The bindings generator will autogenerate bindings code so that you can do the following from JavaScript:

var inst = new MyClass;
inst.doSomething();

In other words, use that class from JavaScript almost as if it was a native JavaScript class.

Turns out that really doing this is not easy to do ;) One issue is callbacks from C++ into JavaScript: Imagine that you compiled some C++ library into JavaScript, and at some point the C++ code will expect to receive an object on which is a virtual function, which it will call. The virtual function is a common design pattern where you can basically get a callback to your own code. Typically you would create a new subclass, implement that virtual function, create an instance, and pass it to the library. That function will then be called when needed from the library.

Why is this difficult when mixing C++ and JavaScript? The main issue is that in C++ you would be creating those new classes and functions at compile time. But in JavaScript you are doing it at runtime. Creating a new class at runtime is not simple, but it was one option I considered. However compilation speed was too much of a concern. Instead, I went for a vtable customization approach.

The vtable of a class is a list of addresses to its virtual functions. Virtual functions at runtime work as follows: The code goes to the vtable, and to the proper index into it, loads the address, and calls that function. So by replacing the vtable you can change what gets called. However this still turned out to be fairly difficult. The reason is that the bindings code gets you into this situation:

// 1: Original C++ codevoid MyClass::doSomething();

// 2: Autogenerated C++ bindings code
void emscripten_bind_MyClass_doSomething(MyClass *self)
{ self->doSomething(); }


// C++/JS barrier

// 3: Autogenerated JS bindings code

MyClass.prototype.doSomething = function() {

  _emscripten_bind_MyClass_doSomething(this.ptr);
};

// 4: Handwritten JS code
myClassInstance.doSomething();

The top layer is the original C++ code in the library you are compiling. Next is the generated C++ bindings code. This does almost nothing except for it being defined as "extern C", so that there is no C++ name mangling. Below that is the JS bindings code, which also seems fairly trivial here, but generally speaking it handles type conversions, object caching and a few other crucial things. Finally, at the bottom is the handwritten JS code you create yourself.

So, the idea of the vtable customization hack is to receive a concrete object, then copy and modify its vtable, replacing functions as desired. The replacements can be native, normal JS functions, and presto: Your C++ library is calling back into your handwritten JS code. However, how do you modify the vtable, exactly? When your handwritten code wants to modify it, what it specifies is code on the third level, something like this:

customizeVTable(myClassInstance, [{
  original: MyClass.prototype.doSomething,
  replacement: function() { print('hello world!') }
}]);

Here we want to replace doSomething with a custom JS function. But what appears in the vtable is not the third-layer function specified here. It isn't even the second-layer function! It's the first-layer one. How can you get to there, from here..?

A natural idea is to add something to the second layer,

// 2: Autogenerated C++ bindings code
void emscripten_bind_MyClass_doSomething(MyClass *self)
{ self->doSomething(); }
void *emscripten_bind_MyClass_doSomething_addr = &MyClass:doSomething;

- basically, have the address of the function in the bindings code. You can then read it at runtime and use that. But there are a few problems here. The first is that this code won't compile! The right-hand-side is a two-part pointer, consisting of a class and an representation of the function in the class. You can't convert that to void* (well, GCC will let you, but it won't work). Even if you do get around the compilation issue, though, you will be left with that representation of the function. I had hoped it was a simple offset into the vtable - but it isn't, at least not in Clang. After some mucking around with trying to figure out what in the world it was, I realized there was a better solution anyhow, because of the other reason that this approach is a bad idea: This approach forces you to add a lot of bindings code, a little for every single function. That's a lot of overhead, considering you will likely use that information for very few functions!

So instead, I arrived at the following hack:
  • Add a terminating 0 to all vtables at compile time. (This adds some overhead, but there is one vtable per class, and it's just one 32-bit value for each).
  • Copy the object's vtable.
  • Replace all the vtable elements with 'canary functions', that report back to you with their index in the vtable.
  • Call the function you want to replace, through the third-layer function you have available in JavaScript.
  • Since you replaced the entire vtable, you end up calling one of those. The canary function then reports back by setting a value. That value is the index of the function you want to replace in the vtable.
  • Copy the vtable again, this time the only modification is to replace the function at the index that you just found with the replacement function you want run instead.
  • (There are some additional complications, for example due to how emscripten handles C++ function pointers in JavaScript - pointers to functions are just integers, like all pointers, so there is a lookup table to map them to actual JS functions. Another issue is that the third-layer JS bindings code will try to convert types, and if you pass it the wrong things it will fail, so calling the canaries must be done very carefully. But the description above is the main idea.)
This ends up working properly. You can see the code in tools/bindings_generator.js (search for customizeVTable), and you can see it used in the latest version of ammo.js (the README there has been updated with documentation for it).