Thursday, September 8, 2011

The VTable Customization Hack

Recently my main focus in emscripten (the LLVM-to-JavaScript compiler) has been on the bindings generator: A tool to make it easy to use C++ code from within JavaScript. Why is this needed? Well, assume you have some C++ class,

class MyClass {
public:
  MyClass();
  virtual void doSomething();
};

The bindings generator will autogenerate bindings code so that you can do the following from JavaScript:

var inst = new MyClass;
inst.doSomething();

In other words, use that class from JavaScript almost as if it was a native JavaScript class.

Turns out that really doing this is not easy to do ;) One issue is callbacks from C++ into JavaScript: Imagine that you compiled some C++ library into JavaScript, and at some point the C++ code will expect to receive an object on which is a virtual function, which it will call. The virtual function is a common design pattern where you can basically get a callback to your own code. Typically you would create a new subclass, implement that virtual function, create an instance, and pass it to the library. That function will then be called when needed from the library.

Why is this difficult when mixing C++ and JavaScript? The main issue is that in C++ you would be creating those new classes and functions at compile time. But in JavaScript you are doing it at runtime. Creating a new class at runtime is not simple, but it was one option I considered. However compilation speed was too much of a concern. Instead, I went for a vtable customization approach.

The vtable of a class is a list of addresses to its virtual functions. Virtual functions at runtime work as follows: The code goes to the vtable, and to the proper index into it, loads the address, and calls that function. So by replacing the vtable you can change what gets called. However this still turned out to be fairly difficult. The reason is that the bindings code gets you into this situation:

// 1: Original C++ codevoid MyClass::doSomething();

// 2: Autogenerated C++ bindings code
void emscripten_bind_MyClass_doSomething(MyClass *self)
{ self->doSomething(); }


// C++/JS barrier

// 3: Autogenerated JS bindings code

MyClass.prototype.doSomething = function() {

  _emscripten_bind_MyClass_doSomething(this.ptr);
};

// 4: Handwritten JS code
myClassInstance.doSomething();

The top layer is the original C++ code in the library you are compiling. Next is the generated C++ bindings code. This does almost nothing except for it being defined as "extern C", so that there is no C++ name mangling. Below that is the JS bindings code, which also seems fairly trivial here, but generally speaking it handles type conversions, object caching and a few other crucial things. Finally, at the bottom is the handwritten JS code you create yourself.

So, the idea of the vtable customization hack is to receive a concrete object, then copy and modify its vtable, replacing functions as desired. The replacements can be native, normal JS functions, and presto: Your C++ library is calling back into your handwritten JS code. However, how do you modify the vtable, exactly? When your handwritten code wants to modify it, what it specifies is code on the third level, something like this:

customizeVTable(myClassInstance, [{
  original: MyClass.prototype.doSomething,
  replacement: function() { print('hello world!') }
}]);

Here we want to replace doSomething with a custom JS function. But what appears in the vtable is not the third-layer function specified here. It isn't even the second-layer function! It's the first-layer one. How can you get to there, from here..?

A natural idea is to add something to the second layer,

// 2: Autogenerated C++ bindings code
void emscripten_bind_MyClass_doSomething(MyClass *self)
{ self->doSomething(); }
void *emscripten_bind_MyClass_doSomething_addr = &MyClass:doSomething;

- basically, have the address of the function in the bindings code. You can then read it at runtime and use that. But there are a few problems here. The first is that this code won't compile! The right-hand-side is a two-part pointer, consisting of a class and an representation of the function in the class. You can't convert that to void* (well, GCC will let you, but it won't work). Even if you do get around the compilation issue, though, you will be left with that representation of the function. I had hoped it was a simple offset into the vtable - but it isn't, at least not in Clang. After some mucking around with trying to figure out what in the world it was, I realized there was a better solution anyhow, because of the other reason that this approach is a bad idea: This approach forces you to add a lot of bindings code, a little for every single function. That's a lot of overhead, considering you will likely use that information for very few functions!

So instead, I arrived at the following hack:
  • Add a terminating 0 to all vtables at compile time. (This adds some overhead, but there is one vtable per class, and it's just one 32-bit value for each).
  • Copy the object's vtable.
  • Replace all the vtable elements with 'canary functions', that report back to you with their index in the vtable.
  • Call the function you want to replace, through the third-layer function you have available in JavaScript.
  • Since you replaced the entire vtable, you end up calling one of those. The canary function then reports back by setting a value. That value is the index of the function you want to replace in the vtable.
  • Copy the vtable again, this time the only modification is to replace the function at the index that you just found with the replacement function you want run instead.
  • (There are some additional complications, for example due to how emscripten handles C++ function pointers in JavaScript - pointers to functions are just integers, like all pointers, so there is a lookup table to map them to actual JS functions. Another issue is that the third-layer JS bindings code will try to convert types, and if you pass it the wrong things it will fail, so calling the canaries must be done very carefully. But the description above is the main idea.)
This ends up working properly. You can see the code in tools/bindings_generator.js (search for customizeVTable), and you can see it used in the latest version of ammo.js (the README there has been updated with documentation for it).


No comments:

Post a Comment