At the "hardware" side of the emulator, we use simple C++ wrappers to emulate the MCU hardware registers, basically meaning that all direct reads and writes to them are "trapped" with C++ syntactic sugar. [...]
What you are doing is writing a hardware emulator/simulator.
What I am doing is writing a call level emulator.
Right. Both are good, if you have a reasonable HAL. (At Ell-i we don't have any reasonable HAL yet.) They serve different purposes.
My decision to not to write a hardware emulator was to reduced overhead and complexity. One of the goals is to support large virtual networks, so reducing overhead seemed important.
I see. However, you have to be very careful in what you do, as that affects what you get. Unfortunately I'm not an expert there. Years ago I produced small Linux OpenWRT images and ran a few tens of them under VMware in a laptop -- that worked nicely. But it would never have scaled up to hundreds or millions of nodes which you need today. Furthermore, when you start scaling up, even the IPC interface may turn up to be a bottleneck, depending on your goals.
The bottom line is that it would be good to understand what you mean, exactly, with "large" and "virtual" here. Cf. e.g. 
I always thought that, if I had enough time, adding hardware emulation would be a "nice to have" to allow testing of drivers.
That's exactly what we do.
But then, parsing the memory and emulating the actual hardware also looked like it would become kind of tedious.
That was one approach I considered, but we chose another way, just like you suggest:
In any case, now I tend to think that dummy interfaces for unittests would pose a more rewarding approach than having them actually do something (like for example native's interfaces) if testing drivers was the goal.
We do exactly that. Instead of emulating the actual hardware a la qemu, we emulate the hardware API. That is, we create C++ wrapper objects for each peripheral register. As in the STM32 world the peripherals themselves are represented as structs of uint32s, in our case the corresponding emulated peripheral is a struct of C++ objects.
The trick here is to write in Clean C; i.e., produce C code that can also be compiled with a C++ compiler. 
Now, with this, when a driver is accessing a register directly, instead of doing a memory write or read, as would happen in real hardware, the C++ compiler generates a call to the corresponding overloaded member operator-function. This member function then reads or sets the actual value, and produces any desired side effects.
Here is a reasonable example:
(That code would benefit from some cleanups, but you get the idea.)
As I wrote, this does not work with indirect access. We could probably make it work with indirect access as long as no explicit casts are used, but IMHO that would not be worth the effort. It is easier to keep the amount of indirect access in minimum in the drivers, and handle them case-by-case for testing.
One of the advantages of the current approach is that the native platform is treated exactly like any other platform by the build system.
Right. That should be the goal.
But are you there already? Have you checked that your linker scripts are sufficiently identical? Have you disabled using shared libraries at the native side? Have you the right compiler flags there to by-default prevent the native compilation from using host-local header files? etc.
And, as you have noticed, the startup sequences are different.
The bottom line is that "treating the native platform _exactly_ like any [MCU] platform" is not trivial. It is more complex than you what it may look at the outset. If you want to do it really properly, you have to have a "cross compiler" for the native environment, meaning that you build a separate tool chain than uses different include files and different libraries, you have your own "boot-time" routines, etc. And you need to think very carefully what happens when you launch a native application, i.e. emulate the boot sequence.
Trapping of signals already works in RIOT native as it is.
Right, but you do that from a library that is linked in to the binary, i.e. something that is "not exactly" like in any MCU platform.
For example ctrl+c is used to gracefully exit the process.
BTW, what you mean with graceful here, exactly? Do you have a signal handler that e.g. cleans up any external files created for the emulator? Or are you just relying on the underlying kernel doing the mostly-right thing?
Also, I was planning to add additional signal handlers for debugging (maybe USR1 already does something which I forgot to take out again..), and a separate socket interface for state setting/getting and event triggering (buttons/GPIO/..).
You can do those kinds of things equally easily with a separate launcher or a library that you link into your binary.
The philosophical or architectural difference between a launcher and a linked-in library is in who is in control. With a launcher, you have two "mains", the launcher main and the user main in the application shared library. When you link in a library to the binary, you have only one "main" as you have noticed, and you have less control on what happens when the executable is launched.
BTW, you can also build a launcher that is able to dynamically load an executable instead of loading a shared library, but that requires somewhat more work. That's the main reason why I recommend a shared library. From that point of view, the main difference is in the command line that you use to link the object files into a executable or to a shared library.
In any case, once you have loaded a shared library into your process address space, there is very little difference between the code-that-initiated-the-load and the code-that-was-loaded. The runtime situation is almost identical to a situation where you would just have launched a single binary, with everything statically linked in.