Distributed Messaging

Kaspar · 13 August 2013 13:51

Heyya,

(if you're on stablizing for the release, don't read this message until after the release)

I've been thinking on how to implement distributed messaging, e.g., IPC between threads on different nodes.

One idea I had was to just launch a proxy thread per connection and then use the normal IPC API.

Example in Pseudocode:

proxy_tid = setup_distributed_ipc(<whatever_network_address_of_other_node>, <TID_of_target_process>)

... Starts a new thread, sends something to the other node starting a corresponding proxy thread in the other node. Then the proxy-thread just relays all messages to the other proxy_thread over network, which in turn relays it to <TID_of_target_process>. The other thread can send messages the other direction the same way.

A more sophisticated proxy thread could implement other messaging schemes like broadcasting.

For distributed IPC to make sense, we need to extend messaging to support sized messages (e.g., bigger than 4 bytes / one value), as pointers don't make sense in a distributed environment.

What do you think?

Cheers, Kaspar

Ludwig_Knupfer1 · 13 August 2013 15:02

Hi,

I'm all for both, increasing the possible payload size and remote message delivery.

I guess it has some interesting challenges too .. for example all the challenges of routing

One immediate challenge (for the simple proxy thread/process without broadcasting) though: Where does TID_of_target_process come from?

Cheers, Ludwig

Kaspar · 13 August 2013 15:24

Hi,

I'm all for both, increasing the possible payload size and remote message delivery.

Awesome. After solving our little toolchain building challenge, these'll be the first things I'll get my hands dirty on.

I guess it has some interesting challenges too .. for example all the challenges of routing

Uiiii. Isnt that some distant network layer thing?

One immediate challenge (for the simple proxy thread/process without broadcasting) though: Where does TID_of_target_process come from?

Good question. We might need some tid agnostic service publishing on the same layer that accepts the request from the proxy instance of the initiating node. Couldn't that be a service waiting on a specified TPC port?

Mmmmh challenges.

Cheers, Kaspar

cabo · 13 August 2013 15:47

It appears to be a rite of passage for a research operating system to develop a strange, non-functional and non-interoperating networking stack at some point in its life. In entertainment, that would be called "jumping the shark" (quid google).

Don't do that.

Proxying IPC to a remote system may be a useful tool, but there are a number of dangers:

-- don't put things in the wrong layer (should the IPC layer be concerned with network addresses? Application names?); -- make sure you have a place to handle the leakiness of the abstraction (distributed systems are different from local access); -- don't change everything in the abstraction you are "extending" (length, pointers, ...); -- don't move to the wrong focus (is RIOT about distributed computing?).

Grüße, Carsten

Kaspar · 14 August 2013 11:10

Hey,

Proxying IPC to a remote system may be a useful tool, but there are a number of dangers:

-- don't put things in the wrong layer (should the IPC layer be concerned with network addresses? Application names?);

I totally agree.

The whole idea of proxying was to not have to change the IPC layer and keep on-node messaging as small as possible as in don't make compromises for embedded systems without networking / distribution.

Have inter-node messaging as option using proxy mechanisms, it can be used wherever needed, without any drawbacks when not needed. All the networking and application name stuff should be optional and not in the core IPC code.

Also the proxying can be hand-taylored to whatever routing / networking layer is present.

-- make sure you have a place to handle the leakiness of the abstraction (distributed systems are different from local access);

I think most of the issues can be handled in the proxy code.

But an application definately has to be written with distribution in mind.

Right now, most apps will wait forever if they don't get the desired reply to a message. The programmer has to take care of a possibly unexpected reply (e.g., from the proxy, "timeout" or "other node down" or "network error").

-- don't change everything in the abstraction you are "extending" (length, pointers, ...);

Right now, IPC depends on shared address space between threads to be usable. With minor changes (and the same API), we can send some data along with a message. As far as I can see, this is the only change needed in the core messaging code.

-- don't move to the wrong focus (is RIOT about distributed computing?).

Well, aren't WSNs inherently distributed systems?

Many embedded systems consist of more than one MCU. Distributed IPC might be an easy way to link them up from a programmers perspective, like, connect SPI or serial, write a minimal serial-to-IPC-proxy and have them share ressources via IPC.

I've always been fascinated with distributed systems. I'll experiment with that even if it's of no use to RIOT at all.

Cheers Kaspar

cabo · 14 August 2013 11:15

OK, that indeed makes a lot of sense (the chip soldered on next to you is less likely to go away than a random node in a general distributed system :-). Still, it might independently crash, and unless you want to crash in unison (may be perfectly fine though), that makes the abstraction leaky.

Grüße, Carsten