Remarks on a decade of RIOTing

As we officially approach a decade of developing RIOT, this thread aims to gather key feedback, emerging principles, and lessons learnt, going forward. Your input is welcome!

1 Like

What happened to the demo board idea from last year?

Cough cough… It turns out the free time that I thought I might have was actually a negative value. It was also pretty hard to come up with an idea that all interested parties could really get behind.

1 Like

there were some discussions on Matrix which could be good to partially move/summarize here to leave some trace @maribu @miri64 @Kaspar @oleg

Hey Manu,

On Wed, Dec 07, 2022 at 04:45:38PM +0000, Emmanuel Baccelli via RIOT wrote:

there were some discussions on Matrix which could be good to partially move/summarize here to leave some trace @maribu @miri64 @Kaspar @oleg

thanks for the reminder.

When I recall the last ten years of RIOT development a lot of things come to mind. Let me share some of my thoughts (partly also taken from the discussion on Matrix). I will start with some technical aspects:

Multitasking OS

I think RIOT can claim to be a trailblazer for multi-tasking OS in the domain of constrained-node networks. When we started the consensus was not to use multitasking for most embedded OS, but rather go for an singlethreaded, purely event-driven approach which require various quirks (cf. Contiki Protothreads or TinyOS nesC). Today multitasking can be considered the default paradigm for IoT OS and from the basic architecture, Zephyr or mbed OS follow a similar approach (as far as I know).

Of course, RIOT was not the first multi-tasking OS for MCUs but among comparable candidates, the claim holds true as far as I can tell.

GNRC

The consensus seems to be:

  • bad idea: using (too many) threads
  • debatable: netapi
  • quite good idea: pktbuf

In general, I think it is fair to claim that GNRC has not been a failure but we would design it differently today. When discussing the pros and cons we have to bear in mind that the goal for GNRC was never to be the most efficient or most performant network stack but provide a base for general purpose applications and to enable research and teaching on IoT networking. The fact, that GNRC has been the default network stack in RIOT over the last ~8 years (?) even with other network stacks being available is a good indication for it being useful.

On the other hand, debugging, stabilizing, and maintaining has been a time consuming task. It required a lot of efforts for testing and a lot of dedication by @miri64 (and some others) to get it stable and mature.

APIs

We should talk about the benefits and pains when it comes to developing (stable) APIs like

  • sock
  • netdev
  • SAUL
  • peripheral
  • timer interfaces (swtimer, utimer, vtimer, xtimer, ztimer…)

Programming languages

In my opinion, RIOT has always been very open to enable different programming and scripting languages on top of the core OS. However, there was certainly a strong lack of (heavy) users for most of the languages (C++, micropython, Javascript…) that eventually let to the fact that many features of these languages remained probably not well supported or missing all together.

The biggest success story in terms of programming languages other than C in RIOT up to date is definitely RUST. I guess @chrysn and @kaspar can provide more input here.

Cheers Oleg

– panic(“mother…”); linux-2.2.16/drivers/block/cpqarray.c

Hi Oleg,

On Thu, 8 Dec 2022, Oleg Hahm via RIOT wrote:

[15_2.png] oleg internal members 8 December

GNRC

[…]

The fact,that GNRC has been the default network stack in RIOT over the last ~8 years (?) even with other network stacks being available is a good indication for it being useful. […] On the other hand, debugging, stabilizing, and maintaining has been a time consuming task. It required a lot of efforts for testing and a lot of dedication by @miri64 (and some others) to get it stable and mature.

hmm, not sure what the takeaway is. that good software requires testing and dedication, and that this takes time, is not really surprising.

completely independent of GNRC, one lesson learned is definitely: testing is very crucial, and that we did a bad job in the beginning of RIOT.

APIS

We should talk about the benefits and pains when it comes to developing (stable) APIs like

  • sock
  • netdev
  • SAUL
  • peripheral

wondering why the timer API was not part of the discussion.

cheers matthias

– Matthias Waehlisch . Freie Universitaet Berlin, Computer Science … Page Redirection

Hi Oleg,

Regarding GNRC:

The consensus seems to be:

  • bad idea: using (too many) threads
  • debatable: netapi
  • quite good idea: pktbuf

In general, I think it is fair to claim that GNRC has not been a failure but we would design it differently today. When discussing the pros and cons we have to bear in mind that the goal for GNRC was never to be the most efficient or most performant network stack but provide a base for general purpose applications and to enable research and teaching on IoT networking. The fact, that GNRC has been the default network stack in RIOT over the last ~8 years (?) even with other network stacks being available is a good indication for it being useful.

Although the idea of having one thread per protocol is stated as a design decision, IMO it’s more a consequence of the synchronization mechanisms that were available at that time (IPC messages). Back then there were no semaphores, thread flags, event threads, etc. Therefore I would propose to separate the design decision (generic message passing via netapi) from the actual implementation, as the implementation can be revised and improved.

Stepping aside from the actual implementation, I think the concept of netapi presents advantages that I have not seen in other network stacks:

  1. Adding a new protocol is straightforward. Simply subscribe to a given packet type and inject back packets with the dispatch function. Concrete examples: LoRaWAN over netif (GNRC LoRaWAN), CCN-Lite over IEEE 802.15.4, GNRC SCHC integration. This would have been complicate with e.g LWIP. For prototyping and research this is a huge advantage.

  2. Although presented as a disadvantage, IMHO GNRC is way easier to debug than the other supported network stacks. We have mechanisms to subscribe to packets, tools to browse packet structures, etc. It’s clear that some protocol implementations are messier than others and therefore give more headaches, but that’s not a problem of GNRC design decisions. I think many of the “known problems” of GNRC were actually because of the (missing) lower layer design, which translated into network stack issues.

Regarding the pktbuf, I think it’s a good idea to have centralized buffers for the network stack, but I also have concerns with the current implementation:

  1. The allocation strategy is slow. This is in many cases too slow for the timings of MAC layers. Therefore, I would enable memory pools (or Memory Slabs).
  2. Marking packets should be almost No-OP. As it is now, trimming SDU headers is quite some overhead (allocate new pktsnip, maybe copy the packet, etc).
  3. I think the pktbuf would benefit from a more defined packet structure (not just a linked list of chunks, which forces us to add a gnrc_netif_hdr snippet for holding metadata).

Again, this is more an implementation issue than design decision.

We should talk about the benefits and pains when it comes to developing (stable) APIs like

  • sock
  • netdev
  • SAUL
  • peripheral

I would try to identify the APIs, since there are definitely more (timers, Radio HAL) and there are definitely different lessons learned for each one.

Having stable APIs is definitely the end goal of any API, but I think only a few RIOT APIs are actually stable (in terms of providing reasonable features and not changing API/behavior).

In my opinion the API design is tainted by:

  1. Early generalization.
  2. Premature optimization.
  3. Diffuse or non existing architecture of the underlying system.

I’m not aware of any API change because of poor usability. On the contrary, it’s usually because the API works OK for a single use-case/system but do not work with others (1., 3.) or because we optimize-out features or mechanisms which turn to be critical (2.).

A concrete example is netdev. While on one side it performs fine for a subset of devices (transceivers with MAC acceleration, Ethernet), it is not enough for implementing a North-Bound API on top of a MAC layer and it is not enough as a HAL (e.g it took several years and maintainers to provide support for openWSN on top of netdev, and it sadly only runs in a subset of devices).

Cheers, José

On 22/12/08 02:11PM, Oleg Hahm via RIOT wrote:

Hey Manu,

On Wed, Dec 07, 2022 at 04:45:38PM +0000, Emmanuel Baccelli via RIOT wrote:

there were some discussions on Matrix which could be good to partially move/summarize here to leave some trace @maribu @miri64 @Kaspar @oleg

thanks for the reminder.

When I recall the last ten years of RIOT development a lot of things come to mind. Let me share some of my thoughts (partly also taken from the discussion on Matrix). I will start with some technical aspects:

Multitasking OS

I think RIOT can claim to be a trailblazer for multi-tasking OS in the domain of constrained-node networks. When we started the consensus was not to use multitasking for most embedded OS, but rather go for an singlethreaded, purely event-driven approach which require various quirks (cf. Contiki Protothreads or TinyOS nesC). Today multitasking can be considered the default paradigm for IoT OS and from the basic architecture, Zephyr or mbed OS follow a similar approach (as far as I know).

Of course, RIOT was not the first multi-tasking OS for MCUs but among comparable candidates, the claim holds true as far as I can tell.

GNRC

The consensus seems to be:

  • bad idea: using (too many) threads
  • debatable: netapi
  • quite good idea: pktbuf

In general, I think it is fair to claim that GNRC has not been a failure but we would design it differently today. When discussing the pros and cons we have to bear in mind that the goal for GNRC was never to be the most efficient or most performant network stack but provide a base for general purpose applications and to enable research and teaching on IoT networking. The fact, that GNRC has been the default network stack in RIOT over the last ~8 years (?) even with other network stacks being available is a good indication for it being useful.

On the other hand, debugging, stabilizing, and maintaining has been a time consuming task. It required a lot of efforts for testing and a lot of dedication by @miri64 (and some others) to get it stable and mature.

APIs

We should talk about the benefits and pains when it comes to developing (stable) APIs like

  • sock
  • netdev
  • SAUL
  • peripheral

Programming languages

In my opinion, RIOT has always been very open to enable different programming and scripting languages on top of the core OS. However, there was certainly a strong lack of (heavy) users for most of the languages (C++, micropython, Javascript…) that eventually let to the fact that many features of these languages remained probably not well supported or missing all together.

The biggest success story in terms of programming languages other than C in RIOT up to date is definitely RUST. I guess @chrysn and @kaspar can provide more input here.

Cheers Oleg

– panic(“mother…”); linux-2.2.16/drivers/block/cpqarray.c


Visit Topic or reply to this email to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, click here.

1 Like

If the debugging of the network protocol can be done by subscribing to packets and analyzing them, you might be right, but that basically requires that the network stack is actually functioning. If something is broken in that infrastructure, then it becomes hard. Why is the thing suddenly stalled? Where is the leak? Where is the memory corruption coming from? At that point it’s not about looking at packet data anymore, but at looking at all involved (thread)state. And that is incredibly hard with GNRC.

IMO netdev was misunderstood quite a bit.

Providing a way to pass events from the device, and with get()/set() allowing arbitrary implementations for what a netopt means, everything can be implemented on top of it. Or is there a concrete example of what wasn’t possible (vs. just not implemented)?

Hi Kaspar,

IMO netdev was misunderstood quite a bit.

Providing a way to pass events from the device, and with get()/set() allowing arbitrary implementations for what a netopt means, everything can be implemented on top of it. Or is there a concrete example of what wasn’t possible (vs. just not implemented)?

I agree that was misunderstood. But I would like to separate the implementation details from the architecture.

Regarding the implementation details, it is clear it’s possible to implement everything with a generic function and a generic callback. So I wouldn’t focus the discussion on whether it’s possible or not, but rather if it makes sense from an architecture perspective or if it’s efficient.

The problems I see regarding architecture:

  1. From the official RIOT documentation 1:

This interface provides a uniform API for network stacks to interact with network device drivers."

From the definition it’s clear netdev sits on top of the device driver (so, it acts as a HAL). However, please note that:

  1. Device classes have unique properties. IEEE 802.15.4, LoRa and Ethernet transceivers are quite different from each other. However, two IEEE 802.15.4 devices can be abstracted, as they expose the same set of features.

  2. In an analogy with the periph API, what we are doing with netdev is like trying to implement a “generic Periph API” for interacting with all periphs, instead of having periph_timer, periph_spi, etc. While theoretically possible, the user of such API would still need to understand what the underlying system does and expects. This is exactly what happens with netdev: it cannot abstract a “set of network devices” but rather “a technology specific device”. To prove this, it’s not possible to provide an unknown PSDU and expect netdev will be able to send. You need to know it’s e.g an Ethernet device or LoRa. Therefore, let’s agree that each technology has its own API and we just use netdev to implement each API.

  3. If the upper layer doesn’t know what a netopt is, as the device interprets its meaning, or the upper layer doesn’t have any guarantees, how can you implement any upper layer that expects a certain logic from the lower layer? Some examples:

  • For the case of IEEE 802.15.4 devices, there’s no way to know whether the transmission CSMA/CA and retransmissions, do CCA or simply transmit a frame. In practice, each device implemented a different send function, which led to “device dependent network stacks” (e.g not so long ago there was no way to transmit 6LowPAN fragments in CC2538 and NRF52840).
  • Depending on the netdev implementations, callbacks might be re-entrant or not (depending on where the drivers triggers the event callback). In practice, you require to know what the driver is doing in order to safely call a device function.
  1. Let’s say we all sit together and fix 3. by actually defining the API and guarantees for each class. So, we end up with different netdev classes, one for each technology. To make them work, we would need at least:
  2. A list of mandatory and optional NETOPTs for each technology.
  3. The semantics for get/set.
  4. The guarantees.

The question is, what do we gain from implementing these APIs with netdev instead of just writing a dedicated API (e.g periph_timer). Implementing the driver logic with get/set is error prone, hard to test and to document. Saving function pointer members by using a generic function does not necessarily reduce memory requirements, as the get/set use patterns that are hard to optimize-out.

Also note that:

  1. The get function has almost no use-case for a device driver, as in practice the MIB/PIB holds the device state. E.g none of the network stack needs to read a device register to fetch the channel, as that information exists in the stack.
  2. The isr function assumes the transceiver and the device driver map one-to-one, which is not necessarily true. An at86rf215 device exposes two transceivers on the same device. In practice, the device needs work-arounds to run with netdev, such as having 2 threads.
  3. The send function is too high-level for a transceiver. For any slotted MAC, you need to a) load data into the framebuffer, b) trigger TX at the right time. To implement that with netdev, we added a NETOPT_PRELOADING that skips TX on send and overloaded NETOPT_STATE_TX. This could have been solved by write and transmit functions.

I agree though that the netdev interface definitely improved the situation back then and that for a set of device works good (specially Ethernet). But nowadays we have more complex and powerful network stack components (e.g slotted MACs, LPWAN), where we definitely benefit from dedicated transceiver APIs. For example, since the IEEE 802.15.4 Radio HAL 2 is there, it feels like for a fraction of the work of implementing a netdev driver you get the same user experience, independent of which board you choose.

On 22/12/09 09:21AM, Kaspar Schleiser via RIOT wrote:

IMO netdev was misunderstood quite a bit.

Providing a way to pass events from the device, and with get()/set() allowing arbitrary implementations for what a netopt means, everything can be implemented on top of it. Or is there a concrete example of what wasn’t possible (vs. just not implemented)?


Visit Topic or reply to this email to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, click here.

1 Like

Hi again,

If the debugging of the network protocol can be done by subscribing to packets and analyzing them, you might be right, but that basically requires that the network stack is actually functioning. If something is broken in of that infrastructure, then it becomes hard. Why is the thing suddenly stalled? Where is the leak? Where is the memory corruption coming from? At that point it’s not about looking at packet data anymore, but at looking at all involved (thread)state. And that is incredibly hard with GNRC.

While I agree having more threads adds some degrees of complexity, I couldn’t say GNRC this make it necessarily harder to debug compared to alternative network stack (given that in general network stacks are known to be incredibly hard to debug 1). After all, other network stacks will also show issues if their infrastructure is broken. And in such cases the problem is not solved by just looking at packet data anymore.

I’m not aware of any recent infrastructure bug in GNRC but more protocol implementation bugs, where I still think the debug tools play a huge role there. I hope I had something similar while working with openDSME, where it also experienced stalls and leaks. With openWSN the experience was even worse IMO.

Cheers, José

On 22/12/09 08:59AM, Kaspar Schleiser via RIOT wrote:

If the debugging of the network protocol can be done by subscribing to packets and analyzing them, you might be right, but that basically requires that the network stack is actually functioning. If something is broken in of that infrastructure, then it becomes hard. Why is the thing suddenly stalled? Where is the leak? Where is the memory corruption coming from? At that point it’s not about looking at packet data anymore, but at looking at all involved (thread)state. And that is incredibly hard with GNRC.


Visit Topic or reply to this email to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, click here.