How to deal with lost interrupts?

Hi all,

There are several snippets in RIOT that rely on an IRQ event. E.g radios that don't indicate again if the ISR request is not processed. In that case, the radio would stop receiving packets or TX done interrupts until someone calls "dev->driver->isr".

The LoRa SX127x [1] driver has some RX and TX timeout in case interrupts are lost, (I'm making this feature optional [2] since it's not required in all use cases, but it's a separate discussion)

The main question here is, how should we handle lost interrupts in general? Let's say:

1. Should these timeouts be handled in the driver or upper layers? Or by some module in the OS? (e.g watchdog) 2. Should the device (or system on top) try to recover? Kernel panic? Leave it to the user?

Cheers, Jos�

[1]: https://github.com/RIOT-OS/RIOT/blob/master/drivers/sx127x/sx127x.c#L324 [2]: https://github.com/RIOT-OS/RIOT/pull/12908

Hey Jose,

Hi Kaspar,

No, I wouldn't consider those lost interrupts (because messages are lost, not IRQs).

I'm referring to interrupts that get lost interrupts due to malfunctions in the hardware (or that are never generated were expected). E.g: - Device hangs and stops responding to requests (e.g TX_START not   followed by a TX_DONE IRQ) - Unconnected pins (broken port, damaged cables, etc). - Power loss in the device (so no IRQ is generated).

These states can hang certain implementations. So I'm asking how should we deal in general.

Hi,

There are several snippets in RIOT that rely on an IRQ event. E.g radios that

    > don't indicate again if the ISR request is not processed. In that case,     > the radio would stop receiving packets or TX done interrupts until     > someone calls "dev->driver->isr".

Typically, this happens with EDGE-triggered interrupts, where there are multiple sources of interrupt, and the driver fails to check all the sources. Or there is a race condition where a new interrupt occurs while interrupts are masked and the driver has already checked.

With LEVEL-triggered interrupts, this is usually less of an issue, as the IRQ pin stays low until the interrupt is acknoweldged, but I'm sure that there are devices which do not include an ACK stage.

    > 1. Should these timeouts be handled in the driver or upper layers? Or by     > some module in the OS? (e.g watchdog)

Check all devices once per "major" wake-up cycle.