Code crashing with DEVELHELP but not without

Carton32 · 30 November 2022 16:21

Hi,

I have a board based on a nrf52840.

My code is running fine since a bit less than an hour performing a test (fast IEEE802.15.4 pings between two boards). I compiled without the DEVELHELP flag in my Makefile. Before this, I was doing the exact same tests but with DEVELHELP = 1. It was stopping pinging after a few seconds to ~1-2 minutes.

GDB was giving me this output (with DEVELHELP = 1) :

panic_arch () at /home/user/firmware/RIOT/cpu/cortexm_common/panic.c:43
43          __asm__("bkpt #0");

What could cause the code to stop with DEVELHELP but not without ? I have a thread checking every 15 seconds that the nrf802154_radio is still able to emit and receive frames and it resets the radio if needed. If I don’t start this thread, no issue. So I think it’s coming from the resets done sometimes because the radio is very busy with my spam of ICMP pings. But the mystery lies in why it’s running fine without DEVELHELP but not with it.

Do you have any clue about it ?

benpicco · 30 November 2022 18:09

If you are in gdb can you do a bt to get the backtrace of the crash. DEVELHELP also enables assertions, so a failed assertion will trigger a panic - are you seeing something like that?

Carton32 · 1 December 2022 07:11

This is what I have with bt :

Program received signal SIGTRAP, Trace/breakpoint trap.
panic_arch () at /home/user/firmware/RIOT/cpu/cortexm_common/panic.c:43
43          __asm__("bkpt #0");
(gdb) bt
#0  panic_arch () at /home/user/firmware/RIOT/cpu/cortexm_common/panic.c:43
#1  0x00001790 in core_panic (crash_code=crash_code@entry=PANIC_ASSERT_FAIL, message=message@entry=0x15d08 "FAILED ASSERTION.") at /home/user/firmware/RIOT/core/lib/panic.c:81
#2  0x00001702 in _assert_panic () at /home/user/firmware/RIOT/core/lib/assert.c:31
#3  0x0000b046 in ieee802154_submac_process_ev (submac=0x15d08, submac@entry=0x20003fac <nrf802154_netdev+36>, ev=ev@entry=IEEE802154_FSM_EV_TX_DONE)
    at /home/user/firmware/RIOT/sys/net/link_layer/ieee802154/submac.c:400
#4  0x0000dbc6 in ieee802154_submac_tx_done_cb (submac=0x20003fac <nrf802154_netdev+36>) at /home/user/firmware/RIOT/sys/include/net/ieee802154/submac.h:593
#5  _isr (netdev=0x20003f88 <nrf802154_netdev>) at /home/user/firmware/RIOT/drivers/netdev_ieee802154_submac/netdev_ieee802154_submac.c:200
#6  0x0000724a in _process_events_await_msg (msg=<optimized out>, netif=<optimized out>) at /home/user/firmware/RIOT/sys/net/gnrc/netif/gnrc_netif.c:1704
#7  _gnrc_netif_thread (args=<optimized out>) at /home/user/firmware/RIOT/sys/net/gnrc/netif/gnrc_netif.c:1872
#8  0x00001f00 in sched_switch (other_prio=10) at /home/user/firmware/RIOT/core/sched.c:298
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

The test without DEVELHELP ran fine until 5:00 this morning when I had to stop it because I had to leave my flat. On the other end, with DEVELHELP = 1, I got this backtrace in after ~5 minutes.

oleg · 1 December 2022 09:28

Hey Carton32,

On Thu, Dec 01, 2022 at 07:17:49AM +0000, Carton32 via RIOT wrote:

This is what I have with bt :

Program received signal SIGTRAP, Trace/breakpoint trap.
panic_arch () at /home/user/firmware/RIOT/cpu/cortexm_common/panic.c:43
43          __asm__("bkpt #0");
(gdb) bt
#0  panic_arch () at /home/user/firmware/RIOT/cpu/cortexm_common/panic.c:43
#1  0x00001790 in core_panic (crash_code=crash_code@entry=PANIC_ASSERT_FAIL, message=message@entry=0x15d08 "FAILED ASSERTION.") at /home/user/firmware/RIOT/core/lib/panic.c:81
#2  0x00001702 in _assert_panic () at /home/user/firmware/RIOT/core/lib/assert.c:31
#3  0x0000b046 in ieee802154_submac_process_ev (submac=0x15d08, submac@entry=0x20003fac <nrf802154_netdev+36>, ev=ev@entry=IEEE802154_FSM_EV_TX_DONE)
    at /home/user/firmware/RIOT/sys/net/link_layer/ieee802154/submac.c:400
#4  0x0000dbc6 in ieee802154_submac_tx_done_cb (submac=0x20003fac <nrf802154_netdev+36>) at /home/user/firmware/RIOT/sys/include/net/ieee802154/submac.h:593
#5  _isr (netdev=0x20003f88 <nrf802154_netdev>) at /home/user/firmware/RIOT/drivers/netdev_ieee802154_submac/netdev_ieee802154_submac.c:200
#6  0x0000724a in _process_events_await_msg (msg=<optimized out>, netif=<optimized out>) at /home/user/firmware/RIOT/sys/net/gnrc/netif/gnrc_netif.c:1704
#7  _gnrc_netif_thread (args=<optimized out>) at /home/user/firmware/RIOT/sys/net/gnrc/netif/gnrc_netif.c:1872
#8  0x00001f00 in sched_switch (other_prio=10) at /home/user/firmware/RIOT/core/sched.c:298
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

The test without DEVELHELP ran fine until 5:00 this morning when I had to stop it because I had to leave my flat. On the other end, with DEVELHELP = 1, I got this backtrace in after ~5 minutes.

that’s interesting.

The assertion that fires is caused by this code:

if (new_state == IEEE802154_FSM_STATE_INVALID) {                                                                    
    _print_debug(submac->fsm_state, new_state, ev);                                                                 
    assert(false);                                                                                                  
}

If this doesn’t happen without DEVELHELP could indicate two things:

The driver still enters the invalid state from time to time but without the assertion it is just ignored without doing any harm.
DEVELHELP actually changes the timings of your code in a way that the device never enters an invalid state.

In order to check which assumption holds true you could either remove the assert and check if the application runs fine with DEVELHELP enabled afterwards or set ENABLE_DEBUG to 1, run without DEVELHELP, and check if you see the output from the according line.

Cheers Oleg

printk(“HPFS: Grrrr… Kernel memory corrupted … going on, but it’ll crash very soon :-(\n”); linux-2.4.3/fs/hpfs/super.c

jia200x · 1 December 2022 10:42

Considering the radio is running a (Sub)MAC layer on top, I’m afraid just resetting the radio may have very bad consequences. In fact, the assertion @oleg pointed out occurs when the state machine of the SubMAC did an invalid state transition. If you set ENABLE_DEBUG in sys/net/link_layer/ieee802154/submac.c, it will print which state transition ocurred. This would help us to understand what went wrong.

I tend to think it’s related to 1. Some invalid transitions may still be OK for a subset of radios (including the nrf52 series).

If you post which transition occurred and how to reproduce the error we can provide a more accurate answer.

Carton32 · 1 December 2022 11:20

Hi,

Thank you 3 for your answers!

I’ve performed the following test following your instructions :

In submac.c, I’ve commented out the assert(false); line and set ENABLE_DEBUG to 1
I’ve set DEVELHELP to 1 in application Makefile
Forced my nrf802154 resets without condition each 14[s]

I’ve got the following output in terminal :

Welcome to pyterm!
Type '/exit' to exit.
2022-12-01 11:44:49,010 # NETOPT_TX_END_IRQ not implemented by driver
2022-12-01 11:45:02,187 # NETOPT_TX_END_IRQ not implemented by driver
2022-12-01 11:45:44,719 # RX--(ACK_TIMEOUT)->INVALID
2022-12-01 11:45:44,721 # INVALID--(RX_DONE)->INVALID
2022-12-01 11:58:55,836 # NETOPT_TX_END_IRQ not implemented by driver
2022-12-01 11:58:55,995 # NETOPT_TX_END_IRQ not implemented by driver
2022-12-01 12:00:48,609 # RX--(ACK_TIMEOUT)->INVALID
2022-12-01 12:00:48,613 # INVALID--(RX_DONE)->INVALID
2022-12-01 12:02:40,611 # RX--(ACK_TIMEOUT)->INVALID
2022-12-01 12:02:40,614 # INVALID--(RX_DONE)->INVALID
2022-12-01 12:13:52,617 # RX--(ACK_TIMEOUT)->INVALID
2022-12-01 12:13:52,621 # INVALID--(RX_DONE)->INVALID
2022-12-01 12:16:12,610 # RX--(ACK_TIMEOUT)->INVALID
2022-12-01 12:16:12,613 # INVALID--(RX_DONE)->INVALID
2022-12-01 12:18:04,613 # RX--(ACK_TIMEOUT)->INVALID
2022-12-01 12:18:04,615 # INVALID--(RX_DONE)->INVALID
2022-12-01 12:34:12,428 # RX--(TX_DONE)->INVALID
2022-12-01 12:34:12,431 # INVALID--(RX_DONE)->INVALID
2022-12-01 12:37:42,425 # RX--(ACK_TIMEOUT)->INVALID
2022-12-01 12:37:42,434 # INVALID--(RX_DONE)->INVALID

The radio stop receiving sometimes but my resets are always putting it back to normal. I do my pings from a board running Linux and a script doing batches of 1000 pings. If the script detect a packet loss of 100% for a batch, it stops. It didn’t stop although I can see abnormal receive rates which means the radio stopped working temporarily.

My reset procedure for the nrf802154 radio is the following :

ieee802154_radio_off(dev);
nrf802154_init();
netdev->driver->init(netdev);

dev is of type ieee802154_dev_t and netdev of type netdev_t.

As my RIOT board can’t be remotely updated, I need some procedures like this to check the health of each peripherals to avoid an end-user has to perform maintenance activities (even as simple as power off → power on).

jia200x · 1 December 2022 13:20

Hi @Carton32

Carton32:

My reset procedure for the nrf802154 radio is the following :
ieee802154_radio_off(dev);
nrf802154_init();
netdev->driver->init(netdev);
dev is of type ieee802154_dev_t and netdev of type netdev_t.

The Radio HAL devices are not thread safe, therefore running such code on a separate thread will eventually corrupt the driver state. On the other side, the snippet you posted will de-synchronize the SubMAC and may cause such unexpected state transitions. AFAIU the code doesn’t fail if you remove the snippet, right? Then this would be the actual problem.

In that case a full reset may simplify things, as we currently don’t have a way to reset the network stack gracefully. Would it be compatible for your use case? Otherwise we might need to think in workarounds.

Carton32 · 1 December 2022 13:55

Exactly

I also have 3 relays on the board. If the MCU is reset, I lose their state before the whole reset and the most critical moment when I should not lose RF communication more than 30[s] is when they are in an active state. That’s why I’m trying to not perform a whole reset. I could save the state of the relay, perform a whole reset and restore their state but it could shorten the life of the relays drastically (even worse if we consider the devices they are switching).

Of course, this reset procedure may never be needed if my board has no EMC issue. It was tested for CE marking and also with IEC 61000-4-4 with special levels up to +/- 5kV without showing any weaknesses. Sadly, it was also the case for its previous version. No issue during 4 years and then it started to show some problems. So this reset procedure is now necessary on the old version. It was equipped with a MRF24J40 (now EOL).

The on-site updates were painful. x) I’m trying to avoid that this situation happens again.

benpicco · 1 December 2022 16:42

The on-site updates were painful. x) I’m trying to avoid that this situation happens again.

Slightly unrelated, but have you considered remote updates with SUIT?

Carton32 · 1 December 2022 18:24

Yes. I was following the discussion while you were working on it. Sadly, the previous version of the board has a small ROM. My application needs about 70kB of ROM and the STM32F302RB used has only 128K. For the new version, it’s something I want to use but I hadn’t enough time to check it seriously.