I am new to RIOT, so I hope this is user error, but I am having some grief inside xtimer. I am running a mesh of nodes with a full RPL stack (in case that’s relevant) and the biggest problem I have at the moment is nodes hanging in _add_timer_to_list in xtimer_core (link)
Using GDB, it seems that one of the nodes in the list points to itself, hence the endless loop.
My first question is: when is this possible? It seems at first glance that all code paths that lead here call remove_timer to prevent this sort of problem. I don’t access a the same timer object from two different threads. My code using xtimer functions is not reentered.
I don’t use that many timer operations in my application code, but I do assume that the following functions don’t require any freeing or removing afterwards, am I wrong?
xtimer_set_msg (the msg is statically alllocated)
Thanks for the reply. I am on a platform essentially equal to a samr21xpro.
The short answers:
only one declared xtimer_t object that is used more than once. I use it with xtimer_set_msg for a thread to send itself a message. Both the timer and the msg object are statically allocated. On the other hand, I have RPL and all sorts of network things going and I have no doubt there are a ton of timers involved. In terms of ephemeral timers, I call xtimer_usleep a LOT with intervals of between 1ms and 100ms from multiple threads. I also send packets every 200ms or so and receive them every 500ms or so.
-The interrupt load might be pretty steep if the radio is interrupting on every packet (promiscuous mode). I don’t think it is. Otherwise I would imagine that other than the timers it is less than ten per second.
As for memory corruption, that may well be the case. I will double check my code. I thought it was somewhat unusual that multiple boards would all get a timer pointing to itself, but I suppose not all corruption is non-deterministic and they all run identical firmware, so it might be corruption.
One question, in the network stacks, are there ever two threads possibly using the same timer object? I ask because the timer_remove and the insert are in two different critical sections, and if there are concurrent calls with the same timer object then it might be possible to interrupt between the critical sections and insert a timer that is already in the list. What would then happen is that this loop would end with list_head equal to the timer (assuming no other timer has the same time), and then the next two lines would basically link the timer to itself.
normally you can guess where the timer came from by looking at the address (or the debugger straight tells you). Is this somehow possible for your case (i.e. 0x200010a4)? That might be helpful for the timer people.
I am using other IPC messages, yes. There is a thread waiting with xtimer_msg_receive_timeout that gets messages either from xtimer_set_msg or from the network stack on packet reception.
Incidentally, if I decrease the load on the MCU by increasing the sensor sampling interval, the problem seems to go away (or at least it has not shown again in the past day).
I inserted a trap in the timer code that will stop everything and let me debug if a timer is found in the linked list with the same address as the timer to be inserted, I’ll let you know again with more info when I reproduce it.
One question, in the network stacks, are there ever two threads possibly
using the same timer object? I ask because the timer_remove and the
insert are in two different critical sections, and if there are
concurrent calls with the same timer object then it might be possible to
interrupt between the critical sections and insert a timer that is
already in the list. What would then happen is that this loop
end with list_head equal to the timer (assuming no other timer has the
same time), and then the next two lines would basically link the timer
I could be wrong though, that is just a guess.
I think your analysis is correct, I managed to create a test case that
shows pretty much the behaviour you're describing.
Guarding most of xtimer_set() (using disableIRQ/restoreIRQ) fixes the
problem, but disables interrupts for the backoff spin loop.
While hanging within xtimer is probably the worst, I'm not sure what
would be the best semantic for concurrently setting the same timer object:
- return an error (cleanest, but currently xtimer_set() never returns an
- first xtimer_set() wins (easy to implement by somehow tagging the
timer object, but probably unexpected)
- second xtimer_set() wins (very hard to do as xtimer_set() can be
called in ISR context, and there's no way to wait for, e.g., a mutex)
- guard the whole timer setting procedure (would disable interrupts