Odd problems with xtimer

Michael_Andersen1 · 10 February 2016 06:57

Hi

I am new to RIOT, so I hope this is user error, but I am having some grief inside xtimer. I am running a mesh of nodes with a full RPL stack (in case that’s relevant) and the biggest problem I have at the moment is nodes hanging in _add_timer_to_list in xtimer_core (link)

Using GDB, it seems that one of the nodes in the list points to itself, hence the endless loop.

(gdb) print/x *list_head $3 = 0x200010a4 (gdb) print/x (*list_head)->next $4 = 0x200010a4

My first question is: when is this possible? It seems at first glance that all code paths that lead here call remove_timer to prevent this sort of problem. I don’t access a the same timer object from two different threads. My code using xtimer functions is not reentered.

I don’t use that many timer operations in my application code, but I do assume that the following functions don’t require any freeing or removing afterwards, am I wrong?

xtimer_now xtimer_set_msg (the msg is statically alllocated) xtimer_msg_receive_timeout xtimer_usleep_until

Any help would be appreciated.

Kaspar · 10 February 2016 10:45

Hey Michael,

it seems that one of the nodes in the list points to itself, hence the endless loop.

My first question is: when is this possible? It seems at first glance that all code paths that lead here call remove_timer to prevent this sort of problem.

It should not be possible (tm).

I took another look at the code, it seems to me that timer->next gets overwritten whenever a timer is set, so there can't be some outdated value.

It might be that the list logic has a bug somewhere, but I remember testing them quite rigourously.

I don't access a the same timer object from two different threads. My code using xtimer functions is not reentered.

I don't use that many timer operations in my application code, but I do assume that the following functions don't require any freeing or removing afterwards, am I wrong?

Completely right.

Could you tell us more on how you are using timers?

Interesting would be things like

- what platform are you on - how many timers are simultaneously active - how are the intervals - how is the interrupt load

... that might help corner the issue.

You should consider xtimer just showing a problem which might be caused by memory corruption.

Kaspar

Michael_Andersen1 · 10 February 2016 22:08

Hi

Thanks for the reply. I am on a platform essentially equal to a samr21xpro.

The short answers:

samr21xpro
only one declared xtimer_t object that is used more than once. I use it with xtimer_set_msg for a thread to send itself a message. Both the timer and the msg object are statically allocated. On the other hand, I have RPL and all sorts of network things going and I have no doubt there are a ton of timers involved. In terms of ephemeral timers, I call xtimer_usleep a LOT with intervals of between 1ms and 100ms from multiple threads. I also send packets every 200ms or so and receive them every 500ms or so. -The interrupt load might be pretty steep if the radio is interrupting on every packet (promiscuous mode). I don’t think it is. Otherwise I would imagine that other than the timers it is less than ten per second.

As for memory corruption, that may well be the case. I will double check my code. I thought it was somewhat unusual that multiple boards would all get a timer pointing to itself, but I suppose not all corruption is non-deterministic and they all run identical firmware, so it might be corruption.

One question, in the network stacks, are there ever two threads possibly using the same timer object? I ask because the timer_remove and the insert are in two different critical sections, and if there are concurrent calls with the same timer object then it might be possible to interrupt between the critical sections and insert a timer that is already in the list. What would then happen is that this loop would end with list_head equal to the timer (assuming no other timer has the same time), and then the next two lines would basically link the timer to itself.

I could be wrong though, that is just a guess.

miri64 · 10 February 2016 22:21

Hi, normally you can guess where the timer came from by looking at the address (or the debugger straight tells you). Is this somehow possible for your case (i.e. 0x200010a4)? That might be helpful for the timer people.

Regards, Martine

Joakim_Nohlgard · 11 February 2016 10:05

Also, you can use the .map file to find out if there are any buffers or other things nearby which may have overflowed and messed up your state.

Are you using any IPC messages other than the xtimer functions? (I wonder if there might be a race between the timer ISR callbacks and the message reception in xtimer)

Regards, Joakim

Michael_Andersen1 · 13 February 2016 00:28

Hi

I am using other IPC messages, yes. There is a thread waiting with xtimer_msg_receive_timeout that gets messages either from xtimer_set_msg or from the network stack on packet reception.

Incidentally, if I decrease the load on the MCU by increasing the sensor sampling interval, the problem seems to go away (or at least it has not shown again in the past day).

I inserted a trap in the timer code that will stop everything and let me debug if a timer is found in the linked list with the same address as the timer to be inserted, I’ll let you know again with more info when I reproduce it.

Thanks Michael

Kaspar · 19 February 2016 21:24

Hey Michael,

One question, in the network stacks, are there ever two threads possibly using the same timer object? I ask because the timer_remove and the insert are in two different critical sections, and if there are concurrent calls with the same timer object then it might be possible to interrupt between the critical sections and insert a timer that is already in the list. What would then happen is that this loop <RIOT/sys/xtimer/xtimer_core.c at master · RIOT-OS/RIOT · GitHub; would end with list_head equal to the timer (assuming no other timer has the same time), and then the next two lines would basically link the timer to itself.

I could be wrong though, that is just a guess.

I think your analysis is correct, I managed to create a test case that shows pretty much the behaviour you're describing.

Guarding most of xtimer_set() (using disableIRQ/restoreIRQ) fixes the problem, but disables interrupts for the backoff spin loop.

While hanging within xtimer is probably the worst, I'm not sure what would be the best semantic for concurrently setting the same timer object:

- return an error (cleanest, but currently xtimer_set() never returns an error) - first xtimer_set() wins (easy to implement by somehow tagging the timer object, but probably unexpected) - second xtimer_set() wins (very hard to do as xtimer_set() can be called in ISR context, and there's no way to wait for, e.g., a mutex) - guard the whole timer setting procedure (would disable interrupts while spinning) - ?

Opinions?

Kaspar