Odd problems with xtimer

malo · 24 February 2016 13:56

Hello Kaspar,
sorry for "out out order" message, but I was not in the list when this thread was started.

Adding my two cents:
the current xtimer is "just" not thread safe and should not be used/set from multiple threads (hope we agree on this). the question is, if there is somewhere in the riot core single xtimer_t* set from multiple threads?
Or was just userspace misuse...
Hope the second is the case:)


As for returning error on xtimer_set - what you would do with the return value?
maybe something like?:
while (true)
{
  if (!xtimer_set())
  {
    yield();
  }
  else
  {
    break;
  }
}
quite ugly...


Sure there should be check in _add_timer_to_long_list and _add_timer_to_list for duplicates and assert in the case that trying to add pointer which is already in - just to warn user that is doing something bad.

wbr
malo

> <[https://github.com/RIOT-OS/RIOT/blob/master/sys/xtimer/xtimer_core.c#L209-L211](https://github.com/RIOT-OS/RIOT/blob/master/sys/xtimer/xtimer_core.c#L209-L211)> would
> end with list_head equal to the timer (assuming no other timer has the
> same time), and then the next two lines would basically link the timer
> to itself.
> 
> I could be wrong though, that is just a guess.

I think your analysis is correct, I managed to create a test case that
shows pretty much the behaviour you're describing.

Guarding most of xtimer_set() (using disableIRQ/restoreIRQ) fixes the
problem, but disables interrupts for the backoff spin loop.

While hanging within xtimer is probably the worst, I'm not sure what
would be the best semantic for concurrently setting the same timer object:

- return an error (cleanest, but currently xtimer_set() never returns an
error)
- first xtimer_set() wins (easy to implement by somehow tagging the
timer object, but probably unexpected)
- second xtimer_set() wins (very hard to do as xtimer_set() can be
called in ISR context, and there's no way to wait for, e.g., a mutex)
- guard the whole timer setting procedure (would disable interrupts
while spinning)
- ?

Opinions?

Kaspar

Michael_Andersen1 · 24 February 2016 18:20

Hello Kaspar, sorry for "out out order" message, but I was not in the list when this thread was started.

Adding my two cents: the current xtimer is "just" not thread safe and should not be used/set from multiple threads (hope we agree on this). the question is, if there is somewhere in the riot core single xtimer_t* set from multiple threads? Or was just userspace misuse... Hope the second is the case:)

It was my code that initially discovered it. I was quite deliberate about not accessing the same object from two threads. In fact the timer that was found to be in the list twice was one created by riot to wait for message with timeout. Not ruling out user error, but it would have to be a more subtle user error

I've been busy on other things, so I have not had time to dig into where this is coming from, but a temporary hotfix that stopped my mesh from crashing was to put a modified version of remove_timer inside the critical section of add timer. At least that way the critical section does not contain the spin wait. Not advocating this as the solution, it's just a hack until I get time to improve it.

Oleg_Hahm · 24 February 2016 18:49

Hi!

Oleg_Hahm · 24 February 2016 18:50

Hi Kaspar!

Oleg_Hahm · 24 February 2016 18:52

Hi Michael!

> the current xtimer is "just" not thread safe and should not be used/set from multiple threads (hope we agree on this). the question is, if there is somewhere in the riot core single xtimer_t* set from multiple threads? > Or was just userspace misuse... > Hope the second is the case:) > It was my code that initially discovered it. I was quite deliberate about not accessing the same object from two threads. In fact the timer that was found to be in the list twice was one created by riot to wait for message with timeout. Not ruling out user error, but it would have to be a more subtle user error

Do I understand this correctly: your application was not using multiple threads accessing the same timer (i.e. xtimer_t struct), but you had still concurrency problems with the timer? Can you elaborate on this?

I've been busy on other things, so I have not had time to dig into where this is coming from, but a temporary hotfix that stopped my mesh from crashing was to put a modified version of remove_timer inside the critical section of add timer. At least that way the critical section does not contain the spin wait. Not advocating this as the solution, it's just a hack until I get time to improve it.

Would you be willing to share this hack with us? Maybe it gives us more insights.

Cheers, Oleg

Kaspar · 24 February 2016 19:45

Hey,

Oleg_Hahm · 24 February 2016 20:00

Hey Kaspar!

Michael_Andersen1 · 24 February 2016 20:02

Would you be willing to share this hack with us? Maybe it gives us more insights.

Sure! Here is what I did:

This is the "lightweight" remove +int _xtimer_ensure_remove(xtimer_t *timer) +{ + int res = 0; + if (timer_list_head == timer) { + uint32_t next; + timer_list_head = timer->next; + if (timer_list_head) { + next = timer_list_head->target - XTIMER_OVERHEAD; + } + else { + next = _lltimer_mask(0xFFFFFFFF); + } + _lltimer_set(next); + } + else { + res = _remove_timer_from_list(&timer_list_head, timer) || + _remove_timer_from_list(&overflow_list_head, timer) || + _remove_timer_from_list(&long_list_head, timer); + }

malo · 24 February 2016 21:09

Hello Michael,

this is ugly hack indeed… could you rather add some check into the _add_timer functions, put breakpoint there and send the backtrace back?

Anyway majority of the authors are probably still in the list, so everybody just scratch the top of the head and tries to remember if setting timer from two threads somewhere?

wbr malo

Michael_Andersen1 · 24 February 2016 21:13

I did actually put a check in, and break and get back the details. I put the stacktrace on an issue:

https://github.com/RIOT-OS/RIOT/issues/4841

I have not made a reproducer because I’ve been pretty busy, but also because if the problem is due to the network stack, it might be hard to reproduce without all the other chatty nodes around sending ND, RPL and UDP traffic.

malo · 25 February 2016 07:23

Hello Michael,

that is perfect bug report - sorry I have missed that.

wbr malo