Hello Kaspar,
sorry for "out out order" message, but I was not in the list when this thread was started.
Adding my two cents:
the current xtimer is "just" not thread safe and should not be used/set from multiple threads (hope we agree on this). the question is, if there is somewhere in the riot core single xtimer_t* set from multiple threads?
Or was just userspace misuse...
Hope the second is the case:)
As for returning error on xtimer_set - what you would do with the return value?
maybe something like?:
while (true)
{
if (!xtimer_set())
{
yield();
}
else
{
break;
}
}
quite ugly...
Sure there should be check in _add_timer_to_long_list and _add_timer_to_list for duplicates and assert in the case that trying to add pointer which is already in - just to warn user that is doing something bad.
wbr
malo
> <[https://github.com/RIOT-OS/RIOT/blob/master/sys/xtimer/xtimer_core.c#L209-L211](https://github.com/RIOT-OS/RIOT/blob/master/sys/xtimer/xtimer_core.c#L209-L211)> would
> end with list_head equal to the timer (assuming no other timer has the
> same time), and then the next two lines would basically link the timer
> to itself.
>
> I could be wrong though, that is just a guess.
I think your analysis is correct, I managed to create a test case that
shows pretty much the behaviour you're describing.
Guarding most of xtimer_set() (using disableIRQ/restoreIRQ) fixes the
problem, but disables interrupts for the backoff spin loop.
While hanging within xtimer is probably the worst, I'm not sure what
would be the best semantic for concurrently setting the same timer object:
- return an error (cleanest, but currently xtimer_set() never returns an
error)
- first xtimer_set() wins (easy to implement by somehow tagging the
timer object, but probably unexpected)
- second xtimer_set() wins (very hard to do as xtimer_set() can be
called in ISR context, and there's no way to wait for, e.g., a mutex)
- guard the whole timer setting procedure (would disable interrupts
while spinning)
- ?
Opinions?
Kaspar
Hello Kaspar,
sorry for "out out order" message, but I was not in the list when this thread was started.
Adding my two cents:
the current xtimer is "just" not thread safe and should not be used/set from multiple threads (hope we agree on this). the question is, if there is somewhere in the riot core single xtimer_t* set from multiple threads?
Or was just userspace misuse...
Hope the second is the case:)
It was my code that initially discovered it. I was quite deliberate about
not accessing the same object from two threads. In fact the timer that was
found to be in the list twice was one created by riot to wait for message
with timeout. Not ruling out user error, but it would have to be a more
subtle user error
I've been busy on other things, so I have not had time to dig into where
this is coming from, but a temporary hotfix that stopped my mesh from
crashing was to put a modified version of remove_timer inside the critical
section of add timer. At least that way the critical section does not
contain the spin wait. Not advocating this as the solution, it's just a
hack until I get time to improve it.
> the current xtimer is "just" not thread safe and should not be used/set from multiple threads (hope we agree on this). the question is, if there is somewhere in the riot core single xtimer_t* set from multiple threads?
> Or was just userspace misuse...
> Hope the second is the case:)
>
It was my code that initially discovered it. I was quite deliberate about
not accessing the same object from two threads. In fact the timer that was
found to be in the list twice was one created by riot to wait for message
with timeout. Not ruling out user error, but it would have to be a more
subtle user error
Do I understand this correctly: your application was not using multiple
threads accessing the same timer (i.e. xtimer_t struct), but you had still
concurrency problems with the timer? Can you elaborate on this?
I've been busy on other things, so I have not had time to dig into where
this is coming from, but a temporary hotfix that stopped my mesh from
crashing was to put a modified version of remove_timer inside the critical
section of add timer. At least that way the critical section does not
contain the spin wait. Not advocating this as the solution, it's just a
hack until I get time to improve it.
Would you be willing to share this hack with us? Maybe it gives us more
insights.
this is ugly hack indeed…
could you rather add some check into the _add_timer functions, put breakpoint there and send the backtrace back?
Anyway majority of the authors are probably still in the list, so everybody just scratch the top of the head and tries to remember if setting timer from two threads somewhere?
I have not made a reproducer because I’ve been pretty busy, but also because if the problem is due to the network stack, it might be hard to reproduce without all the other chatty nodes around sending ND, RPL and UDP traffic.