HW Timer not firing on MSP430

Hello,

I am encountering some problems with MSP430 ports: when running an application that uses hwtimers to synchronize and execute tasks in a predictable manner, my application eventually stalls (i.e. halt) after an unpredictable and seemingly random delay. This issue arises during simulation (Cooja) as well as on the real hardware (Zolertia Z1).

(Note: this is reason I proposed PR #1002, which has failed to solve the issue. So I'll let to your collective wisdom to decide whether it deserves to be merged, or should be cancelled.)

After quite a lot of debugging, I found that the problem is caused by the timer_round variable (i.e.: the "16 most significant bits"---managed by software---of the timer counter of the MSP430): sometimes, the timer-overflow (TAOV) interruption visibly fails to fire, and this variable is not incremented correctly; this failure then prevents all timer-related code---hwtimer_spin() included---to work.

I wonder how the TAOV can be made to fail during RIOT's normal operation. I have found no code in RIOT that would disable timer-overflow interrupt (i.e.: TAIE bit in TACTL register) after initialization. Consequently, I can only think of the following possible causes: * If TAOV interrupt occurs when the GIE flag (in SR) is disabled---e.g: when another interruption is being serviced---and MSP430 fails to trigger TAOV when interrupts are re-enabled. This is unlikely, since MSP430 manuals specify the existence of such "delayed interrupt triggering". (If such a bug exists, it would be desastrous, since I can't see any way to fix it by software!...) * If TAOV fires, then another timer interrupt (TACCR1 or TACCR2) fires *before* TAOV could be treated, it is masked; then the misbehaviour of the timers. The MSP430 manual isn't very clear on that point: it speaks of multiple interrupts being fired one after another when more than one are pending, but the exact order of priority is not really given, so maybe the TAOV is always fired after the other interrupts, which would prevent them from being correctly handled. * Another unexpected bug causes this... But how?

At this point, I need your knowledge to help me understand and solve the problem. If someone could answer to the following questions: * Does MSP430 *really* handle correctly the "delayed interrupt phenomenon", that is: firing interrupt that have been masked by GIE bit (in R2/status register), once GIE is enabled again? Are there known bugs in MSP430 MCUs related to that? * How does MSP430 exactly handle the occurence of multiple interruptions treated by the TAIV interrupt handler (that is TACCR1, TACCR2, Timer A overflow)? By "multiple occurence", I mean that many of these interruptions occur before they can be treated one after another by the TAIV interrupt handler. * More generally: when multiple interrupts are pending, does MSP430 treat (fire) them is any known order of precedence, or are they fired in order of occurence, or just rendomly? * And finally, for the other ports: do you have similar problems on (for example) ARM Cortex-M MCUs? Or are the interrupt and timer subsystems more "robust" on these platforms? (i.e.: is this a problem specific to the MSP430 architecture?)

Thanks in advance for your hints,

Also note: I just looked how things are handled in FreeRTOS (i.e.: a system known for its strong real-time features). It seems that they only use (on MSP430) TimerA CCR0, in up mode; that is: the HW timer always fire at a predefined, fixed delay. The various timer-related tasks (included the system's scheduler) are then handled by software, similarly (at least in concept) to what is done in RIOT's vtimer module.

So it seems that at least one other well-known project has just used the MSP430 timer machinery in a minimal fashion. I just hope this doesn't mean the whole MSP430 timer system has been deliberately avoided because of known flaws... :expressionless:

Best regards,

Le 15/04/2014 12:43, ROUSSEL K�vin a �crit :

Hello,

I am encountering some problems with MSP430 ports: when running an application that uses hwtimers to synchronize and execute tasks in a predictable manner, my application eventually stalls (i.e. halt) after an unpredictable and seemingly random delay. This issue arises during simulation (Cooja) as well as on the real hardware (Zolertia Z1).

[...]

Hi Kévin,

I encountered a maybe related problem back in January when I was porting the OpenWSN stack to RIOT and the TelosB. I observed that only the highest channel (I think there are 4?) of the timer was behaving correctly while the other channels somehow didn't fire (at all). Sadly I can't recall exactly what my findings concerning the source of this misbehaviour were. Back then I fell back to using hwtimer_arch*() with the channel I knew does work which was good enough for me since OpenWSN provides its own timer abstraction similarly to hwtimer. Then I forgot about this until now your description rang a bell. I'll try to reproduce my debugging setup I had and could come back to you by tomorrow. Hopefully with a refreshed memory and some more insights.

Best, Thomas

Hello Thomas,

Thanks for your input. Indeed, this is consistent with FreeRTOS using only the first comparator of the timer (TACCR0)...

Looks like we're facing another of TI's "design wins"... Great.

Best regards,

Le 15/04/2014 14:18, Thomas Eichinger a �crit :

Hello Kévin,

I wasn't able to reproduce the problem I had with MSP430's hwtimer. Could you provide me a test case or project you're experiencing this behaviour with?

Kind regards, Thomas

Hi Thomas,

Maybe you confuse this with the bug that was fixed in #909 ?

Cheers, Ludwig

Hi Ludwig,

you are right, reading the diffs this most probably will have fixed my problem. (which I still don't remember in full though)

Best, Thomas

Hello everyone,

I think I have found why this problem happens: The timer overflow interrupt, in the MSP430 architecture, has the *least* priority of all timer-related interrupts!

The result is that: when timers are set, or hwtimer_spin() is called, when the timer's counter is about to overflow, there is a non-negligible risk of having the wrong timer_round value when the timers fire (or when hwtimer_spin() should be left). This is what happens when I experience failures with my code.

As I suspected when I saw FreeRTOS source code for MSP430, we are here victim of TI's poor design.

Alas, the timer-related interruption with highest priority on MSP430 is the handler for CCR0. This means that the more robust way to handle timer overflow management is to use the CCR0 comparator interrupt, that is: undo Oleg's PR #909! This implies that we will only have two available instances of hwtimer on MSP430... It's a real shame, but I couldn't find another robust way to fix that problem.

Oleg, what is your opinion? Should we revert #909?