Hello all,
Recently we did some AES encoding/decoding optimizations and experiments, and here I want to share results.
System: STM32L151CCU6 @ 32 MHz, arm-none-eabi-gcc 6.3.1, -O2
What we tried:
- existing option: FULL_UNROLL to unroll loops in aes.c
- new option: AES_CALCULATE_TABLES to calculate AES T-tables in runtime if possible instead of storing everything in the flash memory
- new implementation: AES_ASM (AES-128 only) using “All the AES You Need on Cortex-M3 and M4” code by Peter Schwabe and Ko Stoffelen, see https://github.com/Ko-/aes-armcortexm[ and ](https://github.com/Ko-/aes-armcortexm and https://eprint.iacr.org/2016/714.pdf)https://eprint.iacr.org/2016/714.pdf
What we measured:
- firmware size (we have quite a big fw, but only AES options were changed between build)
- time to encode/decode a single AES-128 block, using xtimer_now (1 us default tick )
Ok, so here it comes (encoding / decoding, firmware size):
-
default RIOT AES implementation (no unroll, T-tables in flash): 74 / 131 us, 104.8 KB
-
no unroll, run-time T-tables calc: 77 / 139 us, 97.1 KB
-
full unroll, T-tables in flash: 64 / 108 us, 110.1 KB
-
full unroll, run-time T-tables calc: 56 / 108 us, 102.1 KB
-
assembler: 45 / 87 us, 104.6 KB
It seems that storing all T-tables in flash is quite useless, at least on Cortex-M MCUs: takes a lot of flash and performance improvement is insignificant or even negative. Unrolling loops with run-time T-tables provides 25-30 % speed improvement over default settings and firmware size reduction by 2.7 KB at the same time, so if you need fast and portable implementation — look no further.
Code: https://github.com/unwireddevices/RIOT/tree/loralan-public/sys/crypto (aes.c, aes_asm_cortexm.c)
I’m not sure about other hardware platforms, but as Cortex-M3/M4 is the most popular processor core now, I think default AES implementation in RIOT should be changed to run-time T-tables calculation to save flash.
P.S. https://docs.google.com/spreadsheets/d/1RwxomeVPoE-SngUHeRgPoI3ObQVgENsQQuIliMXDZNg/edit?usp=sharing