AES crypto optimizations, speed and size

Oleg_Artamonov · 2 January 2018 16:06

Hello all,

Recently we did some AES encoding/decoding optimizations and experiments, and here I want to share results.

System: STM32L151CCU6 @ 32 MHz, arm-none-eabi-gcc 6.3.1, -O2

What we tried:

existing option: FULL_UNROLL to unroll loops in aes.c
new option: AES_CALCULATE_TABLES to calculate AES T-tables in runtime if possible instead of storing everything in the flash memory
new implementation: AES_ASM (AES-128 only) using “All the AES You Need on Cortex-M3 and M4” code by Peter Schwabe and Ko Stoffelen, see https://github.com/Ko-/aes-armcortexm[ and ](https://github.com/Ko-/aes-armcortexm and https://eprint.iacr.org/2016/714.pdf)https://eprint.iacr.org/2016/714.pdf

What we measured:

firmware size (we have quite a big fw, but only AES options were changed between build)
time to encode/decode a single AES-128 block, using xtimer_now (1 us default tick )

Ok, so here it comes (encoding / decoding, firmware size):

default RIOT AES implementation (no unroll, T-tables in flash): 74 / 131 us, 104.8 KB
no unroll, run-time T-tables calc: 77 / 139 us, 97.1 KB
full unroll, T-tables in flash: 64 / 108 us, 110.1 KB
full unroll, run-time T-tables calc: 56 / 108 us, 102.1 KB
assembler: 45 / 87 us, 104.6 KB

It seems that storing all T-tables in flash is quite useless, at least on Cortex-M MCUs: takes a lot of flash and performance improvement is insignificant or even negative. Unrolling loops with run-time T-tables provides 25-30 % speed improvement over default settings and firmware size reduction by 2.7 KB at the same time, so if you need fast and portable implementation — look no further.

Code: https://github.com/unwireddevices/RIOT/tree/loralan-public/sys/crypto (aes.c, aes_asm_cortexm.c)

I’m not sure about other hardware platforms, but as Cortex-M3/M4 is the most popular processor core now, I think default AES implementation in RIOT should be changed to run-time T-tables calculation to save flash.

P.S. https://docs.google.com/spreadsheets/d/1RwxomeVPoE-SngUHeRgPoI3ObQVgENsQQuIliMXDZNg/edit?usp=sharing

Oleg_Artamonov · 2 January 2018 16:08

* oops, sorry, here is the plain text version, no HTML *

Hello all,

Recently we did some AES encoding/decoding optimizations and experiments, and here I want to share results.

System: STM32L151CCU6 @ 32 MHz, arm-none-eabi-gcc 6.3.1, -O2

What we tried: * existing option: FULL_UNROLL to unroll loops in aes.c * new option: AES_CALCULATE_TABLES to calculate AES T-tables in runtime if possible instead of storing everything in the flash memory * new implementation: AES_ASM (AES-128 only) using "All the AES You Need on Cortex-M3 and M4" code by Peter Schwabe and Ko Stoffelen, see https://github.com/Ko-/aes-armcortexm and https://eprint.iacr.org/2016/714.pdf

What we measured: * firmware size (we have quite a big fw, but only AES options were changed between build) * time to encode/decode a single AES-128 block, using xtimer_now (1 us default tick )

Ok, so here it comes (encoding / decoding, firmware size): * default RIOT AES implementation (no unroll, T-tables in flash): 74 / 131 us, 104.8 KB * no unroll, run-time T-tables calc: 77 / 139 us, 97.1 KB

* full unroll, T-tables in flash: 64 / 108 us, 110.1 KB * full unroll, run-time T-tables calc: 56 / 108 us, 102.1 KB

* assembler: 45 / 87 us, 104.6 KB

It seems that storing all T-tables in flash is quite useless, at least on Cortex-M MCUs: takes a lot of flash and performance improvement is insignificant or even negative. Unrolling loops with run-time T-tables provides 25-30 % speed improvement over default settings _and_ firmware size reduction by 2.7 KB at the same time, so if you need fast and portable implementation — look no further.

Code: https://github.com/unwireddevices/RIOT/tree/loralan-public/sys/crypto (aes.c, aes_asm_cortexm.c)

I'm not sure about other hardware platforms, but as Cortex-M3/M4 is the most popular processor core now, I think default AES implementation in RIOT should be changed to run-time T-tables calculation to save flash.

P.S. https://docs.google.com/spreadsheets/d/1RwxomeVPoE-SngUHeRgPoI3ObQVgENsQQuIliMXDZNg/edit?usp=sharing

Ludwig_Knupfer · 2 January 2018 16:26

Hello,

First of all thank you for sharing your insights!

While I’m not entirely sure I get all the implications right away I do have the following thoughts:

I assume the results can not be generalized for the CPU architecture because CPU clock and flash reading speed does vary independently.

I do expect the result depends on the concrete product and configuration you’re looking at. A factor 2 memory access speed difference in between all cortex m products does not seem very unlikely to me.

Did you factor this in to your conclusion? Any thoughts?

Cheers, Ludwig

Oleg_Artamonov · 2 January 2018 17:55

I believe results will be even better on most Cortexes, with nice performance gain with run-time T-tables on high-speed CPUs — as a rule, MCU core gains speed easier than flash, so 168-Mhz high-end Cortex-M4F may have same flash memory as entry-level Cortex-M0, limited at 20-30 MHz effective clock speed.

For example, STM32F4, flash wait cycles vs core clock: https://yadi.sk/i/SEI7mXtK3RAFxg, effective flash clock is 20 MHz. Same for Atmel SAM3U (https://yadi.sk/i/B4iA0ik33RAGVK, 24 MHz effective flash clock), same for EFM32, etc.

TI CC1310 and CC2650 have 8KB flash cache so maybe they'll perform better with in-flash T-tables, but that's a very specific case (not to mention that flash cache may be disabled and used as a regular RAM, and it often is, as there's only 20 KB of regular RAM available). Anyway, they have hardware AES accelerator too.

On low-end MCUs (20 MHz or less, 8-bit and 16-bit architectures) in-flash T-tables should be faster, but most such MCUs have a very limited amount of flash as well, so wasting 7 KB on T-tables is not an option anyway.