CoAP API revamp

chrysn · 10 September 2021 11:18

Gcoap in its current form has maintainability issues due to its ties to nanocoap as discussed during the summit.

Content of the summit's pad on the topic (because I don't trust these things to persist)

CoAP API (re-)design

Teaser

Gcoap and nanocoap not only exist in parallel, they also share data structures like the pdu_t or the handler. Changes to Gcoap have accumulated Gcoap-specific fields in the pdu_t, but also make it hard to use Gcoap where even larger changes would be needed (handlers being told which transport data arrived from), and to utilize the underlying socket API to its full extent (thoughts of zero-copy access to data). Is the current level of entangelment sustainable? If not, how can we migrate? What’s the fall-out? And once we’ve done that, can we just swap around CoAP transports and implementations like we swap network stacks? Is composability a topic to include (block-wise in userspace)?

Current design

2 APIs

nanocoap:
- Designed for low memory footprint
- nanocoap.h: message parsing and composition
- nanocoap_sock.h: simple client/server functionality
gcoap:
- Designed for user-friendliness
- Uses nanocoap for message parsing and composition
  - currently not top-down, but gCoAP-specific stuff #ifdef'd into nanocoap (see coap_pkt_t e.g.)

Problems and wishlist

Problems

Co-dependency of gcoap and nanocoap requires touching both for even small protocol improvements, even at API side (handler signatures, member access, case and point #16827 from yesterday)
Transport pretty much decided at compile-time (see #16688)
Using zero-copy capabilities of sock not possible/used atm
Everyone capitalizes Gcoap/gCoAP/gcoap differently

Wishlist

Transparent swap-out of CoAP transport
- Might need distinction between “message of which I know where it’s from and how transported” and “message received that’ll tell me where it’s from”
- Even with Gcoap this works easily on the Rust side: demos running unmodified on Linux, Gcoap or on RIOT sockets but using a Rust CoAP server.
Expose payload-read-function to userspace / make pkt->payload private
- Could help solving zero-copy problem
- Could simplify block-wise transfer in user space (see #16715)
Identify other direct-to-struct access patterns, build (transport-portable) API for them
- Some can identify non-portable behavior.
- When done, OSCORE.

Challenges

API breakage fall-out
Migration

Possible steps

Survey API use?
Deprecate field access??

Usage examples

Kaspar: nanocoap on minimal network stack through CDC-ECM (4k RAM or less)
- no security needs, no alternative transports
- could also be used for slipmux
- might make sense in separate implementations
  - server-only, “Class-0” environments (RFC7228)
- Any advanced features used? (blockwise, observe etc)
  - stateless: blockwise but not observe
Koen: Updates OTA
- all stateless; POST/GET, RIOT client, some RAM available
- block-wise used; manifest would be nice to have handled but other things callback-per-block
MCR: onboarding API
- identical needs as Updates OTA
- would like either DTLS or EDHOC+OSCORE (not runtime configured)

approaches forward

benpicco: fork gcoap, break all?
- hcoap?
- Martine: long-term clean gcoap stuff out of nanocoap (or move to extra struct inheriting from nanocoap, maybe already)

Keep using nanocoap for message parsing (Move from accessing static fields to inline accessors) Deprecate member access through documentation

maribu: Careful about “not changing API too often”, not “too much”
- chrysn: experimental for start, but then stable
- Koen: cochinelle script for simlpe changes?

Hashing out the API

@miri64
@chrysn
@Kaspar (to keep nanocoap from too much breakage)
discuss at next VMAs more details

From there, I’d like to sketch a out concrete redesign tasks:

Plan

Intoduce a new API that’s conceptually similar to Gcoap but does not promise API compatibility to it. I’ll call it gcoap-bis for a working title until it emerges with a name. On the long run, that will replace Gcoap. (This might also be phrased as an evolution of gcoap that just runs in a different namespace to allow one-time migration rather than forcing users through many small steps).

Sketching this out will raise questions on fundamental limitations, like “We want this to be usable on backends that are arbitrarily scatter-gathery”, and raise questions (like “what do we do when not even the options are contiguous in memory”).
Implement that API with nanocoap as original backend.
Compatibly (with deprecations over release cycles) get rid of direct member access in nanocoap (using static inline accessors instead).
Set nanocoap up to be usable on the options-and-payload parts even of messages that are not coap-over-udp (without bloating anything up for that use case).
Optionally: Provide a simple representation-oriented API on top of the new API. This won’t deal in messages any more, will look very different on the server and client side, and allow for more erbium-style interactions. This might become the easy-to-use end user library for some applications.

Possible future backends are then CoAP-over-TCP, over BLE, slipmux, but also GNRC (where with a suitable content setting function we might build frame content directly from flash ROM).

Questions

If we allow scatter-gather-in data, payload access will be scatter-gather obviously (already making use harder), but also option access. How do we best deal with that? Ask them to scatter-gather access the option? Always ask them to provide a buffer and copy over the option (causing more instead of less copying)? Leaning towards “yeah it’s scatter-gather too” right now, with good helpers for all kinds of known-structure options.
- Can we assume that at least for outgoing messages, the message is contiguously allocated? (Probably not, because in the end it’d be nice to directly build CoAP messages into lwIP buffers which can be composed from slabs)
Which tools do we want to give users to check for any critical leftover options?
On Rust I have an embedded-usable CoAP API – can we take some from there? (Note that this is not scatter-gather friendly).
- The crate allows working on very constrained (eg. “all you write to the message is immutable from that point on”) backends; I don’t think that we’ll need these here as even nanocoap can implement MutableWritableMessage.
Which information do we need to provide about the transport, or common metadata?
- People currently expect that they can access MIDs (even though I think they never should)
- How do we best abstract over the hints? (Eg. setting a request to be CON is meaningless on CoAP-over-TCP)

miri64 · 14 September 2021 20:48

Another thing that we did not talk about in the meeting, but might also be put on the wishlist is a generic CoAP API (maybe this would go between “hCoAP”/gcoap_ng/whatever and the representation-oriented API). The main thought is: whatever CoAP library is used in the backend, they all use the same generic CoAP API. This would allow for easy hot-swapping of CoAP implementations, sock-style.

jia200x · 15 September 2021 08:10

Another thing that we did not talk about in the meeting, but might also be put on the wishlist is a generic CoAP API (maybe this would go between “hCoAP”/gcoap_ng/whatever and the representation-oriented API). The main thought is: whatever CoAP library is used in the backend, they all use the same generic CoAP API. This would allow for easy hot-swapping of CoAP implementations, sock-style.

+1

This also allows to use network-stack-embedded CoAP implementation such as the one in OpenThread and even avoid including sock if it’s not needed.

miri64 · 15 September 2021 08:34

I started a tracking issue for that step on Github.

chrysn · 16 September 2021 06:30

Pluggability is great (and I see it as a desirable outcome of this), but we need to be aware that it always comes at the cost of either supporting a small core set of interactions, or ramping up error handling.

For example, if one were to support a struct-based API as a backend (like erbium, which admittedly is at a higher abstraction level and would not fit in here precisely), “set this option” could fail on “that’s not an option this backend supports”. For the purposes here, I’ll try to find some reasonable common ground without including everything.

I also do hope that picking the minimal backend means that applications like Kaspar’s can largely use the shared API without loss of coding or execution efficiency, but see that as an ambitious extra goal we may or may not reach.

chrysn · 26 September 2021 12:44

A big question that we’ll need to ponder is how to use encryption (OSCORE but also EDHOC and anything COSEish) with scatter-gather data:

AEAD implementations are extremely wary of scatter-gather data. Even when they support stream processing on a theoretical level, no single one I’ve found offers to do AEAD on noncontiguous data. The rationale given is that once you start decrypting and leave the tentative plaintext in memory, there’s the risk that users will read from there and that’s a big no-go in AEAD before the authentication has been checked (which is at the end). This is primarily about decryption (because with encryption not so much can go bad), but even there memory can be scattered (eg. in lwIP slabs).

It’s hard enough to find implementations that allow scatter-gather feeding of the additional data (where nothing could go wrong) already, and that’s just about API complexity and not about safety (and API complexity) – in the foreseeable future I don’t expect to have a sensible portion of the used operations scatter-gather friendly.

So what to do if, for example, in an incoming message the user asks through the API for bytes 20 through 542 of the message payload in memory? (For example, in libOSCORE this is phrased as “get a memory mapped view of the message” – might be pretentious terminology for something that’s really just “give me the address of”, but backends could be diverse). What’d our API prescribe?

That all backends need to be able to do that? Can’t be guaranteed if buffers are full and malloc says no…
That users must expect this to fail and provide buffers on their own to copy out (possibly throgh a copy_from_payload helper) into there if data is too scattered?
That application authors must carefully read the docs of both their CoAP applications and their backend and ensure they work? (“may not be able to provide a memory mapped complete view of the message” … “requires that the message can be viewed in memory” … too bad, fails!)
That it’s the general expectation that backends reallocate and copy data if needed (it’s not like that’ll happen on every message!), that applications should not try to work around that (“if it doesn’t work, the system is currently out of memory; if you added a static buffer to work around that, that static buffer might just as well have given lwIP a single larger slab”) and need to handle defeat by taking a smaller bite at the application level?

I’m leaning towards the latter – because CoAP can usally do that well (eg. in OSCORE: indicating that the sender needs to make its inner blocks smaller), and also because it’d still kinda work with my hopes of keeping simple applications simple. (“You know you’re building things nanocoap_sock style? Great, then no error can occur, and you can slop around error handling, or just put an assert there to be safe”).

mcr · 27 September 2021 17:33

A big question that we’ll need to ponder is how to use encryption (OSCORE but also EDHOC and anything COSEish) with scatter-gather data:

> AEAD implementations are *extremely* wary of scatter-gather data. Even

I’ve never heard that.

> when they support stream processing on a theoretical level, no single
> one I've found offers to do AEAD on noncontiguous data. The rationale
> given is that once you start decrypting and leave the tentative
> plaintext in memory, there's the risk that users will read from there
> and that's a big no-go in AEAD before the authentication has been
> checked (which is at the end). This is primarily about decryption
> (because with encryption not so much can go bad), but even there memory
> can be scattered (eg. in lwIP slabs).

This seems like a made-up excuse. How do I start to read if the synchronous call hasn’t returned? How would this be ANY different than if the data was in a single buffer?

> I'm leaning towards the latter -- because CoAP can usally do that well
> (eg. in OSCORE: indicating that the sender needs to make its inner
> blocks smaller), and also because it'd still kinda work with my hopes
> of keeping simple applications simple. ("You know you're building
> things nanocoap_sock style? Great, then no error can occur, and you can
> slop around error handling, or just put an assert there to be safe").

Sounds good to me.

chrysn · 28 September 2021 16:33

There’s two kinds of scatter-gather access – using linked lists (ie. control never leaves the user) and using some feeder / callback (eg. an Iterator<Item=&[u8]> in Rust, or in extreme forms even asynchronous streams, possibly realized by the feeder just calling into the AAD state machine from the outside).

The latter is way more flexible, for example if you want to process a single file that can’t even be mapped to contiguous RAM (like, decrypting an entire firmware image in one go when you have less RAM than flash). But that does allow user code to be executed on maybe incorrectly decrypted plaintext.

I tried to support my statement on the implementors’ aversion against scatter-gather by citing OpenSSL’s EVP API, but turns out that’s internally contradictory and may indeed allow this use (they just insist the AAD be in contiguous memory): “you can only call EVP_DecryptUpdate once for AAD and once for the plaintext” vs “Provide the message to be decrypted, and obtain the plaintext output. / EVP_DecryptUpdate can be called multiple times if necessary”. The libsodium API takes all in one go. Some discussion was had around the RustCrypo traits when I looked at them for OSCORE.