Assessing CodeQL from a Free Software point of view

Before picking the FSFE’s brains on whether we want to use CodeQL (as suggested during the last VMA), I’ve had a look at what they actually publish:

  • codeql-action is licensed MIT, but that appears to mainly be the tools to use CodeQL. (A distracting practice that unfortunately has become common – slap an OSI approved license around the thing users see, but it’s useless on its own. In Debian that’d put it into the “contrib” section rather than the main free repos.)
  • CodeQL itself is non-free, not even source-available, but gratis to use for some sorts of projects, with some parts only allowed “[if] the Open Source Codebase is hosted and maintained on”. It reserves the right to phone home and ask for auto-updates.

There appear to be two aspects to using this “generating CodeQL databases” and “performing analyses”. The generating part is what looks like it’d derive a CodeQL database from your own code and is only allowed under the abovermentioned conditions. Running CodeQL queries looks unproblematic, both because it appears not to take code out, and because it may be done in a self-hosted fashion (even though the binaries are still non-free).

Based on that, I don’t think that FSFE can be much help – that was suggested on the assumption that there’s a license attached to the code.

Looking at how this is used (not that they’d point that out a lot), both generation of databases and analysis are done in the actions.

My interpretation of this as a whole is that as long as we’re hosting on GitHub we can use it (licence-wise), but as soon as we’d (say) move our issue tracking off GitHub (and only kept a mirror on GitHub), we’d lose the right to use the software. So far, we pretend to use GitHub voluntarily; if we made this part of our toolchain, this would lock us in.

On the upside, there is no indication that GitHub gains any rights on the code or the generated databases. They acknowledge that the databases have licenses to them (that would be LGPL in our case).

My suggestion moving forward here is to not make this part of our regular processes (as it’d contribute to the GitHub lock-in – “can’t move out without losing X”), but as many of us “host and maintain” their forks of RIOT also on GitHub, I don’t see anything that’d keep such users from running CodeQL on their forked master branches and opening issues if anything shows up.

Thanks, @chrysn for the in-depth analysis! Some thoughts.

Do we really lock ourselves in? As soon as we move to other tools like GitLab, Codeberg, or other such platforms, we can remove the action and the lock would be “opened”. Iff we move away from GitHub, we probably want to remove the whole .github directory anyway at some point.

Regarding running on forks: that would mean that the contributors always need to be at least one commit ahead of the upstream master in their master branch, as the actions are part of the repo code (residing in .github/workflows/). Not arguing against that, but fearing weird PRs by confused contributors who try to PR their master branch :wink: .

It is a lock-in if we start relying on it, or building on it. In a way, we already have that in all the CI setup – it’s just one more thing on the list of the many things that’ll keep us on GH next time they mess up. To some extent that’s also influenced by how we present and perceive this. If this is framed as “this is a convenient service we use now, but are aware that it can be taken away on a whim (which it can) and we’re prepared for that, it’s just as if someone who does many reviews left the team”, it can be fine. I just don’t want to see this on a list of “reasons why we can’t migrate to {codeberg, gitlab, …}” the next time that discussion starts.

As for master branch: I’d have hoped that these CI tools allow something like “here I have a branch called with-codeql; test this every 24 hours by merging origin/master and then running…”. If we manage to work on the expectations side and treat this as an experiment-until-whenever-its-shut-off, it might just as well be an option with the main repo.

By the way, I now remember what that other GitHub service was when the original discussion that prompted some concerns came up: It was (is? didn’t hear from it again) called copilot. Training that and distributing the results without treating them as derivative works they do anyway (citing fair use), so it stands to reason they’d do the same with the security scans no matter whether we use them or not. Making this primarily a matter of “let’s not make this a part of our infrastructure that we use so much it’ll keep us at GH even when more good reasons to move appear”.