As a branch from the CI speed goal I think before we make and big suggestions an/or changes we try to find how is actually taking long and where are the failures.
If we find some common errors that cause lots of resources to discover maybe we can prevent them (either with faster failing or with more info on how developers should test before submitting a PR)…
This could start with simple things such as time spend on failed PRs vs successful then get more detailed by looking at the types of failures. Maybe also look at how many retries for until the PR is merged (for example someone can pass murdock but still push some changes later).
I’m collecting some statistics on the Murdock queue here. The load is the number of PRs in the queue, the wait time is the time a build has to wait in the queue (both averaged over the whole queue and the max in the queue).
(I would be happy to move this grafana instance and the python scripts to RIOT infrastructure though )
That is very nice! I was also thinking about using grafana too. How easy would it be to log the failures overtime (I think the history gets erased when the pr is rerun)?
For this setup I’m using InfluxDB as storage backend. InfluxDB works very nice as time series database. It scales pretty good to the number of records I have so far, although it doesn’t like the number of tests vs boards we have. I think that results in what InfluxDB calls high ‘series cardinality’. It shows it without any errors, but it can slow down slightly because of it.
In the queue case I’m simply adding a record every N minutes with the queue size, the average and the max wait time. This is case is a breeze for Influx.
If you want to log builds we could add a hook to the scripts to submit the results (success, failure, aborted). This would contain the PR number as field. I don’t know yet how I would add information about the exact failed/succeeded tasks to the database.