Our engineers share the background to our Continuous Integration (CI) process in this design doc. Internally, the team decided to nickname it OurCI and you'll see it referred to as such in the design doc below.
As always, the below is the original design doc our engineering team used to outline their thought process for the development part of Mongoose IoT Platform. Still included are all our internal puns (sorry about those!), and if anything is unclear and you want to dive deeper, head over to the forum and ask your question there.
Design OurCI service that can distribute builds to remote agents. Some agents will perform builds, some other agents will flash devices and run Hardware (HW) tests. Job execution is parallel.
Non-goals: Defining the exact behavior of HW tests.
Our current CI is slow and cannot run HW tests.
Parallelizing builds on a multicore host is not easy in the current setup. It's not just about adding -j32 to our make invocation. We need to run builds with different toolchains that are distributed in different Docker images, and we need to be able to build things in parallel but still have a readable output in case of build failures.
The core of our cloud backend is ready enough to serve as a basis for our continuous integration tool. Using it will offer us the opportunity to dogfood our infrastructure. This will affect some choices here. We also have a drive toward avoiding over-engineering. The proposed solution hopefully contains a balanced tradeoff between using our cloud infrastructure for its own sake and being simple enough so it can be relied upon even while we are doing massive changes on the very system we're testing.
- Implement generic Pub Sub service modeled after Google Cloud Pub Sub, and already designed some time ago but not yet implemented.
- Agents will talk clubby, execute shell scripts (or whatever), and report back output and status.
- Current builds' scripts will be incrementally adapted to take advantage of the ability to spawn child builds.
- HW tests are just shell scripts that run on agents running on hosts that are connected to devices.
The ability to spawn new build tasks from a shell script allows us to build simple workflows.
There is no support for fancy things such as join nodes, node dependencies, or defining a workflow in a declarative way: The execute tasks are defined at runtime by a seed master script.
Green boxes here are messages published via Pub Sub. Dotted lines refer to tasks executed by.
Distribution protocol is based on our Pub Sub service.
The OurCI master process will listen to GitHub events and spawn master builds by publishing a build request message.
A master build script will spawn child builds by publishing build request messages to more specific topics.
Agents are subscribers to Pub Sub topics. The generic agent listens to a Pub Sub topic and runs a subprocess (e.g. shell script). The script in the subprocess will perform a task, and its execution status will be sent back to the OurCI master. Script output will be streamed to the Log Store service.
Each agent can perform one task at a time. It will poll for Pub Sub messages and, once received, it will periodically acknowledge it until the subprocess is done or a build timeout occurs. If the Pub Sub message times out before the agent explicitly replies (either with success or failure, including build timeout failures), the Pub Sub services will make the message available for another agent to grab. We thus distinguish a build script timing out and an agent failing to communicate with the Pub Sub service.
The number of agent processes running on a given machine defines the parallelism we can achieve on that host. An agent can poll for multiple topics but will only execute one at a time.
Agent scripts can spawn Docker containers that perform the build or execute the Mongoose Flashing Tool or esptool to flash a device.
The agent will communicate with the build script via environment variables. The build request type will thus be available to the build script. A single build script will thus be able to run the correct build step based on the actual build request, e.g. "run v7 tests," "build Mongoose IoT Platform ESP8266".
The Pub Sub topic is used to route the build request to an agent capable of understanding that particular build request.
Some agent scripts will store the build artefacts required by subsequent build steps. For example, a firmware (FW) will be built by a cross compiler running in a Docker container on the cloud, stored in the blobstore, and later fetched by a HW agent and flashed on a device.
Not all build artefacts are necessarily stored in the blobstore. For example, build scripts that create docker images can publish them to the docker registry.
Topics and Build Environments
A Pub Sub topic is just a string telling which build environment is expected from the agent subscribing to that topic:
In future, we can add more build environments, like a native Windows build or OS X build machines.
This is our current build environment: A host with srcfs available at /data/srcfs, SSH credentials to pull the latest commits from GitHub, and the possibility to run Docker containers as build steps.
Our entire current `ourci` build script can be run by an agent providing the `build:docker` build environment.
This build environment runs a script on a host with an attached esp-12e device with 4M flash and Mongoose Flashing Tool or esptool installed.
The serial port and clubby credentials for fetching the FW and test scripts via blobstore are available to the script via environment variables.
Esp-12e devices will come with some hardware attached. For simplicity, we currently provide no way to negotiate the actual set of attached hardware, hence, the test script and the environment will have to match.
The build environment doesn't come with a git checkout, nor the credentials to fetch one. It totally depends on previous build tasks to provide it with all necessary files to flash the firmware and run a test script.
The test downloaded script will actually perform the flashing, boot the device, wait for WebDAV, upload a JS test script and listen for test results from the device. Alternatively, the test script will interact with the FW via serial port.
This setup reduces the need to update the agent every time we improve the test script. The test script will be obtained from the branch being tested, allowing us to test testing scripts in PRs as well.
This build environment is like test:esp-12e but comes with a cc3200 device and flashing tool installed.
Workflow Execution and Data model
Until now, we focused on agents and the environment they have to provide in order to execute individual build tasks. However, a build job is composed of a bunch of build tasks, which are orchestrated in order to achieve a single goal.
Unlike some workflow tools, build tasks are not defined statically, but are instead spawned by other build tasks.
This makes it easy to run builds and delegate build logic to specific build scripts, but how can we gather the results of the build steps?
Each time a build task is spawned, it's associated with a build job. The build job will keep track of each outstanding build task and its completion status. A build job is completed only when all its tasks are completed.
An agent can create tasks and change in task status (start execution, completion, etc) by calling a clubby method on the OurCI master.
All task creation is done via OurCI master, which keeps in memory state and restarts all builds for not green pull requests (like it does now). Persistency can be added later.
The new UI will be built on top of the cloud/dashboard UI.
The UI will communicate with the OurCI master via JS clubby.
During initial page load, the UI will obtain the initial list of build jobs with a clubby call to the OurCI master. The UI will subscribe to Pub Sub notifications to keep the UI up to date and use the Log Store API to stream the currently selected build logs.
The UI will render a list of build jobs. When clicking on a build job, it will show a flat list of currently scheduled build tasks. No attempt is made to show a tree or a graph of the workflow.
When clicking on a build task, it will show the log for that build task.
Other Execution Modes
The granularity and independence of our build tasks make it possible to exploit some build agents independently from OurCI main workflow. We can spawn build tasks from external CI services like CircleCI, provided that the external CI build script can perform a clubby call (perhaps via REST API). To better suit external use cases, we can tailor HW agents for that use-case by using a separate topic that implements a different contract in its build environment (e.g. a more external-user friendly way to get build artefacts).
Agents are simple clubby clients that can spawn a subprocess, capture its output, and stream it back via clubby. The easiest way to implement it currently is in Go, as we can take the log tailer code out of the existing OurCI. It's possible to write it using MGIOT POSIX as well, but at the moment, it would be a waste of time to duplicate code and depend on MGIOT to be bug-free; after all, this is a continuous integration system; it should be more stable than what it's testing.
Agents don't have to be registered. All they need to do is poll to a subscription. Here is the diagram taken from the Pub Sub design doc that illustrates the difference between a topic a subscription and a subscriber:
In order for a bunch of agents to share the same task queue, each agent will pull messages from a subscription. The subscription has a well-known name. For this use case, it can have the same name as the topic itself.
Since agents pull for tasks, tasks will be consumed as soon as agents are ready. The more agents pull from the same queue, the faster we'll progress. Agents can come and go dynamically. If an agent disappears without acknowledging or replying to a Pub Sub message, the Pub Sub service will make the message available again and another agent will pick it. Messages will have an ETA and will be dropped once expired. The rest of the details are deferred to the document describing the Pub Sub service.
GCE (Google Cloud Engine) instance group auto-scaler can be configured to add more instances if the load is too high. We can implement better policies by monitoring the Pub Sub queue length.
And that's how (and why) OurCI works the way it does. Stay tuned for our next post to see some examples of the system in action.