Skip to main content
blog.philz.dev

CI Performance Debugging

A friend of mine asked me to look at why their GitHub Actions CI workflow was slow. The punchline was that their self-hosted GitHub Runner (on AWS EC2) had too few IOPS available to it, and, as a result, was waiting around for the EBS volume quite a bit.

The atop tool showed a highly utilized disk in a nice red color, so, fine, we figured it out. I pointed Bazel to /dev/shm (--sandbox_base=/dev/shm), and suddenly we were 3x faster.

Snippet of atop

tmate #

Logging into a far-away machine can be tricky. Turns out the following should work:

- name: Install tmate
	run: |
		sudo apt-get update
		sudo apt-get install -y tmate

- name: Start tmate session
	run: |
		set -x
		tmate -S /tmp/tmate.sock.${GITHUB_RUN_ID} new-session -d
		tmate -S /tmp/tmate.sock.${GITHUB_RUN_ID} wait tmate-ready
		tmate -S /tmp/tmate.sock.${GITHUB_RUN_ID} display -p '#{tmate_ssh}'

Simon Willison has also described this on his blog, and there's a GitHub action called action-tmate that does the same.

In my case, however, this didn't work! It turned out that the GitHub runner's user had shell /usr/sbin/nologin, and tmate was refusing to start. The error confusingly said:

You must specify a socket name with -S. For example:
  tmate -S /tmp/tmate.sock new-session -d
  tmate -S /tmp/tmate.sock wait tmate-ready

If you ran tmate -F directly, it would confusingly spew out:

Session shell restarted

The underlying problem was that tmate couldn't create a login session because the shell was set to nologin.

To work around this, change the shell or run tmate as root, or use Tailscale or one of the AWS Connect options...

Bazel Build Profiles #

When I went rooting around in the Bazel setup, I found that the somewhat hidden bazel-out/../../../command.profile.gz file is, in fact, in the chrome://tracing/ format. So, if you're using bazel test, you have a nice graph of your CPU usage (as well as a timeline visualization of your tests) already!

In the first (slower) run, the CPU graph goes up and down. Meanwhile, in the second (faster) run, the CPU graph is pegged once the (parallelized) tests start, and it remains pegged.

Slow CPU Profile

Fast CPU Profile