Introducing GHA-bench
GHA-bench is a benchmark and a set of evals for how well different coding agents author and test GitHub Actions.
How it works
Agents (currently a variety of Anthropic models set to various effort levels, driven by Claude Code) are given set of tasks they must automate using GitHub Actions, either using a particular scripting language or whichever they want.* They must use Test-Driven Development (TDD)– basically “write tests first, and don’t come back until they all pass”.**
A panel of judges (Google Gemini and Claude Haiku) then evaluates the comprehensiveness of the tests and the quality of the code.
Which model, effort level and scripting language should you use?
Adjust the sliders according to your priorities.
* When allowed to choose, the agents always choose Python.
** Agents run their tests locally in a container that leverages nektos act to emulate a GitHub-hosted runner.