Introducing GHA-bench

GHA-bench is a benchmark and a set of evals for how well different coding agents author and test GitHub Actions.

How it works

Agents (currently a variety of Anthropic models set to various effort levels, driven by Claude Code) are given set of tasks they must automate using GitHub Actions, either using a particular scripting language or whichever they want.* They must use Test-Driven Development (TDD)– basically “write tests first, and don’t come back until they all pass”.**

A panel of judges (Google Gemini and Claude Haiku) then evaluates the comprehensiveness of the tests and the quality of the code.

Which model, effort level and scripting language should you use?

Adjust the sliders according to your priorities.

Duration 17.5%

Cost 17.5%

Tests Quality 40.0%

Code Maintainability 25.0%

Model	Language	Duration	Cost	Tests	Code

* When allowed to choose, the agents always choose Python.

** Agents run their tests locally in a container that leverages nektos act to emulate a GitHub-hosted runner.