Quoting Anthropic: Opus 4.8 Safety “somewhat less robust”

2026-05-28T19:35:00+00:00

Agentic safety. Although it shows improvements in some areas (such as refusing malicious requests), we found Opus 4.8 to be somewhat less robust than Opus 4.7 in several agentic contexts (such as vulnerability to prompt injection attacks). However, the application of our safeguards closes the gap between the models in practice. […]

- Anthropic, System Card: Claude Opus 4.8

Introducing GHA-bench

2026-05-13T12:51:00+00:00

GHA-bench is a benchmark and a set of evals for how well different coding agents author and test GitHub Actions.

How it works

Agents (currently a variety of Anthropic models set to various effort levels, driven by Claude Code) are given set of tasks they must automate using GitHub Actions, either using a particular scripting language or whichever they want.* They must use Test-Driven Development (TDD)– basically “write tests first, and don’t come back until they all pass”.**

A panel of judges (Google Gemini and Claude Haiku) then evaluates the comprehensiveness of the tests and the quality of the code.

Which model, effort level and scripting language should you use?

Adjust the sliders according to your priorities.

Duration 17.5%

Cost 17.5%

Tests Quality 40.0%

Code Maintainability 25.0%

Model	Language	Duration	Cost	Tests	Code

* When allowed to choose, the agents always choose Python.

** Agents run their tests locally in a container that leverages nektos act to emulate a GitHub-hosted runner.

Adam Daniel

Quoting Anthropic: Opus 4.8 Safety “somewhat less robust”

Introducing GHA-bench

How it works

Which model, effort level and scripting language should you use?