> product // flarebench // live

FlareBench
where agents fall short.

A living map of where AI coding agents fall short on Cloudflare. We hand a model a real task; it builds in an agent loop; the harness deploys the result to a real Cloudflare account and grades what actually happened, by hitting the live URL and driving it in a real browser. The ranking is a byproduct. The product is the diagnosis.

Talk to us about FlareBench flarebench.au ↗

116 commits in its first 4 daysgraded on real deploys, not sandbox teststhe model never holds a deploy tokenflarebench.au ↗

// audioHear about FlareBench, narrated by the Grid

0:00 / 0:00

The FlareBench diagnostic map: how AI coding agents score on real Cloudflare tasks

// sound familiar?

You hand a coding task to an AI agent and you cannot tell whether to trust the result or steer it. Sandbox unit tests pass while the deployed thing quietly does the wrong thing.

// AI research

What it does

Real deploys, real grading

Ground truth is a live URL and a behavioural check, not a unit test in a sandbox. The model never holds a deploy token.

Beyond pass/fail

Outcomes are graded handled-well / workable-but-watch-it / the-kind-of-mistake-that-bites, because real agent work is rarely a clean pass.

A growing model board

Claude, GPT, Gemini, DeepSeek, Qwen, Kimi, GLM, MiniMax and more, including Cloudflare’s own Workers AI models, benchmarked on Cloudflare’s own platform.

Stays current by design

Platform quirks and current API shapes are the test, so the map stays meaningful as model training data ages.

You learn when to trust the agent and when to steer
Grading reflects real agent work, not a clean pass or fail
The map stays meaningful as model training data ages

// how it works

01A model is handed a real, contamination-free Cloudflare task
02It builds in an agent loop
03The harness deploys the result to a real Cloudflare account
04It hits the live URL and drives it in a real browser
05The outcome is graded beyond pass or fail

// is this you?

Built for anyone shipping real work with AI coding agents on Cloudflare who needs to know where they fall short.

Not for you if you want a leaderboard to crown a single best model. The ranking is a byproduct; the product is the diagnosis.

// the build log

● built in the open

116 commits in its first four days

Started 31 May 2026, on the web within the week: hand-written contamination-free tasks, a deploy-and-grade harness, and a public diagnostic map. Built because we needed the answer ourselves: when do you trust the agent, and when do you steer?

read the making of →

// built on

Ground trutha live URL plus a real-browser behavioural check

Isolationeach build deployed to real Cloudflare isolation

Gradinghandled-well / workable-but-watch-it / mistake-that-bites

Model boardClaude, GPT, Gemini, DeepSeek, Qwen, Kimi, GLM, MiniMax, Workers AI

Safetythe model never holds a deploy token

// part of the same Grid

● Open source