> product // flarebench // live
FlareBench
where agents fall short.
A living map of where AI coding agents fall short on Cloudflare. We hand a model a real task; it builds in an agent loop; the harness deploys the result to a real Cloudflare account and grades what actually happened, by hitting the live URL and driving it in a real browser. The ranking is a byproduct. The product is the diagnosis.
// sound familiar?
You hand a coding task to an AI agent and you cannot tell whether to trust the result or steer it. Sandbox unit tests pass while the deployed thing quietly does the wrong thing.
// AI research
What it does
Real deploys, real grading
Ground truth is a live URL and a behavioural check, not a unit test in a sandbox. The model never holds a deploy token.
Beyond pass/fail
Outcomes are graded handled-well / workable-but-watch-it / the-kind-of-mistake-that-bites, because real agent work is rarely a clean pass.
A growing model board
Claude, GPT, Gemini, DeepSeek, Qwen, Kimi, GLM, MiniMax and more, including Cloudflare’s own Workers AI models, benchmarked on Cloudflare’s own platform.
Stays current by design
Platform quirks and current API shapes are the test, so the map stays meaningful as model training data ages.
- You learn when to trust the agent and when to steer
- Grading reflects real agent work, not a clean pass or fail
- The map stays meaningful as model training data ages
// how it works
- 01A model is handed a real, contamination-free Cloudflare task
- 02It builds in an agent loop
- 03The harness deploys the result to a real Cloudflare account
- 04It hits the live URL and drives it in a real browser
- 05The outcome is graded beyond pass or fail
// is this you?
Built for anyone shipping real work with AI coding agents on Cloudflare who needs to know where they fall short.
Not for you if you want a leaderboard to crown a single best model. The ranking is a byproduct; the product is the diagnosis.
// the build log
116 commits in its first four days
Started 31 May 2026, on the web within the week: hand-written contamination-free tasks, a deploy-and-grade harness, and a public diagnostic map. Built because we needed the answer ourselves: when do you trust the agent, and when do you steer?
read the making of →// built on
Want FlareBench for your business?
Tell us what you need. We’ll give you a straight answer on whether it fits.