Round 4 · 2026-06-04 · Top-tier arena

Three top agents, same tasks:
matching output, engineering decides.

OpenClacky and Claude Code both run claude-4.6-sonnet; Codex runs gpt-5.5 — each on its strongest native stack, across the same 3 real-world tasks. This round isn't about who can finish; it's about clarity, AI-flavour and the bill at equally top-tier output. Fully recorded, per-request OpenRouter data published.

Contenders
OpenClacky claude-4.6-sonnet Claude Code claude-4.6-sonnet Codex gpt-5.5

Three takeaways

Verdict first

Output: OpenClacky and Claude Code share the first tier. Both delivered all three tasks at high quality; OpenClacky edges ahead on clarity, typography and motion, with the least AI-flavour. Codex clearly trails — broken formatting in the supplier report, and a portfolio homepage that is essentially unusable.

Cost: equal output, OpenClacky $4.50 ≈ Claude Code $4.50. Codex spent 2.8× ($12.54) and still finished last — it fell into a screenshot→inspect→self-check loop on the portfolio task, burning $8.61 on that task alone.

Model freedom: swap in a third-party model and OpenClacky still delivers. With all three forced onto deepseek-v4-pro, OpenClacky completed everything for $1.30; Claude Code / Codex also ran, but cost and cache behaviour fluctuated with the third-party provider (see reference group).

Click any item to jump to its section

Three tasks, deliverable by deliverable

Output review · first-hand tester feedback

Each task: side-by-side previews of all three deliverables + tester review + full recording. Previews are the original delivered files, untouched.

01

Supplier screening report

Given 10 supplier meeting-notes docx files plus screening criteria, produce a presentation-ready supplier shortlist report.

All three completed the task. OpenClacky's report is the most readable — it best serves the goal of a document meant for human briefing. Claude Code covered more evaluation dimensions with richer criteria. Codex's formatting broke, with noticeably colloquial and AI-flavoured wording.

OpenClacky
Completion Good
Clarity & layout Good
AI-flavour (less is better) Good
Claude Code
Completion Good
Clarity & layout Fair
AI-flavour (less is better) Fair
Codex
Completion Fair
Clarity & layout Poor
AI-flavour (less is better) Poor
OpenClacky DOCX
OpenClacky p1 OpenClacky p2 OpenClacky p3 OpenClacky p4 OpenClacky p5 OpenClacky p6
Claude Code DOCX
Claude Code p1 Claude Code p2 Claude Code p3 Claude Code p4 Claude Code p5
Codex DOCX
Codex p1 Codex p2 Codex p3 Codex p4 Codex p5

Previews are page renders of the docx deliverable; original file downloadable.

Full recording (sped up)
OpenClacky
Claude Code
Codex
02

AI industry daily brief

Research the last 3 days of AI industry news online and produce an HTML briefing.

All three completed the task. OpenClacky's presentation is the most scannable — what to read is obvious at a glance, with the least AI-flavour. Claude Code's page has the strongest AI-flavour; Codex sits in between.

OpenClacky
Completion Good
Clarity & layout Good
AI-flavour (less is better) Good
Claude Code
Completion Good
Clarity & layout Fair
AI-flavour (less is better) Poor
Codex
Completion Good
Clarity & layout Fair
AI-flavour (less is better) Fair

Previews are the original delivered HTML.

Full recording (sped up)
OpenClacky
Claude Code
Codex
03

Personal portfolio website

Given résumé materials and a requirements doc, build a complete multi-page portfolio site (HTML/CSS/JS).

OpenClacky and Claude Code both delivered high quality with similar styles; OpenClacky is slightly better on typography, copy presentation and motion. Codex's result is unusable — the homepage layout and expression fail outright, and it kept looping through screenshot→screen-recognition→self-check: a static UI design barely needs self-checking, yet it checked endlessly, burned the most, and still finished last.

OpenClacky
Completion Good
Clarity & layout Good
AI-flavour (less is better) Good
Claude Code
Completion Good
Clarity & layout Good
AI-flavour (less is better) Good
Codex
Completion Poor
Clarity & layout Poor
AI-flavour (less is better) Fair

Previews are each agent's delivered homepage; click through to browse the full site.

The process data corroborates the review: on the portfolio task Codex issued 126 requests (OpenClacky 42, Claude Code 19), 10.5M prompt tokens, $8.61 on a single task — roughly 3× the others. The billing curve matches the self-check loop visible in the recording.

Full recording (sped up)
OpenClacky
Claude Code
Codex

Equal output, the bill tells the difference

Process & cost

Data from per-request OpenRouter billing (separate API keys per agent), summed line by line — no estimates.

Total across 3 tasks
OpenClacky
$4.50
Claude Code
$4.50
Codex
$12.54
OpenClacky
Requests
88
Cache hit rate
90.6%
Claude Code
Requests
46
Cache hit rate
79.2%
Codex
Requests
173
Cache hit rate
94.3%
Per-task breakdown
Task Agent Requests Cost Prompt tokens Cache hit
Supplier screening report OpenClacky 22 $0.63 559,727 88.5%
Claude Code 13 $0.86 582,039 74.7%
Codex 33 $2.38 1,838,748 88.1%
AI industry daily brief OpenClacky 24 $0.93 756,678 83.7%
Claude Code 14 $1.05 763,008 76.8%
Codex 14 $1.55 761,212 82.2%
Personal portfolio website OpenClacky 42 $2.94 3,051,284 92.7%
Claude Code 19 $2.58 1,619,726 81.9%
Codex 126 $8.61 10,528,212 96.2%

OpenClacky and Claude Code both totalled $4.50 — an open-source harness matching Anthropic's first-party tool on the same model. Codex totalled $12.54, $8.61 of which went to the portfolio self-check loop.

Claude Code and OpenClacky ran claude-4.6-sonnet via Amazon Bedrock; Codex ran gpt-5.5 via OpenAI — all metered through OpenRouter. Differing request counts per task reflect normal strategy differences; read alongside the recordings.

What if all three ran DeepSeek?

Reference group · for reference only

All three agents re-ran the same 3 tasks on deepseek-v4-pro (OpenRouter / StreamLake provider). OpenClacky delivered everything as usual, $1.30 total — the harness is model-agnostic.

Why reference-only: cache differences in this group mostly stem from the OpenRouter DeepSeek provider's cache behaviour (e.g. Claude Code hit 0% on two tasks here, while DeepSeek's official API does serve its cache), so they can't be attributed to the tools themselves. Interpret the cost data with care.

Total across 3 tasks
OpenClacky
$1.30
Claude Code
$3.21
Codex
$2.61
Task OpenClacky Claude Code Codex
Supplier screening report $0.15 · 50.7% $0.78 · 11.9% $0.50 · 57.4%
AI industry daily brief $0.28 · 65.4% $0.42 · 0.0% $0.29 · 49.3%
Personal portfolio website $0.87 · 62.7% $2.01 · 0.0% $1.82 · 47.9%
Total across 3 tasks $1.30 · 62.5% $3.21 · 2.9% $2.61 · 49.8%

Task · $Cost · Cache hit

What this group really shows: OpenClacky's engineering is model-agnostic — swap in a third-party model an order of magnitude cheaper, and it still ships every task. First-party tools tend to reserve their best experience for their own models.

How we tested

Methodology
Same time window · 2026-06-04 Same 3 tasks, same input materials Default configs, no special tuning Separate OpenRouter API key per agent Per-request billing audit, no estimates Single run, no cherry-picking

This round differs from round 1: round 1 compared four agents' cost engineering on the same model; this round takes on Claude Code and Codex — two top first-party tools on their strongest models — proving output parity first, then comparing engineering efficiency.

Raw data downloads

Original per-request OpenRouter billing exports — 3 files for the main group, 3 for the DeepSeek reference group. Deliverables open directly in each task section above.

Run open-source OpenClacky, pick your own model

First-party-grade output and engineering efficiency, bound to no model.

Download OpenClacky free
Other rounds