Round 4 · 2026-06-04 · Top-tier arena

Three top agents, same tasks:
matching output, engineering decides.

OpenClacky and Claude Code both run claude-4.6-sonnet; Codex runs gpt-5.5 — each on its strongest native stack, across the same 3 real-world tasks. This round isn't about who can finish; it's about clarity, AI-flavour and the bill at equally top-tier output. Fully recorded, per-request OpenRouter data published.

Contenders

OpenClacky claude-4.6-sonnet Claude Code claude-4.6-sonnet Codex gpt-5.5

Three takeaways

Verdict first

Output: OpenClacky and Claude Code share the first tier. Both delivered all three tasks at high quality; OpenClacky edges ahead on clarity, typography and motion, with the least AI-flavour. Codex clearly trails — broken formatting in the supplier report, and a portfolio homepage that is essentially unusable.

Cost: equal output, OpenClacky $4.50 ≈ Claude Code $4.50. Codex spent 2.8× ($12.54) and still finished last — it fell into a screenshot→inspect→self-check loop on the portfolio task, burning $8.61 on that task alone.

Model freedom: swap in a third-party model and OpenClacky still delivers. With all three forced onto deepseek-v4-pro, OpenClacky completed everything for $1.30; Claude Code / Codex also ran, but cost and cache behaviour fluctuated with the third-party provider (see reference group).

Click any item to jump to its section

Three tasks, deliverable by deliverable

Output review · first-hand tester feedback

Each task: side-by-side previews of all three deliverables + tester review + full recording. Previews are the original delivered files, untouched.

Supplier screening report

Given 10 supplier meeting-notes docx files plus screening criteria, produce a presentation-ready supplier shortlist report.

All three completed the task. OpenClacky's report is the most readable — it best serves the goal of a document meant for human briefing. Claude Code covered more evaluation dimensions with richer criteria. Codex's formatting broke, with noticeably colloquial and AI-flavoured wording.

OpenClacky

Completion Good

Clarity & layout Good

AI-flavour (less is better) Good

Claude Code

Completion Good

Clarity & layout Fair

AI-flavour (less is better) Fair

Codex

Completion Fair

Clarity & layout Poor

AI-flavour (less is better) Poor

OpenClacky DOCX

Claude Code DOCX

Codex DOCX

Previews are page renders of the docx deliverable; original file downloadable.

Full recording (sped up)

OpenClacky

Claude Code

Codex

AI industry daily brief

Research the last 3 days of AI industry news online and produce an HTML briefing.

All three completed the task. OpenClacky's presentation is the most scannable — what to read is obvious at a glance, with the least AI-flavour. Claude Code's page has the strongest AI-flavour; Codex sits in between.

OpenClacky

Completion Good

Clarity & layout Good

AI-flavour (less is better) Good

Claude Code

Completion Good

Clarity & layout Fair

AI-flavour (less is better) Poor

Codex

Completion Good

Clarity & layout Fair

AI-flavour (less is better) Fair

OpenClacky Open original

Open original

Claude Code Open original

Open original

Codex Open original

Open original

Previews are the original delivered HTML.

Full recording (sped up)

OpenClacky

Claude Code

Codex

Personal portfolio website

Given résumé materials and a requirements doc, build a complete multi-page portfolio site (HTML/CSS/JS).

OpenClacky and Claude Code both delivered high quality with similar styles; OpenClacky is slightly better on typography, copy presentation and motion. Codex's result is unusable — the homepage layout and expression fail outright, and it kept looping through screenshot→screen-recognition→self-check: a static UI design barely needs self-checking, yet it checked endlessly, burned the most, and still finished last.

OpenClacky

Completion Good

Clarity & layout Good

AI-flavour (less is better) Good

Claude Code

Completion Good

Clarity & layout Good

AI-flavour (less is better) Good

Codex

Completion Poor

Clarity & layout Poor

AI-flavour (less is better) Fair

OpenClacky Open original

Open original

Claude Code Open original

Open original

Codex Open original

Open original

Previews are each agent's delivered homepage; click through to browse the full site.

The process data corroborates the review: on the portfolio task Codex issued 126 requests (OpenClacky 42, Claude Code 19), 10.5M prompt tokens, $8.61 on a single task — roughly 3× the others. The billing curve matches the self-check loop visible in the recording.

Full recording (sped up)

OpenClacky

Claude Code

Codex

Equal output, the bill tells the difference

Process & cost

Data from per-request OpenRouter billing (separate API keys per agent), summed line by line — no estimates.

Total across 3 tasks

OpenClacky

$4.50

Claude Code

$4.50

Codex

$12.54

OpenClacky

Requests

Cache hit rate

90.6%

Claude Code

Requests

Cache hit rate

79.2%

Codex

Requests

173

Cache hit rate

94.3%

Per-task breakdown

Task	Agent	Requests	Cost	Prompt tokens	Cache hit
Supplier screening report	OpenClacky	22	$0.63	559,727	88.5%
	Claude Code	13	$0.86	582,039	74.7%
	Codex	33	$2.38	1,838,748	88.1%
AI industry daily brief	OpenClacky	24	$0.93	756,678	83.7%
	Claude Code	14	$1.05	763,008	76.8%
	Codex	14	$1.55	761,212	82.2%
Personal portfolio website	OpenClacky	42	$2.94	3,051,284	92.7%
	Claude Code	19	$2.58	1,619,726	81.9%
	Codex	126	$8.61	10,528,212	96.2%

OpenClacky and Claude Code both totalled $4.50 — an open-source harness matching Anthropic's first-party tool on the same model. Codex totalled $12.54, $8.61 of which went to the portfolio self-check loop.

Claude Code and OpenClacky ran claude-4.6-sonnet via Amazon Bedrock; Codex ran gpt-5.5 via OpenAI — all metered through OpenRouter. Differing request counts per task reflect normal strategy differences; read alongside the recordings.

What if all three ran DeepSeek?

Reference group · for reference only

All three agents re-ran the same 3 tasks on deepseek-v4-pro (OpenRouter / StreamLake provider). OpenClacky delivered everything as usual, $1.30 total — the harness is model-agnostic.

Why reference-only: cache differences in this group mostly stem from the OpenRouter DeepSeek provider's cache behaviour (e.g. Claude Code hit 0% on two tasks here, while DeepSeek's official API does serve its cache), so they can't be attributed to the tools themselves. Interpret the cost data with care.

Total across 3 tasks

OpenClacky

$1.30

Claude Code

$3.21

Codex

$2.61

Task	OpenClacky	Claude Code	Codex
Supplier screening report	$0.15 · 50.7%	$0.78 · 11.9%	$0.50 · 57.4%
AI industry daily brief	$0.28 · 65.4%	$0.42 · 0.0%	$0.29 · 49.3%
Personal portfolio website	$0.87 · 62.7%	$2.01 · 0.0%	$1.82 · 47.9%
Total across 3 tasks	$1.30 · 62.5%	$3.21 · 2.9%	$2.61 · 49.8%

Task · $Cost · Cache hit

What this group really shows: OpenClacky's engineering is model-agnostic — swap in a third-party model an order of magnitude cheaper, and it still ships every task. First-party tools tend to reserve their best experience for their own models.

How we tested

Methodology

Same time window · 2026-06-04 Same 3 tasks, same input materials Default configs, no special tuning Separate OpenRouter API key per agent Per-request billing audit, no estimates Single run, no cherry-picking

This round differs from round 1: round 1 compared four agents' cost engineering on the same model; this round takes on Claude Code and Codex — two top first-party tools on their strongest models — proving output parity first, then comparing engineering efficiency.

Raw data downloads

Original per-request OpenRouter billing exports — 3 files for the main group, 3 for the DeepSeek reference group. Deliverables open directly in each task section above.

Main group (Sonnet 4.6 / GPT-5.5)

supplier-openrouter.xlsx ai-daily-openrouter.xlsx portfolio-openrouter.csv

DeepSeek reference group

deepseek-supplier-openrouter.xlsx deepseek-ai-daily-openrouter.xlsx deepseek-portfolio-openrouter.xlsx

Run open-source OpenClacky, pick your own model

First-party-grade output and engineering efficiency, bound to no model.

Download OpenClacky free

Other rounds

Harness benchmark · On par with Claude Code OpenClacky vs Generic Agent · PPT Task OpenClacky vs Pi Agent · PPT task

Three top agents, same tasks:matching output, engineering decides.

Three takeaways

Three tasks, deliverable by deliverable

Supplier screening report

AI industry daily brief

Personal portfolio website

Equal output, the bill tells the difference

What if all three ran DeepSeek?

How we tested

Raw data downloads

Run open-source OpenClacky, pick your own model

Three top agents, same tasks:
matching output, engineering decides.