Three top agents, same tasks:
matching output, engineering decides.
OpenClacky and Claude Code both run claude-4.6-sonnet; Codex runs gpt-5.5 — each on its strongest native stack, across the same 3 real-world tasks. This round isn't about who can finish; it's about clarity, AI-flavour and the bill at equally top-tier output. Fully recorded, per-request OpenRouter data published.
Three takeaways
Verdict firstOutput: OpenClacky and Claude Code share the first tier. Both delivered all three tasks at high quality; OpenClacky edges ahead on clarity, typography and motion, with the least AI-flavour. Codex clearly trails — broken formatting in the supplier report, and a portfolio homepage that is essentially unusable.
Cost: equal output, OpenClacky $4.50 ≈ Claude Code $4.50. Codex spent 2.8× ($12.54) and still finished last — it fell into a screenshot→inspect→self-check loop on the portfolio task, burning $8.61 on that task alone.
Model freedom: swap in a third-party model and OpenClacky still delivers. With all three forced onto deepseek-v4-pro, OpenClacky completed everything for $1.30; Claude Code / Codex also ran, but cost and cache behaviour fluctuated with the third-party provider (see reference group).
Click any item to jump to its section
Three tasks, deliverable by deliverable
Output review · first-hand tester feedbackEach task: side-by-side previews of all three deliverables + tester review + full recording. Previews are the original delivered files, untouched.
Supplier screening report
Given 10 supplier meeting-notes docx files plus screening criteria, produce a presentation-ready supplier shortlist report.
All three completed the task. OpenClacky's report is the most readable — it best serves the goal of a document meant for human briefing. Claude Code covered more evaluation dimensions with richer criteria. Codex's formatting broke, with noticeably colloquial and AI-flavoured wording.
Previews are page renders of the docx deliverable; original file downloadable.
Full recording (sped up)
AI industry daily brief
Research the last 3 days of AI industry news online and produce an HTML briefing.
All three completed the task. OpenClacky's presentation is the most scannable — what to read is obvious at a glance, with the least AI-flavour. Claude Code's page has the strongest AI-flavour; Codex sits in between.
Previews are the original delivered HTML.
Full recording (sped up)
Personal portfolio website
Given résumé materials and a requirements doc, build a complete multi-page portfolio site (HTML/CSS/JS).
OpenClacky and Claude Code both delivered high quality with similar styles; OpenClacky is slightly better on typography, copy presentation and motion. Codex's result is unusable — the homepage layout and expression fail outright, and it kept looping through screenshot→screen-recognition→self-check: a static UI design barely needs self-checking, yet it checked endlessly, burned the most, and still finished last.
Previews are each agent's delivered homepage; click through to browse the full site.
The process data corroborates the review: on the portfolio task Codex issued 126 requests (OpenClacky 42, Claude Code 19), 10.5M prompt tokens, $8.61 on a single task — roughly 3× the others. The billing curve matches the self-check loop visible in the recording.
Full recording (sped up)
Equal output, the bill tells the difference
Process & costData from per-request OpenRouter billing (separate API keys per agent), summed line by line — no estimates.
| Task | Agent | Requests | Cost | Prompt tokens | Cache hit |
|---|---|---|---|---|---|
| Supplier screening report | OpenClacky | 22 | $0.63 | 559,727 | 88.5% |
| Claude Code | 13 | $0.86 | 582,039 | 74.7% | |
| Codex | 33 | $2.38 | 1,838,748 | 88.1% | |
| AI industry daily brief | OpenClacky | 24 | $0.93 | 756,678 | 83.7% |
| Claude Code | 14 | $1.05 | 763,008 | 76.8% | |
| Codex | 14 | $1.55 | 761,212 | 82.2% | |
| Personal portfolio website | OpenClacky | 42 | $2.94 | 3,051,284 | 92.7% |
| Claude Code | 19 | $2.58 | 1,619,726 | 81.9% | |
| Codex | 126 | $8.61 | 10,528,212 | 96.2% |
OpenClacky and Claude Code both totalled $4.50 — an open-source harness matching Anthropic's first-party tool on the same model. Codex totalled $12.54, $8.61 of which went to the portfolio self-check loop.
Claude Code and OpenClacky ran claude-4.6-sonnet via Amazon Bedrock; Codex ran gpt-5.5 via OpenAI — all metered through OpenRouter. Differing request counts per task reflect normal strategy differences; read alongside the recordings.
What if all three ran DeepSeek?
Reference group · for reference onlyAll three agents re-ran the same 3 tasks on deepseek-v4-pro (OpenRouter / StreamLake provider). OpenClacky delivered everything as usual, $1.30 total — the harness is model-agnostic.
Why reference-only: cache differences in this group mostly stem from the OpenRouter DeepSeek provider's cache behaviour (e.g. Claude Code hit 0% on two tasks here, while DeepSeek's official API does serve its cache), so they can't be attributed to the tools themselves. Interpret the cost data with care.
| Task | OpenClacky | Claude Code | Codex |
|---|---|---|---|
| Supplier screening report | $0.15 · 50.7% | $0.78 · 11.9% | $0.50 · 57.4% |
| AI industry daily brief | $0.28 · 65.4% | $0.42 · 0.0% | $0.29 · 49.3% |
| Personal portfolio website | $0.87 · 62.7% | $2.01 · 0.0% | $1.82 · 47.9% |
| Total across 3 tasks | $1.30 · 62.5% | $3.21 · 2.9% | $2.61 · 49.8% |
Task · $Cost · Cache hit
What this group really shows: OpenClacky's engineering is model-agnostic — swap in a third-party model an order of magnitude cheaper, and it still ships every task. First-party tools tend to reserve their best experience for their own models.
How we tested
MethodologyThis round differs from round 1: round 1 compared four agents' cost engineering on the same model; this round takes on Claude Code and Codex — two top first-party tools on their strongest models — proving output parity first, then comparing engineering efficiency.
Raw data downloads
Original per-request OpenRouter billing exports — 3 files for the main group, 3 for the DeepSeek reference group. Deliverables open directly in each task section above.
Run open-source OpenClacky, pick your own model
First-party-grade output and engineering efficiency, bound to no model.
Download OpenClacky free