Harness Benchmark · 2026-04-30

On par with Claude Code,
well ahead of the rest.

One task, four agents, identical conditions.

3-task totals

The chart first

Total cost, request count, and cache hit rate for all four agents under identical prompt, model, and skill. Numbers come from per-request OpenRouter CSV logs.

Total cost

Per-request OpenRouter billing CSV · 3 tasks × 4 agents summed request by request · completed on 2026-04-30 within the same time window

OpenClacky

This project

$5.10

Claude Code

$5.49

OpenClaw

$15.70

Hermes

$30.14

Why the 6× gap

The direct cause: cache × request count

Per-prompt pricing is almost identical. The 6× total-cost gap comes from this: OpenClacky finished the same tasks with fewer requests and a higher cache hit rate.

Cache hit rate

OpenClacky

90.6%

Claude Code

95.2%

OpenClaw

88.7%

Hermes

60.3%

Request count

OpenClacky

Claude Code

OpenClaw

Hermes

218

51 requests + 90.6% hit rate → $5.10. Hermes 218 requests + 60.3% hit rate → $30.14.

Artifact comparison

Same prompt, four very different outputs.

Marketing / PPT HTML outputs are embedded inline; social-content text outputs are listed as files.

01 · guizang-ppt-skill

10-page Horizontal-Swipe Business Deck (single HTML)

guizang-ppt-skill · AI-Agent industry trend talk

OpenClacky

$1.23

Claude Code

$1.45

OpenClaw

$5.07

Hermes

$10.96

OpenClacky

Claude Code

OpenClaw

Hermes

02 · marketing-psychology

AI Customer-Service SaaS Marketing Plan + Live Homepage

marketing-psychology skill · dual deliverable

OpenClacky

$1.72

Claude Code

$1.20

OpenClaw

$7.47

Hermes

$4.65

OpenClacky

Claude Code

OpenClaw

Hermes

03 · social-content

B2B SaaS Competitor Analysis + Week-1 Social Calendar

social-content skill · 6-step pipeline

OpenClacky

$2.14

Claude Code

$2.84

OpenClaw

$3.15

Hermes

$14.53

OpenClacky

15 files

final_report.md flowbase_content_strategy.md flowbase_blog_article.md competitive_gap_analysis.md week1_linkedin.md week1_twitter.md competitor_coda_analysis.md competitor_notion_analysis.md competitor_obsidian_analysis.md competitor_coda_posts.md competitor_notion_posts.md competitor_obsidian_posts.md

Claude Code

8 files

final_report.md flowbase_content_strategy.md competitive_gap_analysis.md week1_linkedin.md week1_twitter.md competitor_coda_analysis.md competitor_notion_analysis.md competitor_obsidian_analysis.md

OpenClaw

10 files

final_report.md flowbase_content_strategy.md competitive_gap_analysis.md week1_linkedin.md week1_twitter.md competitor_coda_analysis.md competitor_notion_analysis.md competitor_obsidian_analysis.md _check.py _check_result.json

Hermes

8 files

_README.md competitive_gap_analysis.md competitor_coda_analysis.md competitor_notion_analysis.md competitor_obsidian_analysis.md competitor_coda_posts.md competitor_notion_posts.md competitor_obsidian_posts.md

See recording & data

Test conditions

Controlled variables

To keep the results comparable, all four agents were run within the same time window, using identical prompts, the same underlying model, and the same skill versions.

Identical prompt Identical model (claude-opus-4-7) Identical skill (same version pre-installed everywhere) Separate OpenRouter API keys Per-request CSV reconciliation, no estimates Single run, no cherry-picking

Want the harness engineering details?

The tech deep dive covers cache design, pipeline decomposition, and every design trade-off.

Download OpenClacky — free Read the tech deep dive

On par with Claude Code,well ahead of the rest.

The chart first

The direct cause: cache × request count

Same prompt, four very different outputs.

10-page Horizontal-Swipe Business Deck (single HTML)

AI Customer-Service SaaS Marketing Plan + Live Homepage

B2B SaaS Competitor Analysis + Week-1 Social Calendar

Controlled variables

Want the harness engineering details?

On par with Claude Code,
well ahead of the rest.