Round 3 · Same model, same prompt, 1v1

OpenClacky vs Pi Agent ——
same PPT task, bill at 65%.

Same guizang-ppt-skill, same claude-4.7-opus, same prompt. The only variable is the agent harness itself. Both delivered an 11-page deck; this page lays out the process data so you can judge for yourself.

OpenClacky clacky-ai/openclacky Pi Agent earendil-works/pi

Pi Agent comes from badlogic (creator of libGDX). It's a 46.9k-star agent harness mono-repo on GitHub — coding agent CLI, unified multi-provider LLM API, TUI / web UI libraries, Slack bot, vLLM pods. It emphasizes shareable OSS sessions and self-extensibility — a platform, not a script. Its "lightweight per-turn context, many small steps" stance is a deliberate engineering tradeoff. This page puts that philosophy on the same axes as OpenClacky for one specific PPT task.

Conclusion first · total cost
$1.18 vs $1.79
Same skill, same claude-4.7-opus, same prompt. OpenClacky was 34% cheaper and finished in 0.43× the time. Both delivered structurally identical 11-page decks.
Cache hit rate
90.7% vs 75.8%
Requests
14 vs 23
Three plain facts
  • OpenClacky's cache hit rate is very high (90.7%). 13 of 14 requests rode the previous prompt cache — this is the root cause of both the lower bill ($1.18) and shorter wall-clock (2m29s).
  • Pi has lighter per-turn prompts (−40%). Average prompt is 33.5k vs OpenClacky's 55.6k — direct evidence of its "small steps, lean context" design philosophy. Single-turn cost on the model side is genuinely lower.
  • The price: 7-turn cold start + 6 mid-run cache breaks. The bill ends up 54% higher ($1.82 vs $1.18).

Experiment setup

The only variable is the agent harness.

Skill
guizang-ppt-skill
Both sides loaded the same skill; outputs are structurally identical
Model
claude-4.7-opus
Both sides used the same model 100%
Prompt
10-page AI Agent trends deck
Targeted at executives, dark tech-business style
Test date
2026-05-09
Both runs same day, same time window, same model instance
View prompt summary
Task: use guizang-ppt-skill to produce "2026 AI Agent Industry Trends: Strategic Paradigm for Enterprise Intelligence".
Audience: enterprise leaders / mid-to-senior decision-makers (focused on ROI, efficiency, organizational change).
Length: 20 minutes / ≤10 pages. Style: high-end, minimal, business, tech.
Outline: cover / core thesis (Copilot→Agent) / market drivers / multi-agent collaboration /
industry ROI / architectural evolution / governance dividends / risks / 4-quarter roadmap / closing.
View original prompt.md →

Run recordings

Process · Evidence

Watch for: (1) what granularity of decision the agent makes per turn; (2) any moments where the same file is re-read.

OpenClacky
≈ 3min
Pi Agent
≈ 6min

Recordings produced 2026-05-09; timestamps line up 1:1 with the OpenRouter CSVs.

All key metrics

Numbers come straight from the OpenRouter activity CSV, request by request. No estimates, no averaging tricks.

Metric OpenClacky Pi Agent Notes
Requests 14 23 Total model calls
Agent iterations 11 23 1:1 with requests in this task
Total wall-clock 149s 342s OpenRouter Σ generation_time (first → last)
Total cost $1.18 $1.79 OpenRouter bill (cache discount applied)
Total prompt tokens 778,403 769,349 All input tokens sent to model
Total cached tokens 705,787 582,868 Tokens that hit prompt cache
Overall cache hit rate 90.7% 75.8% Σcached / Σprompt
Cache breaks 1 7 Requests with hit rate < 50% (incl. cold start)
Avg prompt / turn 55,600 33,450 Per-request input volume
Avg completion / turn 1,054 823 Per-request output volume
Avg generation time / turn 10.6s 14.9s Mean of OpenRouter generation_time_ms
Length cuts 0 0 Requests with finish_reason = length
Errors 0 0 finish_reason = error / content_filter

Green = numerically better on that metric. Gray = no clear "better" direction (e.g. token volume itself is neither good nor bad).

Per-request timeline · how cache breaks happen

Bar length = prompt token volume. Color = cache hit rate. Red means the agent's context was reset or hadn't accumulated yet — tens of thousands of token-cache value just evaporated.

OpenClacky
14 requests
#1 13 0%
#2 21,363 0%
#3 40,959 52%
#4 55,665 74%
#5 56,470 99%
#6 57,015 99%
#7 57,922 98%
#8 67,509 86%
#9 68,421 99%
#10 69,055 99%
#11 70,068 99%
#12 70,287 100%
#13 71,459 98%
#14 72,197 98%
Smooth ramp from 13 to 72k tokens. Excluding the 1-turn cold start, every turn is ≥86% hit rate — no mid-run resets. The cost is heavier per-turn prompts and slightly higher per-turn latency.
Pi Agent
23 requests
#1 3,249 0%
#2 5,786 0%
#3 5,962 0%
#4 6,137 0%
#5 6,856 0%
#6 7,044 0%
#7 10,735 0%
#8 11,987 90%
#9 12,122 0%
#10 12,376 97%
#11 14,314 84%
#12 27,138 44%
#13 45,614 32%
#14 46,108 31%
#15 47,303 97%
#16 47,915 97%
#17 47,915 99%
#18 67,074 71%
#19 68,111 70%
#20 68,576 98%
#21 68,724 99%
#22 69,090 99%
#23 69,213 99%
First 7 turns are all 0% hit (long cold start). Mid-run, requests #9 / #12 / #13 / #14 / #18 / #19 — 6 turns drop below 70% hit (3 of them around 30%). This is the direct cost of the "small steps" philosophy: prompt cache rewards stable, accumulated context.
≥ 80% hit 50–80% hit < 50% hit (incl. cold start / context reset)

Technical traits (each side's tradeoffs)

Two harnesses, two engineering tradeoffs. We compare facts the data supports — no "who's better" verdict.

Context management

Mirror images of "accumulate vs trim per turn" produce two different cost curves.

OpenClacky
Continuous accumulation. Prompt grows monotonically 13 → 72k across 14 turns; overall hit rate 90.7%. Pro: max cache reuse. Con: each turn is heavier than the last.
Pi Agent
Under "small steps", cold start is 7 turns of 0% hit, plus 6 mid-run drops to 30–70%. Pro: per-turn token volume stays low. Con: cache reuse is structurally low.

Tool granularity

OpenClacky has many cohesive purpose-built tools. Pi composes finer-grained light calls across more turns.

OpenClacky
13 tool_use calls across file_reader / edit / glob / grep / write etc. Each call returns more information, so 14 turns suffice.
Pi Agent
19 tool_use calls across 23 turns. Lighter per call, finer-grained decisions. This aligns with Pi's "observable, shareable session" design goal — fine steps make for cleaner postmortems and dataset publication.

Per-turn load

"Light per turn, many turns" vs "Heavy per turn, few turns" — the most visible axis of difference.

OpenClacky
Avg prompt 55.6k, avg generation 10.6s. Heavier and slower per turn — but only 14 turns total.
Pi Agent
Avg prompt 33.5k, 40% lighter than OpenClacky. Friendlier to the model per turn — Pi's real strength on this axis. Cost: 23 turns and pricier re-billing each turn.

Resilience

Neither side hit any failure path on this task. No quality difference.

OpenClacky
0 length cuts, 0 errors. All 14 turns finished with finish_reason=stop.
Pi Agent
0 length cuts, 0 errors. 22 of 23 turns tool_calls/stop; 1 turn finish_reason empty (no impact on result).
How to read this data: Pi achieves smaller per-turn token volume and finer session observability via "lightweight + small steps". OpenClacky maximizes cache hit rate and per-call information density via "accumulating context + purpose-built tools". Both delivered the deck — but under per-token billing where prompt-cache hit rate dominates the math, OpenClacky's 90.7% hit rate brings the bill down to 66% of Pi's. Which tradeoff is right depends on whether you value "shareable sessions / lighter per-turn" or "cheaper / faster overall".

Final outputs

Both outputs are structurally identical (same skill template: 11 pages, dual canvas backgrounds, same fonts). Click any thumbnail to open the full deck.

Thumbnails are scaled previews. Best viewed in a desktop browser; both decks include WebGL backgrounds and animations.

Wrap-up

Same skill, same claude-4.7-opus, same prompt. Both delivered structurally identical 11-page decks; visible quality difference is minimal.

The difference is in the process: OpenClacky took the accumulating-context + purpose-built-tools route — 14 requests, 90.7% cache hit rate, $1.18, 2m29s. Pi took the lightweight-context + small-steps route — smaller per-turn prompts, ending at 23 requests, $1.79, 5m42s.

Worth respecting: Pi is a 46.9k-star agent harness platform whose design goals include shareable OSS sessions, self-extensibility, and unified multi-provider APIs — \"lightweight + small steps\" is a natural consequence of that, not a bug. Under prompt-cache pricing, \"lighter per-turn\" and \"higher cross-turn hit rate\" pull against each other.

This is an honest tradeoff comparison: neither side dominates — they fit different scenarios. If you need shareable session datasets, fine-grained step replay, multi-provider flexibility, Pi's engineering value is hard to replace. If you want max cache reuse and cheaper / faster overall runs, OpenClacky's accumulating scheduling wins.

All raw data (OpenRouter CSV, session JSONL, recordings) is on this page. Verify it yourself, and feel free to visit Pi's GitHub for the author's own design exposition.

Other tasks

01 · guizang-ppt-skill
10-page Horizontal-Swipe Business Deck (single HTML)
guizang-ppt-skill · AI-Agent industry trend talk
02 · marketing-psychology
AI Customer-Service SaaS Marketing Plan + Live Homepage
marketing-psychology skill · dual deliverable
03 · social-content
B2B SaaS Competitor Analysis + Week-1 Social Calendar
social-content skill · 6-step pipeline
← Back to benchmark overview