AI Customer-Service SaaS Marketing Plan + Live Homepage
marketing-psychology skill · dual deliverable
Task & environment
Analyze only gorgias.com. Output a Chinese marketing document (positioning, 30-day acquisition plan, content topics, DM scripts, homepage copy, objections) and a single-file Chinese index.html (hero / pain / solution / features / use cases / FAQ / demo-booking, no external assets).
Read the full prompt →One marketing execution Markdown + one single-file Chinese index.html (fully inlined CSS/JS)
Results
All numbers are recomputed per-request from the OpenRouter activity CSV.
| Agent | Requests | Cost | Prompt | Hit rate | Hit rate (-1st) | Truncations | Errors | Model clean |
|---|---|---|---|---|---|---|---|---|
|
OpenClacky
This project
|
20 | $1.72 | 628,278 | 91.0% | 92.2% | 1 | 0 | ✅ Clean |
|
Claude Code
Closed-source
|
8 | $1.20 | 310,106 | 64.5% | 63.6% | 0 | 0 | ⚠️ Mixed |
|
OpenClaw
Open-source peer
|
34 | $7.47 | 3,759,466 | 86.1% | 88.2% | 8 | 0 | ✅ Clean |
|
Hermes
Open multi-agent
|
22 | $4.65 | 1,258,934 | 52.9% | 53.9% | 0 | 0 | ✅ Clean |
Artifact comparison
Marketing / PPT HTML outputs are embedded inline; social-content text outputs are listed as files.
Full screen recordings of all four runs
Process footage · EvidenceFull screen recordings captured during task execution. Same prompt, same window, four agents.
Recordings captured during the May 2026 benchmark. Original timing can be cross-checked against the created_at column in the OpenRouter logs.
Actual artifacts
All four agents' deliverables are public. Preview the HTML, read the Markdown, or download the source.
OpenClacky
3 filesClaude Code
2 filesOpenClaw
2 filesHermes
3 filesExecution path & observations
- OpenClacky — 20 requests, single session. Session JSON was cleared by the rotate mechanism; system log confirms playbook landed 12:35, plan landed 16:09. Hit rate 91.0% / 92.2% cold-start-excluded — highest on this task.
- Claude Code — 8 requests, cheapest at $1.20, but 3 of 8 requests silently used haiku/sonnet (architectural behavior: auto-dispatched lightweight models for auxiliary steps). Low hit rate (64.5%) is explained by the small session size — the first request carries proportionally more weight.
- OpenClaw — 34 requests at $7.47 (4.3× OpenClacky). 8 of 34 requests hit
finish_reason=length(23.5%) — output reached max_tokens, triggering continuation/retry with larger resubmitted context. - Hermes — 22 requests at $4.65. Hit rate 52.9% → 53.9% cold-start-excluded — essentially flat, reconfirming the cache miss is architectural.
Takeaway
On this task, OpenClacky's hit rate (91.0%) actually beat Claude Code's (64.5%). When request counts grow, OpenClacky's cache engineering edge shows up.
Claude Code's $1.20 headline number needs nuance: 3 of its 8 requests were haiku/sonnet, so this isn't strictly a "same model" comparison.
OpenClaw's high cost comes almost entirely from 8 output truncations and their retry overhead — the same failure mode shows up again on the PPT task.