03 · social-content · 2026-04-30

B2B SaaS Competitor Analysis + Week-1 Social Calendar

social-content skill · 6-step pipeline

Task & environment

Original prompt

Run in order: read source material → 3 competitor analyses → gap analysis (shared blind spots + differentiation) → FlowBase content strategy → week-1 LinkedIn & Twitter content calendars → final combined report.

Read the full prompt →
Expected artifacts

10 Markdown files: 3× *_posts.md + 3× *_analysis.md + competitive_gap_analysis.md + flowbase_content_strategy.md + week1_linkedin.md + week1_twitter.md + final_report.md

Results

All numbers are recomputed per-request from the OpenRouter activity CSV.

Agent Requests Cost Prompt Hit rate Hit rate (-1st) Truncations Errors Model clean
OpenClacky
This project
21 $2.14 1,008,988 90.0% 91.4% 0 0 ✅ Clean
Claude Code
Closed-source
43 $2.84 3,204,300 97.6% 98.2% 0 0 ✅ Clean
OpenClaw
Open-source peer
13 $3.15 1,626,133 82.4% 88.3% 0 0 ✅ Clean
Hermes
Open multi-agent
145 $14.53 3,850,850 47.1% 47.4% 0 0 ✅ Clean

Artifact comparison

Marketing / PPT HTML outputs are embedded inline; social-content text outputs are listed as files.

Full screen recordings of all four runs

Process footage · Evidence

Full screen recordings captured during task execution. Same prompt, same window, four agents.

OpenClacky
MP4
Claude Code
MP4
OpenClaw
MP4
Hermes
MP4

Recordings captured during the May 2026 benchmark. Original timing can be cross-checked against the created_at column in the OpenRouter logs.

Execution path & observations

  • OpenClacky — 21 requests, single session. Delivered all 10 expected artifacts plus a bonus flowbase_blog_article.md and a verify.py self-check script.
  • Claude Code — 43 requests, single session. Highest cache rate on this task (97.6%). The archive has 7 deliverables; the 3 *_posts.md files were processed in-context and not persisted as separate files.
  • OpenClaw — 13 requests, single session. Fewest requests, but larger per-request prompts + 82.4% hit rate pushed total to $3.15 — higher than OpenClacky.
  • Hermes — 145 requests across 9 sessions (1 orchestrator + 8 sub-tasks). Even with the first request removed, hit rate only climbs to 47.4% — the multi-session architecture rebuilds cache on every handoff. The orchestrator stopped at 21:08 after gap analysis; it never ran the strategy / week-1 / final-report steps.

Takeaway

On this task, OpenClacky delivered all 10 skill-defined artifacts for $2.14 across 21 requests — the lowest cost and the most complete output of any agent.

Claude Code had the highest cache hit rate (97.6%) but used more requests (43), ending at $2.84 — same order of magnitude as OpenClacky.

Hermes ran a multi-session pipeline that caused heavy cache misses and spent $14.53 without finishing the second half.

Other tasks

01 · guizang-ppt-skill
10-page Horizontal-Swipe Business Deck (single HTML)
guizang-ppt-skill · AI-Agent industry trend talk
02 · marketing-psychology
AI Customer-Service SaaS Marketing Plan + Live Homepage
marketing-psychology skill · dual deliverable
← Back to benchmark overview