OpenClacky's Harness Engineering

Three generations of Agent architecture over two years — this is what we learned.

Over the past two years, we built three generations of Agent architecture. The first explored RAG. The second went deep on SWEBench and multi-Agent orchestration in the cloud — we hit a lot of walls, and learned a lot.

The deepest lesson: users just want their tasks done well and fast. The best architecture isn't chasing multi-Agent complexity. It's taking a single Agent and making it as effective and cost-efficient as possible.

That led to the third generation: a full rewrite in Ruby, completed over three months. That's OpenClacky.

This document covers the core engineering decisions in that third-generation architecture, and the trade-offs behind each one.

📊 Want the numbers first? We published a 3-task head-to-head benchmark: OpenClacky / Claude Code / OpenClaw / Hermes under identical prompt, identical model, and identical skill — every metric recomputed line-by-line from the OpenRouter CSV.

📝 Founder's essay: Every AI Agent Feature Is a Cache Invalidation Surface — the full narrative behind Decision I: three generations of architecture trade-offs and the benchmarks that validated our final approach.

I. Always Targeting 100% Cache Hit

Prompt Caching is one of the most effective ways to reduce LLM costs. But most Agent implementations leave a lot of savings on the table here — inflated bills and slower inference often trace back to poor cache architecture.

Don't Rebuild the System Prompt

OpenClacky builds the system prompt once at session start, then never modifies it — only appending new messages forward.

"Never modify" doesn't mean "can't update dynamically." We built a [session context] mechanism that inserts dynamic changes (Skill list reloads, model switches) as standalone blocks into the conversation flow, rather than reconstructing the system prompt.

The result: the system prompt cache breakpoint stays valid for the entire session. Most Agents restart the session when they need these kinds of updates — Claude Code and OpenClaw both do. A restart invalidates every cache entry and starts billing from zero. Our design makes that cost zero.

Double Cache Marking

The intuitive approach is to apply cache_control to the last message. But this creates a structural miss pattern:

Round N:   mark messages[-1] → cache established
Round N+1: messages[-1] is now messages[-2], mark is gone → cache miss

We mark the last 2 messages simultaneously:

Round N:   mark messages[-2] and messages[-1]
Round N+1: messages[-2] mark still present → cache hit

This also solves the rollback problem — when LLM retries happen or the agent backtracks after an error, the double mark ensures we hold onto as much cache as possible, instead of paying to rebuild everything.

Insert-then-Compress

Compression is the other cache killer. The common approach is to open a new LLM call for compression — which invalidates all established cache.

Our approach: insert the compression instruction into the current conversation flow, completing it as part of the next normal request. Cache reuses naturally. Compression cost approaches zero.

II. Minimal Toolset: Everything Is a Skill

More tools means more noise in model tool selection, longer schemas, and less stable cache. That's the real cost behind tool bloat.

Hermes Agent is a popular framework with 52 built-in tools. It supports lazy loading, but once configured, most tools end up in context. The schema for 52 tools is itself a fixed cost — and overlapping tool descriptions cause more hallucination. The more tools the model sees, the more likely it is to pick the wrong one.

First-Class Architecture Decision: invoke_skill

OpenClacky's core tool list has exactly 16 entries. This wasn't achieved by cutting features — it was achieved by one architectural decision: making invoke_skill a first-class citizen.

invoke_skill is a meta-tool. It lets the Agent delegate any complex capability to an independent Skill, rather than stuffing specialized tools into the core list. This single decision solves several problems at once:

Sub-agent invocation: A Skill is a sub-agent. Calling a complex code-review Skill and calling a shell command are protocol-equivalent. The Agent doesn't need to understand the sub-agent's internals — only that the capability exists.

Composite capabilities: Some operations require multi-step tool coordination — memory recall (file read + semantic judgment + summarization), codebase exploration (glob + grep + multi-file reading + structure analysis), history compression (read + summarize + write back). These generate schema noise as core tools. As Skills, they become a single clean invocation.

Unlimited extensibility, stable core list: Users install new Skills, tool count stays the same, schema doesn't change, cache is unaffected. This is why 16 tools can cover an unbounded capability surface.

What We Saved

Capability removed	Replacement	Tools saved
Codebase analysis toolset	`code-explorer` Skill	~5
Memory read/write tools	`recall-memory` Skill	~3
Browser automation (split across multiple action tools)	Single `browser` tool covers all	~8
Sub-agent orchestration tools	`invoke_skill` unified entry	~6
Scheduled task management	`cron-task-creator` Skill	~4

Core tool list stays stable. Cache breakpoints don't move. Tool count isn't a competitive advantage. Task completion rate is.

Comparison

All these agents share the same core capabilities — the real difference is tool count.

	OpenClacky	Claude Code	OpenClaw	Hermes Agent
Built-in tool count	16	40+	23 (with plugins/MCP/channels: 30–50)	52
Sub-agent invocation	✅	✅	✅	✅
Channel integration	✅	✅	✅	✅
Web Search / Fetch	✅	✅	✅	✅
Browser automation	✅	✅	✅	✅
Hot-extend capabilities (no restart)	✅ Skill hot-reload	❌	❌	❌
Handling tool bloat	No bloat by design	ToolSearchTool (auto-enabled beyond ~30)	Lazy-load via extensions	Lazy loading

Claude Code's ToolSearchTool is a clever engineering solution: once tool count exceeds ~30, the model switches to on-demand tool discovery instead of loading all schemas upfront. That avoids context explosion — but it's a fix applied after the bloat already exists. OpenClacky's approach is to never let the tool list bloat in the first place.

III. FuzzySearchReplace

Code editing is the most frequent operation in a Coding Agent, and the most failure-prone. LLM-generated old_string values have an inherent problem: escape characters, indentation, trailing whitespace — these details slip during generation, causing exact match failures and retry loops.

The Problem

Each edit failure requires the agent to re-interpret the error, generate a new old_string, and try again. In long tasks, these failures accumulate into significant token waste and time cost.

Our Decision

StringMatcher implements a 5-layer fallback matching strategy:

Layer	Strategy	Addresses
1	Exact match	Normal case
2	Trim whitespace	Extra newlines or spaces in generated output
3	Unescape escape chars	`\n` `\t` `\uXXXX` not expanded
4	Trim + Unescape combined	Both issues present
5	Line-by-line smart match (Tab/Space tolerance)	Inconsistent indentation style

Most cases resolve at layers 2-3. This isn't fuzzy matching — each layer targets a specific LLM output failure mode, and matches are deterministic.

IV. Self-Evolving Parsers & Scripts

PDF, Word, Excel, and image processing are core foundational capabilities for any agent. But the diversity of local environments makes this genuinely hard — different operating systems, different versions, different dependencies, always some unexpected format lurking.

The standard tradeoff has two options: drop these capabilities entirely, or raise the installation bar by requiring users to pre-install all dependencies. Neither is a good answer.

Our Decision

We took a third path: let the agent evolve its own tools.

The gem ships default parser scripts, which are copied to the user directory ~/.clacky/parsers/ on first run. From that point on, the agent always uses this user-space version.

When parsing fails, the agent locates the relevant script and fixes it directly. The fix takes effect on the next run. Shell scripts in ~/.clacky/scripts/ use the same mechanism.

The result: zero installation friction, and it gets better with use. The first encounter with an unusual format might fail — but once the agent self-repairs, it's fixed permanently. No waiting for a version bump, no manual intervention required.

V. Gated Skill Self-Evolution

Automatic Skill creation is a double-edged sword. Without gates, it accumulates low-quality Skills that contradict each other, produce noise during retrieval, and degrade agent judgment — what we call Skill entropy.

Our Decision

Two self-evolution hooks, both with strict trigger conditions:

SkillAutoCreator (distill new Skills from tasks):
- A complex task with ≥ 12 iterations completed
- And no existing Skill was invoked during the task (indicating a genuinely new workflow)
- LLM itself judges whether the workflow is worth capturing as a Skill

SkillReflector (improve existing Skills):
- User explicitly invoked a Skill (not inferred/triggered passively)
- Execution exceeded 5 iterations (indicating the Skill handled a complex task)
- LLM reflects on whether instructions were clear, whether edge cases were missed

Create less, don't create garbage. Every Skill in ~/.clacky/skills/ should be backed by sufficient real-world use.

VI. Whitelisted Memory Writes

Long-term memory matters for session continuity. But memory writes have costs: an additional LLM call, and the quality problem — low-quality memories become context noise in future sessions.

Our Decision

Memory writes require two conditions simultaneously:

Session turn count ≥ 10 (short conversations rarely produce information worth long-term retention)
Clear high-value signal present: an explicit user decision, newly established persistent context, or a recurring failure pattern

When conditions aren't met, nothing gets written at session end. When they are, the LLM decides what to retain, which file to update, and how to merge with existing memories.

Every entry in ~/.clacky/memories/ has been filtered. Signal-to-noise ratio at retrieval is higher, judgment more accurate.

VII. No Gateway Design

Many agent tools keep a background service process running at all times — even after you've stopped using the agent, it's still listening on a port, still reachable, still potentially responding to requests you didn't ask for.

We think that's wrong. When you say stop, it should actually stop.

OpenClacky doesn't keep any extra background processes running. Close the agent, the port is released, nothing remains listening on your system. That's the complete process control users deserve.

Seamless Upgrades

No persistent process usually comes with a cost: upgrades require downtime. If a user clicks upgrade in the Web UI and the service drops out mid-restart, the experience breaks.

Our solution uses the same pattern as nginx and other production-grade servers: Master-Worker separation with socket inheritance.

The Master process permanently holds the TCP port but handles no requests itself. The Worker process inherits the Master's socket and runs the actual HTTP service. On upgrade, the Master starts a new Worker with the updated version, waits for it to become ready, then signals the old Worker to exit gracefully. The port stays bound throughout — no dropped connections.

From the user's perspective: click upgrade in the Web UI, the new version takes effect silently, the page never disconnects.

Summary

These seven decisions share a common underlying logic: don't trade complexity for capability — trade precision for efficiency.

Most agent frameworks take an additive approach — more tools, more memory, more orchestration layers — then patch the bloat: add a search layer when tools overflow, add a filter when memory gets noisy, add a compressor when context gets too long. Every patch layer adds token cost and new failure modes.

Our choice was to set explicit boundaries at each layer from the start:

Cache layer: Don't accept default cache invalidation. Dual-mark sliding window locks the system prompt in place. Insert-then-Compress ensures compression itself hits the cache. Near 100% hit rate, predictable cost per call.
Tool layer: 16 tools — not from laziness, but because tool count directly affects schema noise and model judgment. Extension happens through invoke_skill, keeping the core tool set clean.
Fault tolerance layer: FuzzySearchReplace with 5 fallback levels. Not a trick — an honest acknowledgment that string matching fails in real codebases, with the failure path designed in from the start.
Knowledge layer: Memory writes require whitelist signals. Skill evolution requires iteration thresholds. Parsers self-repair instead of waiting for a release. Every piece of information that enters the long-term knowledge base has been validated by real use. High signal density, low noise.
Reliability layer: Time Machine rolls back both file state and conversation state together — users can experiment freely. No Gateway means stop is actually stop, no hidden services. Master-Worker separation makes upgrades seamless, port held throughout.

None of these decisions are clever tricks. They're the honest answer to the question "how should an AI Agent be designed" — earned through two years of getting it wrong first.

Fully open source, MIT license: github.com/clacky-ai/openclacky

Installation guide → · Agent Configuration →