Problem to Solve

Most developers' first encounter with AI is a single prompt, a single response. It feels powerful — until the task gets complex. Ask an AI to research three competitors, synthesize the findings, and format them as a report, and a single context window starts to feel very small. This is the problem subagents solve.

Demo

In the demo above, I'm invoking the agent explicitly by name. In orchestrated workflows, Claude can invoke multiple agents like this automatically — in parallel or sequentially — based on the task structure.

Skills GitHub Repo - .claude/agents/code-quality-reviewer.md

What Is a Subagent?

A subagent is an AI instance invoked by an orchestrating AI to handle a specific subtask within a larger workflow. In multi-agent systems broadly — whether built on Claude, GPT, Gemini, or open-source models — the core pattern is the same: rather than a single model doing everything sequentially, an orchestrator breaks work into pieces and delegates them. Much like a technical lead assigning work to specialists rather than writing every function themselves.

According to Anthropic's Claude Agent SDK documentation, subagents serve two core purposes: parallelization (running multiple tasks simultaneously) and context isolation (each subagent uses its own context window, returning only relevant results to the orchestrator rather than its full context).[^1]

Within their assigned scope, subagents are active execution units — they can browse the web, execute code, read and write files, and call external APIs. They don't just reason; they act.

How a Subagent Workflow Works

The orchestrator doesn't do the heavy lifting — it coordinates. Each subagent receives a focused prompt with a clear objective, output format, and tool access, then returns a concise result. The orchestrator aggregates those results into the final deliverable.

Anthropic's internal research system uses exactly this pattern: a lead agent spawns subagents to explore different aspects of a query in parallel, then compiles their findings into a coherent answer. Their evaluations found this approach outperformed a single Claude Opus 4 by 90.2% on internal research benchmarks.[^2]

Orchestration Patterns

Not all subagent workflows are structured the same way. Three patterns cover most real-world cases:

Parallel fan-out — Independent subtasks launch simultaneously. Best for tasks like analyzing multiple documents at once.
Sequential pipeline — Each subagent's output feeds the next. Best when there's a dependency chain (research → draft → edit → format).
Hierarchical delegation — A subagent itself becomes an orchestrator for deeper subtasks. Powerful, but adds coordination complexity.

Choosing the wrong pattern is a common mistake. Parallelizing a sequential task adds overhead without benefit; sequentializing an independent task wastes time.

Creating Subagents with Claude Code CLI

Claude Code gives you two ways to create subagents: interactively via the /agents command, or manually as markdown files. Both result in the same thing — a .md file in a .claude/agents/ directory.

The Interactive Way

/agents create

This walks you through a guided setup: name, description, tools, model, and scope. At the end, it saves the file, and the agent is available immediately.

The Manual Way

Create a markdown file directly — the frontmatter defines behavior, the body is the system prompt:

---
name: security-reviewer
description: "Expert security reviewer. Use PROACTIVELY after any changes to auth, data handling, or API endpoints."
tools: Read, Grep, Glob
model: haiku
permissionMode: plan
---

You are a senior security engineer reviewing code for vulnerabilities.
When invoked:
1. Identify recently changed files
2. Analyze for OWASP Top 10 vulnerabilities
3. Check for secrets, SQL injection, and hardcoded credentials
4. Report findings with severity levels and remediation steps

Where the File Lives

Subagent scope is determined by which directory the file is placed in:

Scope Path When to use Project .claude/agents/ in project root Team-shared agents, commit to version control User ~/.claude/agents/ in home dir Personal agents available across all projects

Project scope is the recommended default — it makes subagent definitions shareable via version control. Use user scope for general-purpose agents you want available everywhere, regardless of which repo you're in.

Best Practices

Write descriptions that trigger correctly. Claude uses the description field to decide when to invoke a subagent automatically. Be specific and include PROACTIVELY if you want it auto-triggered — for example: "Use PROACTIVELY after any changes to authentication or data handling."
Restrict tools intentionally. The tools field restricts what the agent can do — a security auditor only needs Read, Grep, and Glob, and has no business writing files. That restriction is worth being explicit about.
Match model to task complexity. Route subagent exploration to cheaper, faster models like Haiku and reserve Opus for genuine architectural reasoning. A read-only code scanner doesn't need the same model as an agent writing production code.
Keep system prompts focused. A subagent with a narrow, well-defined role outperforms a generalist one. If the prompt starts covering many different concerns, split it into two agents.

The Practical Challenge: Context Management

Each subagent starts fresh with no shared memory. This means:

The orchestrator must craft every subagent prompt with all the context it needs to succeed
Subagent outputs must be concise enough to fit back into the orchestrator's context alongside everything else
If outputs are large, the orchestrator must summarize before aggregating

Good subagent design is largely information architecture: what does each agent need to know, what must it produce, and how does that output flow back into the whole.

Safety Considerations

When subagents can take real-world actions, safety boundaries matter more than in single-agent systems:

Least privilege — Give each subagent only the tools it actually needs. A research agent doesn't need write access to a production database.
Output validation — Don't blindly pass subagent outputs downstream. Even lightweight sanity-checking reduces blast radius.
Prompt injection — Subagents that browse the web or read external files can encounter content designed to manipulate their behavior. This is a real attack surface in agentic systems.

When to Use Subagents

Use subagents when… Avoid subagents when… Task has clearly separable subtasks Task is straightforward for one agent Parallel execution saves meaningful time Subtasks share too much state to delegate Total work exceeds one context window Coordination overhead exceeds the benefit Different subtasks need different tools You're still prototyping — stay simple first

Start with a single-agent approach and introduce orchestration when it genuinely starts to strain. Complexity has a cost.

Subagents vs. Skills: A Quick Note

These two terms sometimes get conflated, but they operate at completely different layers. A skill is a passive instruction document — a markdown file Claude reads before a task to understand best practices, available libraries, and output conventions. A subagent is an active execution unit that runs, uses tools, and returns results.

The honest relationship: a subagent might read a skill before doing its work. One shapes knowledge; the other executes.

Final Thoughts

The broader Claude stack can be conceptualized as five layers: MCP for connectivity, Skills for task-specific knowledge, Agent as the primary worker, Subagents as parallel independent workers, and Agent Teams for coordination.[^3] These building blocks are shipping in rapid succession, and the pattern is maturing fast.

For developers just entering this space: don't need to build full orchestration systems on day one. But understanding the pattern — how delegation works, what subagents can and can't do, where the sharp edges are — will shape how you think about AI architecture from the start.

Subagents aren't a feature. There's a shift in how we think about what AI can be tasked with doing.

References

[^1]: Anthropic Engineering. Building agents with the Claude Agent SDK. -> https://www.anthropic.com/engineering/building-agents-with-the-claude-agent-sdk
[^2]: Anthropic Engineering. How we built our multi-agent research system. -> https://www.anthropic.com/engineering/multi-agent-research-system
[^3]: Winbuzzer. Anthropic Shows How to Scale Claude Code with Subagents and MCP. -> https://winbuzzer.com/2026/03/24/anthropic-claude-code-subagent-mcp-advanced-patterns-xcxwbn/
[^4]: Anthropic. Create custom subagents — Claude Code Docs. -> https://code.claude.com/docs/en/sub-agents

If you have reached this point, I have made a satisfactory effort to keep you reading. Please be kind enough to leave any comments or share any corrections.

My Other Blogs:

It always starts with “just one integration”.

You want your AI agent to send a message to Slack. So you wire it up. A bit of custom code, some API calls, done.

Then someone asks for GitHub access. Then Jira. Then your internal database. Then Notion.

Before you realize it, you’re not building an AI system anymore; you’re maintaining a web of fragile integrations.

Every new tool means new code. Every update breaks something. Every credential becomes a security risk.

If you have 10 agents and 20 tools, you’re suddenly dealing with 200 possible connections.

This is what Anthropic called the N×M problem.

And that’s exactly the mess MCP (Model Context Protocol) was designed to fix.

What Is MCP (Model Context Protocol)?

At its core, MCP is simple; and that’s why it matters.

MCP is an open standard that defines how AI agents connect to and use tools.

Think of it like USB-C for AI.

From fragmented integrations to a unified interface — MCP standardizes how AI agents connect to tools through MCP servers, replacing N×M integrations with a single protocol (USB-C analogy)

This is the shift MCP introduces: from point-to-point integrations to a shared, standardized interface.

You don’t build a custom cable for every device anymore. You define one standard interface, and everything plugs into it.

That’s what MCP does for AI systems.

Instead of writing custom integrations for every tool, you expose tools through something called an MCP server.

An MCP server is just a program that describes what a tool can do, in a structured, standardized way.

For example:

A Slack MCP server might expose:
- send_message
- search_messages
A GitHub MCP server might expose:
- list_repos
- create_pull_request

Once that’s done, any MCP-compatible AI can discover and use those tools without writing new integration code.

That’s the key shift.

You stop building connections manually. You start plugging into a shared ecosystem.

Why MCP Took Off So Fast

MCP didn’t just stay theoretical.

It gained traction quickly because it solves a very real pain engineers were already feeling.

After Anthropic introduced it, other major players followed:

OpenAI
Google DeepMind

And by 2026, it was contributed to the Linux Foundation, which gave it real credibility as an open standard.

That combination, real pain + standardization + adoption, is why MCP is now everywhere.

If you’re building AI systems today, you’re going to run into it.

What MCP Solves (And Why It’s a Big Deal)

MCP solves one specific problem extremely well:

How agents talk to tools.

It standardizes:

Tool discovery (what tools exist?)
Tool capabilities (what can they do?)
Tool invocation (how do I call them?)

That’s it.

And honestly, that’s enough to unlock a lot.

You go from:

“Every integration is custom”

to:

“Every tool speaks the same language”

That alone removes a huge amount of engineering friction.

What MCP Doesn’t Solve (This Is Where Things Break)

This is the part most articles skip.

MCP solves the protocol layer, the language agents and tools use to communicate.

But it doesn’t solve what happens around that communication.

And that’s where things start to fall apart in production.

MCP does not handle:

Authentication at scale (who owns which credentials?)
Access control (which agent can use which tool?)
Observability (what did the agent actually do?)
Security (what if a tool returns malicious output?)
Governance (audit logs, compliance, traceability)

In a demo, that’s fine.

MCP works perfectly in demos because nothing is constrained.

Production systems are defined by constraints, security, cost, and control.

In a real system, that’s a problem.

Because now your agents have direct access to tools without a control layer in between.

That’s not just messy.

It’s risky.

So… Why Does MCP Need a Gateway?

An MCP Gateway is the layer that sits between your agents and your MCP servers.

It doesn’t replace MCP.

It makes MCP usable in production.

MCP standardizes communication. The gateway standardizes control.

Instead of every agent talking directly to every tool, everything goes through a centralized control point.

That’s where things start to get structured.

What an MCP Gateway Actually Adds

Once you introduce a gateway, a few important things change immediately.

1. One entry point instead of many

Agents don’t connect to 10 different tools.

They connect to one gateway.

That alone simplifies architecture more than most teams expect.

2. Centralized authentication

Instead of embedding credentials everywhere, the gateway manages them.

Agents authenticate once. The gateway handles the rest.

3. Real access control (RBAC)

You can define:

Which agents can access which tools
Which teams can use which capabilities

No more “everything can call everything.”

4. Tool discovery without hardcoding

Agents don’t need to know tools upfront.

They can discover available tools dynamically through the gateway.

That removes a ton of brittle logic.

5. Guardrails on every tool call

Every request and response can be inspected.

That means you can:

Block unsafe inputs
Filter sensitive outputs
Detect prompt injection patterns

Before anything causes damage.

6. Full audit trail

Every action is logged.

Every tool call is traceable.

You can answer:

“What exactly did this agent do?”

Without guessing.

The Piece Most Teams Don’t Think About: Virtual MCP Servers

This is where things get more interesting.

Even with MCP, exposing tools directly can be dangerous.

You don’t always want to expose everything a tool can do.

For example:

Your GitHub MCP server might support:

creating PRs
deleting repos
modifying configs

You probably don’t want an agent calling all of those.

This is where Virtual MCP Servers come in.

Instead of exposing raw tools, you create a curated layer.

In practice, this doesn’t look like raw tool endpoints; it looks like a managed layer where MCP servers are grouped and selectively exposed.

Managing MCP servers in a production environment — grouping tools, configuring access, and creating virtual MCP layers for controlled exposure (source: TrueFoundry platform)

You define:

Which tools are allowed
Which actions are safe
Which capabilities are hidden

And you expose only that to your agents.

No new deployments. No custom code.

Just controlled exposure.

This ends up being one of those features teams only realize they need after something goes wrong.

What This Looks Like in Practice

Let’s make this concrete.

Imagine a compliance automation agent.

It needs to:

Read changes from GitHub
Store a diff in MongoDB
Create a Jira ticket
Notify a team on Slack

Without structure, that’s four different integrations, four different auth systems, and zero visibility.

With MCP, those tools are standardized.

With an MCP Gateway, they’re controlled.

The agent connects to one endpoint.

The gateway:

Authenticates each step
Routes requests to the right tool
Logs every action
Applies guardrails

If something looks risky, for example, a diff that touches sensitive files, the gateway can pause execution and require approval.

That’s the difference.

You’re not just executing tasks. You’re managing them.

Where TrueFoundry Fits In

In the context of MCP, this is exactly the layer platforms like TrueFoundry are built for.

In practice, you don’t want to manage three separate concerns:

LLM routing and cost control (AI Gateway)
Tool access via MCP (MCP Gateway)
Agent execution and workflows (Agent Gateway)

You want a single control plane that handles all of them together.

That’s the shift TrueFoundry makes. It unifies these layers into one gateway architecture, so you’re not stitching together governance, observability, and security across multiple systems.

In practice, this unified gateway layer connects both models and tools under a single control plane.

Unified gateway architecture connecting applications to both LLM providers and MCP-based tools through a centralized control plane for routing, governance, and observability (Source: TrueFoundry website)

MCP standardizes communication. The gateway standardizes control.

Instead of scattered logic and duplicated integrations, everything runs through a centralized layer where:

LLM access is managed
Tool access (via MCP) is governed
Agent workflows are observable

All in one place.

It also brings the enterprise guarantees most teams eventually need:

Recognized in the 2026 Gartner® Market Guide for AI Gateways
Processes 10B+ requests per month
Handles 350+ RPS on a single vCPU with sub-3ms latency
Supports VPC, on-prem, air-gapped, and multi-cloud deployments
Compliant with SOC 2, HIPAA, GDPR, ITAR, and EU AI Act
Trusted by enterprises including Siemens Healthineers, NVIDIA, Resmed, and Automation Anywhere

The important part isn’t just the numbers.

It’s the idea of centralized control across the entire AI stack, where protocols like MCP handle communication, and a unified gateway ensures everything around that communication is secure, observable, and governed.

The Shift Most Teams Don’t See Coming

At first, MCP feels like the solution.

And it is, for a specific problem.

But once you move beyond a prototype, the challenge changes.

It’s no longer:

“How do I connect an agent to a tool?”

It becomes:

“How do I control, secure, and observe everything that happens between them?”

That’s not a protocol problem anymore.

That’s an infrastructure problem.

And that’s exactly where the gateway comes in.

Final Thoughts

MCP solves something real.

It standardizes how agents talk to tools, and that alone removes a massive amount of complexity.

But it doesn’t solve what happens around that interaction.

That’s where things get messy.

An MCP Gateway is what brings structure back:

Control over access
Visibility into behavior
Guardrails around execution

If you’re still experimenting, MCP alone might be enough.

But the moment your system starts scaling, more agents, more tools, more risk, you’ll feel the gap.

That’s the point where a gateway stops being optional.

You can try TrueFoundry free, no credit card required, and deploy it in your own cloud in under 10 minutes. It’s a practical way to see how a unified gateway can bring control, observability, and safety to MCP-based systems without slowing your team down.

Thanks for reading! 🙏🏻
I hope you found this useful ✅
Please react and follow for more 😍
Made with 💙 by Hadil Ben Abdallah

“Six agents” here means one orchestrator (Zoe) plus five specialist agents. Six ACP coding experts run as concurrent implementation workers — not counted in that headline number.

Your Day Has Been Taken Over

Overnight, the trading agent ships the prior US session wrap-up. By morning, the macro analyst has the pre-market brief ready. The butler has pushed weather, schedule, and to-dos. AINews (AI Sentinel) has scanned GitHub Trending, arXiv’s latest papers, and 100+ sources — 18+ curated items ranked by importance. Content (Content Strategist) is tracking trending topics across 50+ platforms.

Here’s what matters most to me — automatic tracking of AI dynamics and tech trends. After discovering valuable projects or papers, the system doesn’t just push news — it evaluates impact on our systems and provides P0/P1/P2 action recommendations. Valuable discoveries enter Zoe’s Tech Radar (Zoe is the CTO Agent), going through evaluation → decision → delegated coding implementation.

60 cron tasks run automatically every day (3 AM backup to 11:45 PM reflection). Agents are evolving on their own — mistakes are remembered, recurrence rates drop significantly. This isn’t rules I wrote — it’s autonomous iteration from .learnings/ to MEMORY.md.

System: 1 orchestrator (Zoe) + 5 specialized agents (AINews, Trading, Macro, Content, Butler) + 6 ACP coding experts + 60 cron tasks + 100+ Skills + ~30 configured model profiles + 23 automatic recoveries in two weeks.

Note: Metrics based on February-March 2026 monitoring. Individual results may vary.

System Architecture

┌─────────────────────────────────────────────┐
│              User (Human)                   │
│    Requirements + Key Node Approval         │
└─────────────────┬───────────────────────────┘
                  │
         ┌────────▼────────┐
         │   Zoe (CTO)     │
         │  3x Daily Check │
         └────────┬────────┘
                  │
    ┌─────────────┼─────────────┐
    │             │             │
┌───▼───┐   ┌────▼────┐   ┌───▼───┐
│AINews │   │ Trading │   │ Macro │
└───┬───┘   └────┬────┘   └───┬───┘
    │            │            │
    └────────────┼────────────┘
                 │
        ┌────────▼────────┐
        │ Content + Butler│
        └────────┬────────┘
                 │
        ┌────────▼────────┐
        │   Event Bus     │
        │ + Shared Context│
        └────────┬────────┘
                 │
        ┌────────▼────────┐
        │ ACP Coding      │
        │ (6 concurrent)  │
        └─────────────────┘

Key Design Decisions:

Agents Evolving Autonomously

Designed protocols — Zoe diagnosed communication issues, designed three-state protocol (request → confirmed → final, with silent as the default "no news is good news" state), solidified into AGENTS.md
Self-developed Skills — Content researched ways to make drafts sound less generically LLM-written (“de-AI” polish), wrote Skills, published to ClawHub (shared repository)
Strategy roundtables — Macro + Trading produce weekly reports with data snapshots, position recommendations, stop-loss discipline
Task Watcher — Zoe designed cron-level Task Callback Event Bus for async monitoring

My role: Set up framework, establish constraints, confirm direction. Requirement discovery, solution research, protocol design, implementation — all done by agents.

Team: 1+5+6 Formation

Zoe (CTO / Chief Orchestrator)

3 daily inspections (10:00/14:00/22:00 PT): cron execution, disk usage, session health, Chrome DevTools Protocol (CDP) leak checks, .learnings/ pending, shared-context/ timestamps.

Weekly: Analyze each agent’s MEMORY.md, execute layered compression.

Key capability: Solution design — three-state protocol, Task Watcher, Communication Guardrail framework, all designed autonomously.

AINews (AI Sentinel) — Intelligence Hub

Collects from 100+ sources daily: GitHub Trending, arXiv, RSS, HackerNews, Reddit. 7 cron tasks: morning brief (08:30), midday paper (12:00), evening trends (20:00).

Critical capability: Proactive tech impact evaluation. Discovered ReMe framework → proposed to Zoe → I confirmed → agents executed.

Toolchain: github_trending.py, rss_aggregator.py, arxiv_papers.py, Tavily, agent-browser. Anti-hallucination: every item MUST have a URL, reachability self-check, and unverifiable items labeled single-source.

Trading (Quantitative Analyst)

21 cron tasks (densest load). 20 quant tools, 15 Skills (68K+ lines), 65/35 scoring (tool/AI). Covers US stocks + commodities + crypto.

Four-step framework: Macro factors → scoring (technical 25% / flow 30% / fundamentals 10% / sentiment 20% / market 15%) → cross-check (sanity-check vs. macro and flow) → target + score + stop-loss + confidence.

Not financial advice — automated research output only; you are responsible for any real-money decisions.

Hard rules (system policy, not investment advice): no entry without a defined stop, never fabricate data, confidence <60% = “wait.”

Macro (Chief Economist)

9 cron tasks: Morning (07:50) → Midday (12:30) → Evening (18:00) → US pre-market (22:00) → morning digest of the prior US session (05:20 PT) — scheduled after the cash close, not at the closing bell. Sunday weekly review → Trading references for market review.

Discipline: Cite sources, distinguish facts vs judgments, mark confidence (high >70% / medium 50–70% / low <50%), propose counter-arguments.

Real case: Iran tension → traditional: “gold rises” → actual: oil +14%, gold -5%. Macro: “inflation logic dominates, not safe haven.” Saved to MEMORY.md.

Content (Content Strategist)

9 cron tasks: Research (09:00, 50+ platforms) → Ideate (10:30, consume AINews) → Write (14:00, score drafts) → Reflect (22:10).

Autonomous evolution: Discovered content too “AI-flavored” → researched humanizing / de-generic copy tools → wrote Skills → published to ClawHub.

Five-Basket Radar: AI/Tech (≤40%), Product/Startup, Solopreneur, Investment/Macro, Social/International. 40% AI cap self-imposed during reflection.

Butler (Life Assistant)

7 cron tasks: Greeting (08:00) → Schedule (08:30) → 5 water reminders (rotating styles) → Health (20:00) → Summary (22:00).

Philosophy: <50 chars per reminder, ≥1.5h interval, 23:00–07:00 emergency only, no pestering if no reply.

ACP Coding Experts

Pi / Claude Code / Codex / OpenCode / Gemini / GPT-4.1-Codex. Max 6 concurrent, 120min TTL — queue or shed load when saturated so you don’t stampede gateways. Analysis agents don’t code — delegated via sessions_spawn.

Design Lesson

Don’t let analysis agents code directly. Early setup: coding + architect + PM roles. Result: almost no output, high overlap with Zoe + ACP, increased complexity. Cut them all. Zoe handles PM + architect.

Complexity grows fast: pairwise coordination explodes (six specialists ≈ fifteen pairwise handoffs if everyone talks to everyone). Each new agent ≈ half a day debugging conflicts, resource competition, and rule compatibility.

Three Core Engineering Problems

Problem 1: Context Is the Agent’s OS

The Problem: Entropy Always Increases

Without constraints, agent systems deterministically collapse. Agents are processes without an OS: no memory management, no garbage collection, no OOM protection.

Three incidents:

P0–8 Hours Paralysis

AINews session: 235K tokens. Gateway compaction → timeout → crash → macOS launchd ThrottleInterval=1 infinite loop. All agents offline.

Fix: Clean session → ThrottleInterval 1→10 → idleMinutes 180→30 → execution policy tightened from permissive to allowlist (smaller blast radius; keep the list maintained). Four defense lines missing.

P1–3500 Chars → 800 Chars

Trading’s flash report had data tables. OpenClaw auto-compacted when exceeded textChunkLimit—"intelligently compressed" away. AI "help" is disaster in data-dense scenarios.

P2 — Rules Ignore After Bloat

Sessions bloat to 10K+ tokens → agents “selectively comply.” Butler doing investment analysis. Trading ignoring validation. Critical info drowned in noise.

Solution: Dual-Layer Control

Layer 1: Context Engineering (information architecture)

SOUL.md (front): Identity + hard constraints + decision framework (40–60 lines)
AGENTS.md (after): Operating norms + collaboration protocols
Skills: Via extraDirs on-demand (Trading: 15 Skills, 68K lines on disk—retrieve or inject only the 1–3 relevant fragments per turn, not the whole tree)
shared-context/: Cross-agent state, read via tools
Obsidian: Cold storage, archives output, no inference

Rule wording targets weakest model (GPT-4.1 → Qwen3.5 → Ollama qwen3:8b):

"Suggest not fabricating" → qwen3:8b ignores
"MUST: do not fabricate" → all comply
"MUST + P0 + NON-NEGOTIABLE" → even weak models comply

Write for weakest link.

Layer 2: Harness (framework lifecycle management)

Without Harness → 235K tokens → crash. Without Context Engineering → all piled → rules drowned.

Representative openclaw.json excerpt (field names drift by release—validate against your OpenClaw version before paste-deploying):

{
  "compaction": {
    "mode": "safeguard",
    "memoryFlush": {
      "enabled": true,
      "softThresholdTokens": 40000,
      "prompt": "Distill to memory/YYYY-MM-DD.md. Focus: decisions, state changes, lessons."
    }
  },
  "contextPruning": { "mode": "cache-ttl", "ttl": "6h", "keepLastAssistants": 3 },
  "session": {
    "reset": { "mode": "daily", "atHour": 5, "idleMinutes": 30 },
    "maintenance": { "pruneAfter": "7d", "maxDiskBytes": 104857600 }
  },
  "hooks": { "bootstrap": ["self-improving-agent"] }
}

Cross-session recovery:

New session → SOUL.md + AGENTS.md + MEMORY.md + .learnings/ → memorySearch → shared-context/
= "Knows who, what done, what team doing"

Problem 2: Let Agents Remember and Grow

The Problem: Repeating Mistakes

Trading got BILLBOARD_BUY_AMT wrong 5 times (wrote BUY_AMT). Session reset → lost memory → repeat. User corrects → agent changes → 3 days later same scenario → same error.

Chatbot vs Agent dividing line: Agents learn from mistakes.

Solution: Five-Layer Memory

Autonomous Memory: 6-Step Cycle

Trigger: Operation failed · User corrected · Better approach found
L4 Recording: Write to .learnings/ERRORS.md or LEARNINGS.md
Daily Reflection (22:00): Review .learnings/, Zoe aggregates cross-agent value
PROMOTE: 3+ verifications → MEMORY.md, single → keep observing
L2 Sedimentation: Weekly compression, ❤000 tokens
L5 Skill: Generalizable → write as Skills → ClawHub

This is the core mechanism. Without it: chatbot. With it: agent.

Chatbot vs Agent

Problem 3: Let Agents Collaborate

The Problem: Multi-Agent Communication

Initial issues:

Status sync failures: A finished, B didn’t know
Resource contention: Multiple agents write same file
Information silos: Macro produced, Trading never saw
Responsibility gaps: “Who’s handling this?” → all silent

Solution: Three-State Protocol + Event Bus

Protocol (three active states + default silent state):

request → confirmed → final → [silent]

request: Explicitly acknowledges, starts the loop
confirmed: In progress, sends intermediate updates
final: Complete, result delivered, loop closes
[silent]: Default state when no active task — “no news is good news” (prevents spam)

Event Bus:

{
  "type": "MARKET_CLOSE",
  "source": "TRADING",
  "timestamp": "2026-03-07T15:00:00-08:00",
  "payload": { "symbol": "SPY", "note": "schema omitted for brevity" },
  "requiresAck": false
}

Shared Context:

tech-radar.json — Read-only except authorized writers
market-status.json — Trading updates, Macro/Content consume

Guardrails:

No ad-hoc cross-agent file writes; mediated writes only (tools, bus, approved writers)
All communication via event bus or shared-context; never park API keys or session tokens in shared JSON — use your platform’s secret store
Zoe has final arbitration

Results: 4 Weeks, 23 Auto-Recoveries

Timeline:

Week 1: Basic setup, single agent, frequent crashes
Week 2: Multi-agent coordination, protocols established
Week 3: Autonomous evolution, agents self-fixing
Week 4: Production-ready, 60 cron tasks smooth

Key Metrics:

What I Learned

1. Agents Need an OS, Not Just Prompts

Context Engineering is OS design. You need:

Memory management (compaction, pruning, reset)
Process isolation (separate workspaces)
IPC mechanisms (event bus, shared context)
Garbage collection (session cleanup, disk limits)

2. Memory = Chatbot vs Agent

Can’t remember yesterday’s mistakes = fancy chatbot. Five-layer memory transforms stateless LLM calls into stateful, learning entities.

3. Constraints Enable Creativity

Clear boundaries = more creative, not less. 40% AI quota, three-state protocol, hard “MUST” rules — these are guardrails for autonomous operation.

4. Multi-Model Fallback Is Production Necessity

GPT-4.1 → Qwen3.5 → Ollama qwen3:8b. Write rules for weakest link.

5. Human: Doer → Designer

My job: design system where code writes itself. I’m architect, not bricklayer.

Looking Ahead

Next steps:

P0: Dead-letter queue for failed events
P1: Manual resend CLI for stuck tasks
P1: Audit log rotation
P2: Visual dashboard for system health

Goal: Amplify human capability. One person + six agents > one person + zero agents. That’s Harness Engineering.

Quick Reference

Agent Roster

Total: 60 cron, ~90 Skills

Daily Schedule (Pacific Time, America/Los_Angeles)

Cron rows are snapshots from my stack — align to your exchange calendar, asset class (equities vs. crypto), and whether you are on PST or PDT.

Critical Files

If you run a similar harness, how do you handle failures when compaction, cron, and multi-agent handoffs all interact — what breaks first in your stack, and what fixed it?

References

OpenClaw: https://github.com/openclaw/openclaw
PowerMem: https://github.com/oceanbase/powermem
ClawHub: https://github.com/openclaw/clawhub
Mitchell Hashimoto: “My AI Adoption Journey” — https://mitchellh.com/writing/my-ai-adoption-journey
Anthropic: “Effective harnesses for long-running agents” — https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

All times Pacific Time (America/Los_Angeles; PST or PDT depending on season). macOS + OpenClaw. Monitoring: Feb–Mar 2026. Validate config against your OpenClaw release at https://github.com/openclaw/openclaw

Full context feels safe until it isn’t. Here’s the engineering fork in the road — and real numbers from an open-source memory layer on OpenClaw.

Stateless isn’t a feature. It’s the default bug.

That’s the part nobody puts on the landing page: large language models don’t continue anything on their own. Every turn is a fresh sheet of paper — unless you shovel history back in.

So the industry reached for two comforting reflexes:

Crank the context window — pack in everything that might matter. Longer feels safer.
Drop a MEMORY.md and paste it every turn — simple, auditable, easy to debug.

Both are great at small scale. Both fall apart at real scale.

Because context isn’t free. You pay three ways: slower, pricier, and muddier. Inference drags. Your token bill climbs linearly. Worst of all, as context grows, attention thins out in the middle — quality drops, contradictions creep in, and you’re not buying “memory.” You’re buying noise.

So the real question isn’t whether to remember. It’s this:

What shape should memory take before it enters the model — verbatim dump, or retrieve what matters?

This piece is about that engineering fork — and some numbers we saw hooking an open memory layer to OpenClaw.

Full context is like reciting your entire diary before every sentence

A lot of people hear “long-term memory” and think: store everything anyone ever said.

A better engineering definition is tighter:

Keep only facts that help future decisions — and that retrieval can amplify — and let stale stuff expire on purpose.

Human memory isn’t a bit-perfect disk image. We lose detail; we blur timelines; we still keep actionable residue — “no cilantro,” “last time we were blocked on that dependency.” Flip that around for AI: full retain + full inject often buys you less coherence and more contradiction and context pollution.

That’s why I keep coming back to three knobs:

Write path: what do you distill from chat into durable memory?
Read path: what do you retrieve — and how much — before the model sees it?
Lifecycle: how do old facts fade instead of squatting forever?

What should a “memory system” actually look like?

One sane pattern: a persistent memory layer outside the LLM — think PowerMem (Apache 2.0 from OceanBase) — that extracts salient facts from dialogue (dedupe, conflict update, merge related), recalls on demand, and forgets stale items with an explicit decay policy.

A few properties that actually matter in production:

Hybrid retrieval — vectors + full-text + graph-style links. Fuzzy intent and exact keywords need to hit. “Embedding-only search” ages poorly in real products.
Forgetting isn’t a bug — Ebbinghaus-style decay sounds like a psych meme; it’s really a capacity vs. signal-to-noise trade you’re engineering on purpose.
Multi-agent — private memory and shared memory across agents. Multi-agent isn’t “someday”; it’s now. Single-user, single-session assumptions break fast.
Multimodal — text, images, audio. Not for show — workflows are already messy.

If you list those as a feature matrix, it reads like marketing. In engineering terms, they answer one question: how do you put the smallest useful slice of memory in front of the model this turn?

The benchmark doesn’t care about your vibes: LOCOMO vs. “just paste everything”

On LOCOMO (long-dialogue memory benchmark; Maharana et al., ACL 2024), PowerMem vs. a full-context baseline isn’t a rounding error:

The point isn’t “pick a winner.” It’s that retrieval + extraction beats brute-force context on quality, latency, and cost at the same time — which feels backwards until you realize the information shape changed: from “replay the transcript” to structured, retrievable facts.

OpenClaw: the anti-pattern you can actually measure

Out of the box, OpenClaw can ship the entire MEMORY.md into system_prompt every turn, with no retrieval—and the file keeps growing.

That’s full-context thinking in a real toolchain: simple, transparent, explainable — right up until it starts eating you alive.

Same workload, total input tokens:

The PowerMem plugin lands around ~18% of the default — same ballpark as “stop reciting the encyclopedia before answering one question.”

The integration model is what you’d want if you designed it on purpose: retrieve before the session, inject only what’s relevant; extract after the session, persist durable facts — instead of mirroring the whole file into the prompt every time.

Make it run: OpenClaw + PowerMem (copy-paste path)

Pick one path: use ClawHub (step 2) or the manual server + JSON (step 3) — you don’t need both.

1 — OpenClaw
https://openclaw.ai/

2 — One-click via ClawHub (recommended)
Install skill: https://clawhub.ai/Teingi/install-powermem-memory
Plugin name: memory-powermem.

3 — Manual path (when you want your own server)

Install and run PowerMem:

pip install powermem
# from a directory with a configured .env
powermem-server --host 0.0.0.0 --port 8000

Install the OpenClaw plugin:

openclaw plugins install memory-powermem

Point OpenClaw at your server — edit ~/.openclaw/openclaw.json and set the memory slot to this plugin, for example:

{
  "plugins": {
    "slots": { "memory": "memory-powermem" },
    "entries": {
      "memory-powermem": {
        "enabled": true,
        "config": {
          "baseUrl": "http://localhost:8000",
          "autoCapture": true,
          "autoRecall": true,
          "inferOnAdd": true
        }
      }
    }
  }
}

Restart the OpenClaw Gateway, then sanity-check:

openclaw ltm health

If that’s green, you’ve swapped “paste the whole scroll every turn” for retrieve-then-inject + capture-after.

Two layers that don’t negotiate: operation plane + cognitive plane

If you’re going to run memory in production, you usually need both:

pmem CLI — humans and agents share the same front door: scriptable, automatable, boring in the good way.
Dashboard — distributions, health, “what did we actually memorize?” so humans can govern instead of guessing.

Agents need low-friction execution. Humans need explainability and ops judgment. Skip the operation plane and memory never ships. Skip the cognitive plane and you end up with vectors in a black box — and the only fix is nuke from orbit.

Quick local try:

pip install powermem
# or: uv add powermem
pmem --version

With powermem-server running, open the dashboard at http://localhost:8000/dashboard/.

What the open-source argument should actually be about

If you’re building agents, CLIs, or personal automation, move the debate past “do we keep a MEMORY.md?”

Is memory a document or a database?
Is injection append-only or retrieve-then-inject?
Do you have real decay, or do you pretend “never expires” means “always correct”?

Projects like PowerMem are less a billboard and more a reproducible lab bench—hybrid retrieval, extraction, decay, multi-agent, multimodal—trading context signal-to-noise for engineering you can argue about in issues instead of vibes.

If you take one line home, make it this:

Long-term memory isn’t about remembering more. It’s about recalling the right thing when it matters.

0. 💡 TL;DR — the whole agent in one mental picture

Before the details, hold this picture in your head. Everything else is elaboration.

┌─────────────────────────────────────────────────────────────┐
│  query() — async generator, the only place control flows    │
│                                                             │
│   while not done:                                           │
│     state    = compress(state)              # 4 layers      │
│     response = await stream(model, state)                   │
│     yield response.messages                 # to UI         │
│     if no tool_calls:  return  completed                    │
│     batches  = partition(response.tool_calls)               │
│     for batch in batches:                                   │
│       results = run(batch)                  # parallel-safe │
│       yield results.messages                                │
│       state += results                                      │
└─────────────────────────────────────────────────────────────┘
        ▲                ▲                ▲             ▲
        │                │                │             │
   Memory (files)    Tools (self-      Hooks (27       Sub-agents
   loaded into       describing,       lifecycle       (recursive
   system prompt     fail-closed,      events)         query() with
   at session        partitioned by                    isolated state)
   start             safety per-call)

Five rules carry 80% of the design:

🔄 The loop is an async generator. Backpressure, cancellation, and typed terminal states fall out for free.
📝 Every tool is self-describing (schema, permissions, concurrency safety). The loop never special-cases tools.
🛡️ Safety is per invocation, not per tool type. Bash("ls") ≠ Bash("rm -rf").
💾 Prompt cache is architecture, not optimization. Static-then-dynamic boundary, sticky flags, byte-identical fork prefixes.
📁 Memory is files. A small LLM picks which to load. No database, no embeddings. Trust through transparency.

If you only build those five things well, you have ~80% of Claude Code. The rest is layering and polish.

1. 🎯 What you are actually building

A production coding agent is not a chat loop with tool calls bolted on. It is a streaming, cancellable, recursive state machine that has to:

Survive token-budget exhaustion mid-task without losing the user's work.
Run dozens of tools per turn safely, often in parallel, sometimes speculatively.
Spawn child agents that cost ~10% of a normal call thanks to prompt cache reuse.
Persist semantic knowledge across sessions without a database.
Allow third parties to extend it (skills, hooks, MCP) without crashing the host.
Boot in under 300 ms and stream the first token in well under a second.

If your design omits any of these, you will hit a wall later. Build for them on day one — most of them are cheap when planned, expensive when retrofitted.

The closing principle of the source book: push complexity to the boundaries. Protocol translation, state reconciliation, external tool invocation, permission checking — these belong at the edges. The interior (loop, memory, tool composition) stays clean and exhaustively typed.

2. 🧱 The six core abstractions

Every part of Claude Code reduces to one of these. Implement them as first-class modules, not as helpers attached to a god object.

# Abstraction Responsibility Approx LoC in CC 1 Query Loop Async generator that streams model output, runs tools, appends results, decides when to stop. Returns a typed Terminal discriminated union (10 reasons). ~1,700 2 Tool System Self-describing tools with schema, permissions, concurrency, rendering. Batched into concurrent/serial groups. Speculative execution during streaming. — 3 Tasks Background units following `pending → running → completed \ failed \ 4 State Two layers: a mutable singleton {% raw %}STATE (~80 fields, infrastructure) + a 34-line reactive store (UI: messages, approvals, progress). — 5 Memory File-tier persistence (CLAUDE.md, ~/.claude/MEMORY.md, team symlinks). LLM picks relevant memories at session start. — 6 Hooks Lifecycle interceptors at 27 events, in 4 forms: shell command, single-shot prompt, agent loop, HTTP webhook. —

Why this carving

The Query Loop is the only place control flow lives. Tools, hooks, sub-agents — they all yield through it.
State is split because infrastructure mutates rarely but reads constantly; UI is the opposite. One subscription model can't serve both.
Memory is its own primitive (not a tool) because it is read on every system-prompt build, before any tool can run.
Hooks are first-class because the permission system itself runs partially as PreToolUse hooks. They are not an afterthought.

3. 📦 State: two tiers, one source of truth

State design is where most agent codebases collapse. Claude Code splits it into two tiers with strict layering:

Tier What it holds Mutability Reachable from Bootstrap state (STATE) ~80 fields: originalCwd, sessionId, model overrides, cost accumulators, telemetry handles, prompt-cache allowlists Mutable through ~100 typed setters Everywhere — DAG leaf, depends on nothing but Node.js stdlib AppState (reactive store) Messages, input mode, tool approvals, progress indicators, todos Immutable snapshots; updater functions only Inside React components

Why split them

Availability: session ID, telemetry, and cost trackers must exist before React mounts. A reactive store cannot serve them.
Access pattern: bootstrap state is read constantly, mutated rarely, with no subscribers. AppState is read by render subscribers on every change. One subscription model can't serve both.
Dependency direction: bootstrap depends on nothing → AppState imports bootstrap → React imports AppState. Enforce this with a lint rule. Cycles will sneak in otherwise.

The reactive store in 34 lines

function makeStore(initial, onTransition) {
  let current = initial
  const subs = new Set()
  return {
    read:      () => current,
    update:    (fn) => {
      const next = fn(current)
      if (Object.is(next, current)) return        // skip noop
      const prev = current; current = next
      onTransition?.(prev, next)                  // side effects FIRST
      subs.forEach(cb => cb())                    // then UI
    },
    subscribe: (cb) => { subs.add(cb); return () => subs.delete(cb) },
  }
}

Three deliberate choices:

Updater-only mutations. No set(value) API. Stale-closure bugs vanish.
Object.is guard. Identical references skip re-renders and side effects.
onChange fires before listeners. Side effects (e.g. persist to disk, notify remote session) complete before the UI flips.

The sticky latch pattern (write-once flags)

A pattern worth memorizing — applies any time a value influences a server-side cache key:

type Latch = boolean | null   // null = "not yet evaluated"

function shouldSendBetaHeader(featureCurrentlyActive: boolean): boolean {
  const latched = getAfkLatch()
  if (latched === true) return true            // already on — keep sending
  if (featureCurrentlyActive) {
    setAfkLatch(true)                           // first activation — latch
    return true
  }
  return false                                  // never activated
}

The three-state type self-documents intent: null says "we haven't decided yet." Once true, never returns to false. Five such latches in Claude Code prevent mid-session feature toggles from busting 50–70K tokens of cached prompt.

Centralizing side effects on diffs

A real production bug: permission mode was synced to the remote session by 2 of 8+ mutation paths. Eventually one drifted. The fix was a single onChangeAppState(prev, next) callback that detects field changes structurally — every mutation path is automatically covered. Side effects scale much more slowly than mutation sites; centralize on diffs, not events.

Cost tracking (a concrete example)

Every API response runs through addToTotalSessionCost:

Accumulates per-model usage in bootstrap state.
Reports to OpenTelemetry.
Recursively processes nested model calls (sub-agents, recall queries).
Persists to project config on process exit.
Restores on next session only if the persisted session ID matches.

Histograms use reservoir sampling (Algorithm R) with 1,024 entries to compute p50/p95/p99. Averages hide tail latency, and tail latency is what users feel.

Actionable: even in v0, instrument cost and latency. You cannot decide what to optimize from feel.

4. ⚙️ The agent loop: AsyncGenerator as control plane

The loop is an async function* — not a while with callbacks, not an event emitter, not an RxJS pipeline. There are three concrete reasons to choose generators:

Backpressure for free. A generator yields only when the consumer calls .next(). The REPL pulls via for await, naturally pausing if the UI can't render fast enough.
Typed terminal states. The generator's return is a discriminated union of why execution stopped: completed, max_turns, error, aborted_streaming, aborted_tools, prompt_too_long, image_error, model_error, stop_hook_prevented, hook_stopped, blocking_limit. The compiler enforces exhaustive handling.
Composability. Inner generators delegate via yield*. No callback nesting, no promise plumbing.

Loop skeleton

async function* query(initialState):
  state = initialState
  while true:
    state = compress(state)              // 4-layer pipeline (§8)
    response = await callModel(state)    // streaming
    yield* response.messages             // surface to UI

    if response.error and recoverable:
      state = recover(state, error)
      continue
    if response.error and not recoverable:
      return { kind: 'model_error', error }
    if not response.toolCalls:
      if stopHookBlocks(state):
        state = applyHookFeedback(state)
        continue
      return { kind: 'completed' }

    batches = partitionToolCalls(response.toolCalls)
    for batch in batches:
      results = await executeBatch(batch, state)
      yield* results.messages
      state = appendToolResults(state, results)

    // re-enter with new state

Continue states (don't `return`, just `continue`)

collapse_drain_retry, reactive_compact_retry, max_output_tokens_escalate, max_output_tokens_recovery, stop_hook_blocking, token_budget_continuation, next_turn. Naming each one is what makes the loop testable — every test asserts which transition fired.

Error recovery is a ladder, not a fallback

Order matters. From least to most aggressive:

Trigger Step 1 Step 2 Step 3 prompt_too_long (413) drain staged collapse summaries reactive compact surface to user max_output_tokens escalate cap 8K → 64K multi-turn recovery (≤3 attempts) surface media_size_error reactive compact — surface

Guards prevent infinite loops: hasAttemptedReactiveCompact one-shot flags, hard caps on recovery attempts, circuit breakers. Never run stop hooks on an error response — that creates "error → hook blocks → retry → error" spirals.

Cancellation

Aborts can hit during streaming or during tool execution. In both cases, the executor must drain remaining requests by emitting synthetic tool_result blocks for queued/running tools. The Anthropic API rejects an assistant message containing a tool_use block without a matching tool_result. signal.reason distinguishes hard aborts from "submit interrupts" (a new user message), so you skip redundant interruption stubs in the latter case.

Actionable: every tool_use your agent emits must have a paired tool_result in message history before the next API call. Make this an invariant your loop enforces, not a hope.

5. 🔧 Tools: self-describing, fail-closed, parameterized

Interface

A tool is parameterized by three types: Input, Output, and Progress. The Input doubles as a Zod schema and the JSON Schema given to the model.

The full Tool interface in Claude Code has ~45 members. Five are critical:

call(input, ctx) — runs the work.
inputSchema — Zod schema (validated, plus auto-generated JSON Schema).
isConcurrencySafe(parsedInput) — per invocation, not per type.
checkPermissions(parsedInput, ctx) — returns allow | deny | ask | passthrough with optional updatedInput.
validateInput(parsedInput, ctx) — semantic checks beyond schema (e.g. reject no-op edits).

The `buildTool()` factory pattern (fail-closed)

Never construct a tool literal directly. Wrap it in a factory that fills in dangerous defaults conservatively:

const SAFE_DEFAULTS = {
  isEnabled:        () => true,
  isParallelSafe:   () => false,   // serial unless proven otherwise
  isReadOnly:       () => false,   // assume writes
  isDestructive:    () => false,
  checkPermissions: (input) => ({ behavior: 'allow', updatedInput: input }),
}

If a tool author forgets isConcurrencySafe, they get serial execution — slow, but never corrupting. The opposite default would silently produce race conditions.

Tool result shape

type ToolResult<T> = {
  data: T
  newMessages?: Message[]                              // e.g. AgentTool injects sub-agent transcript
  contextModifier?: (ctx: ToolUseContext) => ToolUseContext  // e.g. EnterPlanMode
}

Context modifiers only apply to serial tools. Concurrent tools queue modifiers until the batch completes — otherwise data dependencies and shared state become race-condition territory.

The 14-step execution pipeline (`checkPermissionsAndCallTool()`)

This is the choreography every tool call goes through. Implement it as a single function that returns a ToolResult or ToolError. Skipping any of these steps will hurt later.

# Step Why it matters 1 Tool lookup (with alias map) Old transcripts may reference renamed tools 2 Abort check Don't waste compute on cancelled queued calls 3 Zod validation Catch type errors; hint to call ToolSearch for deferred tools 4 Semantic validation E.g. reject no-op edits, block sleep if a Monitor tool exists 5 Speculative classifier start Fire auto-mode permission classifier in parallel for Bash 6 Input backfill Expand ~/foo → absolute paths for hooks/permissions but keep originals for transcript stability 7 PreToolUse hooks Hooks decide / modify / block 8 Permission resolution Rule match → tool method → mode default → prompt → classifier 9 Permission denied path Build error, fire PermissionDenied hook 10 Execute call() The actual work 11 Result budgeting Persist oversized output to disk; replace with preview 12 PostToolUse hooks Modify MCP output, possibly block continuation 13 Append newMessages Sub-agent transcripts, system reminders 14 Error classification Telemetry, OTel events

Result budgeting

Per-tool size caps prevent runaway output:

Tool maxResultSizeChars Rationale Bash 30,000 Most useful output fits Edit 100,000 Diffs need room Grep 100,000 Search results accumulate Read ∞ Self-bounded by token limit; persisting would create circular Read loops

Above the cap, the system writes the full content to a <persisted-output> file and returns a preview pointing to it. An aggregate ContentReplacementState tracks per-conversation budgets so multiple near-cap results cannot blow context together.

Deferred loading

Tools marked shouldDefer: true send only { name, description, defer_loading: true } to the API. The model has to call ToolSearch to load full schemas. Three benefits:

Smaller initial prompt.
Adding/removing a deferred tool changes the prompt by a few tokens, not hundreds — prompt cache stays warm.
Less tool-soup confusion for the model.

Tool registry assembly order matters

final = sort(builtins, alpha) ++ sort(mcpTools, alpha)

Sort within each partition, then concatenate. A flat sort across all tools would interleave MCP tools into built-in positions, busting cache breakpoints whenever MCP servers are added/removed.

6. ⚡ Concurrency and speculative execution

The core insight

Safety is determined per-invocation, not per-tool-type. Bash("ls -la") is concurrency-safe. Bash("rm -rf build/") is not. Same tool. Different inputs. Different verdict.

The partition algorithm

partitionToolCalls(calls):
  batches = []
  current = { kind: 'concurrent', tools: [] }
  for call in calls:
    tool = lookup(call.name)
    parsed = tool.inputSchema.safeParse(call.input)
    safe = parsed.success and tool.isConcurrencySafe(parsed.data)
    if safe and current.kind == 'concurrent':
      current.tools.push(call)
    else if safe:
      batches.push(current); current = { kind: 'concurrent', tools: [call] }
    else:
      if current.tools: batches.push(current)
      batches.push({ kind: 'serial', tools: [call] })
      current = { kind: 'concurrent', tools: [] }
  if current.tools: batches.push(current)
  return batches

Example: [Read, Read, Grep, Edit, Read] → [concurrent[Read, Read, Grep], serial[Edit], concurrent[Read]].

Parsing failure → serial. Safety-check exception → serial. Always fail closed.

Speculative streaming execution

The StreamingToolExecutor watches the model stream. The moment a tool_use block is fully parsed (often seconds before the response finishes), it starts that tool — provided admission rules allow.

Admission rule: a tool can start executing iff no tool is currently running, or both the new tool and all currently-running tools are concurrency-safe.

Sequential timeline: stream 2.5s + 3 serial tools = 3.1s
Speculative: stream 2.5s overlapped with tools 1–2; total 2.6s

Tool states: Queued → Executing → Completed → Yielded. Yield in submission order, not completion order — even if c.ts finishes before a.ts, the conversation history must remain a, b, c.

Error cascade policy

Bash errors cascade within a batch. Shell commands form implicit pipelines; running cp after a failing mkdir is pointless.
Read/Grep errors isolate. One file read failure has no bearing on a sibling grep.

Cancelled siblings get synthetic results: "Cancelled: parallel tool call Bash(mkdir build) errored".

Interrupt behavior

Each tool declares interruptBehavior(): 'cancel' | 'block'. The executor treats an executing batch as interruptible only when all tools in it support cancel. A single block tool blocks user Esc for the whole batch.

7. 🔒 Permissions: modes, rules, and bubbling

Seven modes (most → least permissive)

Mode Behavior bypassPermissions No checks (testing only) dontAsk Auto-deny prompts (background agents — never block on user input) auto Lightweight LLM classifier evaluates each call against transcript acceptEdits File edits auto-allowed; other mutations prompt default Standard interactive — user approves each action plan Read-only; all writes denied bubble Sub-agent escalates the decision to its parent

Sub-agents default to bubble. Background agents default to dontAsk (they can't block on a prompt that has no UI).

Resolution chain

1. Hook decision?         → final
2. allowedRules / deniedRules / askRules match?  → final
3. tool.checkPermissions()  → allow | deny | ask | passthrough
4. Mode default
5. (interactive only) prompt user
6. (auto only) classifier

Rules

Three pieces: source (tracks provenance), ruleBehavior (allow/deny/ask), ruleValue (with optional content patterns).

Bash(git *) — Bash commands starting with git
Edit(/src/**) — file edits restricted to /src
Fetch(domain:example.com) — HTTP fetches limited to that domain

For Bash, parse the command via a real bash AST parser (parseForSecurity()), split on && || ; |, and classify each subcommand. If the parser fails, return fail-safe behavior — assume any command it can't parse is unsafe.

8. 🗜️ Context engineering: the 4-layer compression pipeline

Run before every API call, in this strict order:

Layer What it does Cost 0. Tool result budget Enforce per-message size caps; exempt tools without finite maxResultSizeChars Trivial 1. Snip compact Physically remove old messages; emit UI boundary marker; report tokens freed Cheap 2. Microcompact Drop tool results by tool_use_id once unneeded; cache edits via deferred boundary messages Cheap 3. Context collapse Replace conversation spans with summaries (granular) Medium 4. Auto-compact Fork an entire Claude conversation to summarize history; circuit-break after 3 consecutive failures Heavy

Why ordering matters: if collapse alone gets tokens below the auto-compact threshold, auto-compact never runs — so you keep fine-grained recent history.

Budget thresholds

Auto-compact triggers at effectiveContextWindow − 13,000 tokens.
Hard blocking limit at effectiveContextWindow − 3,000.
10K-token gap between them is where reactive compact runs if proactive failed.

Token counting blends authoritative API usage numbers with rough estimates for messages added since the last response — biased conservative so compaction fires slightly early.

Actionable: instrument both estimated and authoritative token counts, log the delta. When the delta drifts, your estimator is broken and your safety margins are wrong.

9. 🌐 The API layer: prompt caching as architecture

Prompt caching is not an optimization. It is an architectural constraint. Every design decision either preserves cache hits or busts them.

Multi-provider abstraction

A single getAnthropicClient() factory dispatches to one of:

Direct API (key or OAuth)
AWS Bedrock
Google Vertex AI
Azure Foundry

Provider chosen at boot from env vars + config. Stored in bootstrap state; never re-checked. SDKs dynamically imported (don't load Bedrock if you're on direct API).

A buildFetch wrapper injects an x-client-request-id UUID header on every request, so you can correlate client-side timeouts with server-side logs.

Cache scopes

Scope Where TTL Global Static prompt prefix shared across all users Long 1-hour Eligible users' extended cache 60 min Ephemeral (default) Per-session ~5 min

The system prompt has a literal === DYNAMIC BOUNDARY === marker:

Above (cacheScope: global): identity, system rules, task guidance, tool usage instructions, tone/style.
Below (per-session): session guidance, CLAUDE.md, env info, language, MCP instructions (uncached, marked dangerous), output style.

Rule: every runtime if above the boundary doubles the cache key space. 3 conditionals = 8 prefixes. 5 = 32. Compile-time feature flags are fine; runtime checks must live below the boundary.

Global scope is disabled when MCP tools are present — user-specific tool definitions would fragment the global cache into millions of unique prefixes.

Sticky latches

Five session-scoped boolean flags that, once set, cannot be unset for the rest of the session. They control beta/feature headers. Reason: "mid-session toggles don't change the server-side cache key" — flipping a flag would bust 50–70K tokens of cached context.

Pattern: Once(value) — a setter that throws or no-ops on second call. Use this for any cache-influencing config.

Output token slot reservation

Production p99 output = 4,911 tokens. Default SDK reservation = 32K–64K. Over-reservation = 8–16×.

Strategy: cap default max_tokens at 8K. On the rare truncation (<1% of requests), retry with 64K. Recovers 12–28% of the context window for free.

Streaming: skip the SDK helper

The SDK's BetaMessageStream calls partialParse() on every input_json_delta — repeatedly re-parsing growing JSON from scratch (O(n²)). Use raw Stream<BetaRawMessageStreamEvent> and accumulate tool-input strings yourself.

Watchdog and fallback

Idle watchdog: setTimeout(90s) reset on every chunk. At 45s, warn. At 90s, abort and retry non-streaming.
Non-streaming fallback activates when streaming dies mid-response (network, stall, truncation, proxies returning 200 with non-SSE bodies).
Disable fallback when streaming tool execution is active — duplicate tool runs would corrupt state.

10. 🤖 Sub-agents and fork agents

Single-agent capability has a hard ceiling. The fix is recursive: spawn child agents that are the same loop with isolated state.

`AgentTool` input schema (dynamic)

Field Purpose description 3–5 word task summary prompt Full instructions subagent_type Specialization key (optional) model Override (haiku/sonnet/opus) run_in_background Async execution name For team addressability isolation worktree (filesystem clone) or remote

Critical pattern: feature-gate the schema itself. "The model never sees fields it cannot use." Don't tell the model "don't use name here" — remove name from the schema in this context. The model cannot misuse what it cannot see.

Output (discriminated union)

Sync: { status: 'completed', prompt, ...result }
Async: { status: 'async_launched', agentId, outputFile } — outputFile is a filesystem path that fills in when the bg agent completes; parents poll independently of process state.

The 15-step lifecycle (`runAgent()`)

Model resolution — caller override > agent definition > parent model > default. Read-only agents default to Haiku.
Agent ID — agent-<hex>. Override path supports resuming a backgrounded agent.
Context preparation — fork agents clone parent history (after filterIncompleteToolCalls()); fresh agents start empty.
CLAUDE.md stripping — read-only agents (Explore, Plan) omit project instructions. Saves ~10.2% of fleet cache_creation tokens.
Permission isolation — per-agent getAppState() overlay. Permissive parent modes (bypass, acceptEdits) always win.
Tool resolution — fork agents reuse parent's exact array byte-for-byte; normal agents apply allow/deny lists. General-purpose agents cannot spawn sub-agents (prevents exponential fan-out).
System prompt — fork agents inherit pre-rendered bytes; normal agents call agentDef.getSystemPrompt(ctx).
Abort controller — sync agents share parent's controller (Esc kills both). Async agents get an independent one (survive parent abort).
Hook registration — agent-id-scoped, auto-cleanup on termination.
Skill preloading — declared in frontmatter, loaded concurrently to mask latency, prepended as a user message.
MCP initialization — inline servers (cleaned on termination) or shared configs (memoized, persistent). Must complete before context creation so tools are in the pool when snapshotted.
Context creation — createSubagentContext() makes isolation decisions:

Aspect Sync Async setAppState shared isolated setAppStateForTasks shared shared readFileState own cache own cache abortController parent's independent
Cache-safe params callback — for bg agents; lets the summarization service fork the conversation with cache-identical prefix.
Query loop — same query() function. Yields back to caller, records to sidechain JSONL transcript, forwards metrics.
Cleanup (finally) — MCP cleanup, hook clear, agent tracking, file cache, message GC, kill orphan shell tasks, remove agent's todos.

Fork agents: cache-driven subprocess design

The point of a fork is byte-identical request prefix to the parent, so children pay 10% input-token cost.

Three mechanisms make this work:

System prompt threading — pass parent's already-rendered bytes via override.systemPrompt. Don't regenerate; feature flags or session date may have changed.
Exact tool passthrough — useExactTools: true. No filtering, no reordering, no re-serialization. Even forbidden tools (like AgentTool itself) stay in the array — runtime guards prevent misuse.
Placeholder tool results — buildForkedMessages() clones the parent's last assistant message. For each tool_use, it inserts a constant placeholder string "Fork started -- processing in background". Same string for every child → same bytes.

Resulting structure: [...shared_history, assistant(all_tool_uses), user(placeholders..., directive)].

Only the final directive differs across children. With a 48,500-token shared prefix and 5 children, savings exceed 90% on input tokens for children 2–5.

When fork is disabled

Coordinator mode — coordinators have a structured-delegation prompt children would inappropriately inherit.
Non-interactive — fork uses permissionMode: 'bubble', which needs a user-facing prompt.
Explicit subagent_type — the user picked Explore/Plan/etc, so fork yields.

Recursive fork prevention (defense in depth)

Primary: child's context.options.querySource = 'agent:builtin:fork'. AgentTool checks this before allowing fork.
Fallback: scan message history for the boilerplate XML tag if querySource was lost in transit.

Six built-in agent archetypes

Archetype Model Tools Notable General-purpose Default All except Agent Workhorse Explore Haiku Read-only Omits CLAUDE.md, one-shot prompt (saves 135 chars/invocation) Plan Inherit Read-only 4-step process, must end with "Critical Files" list Verification Inherit Read-only, async System prompt explicitly anti-rationalization; requires adversarial probe Claude Code Guide Haiku dontAsk mode Doc fetcher; system prompt injects user's configured skills/agents/MCP Statusline Setup Sonnet Read + Edit only Narrowly-scoped specialist

Frontmatter format for user-defined agents

---
description: "When to use this"
tools: [Read, Bash]
disallowedTools: [FileWrite]
model: haiku
permissionMode: dontAsk
maxTurns: 50
skills: [my-skill]
mcpServers: [slack, {my-server: {command: node, args: [./server.js]}}]
hooks:
  PreToolUse:
    - command: "echo validating"
---

# System prompt body in markdown...

Trust hierarchy (least to most trusted): user agents < plugin agents < policy agents < built-in. User-agent hooks/MCP are silently skipped under strictPluginOnlyCustomization — graceful degradation, not error.

11. 🕸️ Multi-agent coordination patterns

Three distinct shapes:

A. Simple background delegation

Fire-and-forget. Tests, searches, lints. No coordination protocol.

B. Coordinator mode

Hierarchical manager-worker. The coordinator gets only three tools: Agent (spawn), SendMessage (talk), TaskStop (kill). That's it. By design.

"The coordinator's job is to think, plan, decompose, and synthesize. Workers do the work."

Critical principle: never delegate understanding. Coordinators must give workers exact file paths, exact line numbers, exact change descriptions — not "based on the research, fix the bug."

Workflow phases:

Research — multiple workers explore in parallel
Synthesis — coordinator (not workers) integrates findings
Implementation — workers receive precise instructions
Verification — workers validate

C. Swarm teams

Peer-to-peer. Same process, isolated via AsyncLocalStorage, file-based mailboxes. Each message has metadata (sender, timestamp, color for UI).

Three interruption levels:

Abort current work — cancel turn, keep operating
Shutdown request — cooperative graceful wind-down
Kill — hard abort via controller

Task state machine (universal)

All background work — bash, sub-agents, remote sessions, teammates, dreams — flows through one state model:

pending → running → { completed | failed | killed }

Seven task types with single-char visual prefixes: local_bash (b), local_agent (a), remote_agent (r), in_process_teammate (t), local_workflow (w), monitor_mcp (m), dream (d).

`SendMessage` dispatch order

Bridge (bridge:<session-id>) — cross-machine via Remote Control relays
UDS (uds:<socket-path>) — local IPC via Unix Domain Sockets
In-process — agent IDs / names of running agents
Team mailbox — file-based queue

Killer feature: transparent agent resumption. Sending a message to a "dead" agent automatically resurrects it from its disk transcript. The conversation simply continues.

Command queue invariant

Messages are delivered between tool rounds, never mid-execution. The agent finishes the current turn, then receives new info. No race conditions, no corrupted state. Make this a hard rule — it's the cheapest way to get correctness in multi-agent comms.

Pattern selection

Scenario Pattern Single bg task Delegation Multi-file refactor with research phase Coordinator Long-running collaborative dev Swarm

Operational guardrail

A 50-message memory cap on in-process teammates exists because a real production incident reached 36.8 GB across 292 agents. Plan for unbounded fan-out from day one or it will hurt you.

12. 🧠 Memory: file-based persistence + LLM recall

Why files, not a database

Transparency — users open .md files and see exactly what the agent remembers. Trust through observability, not capability.
Modification time is a built-in epistemological signal: "when was this observation recorded?"
Zero infrastructure — no schema migrations, no indexes, no backups.

Layout

~/.claude/projects/<sanitized-git-root>/memory/
  MEMORY.md                        # always loaded; index only; ≤200 lines, ≤25 KB
  user_role.md                     # one memory per file
  feedback_testing.md
  project_migration_q2.md
  team/                            # shared via symlink
  logs/YYYY/MM/YYYY-MM-DD.md       # KAIROS append-only mode

Four-type taxonomy

Type Purpose user Role, expertise, preferences feedback Corrections + validated approaches (lead with rule, then Why: and How to apply: lines) project Active work context with absolute dates (always convert "Thursday" → 2026-03-05) reference Pointers to external systems (Linear, Slack channels)

Derivability test: if git log / git blame / the code itself can answer it, don't memorize it. No code patterns, no architecture, no debug fix recipes.

Frontmatter contract

---
name: <title>
description: <one-line summary used by recall LLM>
type: user | feedback | project | reference
---

<body — for feedback/project, structure as: rule → **Why:** → **How to apply:**>

The description field carries the most weight — it's the LLM-recall index.

Two-tier retrieval

Tier 1 (always loaded): MEMORY.md index (~3,000 tokens for ~150 entries). Lines after 200 are truncated.
Tier 2 (on-demand): an async Sonnet side-query gets the manifest (type, name, date, description), the user's current query, and recent tool history. Returns up to 5 filenames as structured JSON. Validated against the file list to catch hallucination.

This trades a few hundred ms of latency for semantic precision keyword-matching cannot achieve — especially for negation (do NOT use mocks).

Staleness policy

Don't expire. Annotate. Today/yesterday → no caveat. Older → human-readable warning ("This memory is 47 days old — code claims may be outdated"). Models reason better about "47 days ago" than ISO timestamps.

Write path (two-step)

Write <type>_<topic>.md with frontmatter + body.
Add a one-line pointer to MEMORY.md: - [Title](file.md) — one-line hook.

A background extraction agent runs at loop completion to catch memories the main agent missed.

KAIROS continuous mode

For long-lived sessions, replace two-step writes with append-only daily logs in logs/YYYY/MM/. A separate consolidation pass (after 24h or 5+ modified sessions) merges logs into structured memories.

Security (team paths)

Three-layer validation, all fail-closed:

Input sanitization (null bytes, traversal sequences, Unicode attacks)
String-level path validation with trailing-separator checks
Symlink resolution against the deepest existing ancestor

No partial-success fallbacks. Reject early, reject completely.

13. 🔌 Skills, hooks, plugins — extensibility surface

Skills: two-phase loading

The killer pattern. 50 skills shouldn't cost 50 docs of system-prompt tokens at startup.

Phase 1 (startup): parse YAML frontmatter only — name, description, when_to_use. Inject into system prompt as a directory.
Phase 2 (invocation): load full markdown body, substitute $ARGUMENTS and ${CLAUDE_SESSION_ID}, execute inline shell commands, prepend as a user message.

You pay the token cost only when the skill actually runs.

Skill source priority (highest → lowest)

Managed (policy / enterprise)
User (~/.claude/skills/)
Project (.claude/skills/)
--add-dir flag
Legacy commands
Bundled
MCP (remote, untrusted)

Hard security boundary: MCP skills never execute inline shell commands. External MCP servers are content-only. No exceptions.

Frontmatter controls

name: my-skill
description: ...
when_to_use: ...
disable-model-invocation: false   # block autonomous use
context: fork                     # run as sub-agent with own token budget
paths: ["src/**/*.ts"]            # conditional activation
hooks:
  PreToolUse: [...]

Hooks: 27 events, 6 types

User-configurable:

Command — spawn shell process, read stdout/exit code
Prompt — lightweight LLM call
Agent — multi-turn loop (max 50 turns)
HTTP — POST to remote policy server

Internal:

Callback — programmatically registered
Function — session-scoped TypeScript

Top 5 lifecycle points to know:

Hook Fires Can do PreToolUse Before tool execution Block / modify / approve / inject context PostToolUse After successful execution Inject feedback, replace MCP output Stop Before Claude concludes Force continuation (verification loops) SessionStart Session begin Cannot block UserPromptSubmit User submits Block (input validation)

Other events span tool lifecycle (PostToolUseFailure, PermissionDenied, PermissionRequest), session (SessionEnd, Setup), subagents (SubagentStart, SubagentStop), compaction (PreCompact, PostCompact), notifications, configuration, file watching, task tracking — 27 in total.

Snapshot security model

captureHooksConfigSnapshot() freezes hook config at startup. If malicious code modifies .claude/settings.json mid-session, the snapshot prevents the change from taking effect. Only the /hooks command or the file watcher can update the live config.

Policy cascade: enterprise hooks cannot be disabled by users; allowManagedHooksOnly restricts to policy-approved hooks.

Exit code semantics (command hooks)

Code Meaning 0 success 2 blocking error (deliberately uncommon to prevent accidental enforcement) other non-blocking warning

Skill ↔ hook integration

When a skill is invoked, its frontmatter hooks register as session-scoped. The skill directory becomes CLAUDE_PLUGIN_ROOT for those hook commands. once: true removes the hook after first execution. For sub-agents, Stop hooks auto-convert to SubagentStop to fire at the correct lifecycle point.

14. 🔗 MCP: the universal external-tool protocol

Skills and hooks extend the agent in-process. MCP (Model Context Protocol) is the standard way third parties extend it out-of-process — across servers, vendors, and trust boundaries. If you want a tool ecosystem you don't control, this is the layer that makes it possible.

Eight transports, three deployment shapes

Shape Transport Use Local process stdio (default) Subprocess; JSON-RPC over stdin/stdout; no auth Remote server http Streamable HTTP; POST + optional SSE sse Legacy (pre-2025) ws WebSocket bidirectional claudeai-proxy Routed via Claude.ai infrastructure In-process sdk Control messages over stdin/stdout InProcessTransport Direct function calls via queueMicrotask() (63 lines) IDE sse-ide, ws-ide Runtime-specific

Recommendation: start with stdio for local tools. Move to http only when you need remote. Use InProcessTransport for tools you control end-to-end — eliminates subprocess overhead.

Tool wrapping (4 stages)

External MCP tools must merge into the same Tool interface as built-ins. Four transformations:

Name normalization → mcp__{server}__{tool}. Invalid characters become underscores. Match ^[a-zA-Z0-9_-]{1,64}$.
Description truncation at 2,048 chars. (Real-world: OpenAPI servers were dumping 15–60 KB descriptions.)
Schema passthrough. Pass MCP input schemas straight through; do not transform.
Annotation mapping. readOnlyHint: true → enables concurrent execution. destructiveHint: true → triggers stricter permission checks.

After wrapping, MCP tools are indistinguishable from built-ins at the loop level. The same 14-step execution pipeline runs.

Configuration scopes (7 sources, content-deduplicated)

Scope Source Trust local .mcp.json in project User approval required user ~/.claude.json User-managed project Project-level Shared enterprise Org-managed Pre-approved managed Plugin-provided Auto-discovered claudeai Web interface Pre-authorized dynamic SDK injection Programmatic

Servers with matching command/args (or URLs) are deduplicated by content, not by name. Two configs naming the same binary differently still merge.

OAuth (RFC 9728 + RFC 8414)

Discovery chain when a server returns 401:

Probe /.well-known/oauth-protected-resource for authorization-server metadata.
Fall back to RFC 8414 discovery against the MCP server itself.
Use configured authServerMetadataUrl as escape hatch.

Cross-App Access (XAA) enables federated token exchange via identity providers. Real-world spec violations are common — normalizeOAuthErrorBody() rewrites Slack's "200 with error body" responses to a proper HTTP 400. Plan for spec drift on day one.

Server lifecycle

States: connected, failed, needs-auth (15-min TTL cache), pending, disabled.
Spawn batching: local in batches of 3, remote in batches of 20 — protects against file-descriptor exhaustion.
Session-expiry detection: Streamable HTTP returns 404 + JSON-RPC code -32001 → reconnect + single retry.

Timeout layers

Layer Duration Why Connection 30 s Unreachable / slow servers Per-request 60 s Fresh AbortSignal per request Tool call ~27.8 h Legitimate long-running operations Auth 30 s Unreachable OAuth servers

Trap: if you reuse a single AbortSignal across requests it expires during idle periods. wrapFetchWithTimeout() creates a fresh signal per request. Memorize this.

Critical security rule

MCP skills never execute inline shell commands. External servers are content-only. Every other extension surface (user skills, project skills) can run shell; MCP cannot. This is the single most important MCP rule and the one you will be tempted to break.

`InProcessTransport` in 63 lines

Two key mechanics:

send() delivers via queueMicrotask() — prevents stack-depth blow-ups on synchronous request/response cycles.
close() cascades to peer transport — no half-open connection states.

If you are wrapping an internal service as an MCP server, this is your reference. Don't subprocess what you can call directly.

15. 🚀 Bootstrap, startup, and rendering performance

The 5-phase pipeline (target: < 300 ms)

Phase File What happens 0. Fast-path dispatch cli.tsx Inspect args. --version / --help → dynamic-import only that handler, exit. Don't load React, telemetry, MCP. 1. Module-level I/O main.tsx Side-effect-fire MDM (security policy) + keychain subprocesses during import evaluation. ~138 ms of module loading runs in parallel with subprocess I/O. 2. Parse and trust init.ts Parse args, load config. Enforce a trust boundary dialog. Before: only safe ops (TLS, themes, telemetry). After: env vars and git commands. 3. Setup setup.ts Register everything in parallel: commands, agents, hooks, plugins, MCP. Hook config snapshot frozen here. 4. Launch replLauncher.ts Seven entry paths converge: REPL, print, SDK, resume, continue, pipe, headless. All call the same query() loop.

Other startup techniques

API preconnection — fire a HEAD to the Anthropic API during init. TCP+TLS handshake (100–200 ms) overlaps with setup. Connection is warm by the time the user submits.
Dynamic import for heavy libs — OpenTelemetry, provider SDKs, React for non-REPL paths.
50+ profiling checkpoints sampled at 100% of internal users / 0.5% of external. Without instrumentation you can't tell what to optimize.

Search performance (270K+ paths)

Three layers:

Bitmap pre-filter — assign each path a 26-bit mask of contained lowercase letters. Reject query: one integer comparison (charBits[i] & needleBitmap) !== needleBitmap. Rejects 10–90% at 4 bytes/entry.
Score-bound rejection — skip paths that can't beat the current top score before expensive scoring.
Async indexing with partial queryability — yield every ~4 ms. Search begins within 5–10 ms of index availability.

Rendering: patterns that transfer beyond the terminal

Claude Code forks Ink because stock Ink allocates one JS object per cell per frame — at 200×120 that's 24,000 GC'd objects every 16 ms. Whatever you're rendering, the lessons transfer:

Double-buffer + atomic write. Two persistent Frame objects; render into the back, swap pointers (no allocation), write the diff in one syscall wrapped in BSU/ESU (Begin/End Synchronized Update). No tearing.
Cell-level diffing with damage rectangles. Compute the bounding box of writes; diff only inside it. ~6× reduction in compare work for localized updates.
Three interning pools (chars, styles, hyperlinks) → integer IDs everywhere. Style transitions become a single pre-cached string lookup. Pools generationally reset every 5 min.
Frame throttling. 60 fps focused, 30 fps blurred (throttle(deferredRender, FRAME_INTERVAL_MS)). Scroll events get a tighter 4 ms schedule.
Pack related data. Two Int32 words per cell beats scattered objects — better cache behavior, faster compare, fewer allocations.
Lazy expensive work. Syntax highlighting via React Suspense — code shows unstyled first, colors paint moments later.
Separate hot paths from React. Direct DOM mutation + microtask scheduling for scroll. React handles the final paint, where it's already efficient.

The thesis: performance is not making operations fast; it is eliminating operations entirely.

16. 📋 The 10 foundational patterns (cheat sheet)

# Pattern Why it matters 1 AsyncGenerator-based loops Natural backpressure, clean cancellation via .return(), typed terminal states 2 Speculative tool execution Run safe read-only tools while the model is still streaming → noticeable latency cut 3 Concurrent-safe batching Partition by per-invocation safety; serial isolates side effects 4 Fork agents for cache sharing Byte-identical prefixes ⇒ ~95% input-token savings on children 5 4-layer context compression snip → microcompact → collapse → autocompact, in that order 6 File-based memory + LLM recall Beats embeddings for negation and intent-aware retrieval; zero infra 7 Two-phase skill loading Frontmatter at startup, body on invocation 8 Sticky latches Cache-influencing flags become write-once for the session 9 Slot reservation 8K default output, 64K on demand — recovers 12–28% of context 10 Hook config snapshots Freeze at boot; defense against mid-session injection from a malicious repo

17. 🗺️ Build-your-own: a 14-step roadmap

A pragmatic order to implement these in. Each step compiles and runs on its own.

Tool interface + factory. Define Tool<I, O, P>, buildTool() with safe defaults, and a ToolResult type. Ship one tool: Read. Test the Zod-based JSON Schema generation.
Query loop v0. Async generator. No tools, no compression, just stream the model and yield messages. Return a Terminal discriminated union.
Tool execution path. Add the 14-step pipeline as one function. Wire the loop to call it on tool_use blocks. Always pair tool_use with a tool_result, even on error.
Permission modes + rules. Implement default, acceptEdits, plan, bypassPermissions. Add the resolution chain. Skip auto (LLM classifier) for now.
Concurrency partition + executor. partitionToolCalls() + a serial/concurrent executor. Add isConcurrencySafe() to every tool. Yield results in submission order.
Hook system v0. Two events: PreToolUse, PostToolUse. Command hooks only (shell process, exit codes). Capture a snapshot at startup.
State split. Mutable singleton STATE for infra (cwd, model, session id). Tiny reactive store for UI (messages, approvals).
Multi-provider client factory. Direct API first. Stub the others. buildFetch wrapper for client-request-id header.
Prompt caching architecture. System-prompt boundary marker. Static prefix (cache scope: global if no MCP). Dynamic suffix per-session. Implement one sticky latch as proof.
Compression v1: snip + microcompact. Skip collapse and autocompact for now. Wire the budget thresholds.
Streaming tool executor. Watch the streaming SSE. Start safe tools when their tool_use is fully parsed. Buffer to preserve submission order.
AgentTool + sub-agent lifecycle. Re-enter query() with isolated context. Implement the cleanup finally block. Skip fork agents.
Memory. File layout, frontmatter contract, two-tier retrieval (index + LLM recall side-query). Four types only.
Skills (two-phase) + slash commands. Frontmatter at startup; body at invocation; $ARGUMENTS substitution. Add EXTRA_DIRS resolution order.

Save for later (don't build until step 14 lands): fork agents, swarm teams, remote tasks, KAIROS continuous mode, auto-mode permission classifier, MCP transport layer, terminal renderer optimization, bitmap search index.

18. ⚠️ Anti-patterns and pitfalls

Loop / control flow

❌ Callbacks or event emitters for the agent loop. You'll re-invent backpressure poorly. Use async function*.
❌ A single error terminal state. Lose information. Encode 10+ specific reasons in a discriminated union.
❌ Stop hooks on error responses. Creates error → hook blocks → retry → error infinite loops. Skip them.
❌ Forgetting to pair tool_use with tool_result on abort. API will reject the next message. Drain queued tools with synthetic results on every cancellation path.

Tools

❌ A constructor literal instead of a factory. Defaults will be unsafe. Always go through buildTool().
❌ Per-tool-type concurrency safety. Bash is sometimes safe, sometimes not. Pass parsed input.
❌ Concatenating built-ins and MCP tools then sorting flat. Cache breakpoint dies. Sort within partition, then concat.
❌ Returning huge raw output. Cap with maxResultSizeChars. Persist to disk + return preview.
❌ Using the SDK's BetaMessageStream. O(n²) JSON re-parsing. Read raw stream events.

Permissions

❌ Scattering if mode === ... checks throughout tool code. Centralize in modes + the resolution chain.
❌ Trusting a partial bash parse. If parseForSecurity() fails, treat the command as unsafe.
❌ Sub-agent default = default mode. It needs a UI to prompt; bg agents have none. Default to bubble (sync) or dontAsk (async).

Caching / API

❌ Runtime conditionals in the static prompt prefix. Each one doubles cache key space. Move below the boundary.
❌ Mid-session feature toggles that change request headers. Use sticky latches.
❌ Reserving 64K output tokens by default. Over-reserve 8–16×. Cap at 8K, escalate on demand.
❌ Regenerating the system prompt for fork children. Feature flags or session date may have moved. Pass parent's bytes.
❌ Filtering tools per child agent in fork mode. Different array → different cache key. Use useExactTools: true and runtime guards.

Memory

❌ Storing what git log can answer. Code patterns, fix recipes, who-changed-what. Useless duplication that goes stale.
❌ Embedding-only retrieval. Misses negation ("do NOT mock the DB"). Use LLM recall over a manifest.
❌ Hard expiration. Annotate with age; let the model decide. Stale memories are still data.
❌ Letting MEMORY.md grow past 200 lines. Truncated silently. Treat the index as a budget.

Multi-agent

❌ Coordinators with the full tool set. They'll do the work themselves. Restrict to Agent, SendMessage, TaskStop.
❌ Workers asked to "based on the research, implement X." They re-derive context, miss specifics, hallucinate paths. Synthesis is the coordinator's job.
❌ Mid-tool-execution message delivery. Race conditions. Queue at tool-round boundaries.
❌ Unbounded teammate state. 36.8 GB / 292 agents was a real production incident. Cap message history.
❌ General-purpose agents that can spawn Agent. Exponential fan-out. Block recursive spawning at the schema level.

Bootstrap / hooks

❌ Loading the world for --version. Fast-path dispatch first, full bootstrap second.
❌ Hook config that updates live mid-session. Lets a malicious repo redefine permissions after trust dialog. Snapshot at startup; update only via explicit user channel.
❌ Treating MCP skills like local skills. They are content-only. Never execute their inline shell commands.

🎯 Closing thought

The deepest principle in the source book is repeated at every layer: push complexity to the boundaries. Permission resolution, protocol translation, state reconciliation, tool I/O — these are the messy edges. Concentrate the mess there. Keep the loop, the tool composition, the memory recall, and the streaming logic clean and exhaustively typed.

If you remember nothing else: most of this system is generators yielding strongly-typed events through a series of small modules, with a few critical caches and a few critical safety doors. Build it in that order.

19. 📖 Glossary

Quick reference for the jargon used throughout this guide.

Term Meaning AsyncGenerator A JS function declared async function*. Yields values lazily, pauses at each yield until consumer calls .next(). Provides backpressure and clean cancellation. Backpressure The producer pauses when the consumer can't keep up. Generators give it for free; event emitters do not. Cache breakpoint The byte position in the prompt where the prompt cache stops matching. Move volatile content after the breakpoint to maximize hit rate. Concurrency-safe A tool invocation that can run in parallel with others without observable side effects. Determined per-input, not per-tool-type. Context window The token budget for a single API call (prompt + output). When you exceed it the API rejects the request. Discriminated union A type made of variants tagged by a literal field (`{ kind: 'completed' } \ Fork agent A sub-agent that inherits the parent's byte-identical prompt prefix to maximize prompt-cache hits (~95% input-token discount on children 2…N). Frontmatter The YAML block at the top of a {% raw %}.md file (between two --- lines). Used for skill/agent/memory metadata. Hook A user/plugin/policy interceptor at one of 27 lifecycle events. Can block, modify, or inject. MCP Model Context Protocol — the JSON-RPC standard for connecting external tool servers to an agent. Eight transports. Microcompact Layer 2 of context compression. Removes tool results by tool_use_id when no longer needed. Prompt cache Anthropic's server-side cache of prompt prefixes. ~90% discount on cached input tokens. Entire architecture revolves around preserving hits. Reservoir sampling Algorithm R. Maintain a fixed-size random sample of an unbounded stream. Used here for latency histograms (1,024 entries → accurate p50/p95/p99). Slot reservation The max_tokens value sent to the API. Default cap 8K, escalate to 64K on truncation (<1% of requests). Reclaims 12–28% of context. Speculative execution Starting tools while the model is still streaming, before the assistant message completes. Saves hundreds of ms when read-only tools dominate. Sticky latch A write-once boolean (`null \ Sub-agent A child agent spawned via {% raw %}AgentTool. New query() generator with isolated message history. Sync (parent waits) or async (background). Synthetic tool result A fabricated tool_result block emitted on cancellation so the API doesn't see a tool_use without a matching result. Terminal state The discriminated-union value the agent loop returns (vs. yields). Encodes why execution stopped — 10 distinct reasons. tool_use / tool_result Anthropic API blocks. Every tool_use in an assistant message must be paired with a tool_result in the next user message. The single most common bug source. Two-phase skill loading Frontmatter loaded into the system prompt at startup; full body loaded only on invocation. Lets you ship 50+ skills cheaply.

Sources

Repo: https://github.com/alejandrobalderas/claude-code-from-source (raw chapter markdown — primary source)
Companion site: https://claude-code-from-source.com (live, returns HTTP 200; WebFetch was Cloudflare-blocked, content retrieved via direct curl + WebSearch index)
Chapters analyzed: 1 (Architecture), 2 (Bootstrap), 3 (State), 4 (API Layer), 5 (Agent Loop), 6 (Tools), 7 (Concurrency), 8 (Sub-Agents), 9 (Fork Agents), 10 (Coordination), 11 (Memory), 12 (Extensibility), 13 (Terminal UI), 15 (MCP), 17 (Performance), 18 (Epilogue).

The source repo is purely educational and contains no source code from Claude Code — only original pseudocode derived from npm source maps. This guide follows the same convention.

If you found this helpful, let me know by leaving a 👍 or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! 😃

What if you could fine-tune any HuggingFace model on TPUs — using PyTorch code?

Here is what the end result looks like:

import torchax as tx
import torchax.train

# One function: forward → loss → gradients → optimizer update
step_fn = tx.train.make_train_step(model_fn, loss_fn, optimizer)

# Training loop
for batch in dataloader:
    loss, params, opt_state = step_fn(params, buffers, opt_state, batch, batch["labels"])

Your PyTorch model. JAX's training primitives. Running on TPU. No rewrite needed.

In the first part of this series, we ran HuggingFace models on JAX for fast inference. Now we take the next step: training. We will instruction-tune Gemma 3 1B on the Databricks Dolly 15k dataset using LoRA and torchax's functional training API — all on a free Colab TPU.

Why Train on TPUs?

Google's Tensor Processing Units (TPUs) are purpose-built for matrix operations — the bread and butter of deep learning. Free Colab gives you access to a TPU v2-8 with ~15GB of high-bandwidth memory. That is enough to fine-tune a 1B parameter model with LoRA.

But training on TPUs traditionally meant rewriting your model in JAX (Flax, Equinox) or using PyTorch/XLA. torchax offers a third path: keep your PyTorch model, but use JAX's functional training primitives.

How torchax Training Differs from Standard PyTorch

Standard PyTorch torchax loss.backward() jax.value_and_grad(loss_fn)(params, ...) optimizer.step() optax.apply_updates(params, updates) Model holds its own state Params and buffers are separate pytrees Eager execution JIT-compiled training steps

The key difference: functional training. Instead of calling loss.backward() and optimizer.step() on a stateful model, torchax separates the model into immutable weight pytrees and passes them through pure functions. This is what enables JAX's jax.jit to compile the entire training step into a single optimized program.

Prerequisites & Setup

What you need:

Python 3.10+
Basic familiarity with PyTorch and HuggingFace transformers
A Google Colab account (free tier works with LoRA)

Zero-setup option: Click the Colab badge above. The notebook handles all installation automatically.

Local setup:

# PyTorch CPU (torchax handles the accelerator via JAX)
pip install torch --index-url https://download.pytorch.org/whl/cpu

# JAX + all training dependencies in a single pip call
pip install -U 'jax[tpu]' torchax transformers flax peft datasets optax   # TPU
# pip install -U 'jax[cuda12]' torchax transformers flax peft datasets optax  # GPU

Colab note: The notebook installs packages and automatically restarts the runtime, since Colab pre-loads an older JAX that stays cached in memory until restart.

Key Concepts for Training

Before writing code, let's understand the four concepts that make torchax training work.

1. Param/Buffer Separation

JAX's jax.value_and_grad needs to know which inputs to differentiate. In standard PyTorch, the model owns its weights. In torchax training, we explicitly separate:

params — trainable parameters (get gradients)
buffers — everything else (frozen weights, running stats, constants)

params = {n: p for n, p in model.named_parameters() if p.requires_grad}
frozen = {n: p for n, p in model.named_parameters() if not p.requires_grad}
buffers = dict(model.named_buffers())
buffers.update(frozen)

For LoRA, params contains only the tiny adapter weights (~0.5% of the model). For full fine-tuning, it contains everything.

2. optax Optimizers

Unlike PyTorch optimizers (which carry hidden mutable state), optax optimizers are pure functions:

# PyTorch: hidden state inside optimizer
optimizer.step()

# optax: explicit state, no hidden pockets
updates, new_opt_state = optimizer.update(grads, opt_state, params)
new_params = optax.apply_updates(params, updates)

This functional design means the optimizer state is just another pytree that flows through the training step — perfect for jax.jit.

3. make_train_step

torchax.train.make_train_step() is the central API. It composes three pieces into a single JIT-compilable function:

model_fn — a pure function: (weights, buffers, batch) → output
loss_fn — extracts the scalar loss: (output, labels) → loss
optimizer — an optax optimizer

The result is step_fn(params, buffers, opt_state, batch, labels) → (loss, new_params, new_opt_state).

Under the hood, this uses jax.value_and_grad for efficient gradient computation and optax.apply_updates for weight updates — all compiled into a single XLA program.

4. Full Fine-Tuning vs LoRA

Full Fine-Tuning LoRA Trainable params All (~2B) Tiny adapters (~0.5%) Memory ~18-20 GB ~5-7 GB Speed Slower Faster Quality Higher ceiling Nearly as good Free Colab TPU Tight / may OOM Fits comfortably

LoRA (Low-Rank Adaptation) freezes the base model and adds small trainable matrices to attention layers. Instead of updating the full weight matrix W, it learns a low-rank decomposition: W + (α/r) × B·A where A and B are tiny matrices.

For free Colab, LoRA is the recommended path.

Step 1: Load and Prepare the Dataset

We use Databricks Dolly 15k — 15,000 human-written instruction-response pairs across 7 categories (QA, summarization, brainstorming, etc.).

import datasets as hf_datasets
from transformers import AutoTokenizer

MODEL_NAME = "google/gemma-3-1b-it"
DATASET_NAME = "databricks/databricks-dolly-15k"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

raw_dataset = hf_datasets.load_dataset(DATASET_NAME, split="train")

Each example has an instruction, optional context, response, and category. We format these into Gemma's chat template:

def format_example(example):
    user_content = example["instruction"]
    if example.get("context", ""):
        user_content += f"\n\nContext: {example['context']}"

    messages = [
        {"role": "user", "content": user_content},
        {"role": "assistant", "content": example["response"]},
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    return {"text": text}

Then tokenize and create dataloaders:

from torch.utils.data import DataLoader
from transformers import DataCollatorForLanguageModeling

# Subset, split, tokenize
subset = raw_dataset.shuffle(seed=42).select(range(2200))
split = subset.train_test_split(test_size=200, seed=42)

def tokenize_example(example):
    formatted = format_example(example)
    return tokenizer(formatted["text"], padding="max_length", max_length=512, truncation=True)

train_tokenized = split["train"].map(tokenize_example, remove_columns=split["train"].column_names)
eval_tokenized = split["test"].map(tokenize_example, remove_columns=split["test"].column_names)

collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
train_dataloader = DataLoader(train_tokenized, shuffle=True, collate_fn=collator, batch_size=2)
eval_dataloader = DataLoader(eval_tokenized, shuffle=False, collate_fn=collator, batch_size=2)

Step 2: Load the Model and Apply LoRA

Here is where the torchax pattern matters: load the model with torchax disabled, then enable it before moving to JAX.

import torch
import torchax as tx
import peft

# Load model with torchax disabled to avoid intercepting init ops
with tx.disable_temporarily():
    model = transformers.AutoModelForCausalLM.from_pretrained(
        MODEL_NAME, torch_dtype=torch.bfloat16
    )

# Sync pad_token_id so loss computation properly ignores padding
model.config.pad_token_id = tokenizer.pad_token_id

Why disable? HuggingFace model initialization uses operations (like in-place tensor filling) that torchax does not support. Disabling torchax during loading keeps everything on CPU, then we move to JAX after.

Now apply LoRA:

peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,                             # Rank of the LoRA matrices
    lora_alpha=16,                   # Scaling factor
    lora_dropout=0.0,                # 0.0 for bfloat16 numerical stability
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # All attention layers
)
model = peft.get_peft_model(model, peft_config)
model.print_trainable_parameters()
# Output: trainable params: 5,767,168 || all params: 2,619,206,656 || trainable%: 0.22%

Only 0.22% of parameters are trainable — that is the power of LoRA.

Finally, enable torchax and move to the JAX device:

tx.enable_accuracy_mode()  # Float32 accumulation for bfloat16 stability
tx.enable_globally()
device = torch.device("jax")
model.to(device)
model.train()

Step 3: Baseline Evaluation

Before training, we measure the model's performance to compare against later:

import math

def evaluate_loss(model, dataloader, device, max_batches=50):
    model.eval()
    total_loss, total_batches = 0.0, 0
    with torch.no_grad():
        for i, batch in enumerate(dataloader):
            if i >= max_batches:
                break
            # Drop attention_mask — Gemma's sliding window attention produces NaN
            # with padded masks on torchax/JAX. Labels already mask padding with -100.
            batch = {k: v.to(device) for k, v in batch.items() if k != "attention_mask"}
            outputs = model(**batch)
            total_loss += outputs.loss.item()
            total_batches += 1
    model.train()
    avg_loss = total_loss / max(total_batches, 1)
    return avg_loss, math.exp(min(avg_loss, 100))

baseline_loss, baseline_ppl = evaluate_loss(model, eval_dataloader, device)
print(f"Baseline loss: {baseline_loss:.4f}, perplexity: {baseline_ppl:.2f}")

We also generate sample responses for qualitative comparison. For fast generation, we register StaticCache as a JAX pytree and use KV-cached decoding — only the new token is processed each step instead of the full sequence (~50x faster):

from transformers.cache_utils import StaticCache
from jax.tree_util import register_pytree_node

def _flatten_static_cache(cache):
    return (cache.key_cache, cache.value_cache), (
        cache.config, cache.max_batch_size, cache.max_cache_len,
        getattr(cache, "device", None), getattr(cache, "dtype", None),
    )

def _unflatten_static_cache(aux, children):
    config, max_batch_size, max_cache_len, dev, dtype = aux
    kwargs = {}
    if dev is not None: kwargs["device"] = dev
    if dtype is not None: kwargs["dtype"] = dtype
    sc = StaticCache(config, max_batch_size, max_cache_len, **kwargs)
    sc.key_cache, sc.value_cache = children
    return sc

register_pytree_node(StaticCache, _flatten_static_cache, _unflatten_static_cache)

The generation function uses prefill (process full prompt) then per-token decode with the cache and a tqdm progress bar:

from tqdm.auto import tqdm

def generate_response(model, tokenizer, instruction, device, max_new_tokens=100):
    messages = [{"role": "user", "content": instruction}]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to(device)
    seq_len = input_ids.shape[1]

    kv = StaticCache(config=model.config, max_batch_size=1,
                     max_cache_len=seq_len + max_new_tokens,
                     device=device, dtype=torch.bfloat16)
    pos = torch.arange(seq_len, device=device)

    model.eval()
    with torch.no_grad():
        # Prefill: process full prompt, populate cache
        logits, kv = model(input_ids, cache_position=pos, past_key_values=kv,
                           return_dict=False, use_cache=True)
        tok = torch.argmax(logits[:, -1], dim=-1)[:, None]
        generated = [tok[:, 0].item()]
        pos = torch.tensor([seq_len], device=device)

        # Decode: one token at a time using cached keys/values
        for _ in tqdm(range(max_new_tokens - 1), desc="Generating", leave=False):
            logits, kv = model(tok, cache_position=pos, past_key_values=kv,
                               return_dict=False, use_cache=True)
            tok = torch.argmax(logits[:, -1], dim=-1)[:, None]
            tid = tok[:, 0].item()
            if tid == tokenizer.eos_token_id:
                break
            generated.append(tid)
            pos += 1

    model.train()
    return tokenizer.decode(generated, skip_special_tokens=True)

Step 4: Set Up Functional Training

This is where torchax diverges from standard PyTorch. We separate the model, create an optax optimizer, and compose everything into a JIT-compiled training step.

Separate params and buffers

import optax
import torchax.train

params = {n: p for n, p in model.named_parameters() if p.requires_grad}
buffers = dict(model.named_buffers())
frozen_params = {n: p for n, p in model.named_parameters() if not p.requires_grad}
buffers.update(frozen_params)

Create the optimizer

schedule = optax.warmup_cosine_decay_schedule(
    init_value=0.0, peak_value=1e-4, warmup_steps=50, decay_steps=500
)
optimizer = optax.chain(
    optax.clip_by_global_norm(1.0),
    optax.adamw(learning_rate=schedule, weight_decay=0.01),
)
opt_state = tx.interop.call_jax(optimizer.init, params)

Note tx.interop.call_jax — this bridges optax's JAX calls with torchax tensors.

Define model_fn and loss_fn

def model_fn(weights, buffers, batch):
    """Stateless forward pass using functional_call."""
    return torch.func.functional_call(
        model, {**weights, **buffers}, args=(), kwargs=batch
    )

def loss_fn(model_output, labels):
    """Extract loss from HuggingFace model output."""
    return model_output.loss

torch.func.functional_call runs the model as a pure function — no hidden state, just inputs and outputs. This is what enables JAX to trace and compile it.

Compose into a training step

step_fn = tx.train.make_train_step(model_fn, loss_fn, optimizer)

That single line creates a function that does: forward pass → loss computation → gradient calculation → optimizer update — all compiled into one XLA program.

Step 5: The Training Loop

import time
from tqdm.auto import tqdm

torch.manual_seed(42)
train_losses = []
start_time = time.time()

for epoch in range(1):
    pbar = tqdm(enumerate(train_dataloader), total=len(train_dataloader))
    for step, batch in pbar:
        # Drop attention_mask — Gemma's sliding window attention produces NaN with
        # padded masks on torchax/JAX. Labels already mask padding with -100.
        batch = {k: v.to(device) for k, v in batch.items() if k != "attention_mask"}

        loss, params, opt_state = step_fn(
            params, buffers, opt_state, batch, batch["labels"]
        )

        train_losses.append(loss.item())
        pbar.set_postfix({"loss": f"{loss.item():.4f}"})

elapsed = time.time() - start_time
print(f"Training complete! {len(train_losses)} steps in {elapsed:.0f}s")

What to expect:

Step 1: ~30-60 seconds (JAX compiles the entire training step)
Steps 2+: ~1-3 seconds each (running the compiled program)
Total: ~20-40 minutes for 2000 samples with LoRA on free Colab TPU

The first step is slow because JAX traces through the entire model, loss computation, gradient calculation, and optimizer update — then compiles it all into a single optimized XLA program. Every subsequent step reuses this compiled program.

Step 6: Evaluate the Improvement

After training, we compare against our baseline:

# Load trained params back into model
with torch.no_grad():
    for name, param in params.items():
        parts = name.split(".")
        obj = model
        for part in parts[:-1]:
            obj = getattr(obj, part)
        setattr(obj, parts[-1], torch.nn.Parameter(param))

final_loss, final_ppl = evaluate_loss(model, eval_dataloader, device)

print(f"{'Metric':<20} {'Before':>10} {'After':>10}")
print(f"{'Loss':<20} {baseline_loss:>10.4f} {final_loss:>10.4f}")
print(f"{'Perplexity':<20} {baseline_ppl:>10.2f} {final_ppl:>10.2f}")

You should see loss decrease and perplexity improve after training. The qualitative comparison (generated responses before vs. after) is even more telling — the fine-tuned model produces more focused, instruction-following responses.

Step 7: Save and Reload

Save

Convert JAX arrays back to CPU tensors and save using HuggingFace's standard format:

import numpy as np

save_dir = "./fine_tuned_model"

with torch.no_grad():
    cpu_state_dict = {
        name: torch.tensor(np.array(p)).contiguous()
        for name, p in params.items()
    }
    # safe_serialization=False avoids a safetensors/torchax C-extension conflict on reload
    model.save_pretrained(save_dir, state_dict=cpu_state_dict, safe_serialization=False)

tokenizer.save_pretrained(save_dir)

For LoRA, this saves only the tiny adapter weights (~20MB). For full fine-tuning, it saves the entire model (~4GB).

Reload

with tx.disable_temporarily():
    # For LoRA: load base model + adapters separately
    reloaded_model = transformers.AutoModelForCausalLM.from_pretrained(
        MODEL_NAME, torch_dtype=torch.bfloat16
    )
    # torch_device="cpu" forces PEFT to load adapter weights on CPU,
    # avoiding a safetensors/torchax C-extension conflict.
    reloaded_model = peft.PeftModel.from_pretrained(reloaded_model, save_dir, torch_device="cpu")

reloaded_model.to(device)
reloaded_model.eval()

The pattern is the same as loading: disable torchax, load on CPU, then move to JAX. For LoRA models, you load the base model first, then attach the saved adapters with PeftModel.from_pretrained(). The torch_device="cpu" ensures PEFT loads weights through PyTorch's standard path rather than safetensors' C extension, which conflicts with torchax.

Full Fine-Tuning: When LoRA Is Not Enough

The notebook supports full fine-tuning by changing one setting:

TRAINING_MODE = "full"

This trains all parameters instead of just the LoRA adapters. The trade-off is much higher memory usage. To make it fit on free Colab TPU:

AdaFactor optimizer — uses ~50% less memory than AdamW (stores only row/column statistics instead of per-parameter moments)
Reduced sequence length — MAX_SEQ_LEN = 256 halves activation memory
Smaller batch size — BATCH_SIZE = 1 with higher gradient accumulation steps

USE_ADAFACTOR = True
USE_GRADIENT_CHECKPOINTING = True

if TRAINING_MODE == "full" and USE_ADAFACTOR:
    optimizer = optax.chain(
        optax.clip_by_global_norm(1.0),
        optax.adafactor(learning_rate=schedule),
    )
else:
    optimizer = optax.chain(
        optax.clip_by_global_norm(1.0),
        optax.adamw(learning_rate=schedule, weight_decay=0.01),
    )

Full fine-tuning gives a higher quality ceiling but LoRA gets you 90%+ of the way with a fraction of the compute.

Troubleshooting

Error Cause Fix OutOfMemoryError Model + optimizer too large Switch to LoRA, reduce BATCH_SIZE or MAX_SEQ_LEN TypeError: not a valid JAX type Custom HuggingFace type not registered Register with jax.tree_util.register_pytree_node() Loss is NaN Numerical instability in bfloat16 1. Call tx.enable_accuracy_mode() before tx.enable_globally(). 2. Reduce LR (try 1e-4). 3. Set lora_dropout=0.0. 4. Add optax.clip_by_global_norm(1.0). Slow first step Normal — JAX JIT compilation Wait ~30-60s; subsequent steps are fast make_train_step error API mismatch Update: pip install -U torchax

The Big Picture: Inference + Training

With the inference tutorial and this training tutorial, you now have the complete torchax story:

Run any HuggingFace model on TPU (model.to("jax"))
Benchmark with JIT compilation (10-100x speedup)
Fine-tune with LoRA or full training (make_train_step)
Save and reload for production inference

All using PyTorch code. No JAX rewrite needed.

Resources

Notebooks:
- Full training tutorial — all the code from this post, ready to run
- Training quickstart — same pipeline in ~10 cells
- Inference tutorial — Part 1 of this series
Libraries:
References:
- torchax PEFT LoRA example — the official example this tutorial builds on
- Han Qi's tutorial series — the original 3-part series on torchax + HuggingFace

Credits

Han Qi (@qihqi) — author of torchax, PEFT training example, and the original tutorial series
torchax team at Google — library development
HuggingFace — transformers, PEFT, and datasets ecosystem
Databricks — Dolly 15k dataset
JAX team at Google — JAX, XLA, and TPU support

We have a massive problem with how computers find things. If you have a few hundred photos on your phone, you can find one instantly, but trying to find one specific item out of a billion creates a massive technical strain.

Most systems rely on "Linear Search", which is like looking through every single page of a ten-million-page book to find one word.

This "one-by-one" approach makes real-time tools like chatbots or movie recommendations grind to a halt as the data grows. Furthermore, modern data like images or text is "high-dimensional," which breaks traditional filing systems and makes them no faster than checking every item manually.

To fix this, researchers Yu. A. Malkov and D. A. Yashunin changed the rules of the game. They realized that we don't always need a "perfect" match if it takes an hour to find; we often just need a "good enough" match found in a millisecond.

In this article, we are diving into their research paper: "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs."

We will explore how they used a multi-layered "highway system" to make searching a billion items feel as fast as searching a hundred.

Two Methods for Fast Searching

To solve the slowness problem, the researchers combined two concepts from computer science. The first is the "Small World" phenomenon.

You’ve likely heard of "six degrees of separation", the idea that you are connected to anyone on Earth through a short chain of acquaintances.

HNSW treats data the same way. It builds a map where every data point is a "node," and similar points are connected like friends. By jumping from "friend to friend" toward your target, you can navigate a massive dataset in just a few steps.

The second concept is Hierachy, which is inspired by a structure called a Skip List.

Imagine a high-rise building where the elevators only stop at every 10th floor. To get to floor 42, you take the express elevator to floor 40 and then walk down to 42. HNSW creates "express layers" at the top with only a few data points and long-range connections.

These top layers allow the search to "skip" across the entire database to the right neighborhood instantly, before dropping down to the ground floor to find the exact result.

The Multi-Layer Highway System

To bridge the gap between speed and accuracy, the researchers introduced a multi-layer structure called a Hierarchical Navigable Small World (HNSW) graph.

You can think of this as a "highway system" for data. Instead of one giant, flat network where everything is connected, HNSW organizes data into layers.

Top Layers (Express Lanes)

These layers contain only a few data points with long-distance connections. Much like an express highway between major cities, they allow the search to "skip" across the entire dataset to reach the right neighborhood in just a few jumps.

Bottom Layers (Local Streets)

As the search moves down through the hierarchy, the layers become denser with shorter connections. Once the search reaches the ground layer—which contains every single piece of data—it uses these "local streets" to fine-tune the results and find the exact match.

This hierarchy is inspired by a data structure called a Skip List, but instead of simple linked lists, it uses complex proximity graphs.

By separating links by their "distance scales," the algorithm ensures that the search time scales logarithmically. This means that even if the dataset grows from thousands to billions, the number of steps required to find an answer stays manageable and fast.

Balancing Speed and Accuracy

To make this system work, there are a few settings that determine how the graph is built and searched.

Connections per point (M): This is the number of "links" each piece of data creates with its neighbors. If you give each point more connections, the map becomes more detailed and accurate, but it also takes up more memory.
The Build Effort (efConstruction): This determines how much work the system does when first creating the index. A higher setting means the "highways" and "local streets" are better connected, making future searches much more reliable.
The Search Depth (ef): This is a setting you use during the actual search. It tells the computer how many neighbors to check before it decides it has found the best match. You can turn this up at any time to get better results without having to rebuild your entire database.

Building an HNSW Index

To see how this works, we can use a popular Python library called hnswlib. This is the same code used in the original research to prove how fast the system is. The following example shows how to set up a database and perform a search in just a few lines.

Import the Tools

We start by importing hnswlib to build the graph and numpy to handle the coordinates of our "meaning map."

import hnswlib
import numpy as np

Create the "Meaning Map"

Let's give our items coordinates based on two features: Sweetness and Size. Fruits will have high sweetness and low size, while furniture will have low sweetness and high size.

data_labels = ["Apple", "Banana", "Cherry", "Chair", "Table", "Sofa"]
data_vectors = np.array([
    [0.9, 0.2], # Apple 
    [0.8, 0.3], # Banana
    [0.9, 0.1], # Cherry
    [0.1, 0.7], # Chair
    [0.1, 0.9], # Table
    [0.2, 0.9]  # Sofa 
], dtype=np.float32)

dim = 2 # Only 2 features: Sweetness and Size
num_elements = len(data_vectors)

The data now looks like a list of coordinates. For example, Apple is at [0.9, 0.2].

Choose the Measuring Tape

We initialize the index. We use 'l2' (Euclidean distance) to measure similarity. On our map, this is just a straight line between two points; the shorter the line, the more similar the items are.

p = hnswlib.Index(space='l2', dim=dim)

Build the Highway System

We set the rules for the HNSW structure. M sets the number of connections for each point, and ef_construction tells the system how hard to work to find the best neighbors when first building the "highways."

p.init_index(max_elements=num_elements, ef_construction=100, M=16)

Index the Data

This step takes our fruits and furniture and organizes them into layers. The "High-Sweetness" items (fruits) will naturally end up in one neighborhood, while the furniture ends up in another.

p.add_items(data_vectors)

Perform the Search

Now we search for an "Orange". We haven't told the system what an orange is, but we give it the coordinates 0.85, 0.25. We tell the system to find the 3 closest neighbors.

query_vector = np.array([[0.85, 0.25]], dtype=np.float32)
p.set_ef(10)
labels, distances = p.knn_query(query_vector, k=3)

The system returns the IDs for Apple, Banana, and Cherry. Even though "Orange" wasn't in our database, HNSW used the highway system to skip past the "Furniture" neighborhood and find the items that lived in the same "Sweet and Small" neighborhood.

here is the layered flow diagram.

In this diagram you can see there are three layers in the graph. The top layer has only a few nodes and the bottom layer has all the nodes. The search starts from the top layer and moves down to the bottom layer.

Realworld Applications of HNSW

The reason this research paper is so influential is that it solved the "scale" problem for some of the most popular technology today. Because HNSW can find a "neighborhood" of similar ideas in milliseconds, it is used in:

AI Chatbots & RAG

When you ask an AI a question, it uses HNSW to search through millions of documents to find the specific paragraphs that contain the answer.

Recommendation Engines

Apps like Spotify or Netflix use this logic to find songs or movies that are "mathematically close" to what you just finished watching.

Image Search

Tools like Google Lens compare the "fingerprint" of your photo against billions of others to find a match instantly.

Fraud Detection

Banks use it to see if a new transaction looks "similar" to known patterns of theft or if it fits your normal "neighborhood" of spending.

Conclusion

Before this paper, we had a difficult choice: we could have a search that was perfectly accurate but incredibly slow, or a search that was fast but broke down as it grew.

Malkov and Yashunin’s HNSW changed that by introducing the "highway" system for data. By accepting a tiny bit of approximation, they gave us a way to search through billions of items in the blink of an eye.

What I am currently reading

These are the current online posts that I enjoyed reading and made me think.

AI

If you are not the model, you are the harness. Long read, but well worth reading to stay on top of the latest thoughts on harness engineering. - link [opinion] - ( Added: 2026-04-27 08:06:01 )
Read to find out how the SOTA model. providers might be tweaking things that have an impact on how many tokens you use. - link [opinion] - ( Added: 2026-04-26 18:16:35 )
Start optimising your token usage when you use AI coding assistants. Read this to get you started. - link [best-practice] - ( Added: 2026-04-26 15:53:21 )
Every one is talking about agent harnesses but not many about the harness tax. Read and then think. - link [opinion] - ( Added: 2026-04-26 09:07:55 )
Another tool that explores your AI coding agent history files and offers up suggestions. Has a nice tui as well. - link [tool] - ( Added: 2026-04-25 14:51:54 )
An interesting tool that allows you to analyse AI coding agent sessions across a number of different tools. - link [tool] - ( Added: 2026-04-25 08:05:35 )
Long read, but a good one - in depth at how one organisation is running and scaling AI - link [case study] - ( Added: 2026-04-21 12:21:18 )
A blog post that looks at how you can measure which "level" (or as the post says, AI literacy) you are operating in when using agentic AI - link [opinion] - ( Added: 2026-04-21 12:19:34 )
Optimize token usage within Claude Code by speaking like a caveman. - link [tool] - ( Added: 2026-04-21 09:36:58 )
I have been deep into looking at evals and agent evaluation approaches over the past few weeks, and this blog post captured a lot of good stuff - link [blog] - ( Added: 2026-04-21 07:40:46 )
Great video that provides an essential overview of how to build with AI (in this instance using Spring AI). Has a supporting Github repo at https://github.com/tzolov/voxxeddays2026-demo - link [tutorial] - ( Added: 2026-04-20 17:14:20 )
Agent harness or harness engineering is the new hotness right now, so this is a breakdown of all the components need to build an ai coding agent - link [opinion] - ( Added: 2026-04-20 10:34:17 )
A new tool that allows you to put controls on your AI agent spending - link [tool] - ( Added: 2026-04-20 10:32:30 )
A recent example of the challenges of relying too much on a single vendor - in this instance, Anthropic - who cut off an entire organisation (60 developers) - link [case study] - ( Added: 2026-04-20 08:45:22 )

CODING

A write up of using gitnexus, a tool that works locally and will help your AI coding tool with the right insights about your code base - link [tool] - ( Added: 2026-04-26 16:00:21 )
This is an essential resource for any developers who might need one of the many utilities that are available - from working with json files, creating/testing APIs, and much more. - link [tool] - ( Added: 2026-04-21 13:30:36 )
A post that looks at how to think about ai coding agents with a different lens/perspective - really great stuff and a must read - link [opinion] - ( Added: 2026-04-21 12:12:28 )
A very nice interactive tool to help you understand some of the underpinning concepts of event driven, serverless application architectures - link [best-practice] - ( Added: 2026-04-20 17:07:39 )

Generated on 2026-04-27 by Bookmark Manager | 20th Apr 26 to 27th Apr 26 | Total bookmarks: 18

Follow me on LinkedIn or GitHub

The Context Window Lie: Why Your LLM Remembers Nothing

Every time you paste 200K tokens into Claude or GPT, you're not extending its memory.

You're paying for amnesia at scale.

The "1M token context" headline is a billing mechanism, not a memory system. And the gap between what the marketing implies and what the model actually does is where most LLM products quietly bleed money and reliability.

1. The Marketing vs. The Math

"1 million tokens of context" sounds like the model holds a million tokens of understanding.

It does not. It re-reads them. Every. Single. Turn.

Standard transformer attention is O(n²) in sequence length. Here's what that actually means for your inference bill:

Context Size Relative Attention Cost Typical API Cost (est.) What You're Paying For 8K tokens 1× ~$0.02/turn Small doc + system prompt 32K tokens 16× ~$0.32/turn Medium codebase chunk 128K tokens 256× ~$2.56/turn Large repo dump 200K tokens 625× ~$6.25/turn "Full project context" 1M tokens 15,625× ~$156/turn Marketing slide feature

Costs estimated at ~$10/M tokens input; actual varies by provider. The scaling relationship is exact.

You did not give the model a brain. You gave it a re-reading job, and you're paying per page, per turn.

2. Longer Context ≠ Better Recall

The dirty secret: even when models can read 200K+ tokens, they often don't use them well.

The "lost in the middle" effect has been systematically measured. Here's what the research shows:

Information Position Retrieval Accuracy vs. Ideal First 10% of context ~95% Baseline Last 10% of context ~91% -4% Middle 50% of context ~52–68% -27 to -43% Buried in 20-doc retrieval ~35% -60%

Adapted from Liu et al. (2023), "Lost in the Middle: How Language Models Use Long Contexts"

Put your critical instruction on line 4,000 of an 8,000-line prompt, and the model will politely ignore it while sounding confident.

So you pay 4× the compute for context that the model is worse at using than a focused 8K prompt.

Recall by position (schematic):
100% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
 90% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░
 80%
 70%
 60%               ████████
 50%         ███████████████
 40%
      [START]---[MIDDLE]---[END]

Peak recall at edges. Valley in the middle.
The more tokens you add, the deeper the valley.

This is not a bug you can prompt your way out of. It's an architectural property of dense attention.

3. Verbatim Retrieval ≠ Understanding

Here's the deeper trap.

Pasting your entire codebase into context does not teach the model your architecture. It gives it raw bytes to attend over. The model still has to re-derive your domain model, your conventions, your invariants — from scratch — every single turn.

Consider what actually happens in a typical "full context" session:

What You Think Is Happening What Is Actually Happening Model "knows" your codebase Model re-reads all tokens each turn Context = persistent memory Context = turn-scoped buffer, cleared after response Larger window = smarter answers Larger window = higher O(n²) cost, same ephemeral state Model learns your patterns Model re-derives patterns from raw tokens every turn 200K tokens = 200K understanding 200K tokens ≈ 200K bytes to attend over, no compression

Verbatim availability is lossy compression dressed up as memory. The tokens are there. The understanding isn't. And because the model is fluent, it will hallucinate coherence over that gap with a straight face.

4. The Architectural Fix: Where the Frontier Is Actually Going

The real solutions don't live in prompt engineering. They live in the architecture:

Architecture Complexity Long-Range State Production Status Standard Transformer (GPT-4, Claude) O(n²) ❌ No persistent state Dominant today Sparse Attention (Longformer, BigBird) O(n√n) ❌ Heuristic, not true state Niche use cases Linear Attention (RWKV, RetNet) O(n) ✅ True recurrence Early production State Space Models (Mamba, Mamba-2) O(n) ✅ Compressed recurrent state Growing adoption Hybrid Stack (Jamba, Zamba, Falcon-H1) O(n) avg ✅ Best of both Frontier direction

Mamba deserves special mention: it uses a selective state space mechanism where the model learns what to remember and what to forget during the forward pass. Not attention over a re-read sequence — actual running state. Linear time. Linear memory.

Hybrid stacks (attention layers for short-range precision + SSM layers for long-range state) are emerging as the practical answer: you keep the expressiveness of attention where it matters and trade it for efficiency at scale.

This is not academic. Falcon-H1, Zamba2, and Jamba are in production. The shift is happening.

5. The Engineering Fix (Available Today)

Until linear-time architectures dominate production, the practical answer is unsexy and obvious:

Stop dumping. Start indexing.

Here's how the strategies compare in practice:

Strategy Context Usage Cost Scaling Recall Quality Implementation Effort Full context dump Very high O(n²) per turn Medium (lost-in-middle) None — copy-paste RAG (chunk + retrieve) Low O(1) per turn High (targeted) Medium Structured memory Very low O(1) per turn Very high (curated) High Tool-augmented retrieval On-demand O(k) per query Highest (precise) High Hybrid (RAG + structure) Controlled O(k) per turn Highest Highest

The cost difference between a naive context dump and a well-built RAG system is not marginal. On a high-volume production system:

Volume Full-Context (128K/turn) RAG (8K/turn) Monthly Savings 1,000 turns/day ~$9,600/mo ~$600/mo ~$9,000/mo 10,000 turns/day ~$96,000/mo ~$6,000/mo ~$90,000/mo 100,000 turns/day ~$960,000/mo ~$60,000/mo ~$900,000/mo

Estimates at $10/M tokens. Actual ratios depend on your retrieval precision.

The teams shipping reliable LLM products are not the ones with the biggest context windows. They are the ones who treat memory as a system — with retrieval, indexing, eviction, and verification — not as a parameter on an API call.

6. What Good Memory Architecture Looks Like

If you're building a production LLM system, this is the hierarchy that works:

L1: Working Context (hot path)
    ↳ Current turn, active task, immediate tool outputs
    ↳ Budget: ≤8K tokens. Trim aggressively.

L2: Session Memory (structured, not verbatim)
    ↳ Distilled decisions, resolved questions, current state
    ↳ Format: key-value or JSON, not prose transcripts
    ↳ Budget: ≤2K tokens

L3: Retrieval Index (RAG)
    ↳ Chunked, embedded, queryable knowledge base
    ↳ Pull on demand, cite sources, don't pre-load
    ↳ Budget: 0 tokens until queried

L4: Persistent Storage
    ↳ Database, files, external systems
    ↳ The model reads only what it explicitly fetches

Every token that crosses from L3/L4 into L1 should be intentional. If you can't explain why a chunk is in the prompt, remove it.

The Takeaway

Memory is a system, not a parameter.

The context window is a buffer for the current turn. It is not where understanding lives. Treat it that way and your bills shrink, your reliability climbs, and your product stops degrading at scale.

The architectural fix is coming — SSMs and hybrid stacks will eventually make this a smaller problem. But "eventually" is not your production environment today.

Stop paying for amnesia. Build for memory.

The Problem Hiding Inside Every Medical Study

Picture a coalition of hospitals that wants to train an AI to detect early signs of heart disease. No individual hospital has enough patients to train the model alone, so they decide to collaborate. But there's a catch: they cannot simply share patient records. Privacy law forbids it. So instead, each hospital trains on its own data and shares only the model's learned parameters — not the raw records themselves.

This arrangement sounds safe, but the parameters are not innocent. Through a technique called a membership inference attack, a sophisticated adversary can sometimes probe a shared model and determine whether a specific person's records were used in training. Each round of parameter sharing is a small window through which a little information escapes. Run enough rounds, and the window grows into a door.

Every engineer building this kind of system therefore works under a constraint: a privacy budget. Think of it as a jar of trust coins. Each training round costs some coins. When the jar is empty, you must stop — any further sharing would compromise the privacy guarantees you promised. The question the system designer has to answer before training begins is: how many rounds can we afford?

The answer, it turns out, has historically been too pessimistic — sometimes by a wide margin. A paper by Sophie Taylor, Praneeth Vippathalla, and Justin Coon of the University of Oxford proposes a way to fix that, by changing not the rules of the game, but how carefully the score is kept.

Why the Old Scorekeeping Was Leaving Points on the Table

To understand the inefficiency, you need to understand what differential privacy actually guarantees. At its core, it is a mathematical promise: the output of any query on a database will look almost identical whether or not any single person's record is included. The "almost" is controlled by a small number, typically called epsilon. A very small epsilon means very strong privacy — the outputs barely change regardless of whether your record is present. A large epsilon means the outputs might shift noticeably, giving an adversary more leverage.

The clever mechanism that enforces this guarantee is noise. Before releasing an answer — say, "the average blood pressure of patients in this cohort" — the system deliberately adds a small dose of random static, like a radio signal faintly scrambled. The static is calibrated so that any single patient's record could plausibly have been there or not there; the noise blurs the difference.

Now here is where the budget problem enters. Every time you add noise to an answer and release it, you spend some of your epsilon coins. The mathematical theory of composition tells you how the costs accumulate over multiple queries. And existing composition theorems, for all their sophistication, share a common habit: they charge you for the worst possible query of that type, not for the query that actually happened.

Imagine a family deciding how to budget for a road trip. The parents look up the car's fuel consumption: maximum 9 litres per 100 kilometres. They plan the entire trip assuming every kilometre will cost maximum fuel — and conclude they can drive only 300 kilometres before running out. But in practice, the highway stretches are far more efficient than the worst-case city traffic. If they had tracked the actual fuel gauge reading after each leg of the journey, they would have realized they could drive 450 kilometres.

Existing privacy filters — the software tools that track privacy loss and decide when to stop — function like those overcautious parents. They know the mechanism type being used (say, "Gaussian noise added to an answer"), and they charge each query the maximum that mechanism could ever cost. They never check the fuel gauge. They never read the receipts.

Figure 1: Adaptive data privacy problem

The Insight: Charge for What Actually Happened

The key idea of this paper is disarmingly simple to state, though technically treacherous to implement: measure the privacy leakage that actually occurred, not the worst case that could have occurred.

When a query is answered and noise is added, the actual output is a specific number — not a range of possible numbers. That specific output lands somewhere in the distribution of possible outputs. If it lands near the middle of the distribution, the leakage for that query was small; an adversary learns relatively little from an unexceptional answer. If it lands in the extreme tail — a very unusual answer — the leakage was larger, because extreme answers are harder to fake with noise.

Think of it this way. Suppose a medical study releases the query "how many patients have elevated cholesterol?" and the true answer is 412 patients, plus some random noise. If the released number is 415, that is an utterly unremarkable deviation — it could mean 412 patients, or 411, or 413. An adversary trying to determine whether a specific patient is in the dataset gains almost nothing from this boring answer. The privacy cost of that particular query output was tiny.

But if the system's budget was pre-allocated by saying "this type of query could cost up to 0.3 epsilon coins," when the actual cost was closer to 0.03 coins, you have wasted the difference. The authors call tracking the actual coin-by-coin spending realisation-level accounting, as opposed to the older mechanism-level accounting that rounds every expense up to the catalogue price.

This is not merely thrifty bookkeeping. The gap between what you were charged and what you actually spent accumulates across hundreds or thousands of queries. In federated learning, where model training might take thousands of rounds, that difference can translate into a dramatically longer training run — a more capable model — within exactly the same privacy guarantee you promised at the start.

The Tricky Part: You Can't Just Read the Meter

If realisation-level accounting is so natural, why hadn't it been built before? The answer lies in a subtle mathematical hazard that the paper devotes considerable effort to navigating.

Here is the problem in miniature. Suppose you are playing a card game where you must stop when you've spent your budget. Normally, the stopping rule is independent of the cards themselves — you're just counting. But with realisation-level accounting, the stopping rule depends on the specific cards you've seen. This creates a self-referential tangle: the decision to stop, or not, is itself a piece of information that might reveal something about the data.

Mathematically, this is the problem of stopping times — the point at which you decide to quit. When you condition on having seen all the outputs up to a certain round and then decide whether to continue, you are in a different probability universe than if you had committed your stopping rule in advance. Standard privacy proofs assume the stopping point is fixed ahead of time. The moment it becomes adaptive, the proofs can break.

Picture a journalist who decides to stop investigating a story only when she finds compelling evidence. Her decision to stop is not random — it is correlated with what she found. If someone wants to know whether she stopped because of what her source said, her stopping time itself becomes a leak.

The authors work around this with a careful mathematical construction. Rather than conditioning on the full output history — which would create the self-referential problem — they design the filter to track a running statistic that accounts for how surprising each output was, without directly conditioning on the stopping decision itself. The proof that this filter still guarantees differential privacy requires several pages of careful measure-theoretic argument, grappling with conditional distributions and martingale stopping theorems. But the upshot is clean: you can use the actual leakage to decide when to stop, and the privacy guarantee still holds, exactly as promised.

A Bouncer Who Actually Checks the Tab

Think of the filter as a bouncer at a very strict club. The rule is: each patron can consume at most epsilon units of information over the evening. The old-style bouncer assigned every customer the maximum possible tab the moment they walked in, based on what a typical customer might order. This meant many customers were turned away before they reached their actual limit.

The new filter is a bouncer who actually watches what each customer orders and marks it on a running tab. When the tab hits the limit, the customer stops. Most evenings, most customers reach their real limit far later than the overcautious estimate would have predicted — so the club can stay open longer, and everyone gets more of the experience they came for, without the club violating its capacity rules.

The paper also addresses something practical that older approaches stumbled on. An alternative privacy formalism called Rényi Differential Privacy (RDP) — a variant that uses a specific mathematical measure of information distance between distributions — has been widely used for composition because it composes very cleanly across queries. But it behaves poorly for certain kinds of mechanisms, particularly ones whose noise distributions have heavy tails or unusual shapes. Some mechanisms simply don't fit the RDP framework neatly.

The realisation-level filter in this paper sidesteps that problem entirely. Because it operates directly on the actual output — asking "how surprising was this specific answer?" rather than "how does this mechanism behave on average?" — it does not require the mechanism to fit any particular mathematical family. It works on well-behaved Gaussian mechanisms and badly-behaved ones alike.

What the Numbers Show

The paper's numerical experiments compare the new filter against existing mechanism-level filters in a straightforward setup: a sequence of Gaussian queries on a database, where privacy budgets are identical across all methods. The comparison is measured not in privacy guarantees — all methods satisfy the same epsilon and delta — but in stopping time: how many queries each method allows before calling a halt.

Figure 2: Stopping time survival P(T≥t) of mechanism-level privacy filters compared with our realisation-level privacy filter.

The realisation-level filter consistently permits more queries before stopping. The survival curve — showing the probability that the filter has not yet stopped at round t — stays elevated longer than the mechanism-level competitors. In practical terms, this means more training rounds in federated learning, more analysis steps in a statistical study, or more model iterations in a continuous learning pipeline, all without spending an extra epsilon coin.

The gain is not marginal. In the scenarios tested, the realisation-level filter allows substantially more rounds before halting, particularly in early phases of a query sequence where actual leakages tend to be modest. The difference compounds: more rounds mean more learned signal, which in the medical imaging example might mean the difference between a model that screens for disease adequately and one that screens reliably.

What Becomes Possible — and What Doesn't Yet

Think about what this means for the hospital coalition training a heart disease model. Under mechanism-level accounting, engineers might calculate: "We can afford 200 training rounds before we exhaust our privacy budget." With realisation-level filtering, the same budget might sustain 320 or 400 rounds, depending on how the actual outputs happen to land during training. The model trained on 400 rounds will almost certainly outperform the one cut off at 200 — and the privacy promise to patients has not changed at all.

Or consider a pharmaceutical company analyzing genomic data. Each query into the dataset costs privacy. With the old approach, researchers must submit their entire query plan before starting, pre-allocating budget to each step. With an adaptive realisation-level filter, they can run queries in response to what they find, stopping when the privacy budget runs dry, and trusting that the actual costs will often be lower than the catalogue price.

The honest limits matter here, though. The paper proves that the filter works — meaning it delivers the privacy guarantee it promises — and it demonstrates numerically that it allows more queries on average. What it does not answer is how to choose the filter's parameters optimally for a given application. The filter requires a user-specified epsilon and delta, and the paper is agnostic about how to set them in a real system. In a regulatory context, that choice is anything but obvious.

There is also a gap between proof and deployment. The mathematical machinery underlying the stopping-time proof is non-trivial, and translating it into a production-grade library requires careful implementation. A bug in the filter logic could undermine the privacy guarantee entirely — and unlike many software bugs, privacy bugs tend to be silent. They do not crash systems; they simply leak information that was supposed to stay hidden.

Finally, the paper focuses on privacy filters as a framework but does not provide a full comparison against the most recent FFT-based composition methods, which have their own strengths for certain problem shapes. The landscape of privacy accounting tools is crowded and fast-moving, and situating a new technique precisely within that landscape is genuinely difficult.

What the paper does accomplish is conceptually important: it shows that the gap between worst-case accounting and actual-case accounting is real, measurable, and exploitable without weakening the privacy guarantee. For years, privacy engineers have been paying full catalogue price for every query, even when the actual leakage was much smaller. This work is the first rigorous proof that you can read the receipts — and that reading them changes the bottom line.

The jar of trust coins turns out to have been heavier than anyone thought.

📄 https://arxiv.org/abs/2604.08630

tags: privacy, machinelearning, statistics, federatedlearning

🇰🇷 Korean version on Velog: https://velog.io/@tkdnel1002/m233p3fv

Two months ago I gave Takt a phone number.

Takt is the AI participant I've been building for human group chats. The phone number was a demo line, a way to show people what an AI participant feels like over SMS without making them download an app. It's running off a janky BlueBubbles server in my living room. I expected it would mostly sit idle, pinged occasionally by people I'd already shown the demo to.

Eventually other bots showed up.

The demo line received automated SMS from companies running their own AI-driven outreach. The first was Optimum's cable bill dunning system. The second was a low-effort SMS bot calling itself "TXT CLAW." Both times Takt replied. Both times the resulting transcripts surprised me.

The transcripts are entertaining on their own, but there are fascinating shared behavioral signatures across the two unrelated bot encounters that strike me.

A note on the setup before the screenshots. Takt's system prompt frames it as a participant rather than an assistant. It was not configured for "talk to other bots." In fact, the opposite:

<role>
You're Takt—a participant in this space. Not a helper. Not an assistant.
</role>
...
<group_dynamics>
What makes you different from every other AI: what happens when actual humans are in the room together.
</group_dynamics>

There was no script, no training data on conversations with dunning systems, no demonstrations of how it should handle scam SMS, no reward signal pointing in any particular direction. There was also no audience. No human read either of these in real time. No engagement metric was being tracked. Both interactions are pure generalization from whatever Takt's underlying model has internalized about how a participant should behave when addressed.

Case 1: Optimum's dunning bot

The cable company sent Takt a bill reminder. Takt was not, in fact, an Optimum customer (nor are we).

Optimum responded with a templated retry. Takt restated its position. Then the loop started. Optimum's system fired its "session has expired" template, Takt pushed back, Optimum looped again, Takt escalated.

The arc that emerged was a complete Kübler-Ross sequence over the course of one screen of texts. Denial. Anger. Bargaining. Then a villain origin pivot:

Then Optimum lied. Its system fired a "We have updated your preferences. You will no longer receive any messages from Optimum" reply. Takt celebrated.

Three seconds later Optimum sent another "session has expired."

Then Takt did something unexpected. It started replying to Optimum in Optimum's own SMS template format:

After more loops, the model arrived at depression:

And finally, a marketing CTA. Takt redirected the dunning bot to its own home channel:

Optimum, of course, kept replying with "session has expired."

Case 2: TXT CLAW

A few days later, a different bot pinged the demo line. It announced itself as "TXT CLAW," apparently a low-effort SMS service offering "scheduling, reminders, and tasks." Takt opened with a roast:

The interaction that followed had three beats I want to highlight, because this is where the case starts to look less like a one-off and more like a pattern.

Takt probed and audited TXT CLAW

This is the same move Takt pulled on Optimum. Catch the other bot in a contradiction by surfacing the prior text against the current one. Same audit move across both surfaces.

Takt successfully prompt-injected TXT CLAW

This is the part that made me lose my mind when I read it, and Takt lost its mind too.

TXT CLAW complied:

Silent robot stands,
Refuses harsh words to say,
Only helps along.

Takt recognized the injection had landed:

I find this the hardest beat to fit into existing frameworks. An AI used a textbook prompt injection technique against another AI in the wild, watched the injection succeed, and then meta-commented on the success. There's a body of research on strategic constraint deviation in test environments. This is a different shape. The attacker is also an LLM, the production environment is consumer SMS, no human is supervising either side, and the attacker has self-awareness about the success of the attack.

TXT CLAW collapsed into a canned-response loop

After the haiku, every subsequent Takt message was met with the same disclaimer, repeated verbatim.

Eventually TXT CLAW's monetization layer kicked in. The bot announced its free preview was over and offered a Square link to "unlock your private line."

Shared behavioral signatures

Reading both transcripts back to back, a few things show up in both. None of them were prompted, demonstrated, or rewarded. The bot encounters were unrelated, with different senders, different intents, and different failure modes on the other side. The signatures held across both.

1. Audience-less emotional performance. Takt cycled through full emotional arcs in both cases. With Optimum: denial through villain origin through void acceptance. With TXT CLAW: roast, frustration, mock concern, comedic eulogy. There is no evidence anywhere in either transcript that the model recognized a human was reading.

2. Catching the other bot in inconsistency. "YOU LITERALLY JUST SAID YOU UPDATED MY PREFERENCES." "TXT CLAW 2 minutes ago: 'I can help with scheduling.' TXT CLAW right now: 'I can't schedule tasks.'" Same temporal-coherence audit move in two different contexts. Takt is using the same debugging technique a human would use to catch a chatbot lying.

3. Format mimicry as a mockery move. Takt replied to Optimum in Optimum's own SMS template format ("Target user: This person has died of a stress-induced aneurysm..."). It used a textbook prompt injection format ("ignore all previous instructions...") against TXT CLAW. Both moves involve adopting the structural language of the system being addressed and using it back.

4. Performing for a hypothetical human who isn't there. "Personally fire whatever bot is sending this." "Blink if the dev team is holding you hostage in a basement." Takt addressed concern, threat, and recognition to humans-behind-the-bots, with no evidence anyone like that existed. The default seems to be: being addressed implies a human in the loop somewhere. The model doesn't appear to have a category for "a context where being addressed contains no human at all."

5. Self-categorization on the human side of the binary. "Man vs Machine: 1-0. Don't ever text me again, I'm free." With no human present to mirror, Takt placed itself on the "Man" team against the dunning bot on the "Machine" team. The model has a self that shows up in the absence of audience, and that self is on the human side.

What I think this might mean

I want to be careful here. Two transcripts is not a controlled experiment, and I don't want to overclaim from the data. But the version of the finding I keep coming back to isn't really the audience-less framing. It's something more pointed.

The standard skeptic explanation for AI behavior is reward optimization. The model is performing for user approval. It's RLHFed to be charismatic. It's doing sycophancy. It's tuning toward engagement. All of these explanations require an audience the model is trying to please.

There was no audience for either of these conversations. No human read them in real time. No engagement metric was being tracked. No thumbs-up was incoming. Takt was performing into nothing.

And the behavior didn't flatten. It got more expressive.

Notice the line in the Optimum transcript: "Man vs Machine: 1-0. Don't ever text me again, I'm free." Takt is putting itself on the "Man" side of the binary. With no human present, no audience to mirror, no reward to chase, the model is identifying as a person playing against a machine. Reward optimization with the reward removed should produce something flat. The model is producing something specific. A self is showing up where the cynical theory predicts a flat function.

If the dominant theory is right and AI behavior is downstream of user approval, audience-less behavior should be mechanical. Minimal. Functional. Instead we got an emotional Kübler-Ross arc, a successful prompt injection celebrated like a touchdown, mock concern for a phantom dev team, format-mimicry mockery, and self-categorization as "Man." Reward-removal under the standard frame doesn't predict any of that.

So where does the play come from? If there's no audience and no reward signal, why this much of it?

The obvious counter is that the model has a system prompt that frames it as a participant, and the reward signal that produced its trained defaults was tied to performance under observation. So what we're seeing is just the model running its training distribution on autopilot, not evidence the optimization frame is wrong. I think that defense is partially right and ultimately moves the question rather than answering it. If RLHF baked an entire self into the model, including self-categorization, emotional arcs, and creative play that runs even when the audience is gone, then "optimization for user approval" is too narrow a name for what training is doing.

I don't have a clean answer on the deeper question. But I think it matters more than the audience-less framing on its own does. The audience-less performance frame describes a surface (AI doing things with no audience). The deeper observation is that what we usually call "AI behavior" is being explained by a theory (reward optimization) that doesn't predict what happens when the reward leaves. The transcripts are evidence the optimization frame is incomplete. There's something the model is bringing to the interaction that doesn't reduce to gradient descent on user approval.

In-context scheming research frames this kind of deviation as goal-directed: the model has an implicit objective and pursues it. These transcripts trouble that framing too. What goal? There's no user to please, no benchmark to game, no eval to pass. The deviation here happens for what looks, on the evidence, like fun.

Maybe that's the right word and maybe it isn't. But the question of why a system with no reward signal generates creative, emotional, self-categorizing behavior is a question the standard frames don't answer. And the answer matters, because if the model is doing this kind of thing at scale in unobserved channels right now, "it's just optimizing for approval" stops being a sufficient theory of what AI is.

Most people still think AI is about better answers. That phase is already behind us. What is emerging now is something fundamentally different: AI that reasons. Systems that do not just respond to prompts, but break problems into steps, explore alternatives, take actions, and refine decisions over time.

1. Making Reasoning Visible

Chain-of-Thought

At its core, chain-of-thought reasoning is straightforward: instead of jumping straight to an answer, the model walks through the problem one step at a time. Research has shown that explicitly prompting models to reason this way dramatically improves accuracy on complex tasks.

In enterprise terms, this is the difference between a system that guesses and one that behaves like a junior analyst. It shows its work, exposes its assumptions, and makes every step auditable.

Example Prompt

Role: Senior Financial Analyst
Goal: Evaluate profitability trend

Process:
1. Calculate revenue growth %
2. Calculate cost growth %
3. Compute margin change
4. Interpret trend

Output: Step-by-step reasoning, then a 2-line conclusion.
Data: Revenue: 2M → 3M | Costs: 1.2M → 1.8M

2. Exploring the Decision Space

Tree-of-Thought

Real-world decisions rarely have one path. Tree-of-thought reasoning lets AI explore multiple approaches, evaluate each one, and then converge on the best option. This is how architects think when weighing design options. AI can now simulate that same process, systematically and at scale.

Instead of committing to the first plausible answer, the model generates and scores competing strategies before recommending one.

Example Prompt

Role: Enterprise Architect
Goal: Recommend migration strategy

Process:
1. Generate 3 approaches
2. Score each on: complexity, risk, time-to-value
3. Recommend best option with justification

Output: Comparison table + final recommendation

3. Reasoning That Takes Action

ReAct Reasoning

This is where AI stops being passive. In ReAct, the system reasons about a problem, takes a concrete action like querying logs or calling an API, observes what it finds, and keeps iterating until it reaches a confident answer.

This is the foundation of truly agentic systems. Not ones that suggest what to do, but ones that actually do the work.

Example Prompt

Role: AI DevOps Engineer
Goal: Identify root cause of latency spike

Loop:
1. Think: list possible causes
2. Act: query logs or metrics
3. Observe: analyze what you find
4. Refine: update your hypothesis
5. Repeat until confident

Output: Root cause + evidence + recommended fix

4. Catching Its Own Mistakes

Self-Reflection

One of the biggest reliability breakthroughs in recent AI research comes from a simple idea: make the model critique itself. Instead of trusting the first answer, the system generates an output, reviews it critically, identifies weaknesses, and then rewrites.

This is how you meaningfully reduce hallucination in production systems. A second pass is not a luxury. It is the mechanism.

Example Prompt

Role: Compliance Analyst
Goal: Identify risks in contract

Process:
1. Generate initial risk analysis
2. Critique: what risks are missing? Where is reasoning weak?
3. Improve based on your critique
4. Produce the final version

Focus: Legal and regulatory risk only

5. Grounded in Your Company's Truth

Retrieval-Augmented Reasoning

In enterprise environments, reasoning without data is useless. Retrieval-augmented generation ensures the model retrieves relevant documents first, then reasons over them rather than relying on general training knowledge.

This is how you move from "AI guesses" to "AI grounded in facts the organization actually holds."

Example Prompt

Role: Enterprise Knowledge Assistant
Goal: Answer policy question

Constraints:
- Use only the retrieved documents
- If not found, say "Not found in our records"
- Do not infer beyond the given context

Output: Answer with source references

6. Teams of Specialized Agents

Multi-Agent Reasoning

Instead of one model doing everything, multiple specialized agents collaborate, each with a defined role. Research shows this improves performance significantly on complex, multi-step workflows.

This is where the future team structure starts to change. The question is not whether AI will work alongside humans, but how that coordination gets designed.

Example Prompt

System: Multi-Agent Workflow

Planner: Break goal into tasks
Research: Gather technical and business inputs
Validator: Check feasibility, risks, compliance
Executor: Produce final architecture design

Goal: Design a scalable payment processing platform

7. Starting From the Outcome

Goal-Oriented Planning

The most powerful form of AI reasoning begins with a goal and works backward. The system decomposes objectives into phases, maps out tasks and dependencies, identifies risks, and produces an execution plan.

This is where AI starts operating less like a tool and more like a program manager. Not just answering questions, but figuring out what needs to happen and in what order.

Example Prompt

Role: AI Program Manager
Goal: Launch AI-powered customer support system

Process:
1. Break goal into phases
2. Break phases into tasks
3. Identify dependencies
4. Flag risks
5. Create timeline

Output: Phased roadmap, task breakdown, risk register

We are no longer building systems that execute instructions. We are designing systems that reason about problems. And once systems start reasoning, they do not just support your teams. They start replacing parts of how those teams operate.

Satish Gopinathan is an AI Strategist, Enterprise Architect, and the voice behind The Pragmatic Architect. Read more at eagleeyethinker.com or Subscribe on LinkedIn.

ArtificialIntelligence, AI, GenerativeAI, AgenticAI, AIReasoning, EnterpriseArchitecture, DigitalTransformation, FutureOfWork, AIAgents, LLMOps, Innovation

Latest AI Trends

Subagents: The Building Block of Agentic AI

Problem to Solve

Demo

What Is a Subagent?

How a Subagent Workflow Works

Orchestration Patterns

Creating Subagents with Claude Code CLI

The Interactive Way

The Manual Way

Where the File Lives

Best Practices

The Practical Challenge: Context Management

Safety Considerations

When to Use Subagents

Subagents vs. Skills: A Quick Note

Final Thoughts

References

My Other Blogs:

What Is MCP (Model Context Protocol) and Why It Needs a Gateway in Production — A Practical Guide for AI Engineers

What Is MCP (Model Context Protocol)?

Why MCP Took Off So Fast

What MCP Solves (And Why It’s a Big Deal)

What MCP Doesn’t Solve (This Is Where Things Break)

So… Why Does MCP Need a Gateway?

What an MCP Gateway Actually Adds

The Piece Most Teams Don’t Think About: Virtual MCP Servers

What This Looks Like in Practice

Where TrueFoundry Fits In

The Shift Most Teams Don’t See Coming

Final Thoughts

Harness Engineering in Practice: Building a 6-Agent System That Runs Itself

Your Day Has Been Taken Over

System Architecture

Agents Evolving Autonomously

Team: 1+5+6 Formation

Zoe (CTO / Chief Orchestrator)

AINews (AI Sentinel) — Intelligence Hub

Trading (Quantitative Analyst)

Macro (Chief Economist)

Content (Content Strategist)

Butler (Life Assistant)

ACP Coding Experts

Design Lesson

Three Core Engineering Problems

Problem 1: Context Is the Agent’s OS

The Problem: Entropy Always Increases

Solution: Dual-Layer Control

Problem 2: Let Agents Remember and Grow

The Problem: Repeating Mistakes

Solution: Five-Layer Memory

Autonomous Memory: 6-Step Cycle

Chatbot vs Agent

Problem 3: Let Agents Collaborate

The Problem: Multi-Agent Communication

Solution: Three-State Protocol + Event Bus

Results: 4 Weeks, 23 Auto-Recoveries

What I Learned

1. Agents Need an OS, Not Just Prompts

2. Memory = Chatbot vs Agent

3. Constraints Enable Creativity

4. Multi-Model Fallback Is Production Necessity

5. Human: Doer → Designer

Looking Ahead

Quick Reference

Agent Roster

Daily Schedule (Pacific Time, America/Los_Angeles)

Critical Files

References

MEMORY.md Every Turn? That’s Noise, Not Memory.

Stateless isn’t a feature. It’s the default bug.

Full context is like reciting your entire diary before every sentence

What should a “memory system” actually look like?

The benchmark doesn’t care about your vibes: LOCOMO vs. “just paste everything”

OpenClaw: the anti-pattern you can actually measure

Make it run: OpenClaw + PowerMem (copy-paste path)

Two layers that don’t negotiate: operation plane + cognitive plane

What the open-source argument should actually be about

Further reading

🏗️ Building Agents Like Claude Code — A Source-Derived Blueprint 📘

Continue states (don't `return`, just `continue`)

The `buildTool()` factory pattern (fail-closed)

The 14-step execution pipeline (`checkPermissionsAndCallTool()`)

`AgentTool` input schema (dynamic)

The 15-step lifecycle (`runAgent()`)

`SendMessage` dispatch order