Back to HomeCurated by Pillio Technology Solutions · AI · ML · LLM · Deep Learning · GenAI
Latest AI Trends
Full-length articles from the global AI & machine learning community — curated across 12 topics, no paywalls.
🤖AK DevCraft·Apr 27, 2026·6 min read·Global
Subagents: The Building Block of Agentic AI
#ai#claude#machinelearning#llm
Problem to Solve
Most developers' first encounter with AI is a single prompt, a single response. It feels powerful — until the task gets complex. Ask an AI to research three competitors, synthesize the findings, and format them as a report, and a single context window starts to feel very small. This is the problem subagents solve.
Demo
In the demo above, I'm invoking the agent explicitly by name. In orchestrated workflows, Claude can invoke multiple agents like this automatically — in parallel or sequentially — based on the task structure.
A subagent is an AI instance invoked by an orchestrating AI to handle a specific subtask within a larger workflow. In multi-agent systems broadly — whether built on Claude, GPT, Gemini, or open-source models — the core pattern is the same: rather than a single model doing everything sequentially, an orchestrator breaks work into pieces and delegates them. Much like a technical lead assigning work to specialists rather than writing every function themselves.
According to Anthropic's Claude Agent SDK documentation, subagents serve two core purposes: parallelization (running multiple tasks simultaneously) and context isolation (each subagent uses its own context window, returning only relevant results to the orchestrator rather than its full context).[^1]
Within their assigned scope, subagents are active execution units — they can browse the web, execute code, read and write files, and call external APIs. They don't just reason; they act.
How a Subagent Workflow Works
The orchestrator doesn't do the heavy lifting — it coordinates. Each subagent receives a focused prompt with a clear objective, output format, and tool access, then returns a concise result. The orchestrator aggregates those results into the final deliverable.
Anthropic's internal research system uses exactly this pattern: a lead agent spawns subagents to explore different aspects of a query in parallel, then compiles their findings into a coherent answer. Their evaluations found this approach outperformed a single Claude Opus 4 by 90.2% on internal research benchmarks.[^2]
Orchestration Patterns
Not all subagent workflows are structured the same way. Three patterns cover most real-world cases:
Parallel fan-out — Independent subtasks launch simultaneously. Best for tasks like analyzing multiple documents at once.
Sequential pipeline — Each subagent's output feeds the next. Best when there's a dependency chain (research → draft → edit → format).
Hierarchical delegation — A subagent itself becomes an orchestrator for deeper subtasks. Powerful, but adds coordination complexity.
Choosing the wrong pattern is a common mistake. Parallelizing a sequential task adds overhead without benefit; sequentializing an independent task wastes time.
Creating Subagents with Claude Code CLI
Claude Code gives you two ways to create subagents: interactively via the /agents command, or manually as markdown files. Both result in the same thing — a .md file in a .claude/agents/ directory.
The Interactive Way
/agents create
This walks you through a guided setup: name, description, tools, model, and scope. At the end, it saves the file, and the agent is available immediately.
The Manual Way
Create a markdown file directly — the frontmatter defines behavior, the body is the system prompt:
---
name: security-reviewer
description: "Expert security reviewer. Use PROACTIVELY after any changes to auth, data handling, or API endpoints."
tools: Read, Grep, Glob
model: haiku
permissionMode: plan
---
You are a senior security engineer reviewing code for vulnerabilities.
When invoked:
1. Identify recently changed files
2. Analyze for OWASP Top 10 vulnerabilities
3. Check for secrets, SQL injection, and hardcoded credentials
4. Report findings with severity levels and remediation steps
Where the File Lives
Subagent scope is determined by which directory the file is placed in:
Scope
Path
When to use
Project.claude/agents/ in project root
Team-shared agents, commit to version control
User~/.claude/agents/ in home dir
Personal agents available across all projects
Project scope is the recommended default — it makes subagent definitions shareable via version control. Use user scope for general-purpose agents you want available everywhere, regardless of which repo you're in.
Best Practices
Write descriptions that trigger correctly. Claude uses the description field to decide when to invoke a subagent automatically. Be specific and include PROACTIVELY if you want it auto-triggered — for example: "Use PROACTIVELY after any changes to authentication or data handling."
Restrict tools intentionally. The tools field restricts what the agent can do — a security auditor only needs Read, Grep, and Glob, and has no business writing files. That restriction is worth being explicit about.
Match model to task complexity. Route subagent exploration to cheaper, faster models like Haiku and reserve Opus for genuine architectural reasoning. A read-only code scanner doesn't need the same model as an agent writing production code.
Keep system prompts focused. A subagent with a narrow, well-defined role outperforms a generalist one. If the prompt starts covering many different concerns, split it into two agents.
The Practical Challenge: Context Management
Each subagent starts fresh with no shared memory. This means:
The orchestrator must craft every subagent prompt with all the context it needs to succeed
Subagent outputs must be concise enough to fit back into the orchestrator's context alongside everything else
If outputs are large, the orchestrator must summarize before aggregating
Good subagent design is largely information architecture: what does each agent need to know, what must it produce, and how does that output flow back into the whole.
Safety Considerations
When subagents can take real-world actions, safety boundaries matter more than in single-agent systems:
Least privilege — Give each subagent only the tools it actually needs. A research agent doesn't need write access to a production database.
Prompt injection — Subagents that browse the web or read external files can encounter content designed to manipulate their behavior. This is a real attack surface in agentic systems.
When to Use Subagents
Use subagents when…
Avoid subagents when…
Task has clearly separable subtasks
Task is straightforward for one agent
Parallel execution saves meaningful time
Subtasks share too much state to delegate
Total work exceeds one context window
Coordination overhead exceeds the benefit
Different subtasks need different tools
You're still prototyping — stay simple first
Start with a single-agent approach and introduce orchestration when it genuinely starts to strain. Complexity has a cost.
Subagents vs. Skills: A Quick Note
These two terms sometimes get conflated, but they operate at completely different layers. A skill is a passive instruction document — a markdown file Claude reads before a task to understand best practices, available libraries, and output conventions. A subagent is an active execution unit that runs, uses tools, and returns results.
The honest relationship: a subagent might read a skill before doing its work. One shapes knowledge; the other executes.
Final Thoughts
The broader Claude stack can be conceptualized as five layers: MCP for connectivity, Skills for task-specific knowledge, Agent as the primary worker, Subagents as parallel independent workers, and Agent Teams for coordination.[^3] These building blocks are shipping in rapid succession, and the pattern is maturing fast.
For developers just entering this space: don't need to build full orchestration systems on day one. But understanding the pattern — how delegation works, what subagents can and can't do, where the sharp edges are — will shape how you think about AI architecture from the start.
Subagents aren't a feature. There's a shift in how we think about what AI can be tasked with doing.
If you have reached this point, I have made a satisfactory effort to keep you reading. Please be kind enough to leave any comments or share any corrections.
📈Hadil Ben Abdallah·Apr 27, 2026·7 min read·Global
What Is MCP (Model Context Protocol) and Why It Needs a Gateway in Production — A Practical Guide for AI Engineers
#ai#machinelearning#backend#devops
It always starts with “just one integration”.
You want your AI agent to send a message to Slack. So you wire it up. A bit of custom code, some API calls, done.
Then someone asks for GitHub access. Then Jira. Then your internal database. Then Notion.
Before you realize it, you’re not building an AI system anymore; you’re maintaining a web of fragile integrations.
Every new tool means new code. Every update breaks something. Every credential becomes a security risk.
If you have 10 agents and 20 tools, you’re suddenly dealing with 200 possible connections.
This is what Anthropic called the N×M problem.
And that’s exactly the mess MCP (Model Context Protocol) was designed to fix.
What Is MCP (Model Context Protocol)?
At its core, MCP is simple; and that’s why it matters.
MCP is an open standard that defines how AI agents connect to and use tools.
Think of it like USB-C for AI.
From fragmented integrations to a unified interface — MCP standardizes how AI agents connect to tools through MCP servers, replacing N×M integrations with a single protocol (USB-C analogy)
This is the shift MCP introduces: from point-to-point integrations to a shared, standardized interface.
You don’t build a custom cable for every device anymore. You define one standard interface, and everything plugs into it.
That’s what MCP does for AI systems.
Instead of writing custom integrations for every tool, you expose tools through something called an MCP server.
An MCP server is just a program that describes what a tool can do, in a structured, standardized way.
For example:
A Slack MCP server might expose:
send_message
search_messages
A GitHub MCP server might expose:
list_repos
create_pull_request
Once that’s done, any MCP-compatible AI can discover and use those tools without writing new integration code.
That’s the key shift.
You stop building connections manually. You start plugging into a shared ecosystem.
Why MCP Took Off So Fast
MCP didn’t just stay theoretical.
It gained traction quickly because it solves a very real pain engineers were already feeling.
After Anthropic introduced it, other major players followed:
OpenAI
Google DeepMind
And by 2026, it was contributed to the Linux Foundation, which gave it real credibility as an open standard.
That combination, real pain + standardization + adoption, is why MCP is now everywhere.
If you’re building AI systems today, you’re going to run into it.
What MCP Solves (And Why It’s a Big Deal)
MCP solves one specific problem extremely well:
How agents talk to tools.
It standardizes:
Tool discovery (what tools exist?)
Tool capabilities (what can they do?)
Tool invocation (how do I call them?)
That’s it.
And honestly, that’s enough to unlock a lot.
You go from:
“Every integration is custom”
to:
“Every tool speaks the same language”
That alone removes a huge amount of engineering friction.
What MCP Doesn’t Solve (This Is Where Things Break)
This is the part most articles skip.
MCP solves the protocol layer, the language agents and tools use to communicate.
But it doesn’t solve what happens around that communication.
And that’s where things start to fall apart in production.
MCP does not handle:
Authentication at scale (who owns which credentials?)
Access control (which agent can use which tool?)
Observability (what did the agent actually do?)
Security (what if a tool returns malicious output?)
Governance (audit logs, compliance, traceability)
In a demo, that’s fine.
MCP works perfectly in demos because nothing is constrained.
Production systems are defined by constraints, security, cost, and control.
In a real system, that’s a problem.
Because now your agents have direct access to tools without a control layer in between.
That’s not just messy.
It’s risky.
So… Why Does MCP Need a Gateway?
An MCP Gateway is the layer that sits between your agents and your MCP servers.
It doesn’t replace MCP.
It makes MCP usable in production.
MCP standardizes communication. The gateway standardizes control.
Instead of every agent talking directly to every tool, everything goes through a centralized control point.
That’s where things start to get structured.
What an MCP Gateway Actually Adds
Once you introduce a gateway, a few important things change immediately.
1. One entry point instead of many
Agents don’t connect to 10 different tools.
They connect to one gateway.
That alone simplifies architecture more than most teams expect.
2. Centralized authentication
Instead of embedding credentials everywhere, the gateway manages them.
Agents authenticate once. The gateway handles the rest.
3. Real access control (RBAC)
You can define:
Which agents can access which tools
Which teams can use which capabilities
No more “everything can call everything.”
4. Tool discovery without hardcoding
Agents don’t need to know tools upfront.
They can discover available tools dynamically through the gateway.
That removes a ton of brittle logic.
5. Guardrails on every tool call
Every request and response can be inspected.
That means you can:
Block unsafe inputs
Filter sensitive outputs
Detect prompt injection patterns
Before anything causes damage.
6. Full audit trail
Every action is logged.
Every tool call is traceable.
You can answer:
“What exactly did this agent do?”
Without guessing.
The Piece Most Teams Don’t Think About: Virtual MCP Servers
This is where things get more interesting.
Even with MCP, exposing tools directly can be dangerous.
You don’t always want to expose everything a tool can do.
For example:
Your GitHub MCP server might support:
creating PRs
deleting repos
modifying configs
You probably don’t want an agent calling all of those.
This is where Virtual MCP Servers come in.
Instead of exposing raw tools, you create a curated layer.
In practice, this doesn’t look like raw tool endpoints; it looks like a managed layer where MCP servers are grouped and selectively exposed.
Managing MCP servers in a production environment — grouping tools, configuring access, and creating virtual MCP layers for controlled exposure (source: TrueFoundry platform)
You define:
Which tools are allowed
Which actions are safe
Which capabilities are hidden
And you expose only that to your agents.
No new deployments. No custom code.
Just controlled exposure.
This ends up being one of those features teams only realize they need after something goes wrong.
What This Looks Like in Practice
Let’s make this concrete.
Imagine a compliance automation agent.
It needs to:
Read changes from GitHub
Store a diff in MongoDB
Create a Jira ticket
Notify a team on Slack
Without structure, that’s four different integrations, four different auth systems, and zero visibility.
With MCP, those tools are standardized.
With an MCP Gateway, they’re controlled.
The agent connects to one endpoint.
The gateway:
Authenticates each step
Routes requests to the right tool
Logs every action
Applies guardrails
If something looks risky, for example, a diff that touches sensitive files, the gateway can pause execution and require approval.
That’s the difference.
You’re not just executing tasks. You’re managing them.
Where TrueFoundry Fits In
In the context of MCP, this is exactly the layer platforms like TrueFoundry are built for.
In practice, you don’t want to manage three separate concerns:
LLM routing and cost control (AI Gateway)
Tool access via MCP (MCP Gateway)
Agent execution and workflows (Agent Gateway)
You want a single control plane that handles all of them together.
That’s the shift TrueFoundry makes. It unifies these layers into one gateway architecture, so you’re not stitching together governance, observability, and security across multiple systems.
In practice, this unified gateway layer connects both models and tools under a single control plane.
Unified gateway architecture connecting applications to both LLM providers and MCP-based tools through a centralized control plane for routing, governance, and observability (Source: TrueFoundry website)
MCP standardizes communication. The gateway standardizes control.
Instead of scattered logic and duplicated integrations, everything runs through a centralized layer where:
LLM access is managed
Tool access (via MCP) is governed
Agent workflows are observable
All in one place.
It also brings the enterprise guarantees most teams eventually need:
Recognized in the 2026 Gartner® Market Guide for AI Gateways
Processes 10B+ requests per month
Handles 350+ RPS on a single vCPU with sub-3ms latency
Supports VPC, on-prem, air-gapped, and multi-cloud deployments
Compliant with SOC 2, HIPAA, GDPR, ITAR, and EU AI Act
Trusted by enterprises including Siemens Healthineers, NVIDIA, Resmed, and Automation Anywhere
The important part isn’t just the numbers.
It’s the idea of centralized control across the entire AI stack, where protocols like MCP handle communication, and a unified gateway ensures everything around that communication is secure, observable, and governed.
The Shift Most Teams Don’t See Coming
At first, MCP feels like the solution.
And it is, for a specific problem.
But once you move beyond a prototype, the challenge changes.
It’s no longer:
“How do I connect an agent to a tool?”
It becomes:
“How do I control, secure, and observe everything that happens between them?”
That’s not a protocol problem anymore.
That’s an infrastructure problem.
And that’s exactly where the gateway comes in.
Final Thoughts
MCP solves something real.
It standardizes how agents talk to tools, and that alone removes a massive amount of complexity.
But it doesn’t solve what happens around that interaction.
That’s where things get messy.
An MCP Gateway is what brings structure back:
Control over access
Visibility into behavior
Guardrails around execution
If you’re still experimenting, MCP alone might be enough.
But the moment your system starts scaling, more agents, more tools, more risk, you’ll feel the gap.
That’s the point where a gateway stops being optional.
You can try TrueFoundry free, no credit card required, and deploy it in your own cloud in under 10 minutes. It’s a practical way to see how a unified gateway can bring control, observability, and safety to MCP-based systems without slowing your team down.
Thanks for reading! 🙏🏻 I hope you found this useful ✅ Please react and follow for more 😍 Made with 💙 by Hadil Ben Abdallah
Harness Engineering in Practice: Building a 6-Agent System That Runs Itself
#ai#agents#openclaw#llm
“Six agents” here means one orchestrator (Zoe) plus five specialist agents. Six ACP coding experts run as concurrent implementation workers — not counted in that headline number.
Your Day Has Been Taken Over
Overnight, the trading agent ships the prior US session wrap-up. By morning, the macro analyst has the pre-market brief ready. The butler has pushed weather, schedule, and to-dos. AINews (AI Sentinel) has scanned GitHub Trending, arXiv’s latest papers, and 100+ sources — 18+ curated items ranked by importance. Content (Content Strategist) is tracking trending topics across 50+ platforms.
Here’s what matters most to me — automatic tracking of AI dynamics and tech trends. After discovering valuable projects or papers, the system doesn’t just push news — it evaluates impact on our systems and provides P0/P1/P2 action recommendations. Valuable discoveries enter Zoe’s Tech Radar (Zoe is the CTO Agent), going through evaluation → decision → delegated coding implementation.
60 cron tasks run automatically every day (3 AM backup to 11:45 PM reflection). Agents are evolving on their own — mistakes are remembered, recurrence rates drop significantly. This isn’t rules I wrote — it’s autonomous iteration from .learnings/ to MEMORY.md.
Designed protocols — Zoe diagnosed communication issues, designed three-state protocol (request → confirmed → final, with silent as the default "no news is good news" state), solidified into AGENTS.md
Self-developed Skills — Content researched ways to make drafts sound less generically LLM-written (“de-AI” polish), wrote Skills, published to ClawHub (shared repository)
Strategy roundtables — Macro + Trading produce weekly reports with data snapshots, position recommendations, stop-loss discipline
Task Watcher — Zoe designed cron-level Task Callback Event Bus for async monitoring
My role: Set up framework, establish constraints, confirm direction. Requirement discovery, solution research, protocol design, implementation — all done by agents.
Critical capability: Proactive tech impact evaluation. Discovered ReMe framework → proposed to Zoe → I confirmed → agents executed.
Toolchain: github_trending.py, rss_aggregator.py, arxiv_papers.py, Tavily, agent-browser. Anti-hallucination: every item MUST have a URL, reachability self-check, and unverifiable items labeled single-source.
Not financial advice — automated research output only; you are responsible for any real-money decisions.
Hard rules (system policy, not investment advice): no entry without a defined stop, never fabricate data, confidence <60% = “wait.”
Macro (Chief Economist)
9 cron tasks: Morning (07:50) → Midday (12:30) → Evening (18:00) → US pre-market (22:00) → morning digest of the prior US session (05:20 PT) — scheduled after the cash close, not at the closing bell. Sunday weekly review → Trading references for market review.
Discipline: Cite sources, distinguish facts vs judgments, mark confidence (high >70% / medium 50–70% / low <50%), propose counter-arguments.
Real case: Iran tension → traditional: “gold rises” → actual: oil +14%, gold -5%. Macro: “inflation logic dominates, not safe haven.” Saved to MEMORY.md.
Autonomous evolution: Discovered content too “AI-flavored” → researched humanizing / de-generic copy tools → wrote Skills → published to ClawHub.
Five-Basket Radar: AI/Tech (≤40%), Product/Startup, Solopreneur, Investment/Macro, Social/International. 40% AI cap self-imposed during reflection.
Butler (Life Assistant)
7 cron tasks: Greeting (08:00) → Schedule (08:30) → 5 water reminders (rotating styles) → Health (20:00) → Summary (22:00).
Philosophy: <50 chars per reminder, ≥1.5h interval, 23:00–07:00 emergency only, no pestering if no reply.
ACP Coding Experts
Pi / Claude Code / Codex / OpenCode / Gemini / GPT-4.1-Codex. Max 6 concurrent, 120min TTL — queue or shed load when saturated so you don’t stampede gateways. Analysis agents don’t code — delegated via sessions_spawn.
Design Lesson
Don’t let analysis agents code directly. Early setup: coding + architect + PM roles. Result: almost no output, high overlap with Zoe + ACP, increased complexity. Cut them all. Zoe handles PM + architect.
Complexity grows fast: pairwise coordination explodes (six specialists ≈ fifteen pairwise handoffs if everyone talks to everyone). Each new agent ≈ half a day debugging conflicts, resource competition, and rule compatibility.
Three Core Engineering Problems
Problem 1: Context Is the Agent’s OS
The Problem: Entropy Always Increases
Without constraints, agent systems deterministically collapse. Agents are processes without an OS: no memory management, no garbage collection, no OOM protection.
Fix: Clean session → ThrottleInterval 1→10 → idleMinutes 180→30 → execution policy tightened from permissive to allowlist (smaller blast radius; keep the list maintained). Four defense lines missing.
P1–3500 Chars → 800 Chars
Trading’s flash report had data tables. OpenClaw auto-compacted when exceeded textChunkLimit—"intelligently compressed" away. AI "help" is disaster in data-dense scenarios.
P2 — Rules Ignore After Bloat
Sessions bloat to 10K+ tokens → agents “selectively comply.” Butler doing investment analysis. Trading ignoring validation. Critical info drowned in noise.
Skills: Via extraDirs on-demand (Trading: 15 Skills, 68K lines on disk—retrieve or inject only the 1–3 relevant fragments per turn, not the whole tree)
shared-context/: Cross-agent state, read via tools
Obsidian: Cold storage, archives output, no inference
New session → SOUL.md + AGENTS.md + MEMORY.md + .learnings/ → memorySearch → shared-context/
= "Knows who, what done, what team doing"
Problem 2: Let Agents Remember and Grow
The Problem: Repeating Mistakes
Trading got BILLBOARD_BUY_AMT wrong 5 times (wrote BUY_AMT). Session reset → lost memory → repeat. User corrects → agent changes → 3 days later same scenario → same error.
Chatbot vs Agent dividing line: Agents learn from mistakes.
Solution: Five-Layer Memory
Autonomous Memory: 6-Step Cycle
Trigger: Operation failed · User corrected · Better approach found
L4 Recording: Write to .learnings/ERRORS.md or LEARNINGS.md
Daily Reflection (22:00): Review .learnings/, Zoe aggregates cross-agent value
PROMOTE: 3+ verifications → MEMORY.md, single → keep observing
L2 Sedimentation: Weekly compression, ❤000 tokens
L5 Skill: Generalizable → write as Skills → ClawHub
This is the core mechanism. Without it: chatbot. With it: agent.
Chatbot vs Agent
Problem 3: Let Agents Collaborate
The Problem: Multi-Agent Communication
Initial issues:
Status sync failures: A finished, B didn’t know
Resource contention: Multiple agents write same file
Information silos: Macro produced, Trading never saw
Responsibility gaps: “Who’s handling this?” → all silent
Solution: Three-State Protocol + Event Bus
Protocol (three active states + default silent state):
request → confirmed → final → [silent]
request: Explicitly acknowledges, starts the loop
confirmed: In progress, sends intermediate updates
final: Complete, result delivered, loop closes
[silent]: Default state when no active task — “no news is good news” (prevents spam)
My job: design system where code writes itself. I’m architect, not bricklayer.
Looking Ahead
Next steps:
P0: Dead-letter queue for failed events
P1: Manual resend CLI for stuck tasks
P1: Audit log rotation
P2: Visual dashboard for system health
Goal: Amplify human capability. One person + six agents > one person + zero agents. That’s Harness Engineering.
Quick Reference
Agent Roster
Total: 60 cron, ~90 Skills
Daily Schedule (Pacific Time, America/Los_Angeles)
Cron rows are snapshots from my stack — align to your exchange calendar, asset class (equities vs. crypto), and whether you are on PST or PDT.
Critical Files
If you run a similar harness, how do you handle failures when compaction, cron, and multi-agent handoffs all interact — what breaks first in your stack, and what fixed it?
All times Pacific Time (America/Los_Angeles; PST or PDT depending on season). macOS + OpenClaw. Monitoring: Feb–Mar 2026. Validate config against your OpenClaw release at https://github.com/openclaw/openclaw
Full context feels safe until it isn’t. Here’s the engineering fork in the road — and real numbers from an open-source memory layer on OpenClaw.
Stateless isn’t a feature. It’s the default bug.
That’s the part nobody puts on the landing page: large language models don’t continue anything on their own. Every turn is a fresh sheet of paper — unless you shovel history back in.
So the industry reached for two comforting reflexes:
Crank the context window — pack in everything that might matter. Longer feels safer.
Drop a MEMORY.md and paste it every turn — simple, auditable, easy to debug.
Both are great at small scale. Both fall apart at real scale.
Because context isn’t free. You pay three ways: slower, pricier, and muddier. Inference drags. Your token bill climbs linearly. Worst of all, as context grows, attention thins out in the middle — quality drops, contradictions creep in, and you’re not buying “memory.” You’re buying noise.
So the real question isn’t whether to remember. It’s this:
What shape should memory take before it enters the model — verbatim dump, or retrieve what matters?
This piece is about that engineering fork — and some numbers we saw hooking an open memory layer to OpenClaw.
Full context is like reciting your entire diary before every sentence
A lot of people hear “long-term memory” and think: store everything anyone ever said.
A better engineering definition is tighter:
Keep only facts that help future decisions — and that retrieval can amplify — and let stale stuff expire on purpose.
Human memory isn’t a bit-perfect disk image. We lose detail; we blur timelines; we still keep actionable residue — “no cilantro,” “last time we were blocked on that dependency.” Flip that around for AI: full retain + full inject often buys you less coherence and more contradiction and context pollution.
That’s why I keep coming back to three knobs:
Write path: what do you distill from chat into durable memory?
Read path: what do you retrieve — and how much — before the model sees it?
Lifecycle: how do old facts fade instead of squatting forever?
What should a “memory system” actually look like?
One sane pattern: a persistent memory layer outside the LLM — think PowerMem (Apache 2.0 from OceanBase) — that extracts salient facts from dialogue (dedupe, conflict update, merge related), recalls on demand, and forgets stale items with an explicit decay policy.
A few properties that actually matter in production:
Hybrid retrieval — vectors + full-text + graph-style links. Fuzzy intent and exact keywords need to hit. “Embedding-only search” ages poorly in real products.
Forgetting isn’t a bug — Ebbinghaus-style decay sounds like a psych meme; it’s really a capacity vs. signal-to-noise trade you’re engineering on purpose.
Multi-agent — private memory and shared memory across agents. Multi-agent isn’t “someday”; it’s now. Single-user, single-session assumptions break fast.
Multimodal — text, images, audio. Not for show — workflows are already messy.
If you list those as a feature matrix, it reads like marketing. In engineering terms, they answer one question: how do you put the smallest useful slice of memory in front of the model this turn?
The benchmark doesn’t care about your vibes: LOCOMO vs. “just paste everything”
On LOCOMO (long-dialogue memory benchmark; Maharana et al., ACL 2024), PowerMem vs. a full-context baseline isn’t a rounding error:
The point isn’t “pick a winner.” It’s that retrieval + extraction beats brute-force context on quality, latency, and cost at the same time — which feels backwards until you realize the information shape changed: from “replay the transcript” to structured, retrievable facts.
OpenClaw: the anti-pattern you can actually measure
Out of the box, OpenClaw can ship the entire MEMORY.md into system_prompt every turn, with no retrieval—and the file keeps growing.
That’s full-context thinking in a real toolchain: simple, transparent, explainable — right up until it starts eating you alive.
Same workload, total input tokens:
The PowerMem plugin lands around ~18% of the default — same ballpark as “stop reciting the encyclopedia before answering one question.”
The integration model is what you’d want if you designed it on purpose: retrieve before the session, inject only what’s relevant; extract after the session, persist durable facts — instead of mirroring the whole file into the prompt every time.
Make it run: OpenClaw + PowerMem (copy-paste path)
Pick one path: use ClawHub (step 2) or the manual server + JSON (step 3) — you don’t need both.
If that’s green, you’ve swapped “paste the whole scroll every turn” for retrieve-then-inject + capture-after.
Two layers that don’t negotiate: operation plane + cognitive plane
If you’re going to run memory in production, you usually need both:
pmem CLI — humans and agents share the same front door: scriptable, automatable, boring in the good way.
Dashboard — distributions, health, “what did we actually memorize?” so humans can govern instead of guessing.
Agents need low-friction execution. Humans need explainability and ops judgment. Skip the operation plane and memory never ships. Skip the cognitive plane and you end up with vectors in a black box — and the only fix is nuke from orbit.
What the open-source argument should actually be about
If you’re building agents, CLIs, or personal automation, move the debate past “do we keep a MEMORY.md?”
Is memory a document or a database?
Is injection append-only or retrieve-then-inject?
Do you have real decay, or do you pretend “never expires” means “always correct”?
Projects like PowerMem are less a billboard and more a reproducible lab bench—hybrid retrieval, extraction, decay, multi-agent, multimodal—trading context signal-to-noise for engineering you can argue about in issues instead of vibes.
If you take one line home, make it this:
Long-term memory isn’t about remembering more. It’s about recalling the right thing when it matters.
🏗️ Building Agents Like Claude Code — A Source-Derived Blueprint 📘
#ai#llm#webdev#tutorial
A comprehensive synthesis of the claude-code-from-source project (and companion site claude-code-from-source.com) — distilled into core principles, techniques, and actionable guidelines for builders who want to ship a coding agent of comparable quality.
The source repo is a 18-chapter educational reverse-engineering of Claude Code derived from npm source maps. No proprietary code is reproduced — only architectural pseudocode and design rationale. This guide does the same.
0. 💡 TL;DR — the whole agent in one mental picture
Before the details, hold this picture in your head. Everything else is elaboration.
┌─────────────────────────────────────────────────────────────┐
│ query() — async generator, the only place control flows │
│ │
│ while not done: │
│ state = compress(state) # 4 layers │
│ response = await stream(model, state) │
│ yield response.messages # to UI │
│ if no tool_calls: return completed │
│ batches = partition(response.tool_calls) │
│ for batch in batches: │
│ results = run(batch) # parallel-safe │
│ yield results.messages │
│ state += results │
└─────────────────────────────────────────────────────────────┘
▲ ▲ ▲ ▲
│ │ │ │
Memory (files) Tools (self- Hooks (27 Sub-agents
loaded into describing, lifecycle (recursive
system prompt fail-closed, events) query() with
at session partitioned by isolated state)
start safety per-call)
Five rules carry 80% of the design:
🔄 The loop is an async generator. Backpressure, cancellation, and typed terminal states fall out for free.
📝 Every tool is self-describing (schema, permissions, concurrency safety). The loop never special-cases tools.
🛡️ Safety is per invocation, not per tool type. Bash("ls") ≠ Bash("rm -rf").
💾 Prompt cache is architecture, not optimization. Static-then-dynamic boundary, sticky flags, byte-identical fork prefixes.
📁 Memory is files. A small LLM picks which to load. No database, no embeddings. Trust through transparency.
If you only build those five things well, you have ~80% of Claude Code. The rest is layering and polish.
1. 🎯 What you are actually building
A production coding agent is not a chat loop with tool calls bolted on. It is a streaming, cancellable, recursive state machine that has to:
Survive token-budget exhaustion mid-task without losing the user's work.
Run dozens of tools per turn safely, often in parallel, sometimes speculatively.
Spawn child agents that cost ~10% of a normal call thanks to prompt cache reuse.
Persist semantic knowledge across sessions without a database.
Allow third parties to extend it (skills, hooks, MCP) without crashing the host.
Boot in under 300 ms and stream the first token in well under a second.
If your design omits any of these, you will hit a wall later. Build for them on day one — most of them are cheap when planned, expensive when retrofitted.
The closing principle of the source book:push complexity to the boundaries. Protocol translation, state reconciliation, external tool invocation, permission checking — these belong at the edges. The interior (loop, memory, tool composition) stays clean and exhaustively typed.
2. 🧱 The six core abstractions
Every part of Claude Code reduces to one of these. Implement them as first-class modules, not as helpers attached to a god object.
#
Abstraction
Responsibility
Approx LoC in CC
1
Query Loop
Async generator that streams model output, runs tools, appends results, decides when to stop. Returns a typed Terminal discriminated union (10 reasons).
~1,700
2
Tool System
Self-describing tools with schema, permissions, concurrency, rendering. Batched into concurrent/serial groups. Speculative execution during streaming.
—
3
Tasks
Background units following `pending → running → completed \
failed \
4
State
Two layers: a mutable singleton {% raw %}STATE (~80 fields, infrastructure) + a 34-line reactive store (UI: messages, approvals, progress).
—
5
Memory
File-tier persistence (CLAUDE.md, ~/.claude/MEMORY.md, team symlinks). LLM picks relevant memories at session start.
—
6
Hooks
Lifecycle interceptors at 27 events, in 4 forms: shell command, single-shot prompt, agent loop, HTTP webhook.
—
Why this carving
The Query Loop is the only place control flow lives. Tools, hooks, sub-agents — they all yield through it.
State is split because infrastructure mutates rarely but reads constantly; UI is the opposite. One subscription model can't serve both.
Memory is its own primitive (not a tool) because it is read on every system-prompt build, before any tool can run.
Hooks are first-class because the permission system itself runs partially as PreToolUse hooks. They are not an afterthought.
3. 📦 State: two tiers, one source of truth
State design is where most agent codebases collapse. Claude Code splits it into two tiers with strict layering:
Tier
What it holds
Mutability
Reachable from
Bootstrap state (STATE)
~80 fields: originalCwd, sessionId, model overrides, cost accumulators, telemetry handles, prompt-cache allowlists
Mutable through ~100 typed setters
Everywhere — DAG leaf, depends on nothing but Node.js stdlib
AppState (reactive store)
Messages, input mode, tool approvals, progress indicators, todos
Immutable snapshots; updater functions only
Inside React components
Why split them
Availability: session ID, telemetry, and cost trackers must exist before React mounts. A reactive store cannot serve them.
Access pattern: bootstrap state is read constantly, mutated rarely, with no subscribers. AppState is read by render subscribers on every change. One subscription model can't serve both.
Dependency direction: bootstrap depends on nothing → AppState imports bootstrap → React imports AppState. Enforce this with a lint rule. Cycles will sneak in otherwise.
The reactive store in 34 lines
function makeStore(initial, onTransition) {
let current = initial
const subs = new Set()
return {
read: () => current,
update: (fn) => {
const next = fn(current)
if (Object.is(next, current)) return // skip noop
const prev = current; current = next
onTransition?.(prev, next) // side effects FIRST
subs.forEach(cb => cb()) // then UI
},
subscribe: (cb) => { subs.add(cb); return () => subs.delete(cb) },
}
}
Three deliberate choices:
Updater-only mutations. No set(value) API. Stale-closure bugs vanish.
Object.is guard. Identical references skip re-renders and side effects.
onChange fires before listeners. Side effects (e.g. persist to disk, notify remote session) complete before the UI flips.
The sticky latch pattern (write-once flags)
A pattern worth memorizing — applies any time a value influences a server-side cache key:
type Latch = boolean | null // null = "not yet evaluated"
function shouldSendBetaHeader(featureCurrentlyActive: boolean): boolean {
const latched = getAfkLatch()
if (latched === true) return true // already on — keep sending
if (featureCurrentlyActive) {
setAfkLatch(true) // first activation — latch
return true
}
return false // never activated
}
The three-state type self-documents intent: null says "we haven't decided yet." Once true, never returns to false. Five such latches in Claude Code prevent mid-session feature toggles from busting 50–70K tokens of cached prompt.
Centralizing side effects on diffs
A real production bug: permission mode was synced to the remote session by 2 of 8+ mutation paths. Eventually one drifted. The fix was a single onChangeAppState(prev, next) callback that detects field changes structurally — every mutation path is automatically covered. Side effects scale much more slowly than mutation sites; centralize on diffs, not events.
Cost tracking (a concrete example)
Every API response runs through addToTotalSessionCost:
Accumulates per-model usage in bootstrap state.
Reports to OpenTelemetry.
Recursively processes nested model calls (sub-agents, recall queries).
Persists to project config on process exit.
Restores on next session only if the persisted session ID matches.
Histograms use reservoir sampling (Algorithm R) with 1,024 entries to compute p50/p95/p99. Averages hide tail latency, and tail latency is what users feel.
Actionable: even in v0, instrument cost and latency. You cannot decide what to optimize from feel.
4. ⚙️ The agent loop: AsyncGenerator as control plane
The loop is an async function* — not a while with callbacks, not an event emitter, not an RxJS pipeline. There are three concrete reasons to choose generators:
Backpressure for free. A generator yields only when the consumer calls .next(). The REPL pulls via for await, naturally pausing if the UI can't render fast enough.
Typed terminal states. The generator's return is a discriminated union of why execution stopped: completed, max_turns, error, aborted_streaming, aborted_tools, prompt_too_long, image_error, model_error, stop_hook_prevented, hook_stopped, blocking_limit. The compiler enforces exhaustive handling.
Composability. Inner generators delegate via yield*. No callback nesting, no promise plumbing.
Loop skeleton
async function* query(initialState):
state = initialState
while true:
state = compress(state) // 4-layer pipeline (§8)
response = await callModel(state) // streaming
yield* response.messages // surface to UI
if response.error and recoverable:
state = recover(state, error)
continue
if response.error and not recoverable:
return { kind: 'model_error', error }
if not response.toolCalls:
if stopHookBlocks(state):
state = applyHookFeedback(state)
continue
return { kind: 'completed' }
batches = partitionToolCalls(response.toolCalls)
for batch in batches:
results = await executeBatch(batch, state)
yield* results.messages
state = appendToolResults(state, results)
// re-enter with new state
Continue states (don't return, just continue)
collapse_drain_retry, reactive_compact_retry, max_output_tokens_escalate, max_output_tokens_recovery, stop_hook_blocking, token_budget_continuation, next_turn. Naming each one is what makes the loop testable — every test asserts which transition fired.
Error recovery is a ladder, not a fallback
Order matters. From least to most aggressive:
Trigger
Step 1
Step 2
Step 3
prompt_too_long (413)
drain staged collapse summaries
reactive compact
surface to user
max_output_tokens
escalate cap 8K → 64K
multi-turn recovery (≤3 attempts)
surface
media_size_error
reactive compact
—
surface
Guards prevent infinite loops: hasAttemptedReactiveCompact one-shot flags, hard caps on recovery attempts, circuit breakers. Never run stop hooks on an error response — that creates "error → hook blocks → retry → error" spirals.
Cancellation
Aborts can hit during streaming or during tool execution. In both cases, the executor must drain remaining requests by emitting synthetic tool_result blocks for queued/running tools. The Anthropic API rejects an assistant message containing a tool_use block without a matching tool_result. signal.reason distinguishes hard aborts from "submit interrupts" (a new user message), so you skip redundant interruption stubs in the latter case.
Actionable: every tool_use your agent emits must have a paired tool_result in message history before the next API call. Make this an invariant your loop enforces, not a hope.
If a tool author forgets isConcurrencySafe, they get serial execution — slow, but never corrupting. The opposite default would silently produce race conditions.
Tool result shape
type ToolResult<T> = {
data: T
newMessages?: Message[] // e.g. AgentTool injects sub-agent transcript
contextModifier?: (ctx: ToolUseContext) => ToolUseContext // e.g. EnterPlanMode
}
Context modifiers only apply to serial tools. Concurrent tools queue modifiers until the batch completes — otherwise data dependencies and shared state become race-condition territory.
The 14-step execution pipeline (checkPermissionsAndCallTool())
This is the choreography every tool call goes through. Implement it as a single function that returns a ToolResult or ToolError. Skipping any of these steps will hurt later.
#
Step
Why it matters
1
Tool lookup (with alias map)
Old transcripts may reference renamed tools
2
Abort check
Don't waste compute on cancelled queued calls
3
Zod validation
Catch type errors; hint to call ToolSearch for deferred tools
4
Semantic validation
E.g. reject no-op edits, block sleep if a Monitor tool exists
5
Speculative classifier start
Fire auto-mode permission classifier in parallel for Bash
6
Input backfill
Expand ~/foo → absolute paths for hooks/permissions but keep originals for transcript stability
7
PreToolUse hooks
Hooks decide / modify / block
8
Permission resolution
Rule match → tool method → mode default → prompt → classifier
9
Permission denied path
Build error, fire PermissionDenied hook
10
Execute call()
The actual work
11
Result budgeting
Persist oversized output to disk; replace with preview
12
PostToolUse hooks
Modify MCP output, possibly block continuation
13
Append newMessages
Sub-agent transcripts, system reminders
14
Error classification
Telemetry, OTel events
Result budgeting
Per-tool size caps prevent runaway output:
Tool
maxResultSizeChars
Rationale
Bash
30,000
Most useful output fits
Edit
100,000
Diffs need room
Grep
100,000
Search results accumulate
Read
∞
Self-bounded by token limit; persisting would create circular Read loops
Above the cap, the system writes the full content to a <persisted-output> file and returns a preview pointing to it. An aggregate ContentReplacementState tracks per-conversation budgets so multiple near-cap results cannot blow context together.
Deferred loading
Tools marked shouldDefer: true send only { name, description, defer_loading: true } to the API. The model has to call ToolSearch to load full schemas. Three benefits:
Smaller initial prompt.
Adding/removing a deferred tool changes the prompt by a few tokens, not hundreds — prompt cache stays warm.
Less tool-soup confusion for the model.
Tool registry assembly order matters
final = sort(builtins, alpha) ++ sort(mcpTools, alpha)
Sort within each partition, then concatenate. A flat sort across all tools would interleave MCP tools into built-in positions, busting cache breakpoints whenever MCP servers are added/removed.
6. ⚡ Concurrency and speculative execution
The core insight
Safety is determined per-invocation, not per-tool-type.Bash("ls -la") is concurrency-safe. Bash("rm -rf build/") is not. Same tool. Different inputs. Different verdict.
The partition algorithm
partitionToolCalls(calls):
batches = []
current = { kind: 'concurrent', tools: [] }
for call in calls:
tool = lookup(call.name)
parsed = tool.inputSchema.safeParse(call.input)
safe = parsed.success and tool.isConcurrencySafe(parsed.data)
if safe and current.kind == 'concurrent':
current.tools.push(call)
else if safe:
batches.push(current); current = { kind: 'concurrent', tools: [call] }
else:
if current.tools: batches.push(current)
batches.push({ kind: 'serial', tools: [call] })
current = { kind: 'concurrent', tools: [] }
if current.tools: batches.push(current)
return batches
The StreamingToolExecutor watches the model stream. The moment a tool_use block is fully parsed (often seconds before the response finishes), it starts that tool — provided admission rules allow.
Admission rule:a tool can start executing iff no tool is currently running, or both the new tool and all currently-running tools are concurrency-safe.
Sequential timeline: stream 2.5s + 3 serial tools = 3.1s
Speculative: stream 2.5s overlapped with tools 1–2; total 2.6s
Tool states: Queued → Executing → Completed → Yielded. Yield in submission order, not completion order — even if c.ts finishes before a.ts, the conversation history must remain a, b, c.
Error cascade policy
Bash errors cascade within a batch. Shell commands form implicit pipelines; running cp after a failing mkdir is pointless.
Read/Grep errors isolate. One file read failure has no bearing on a sibling grep.
Each tool declares interruptBehavior(): 'cancel' | 'block'. The executor treats an executing batch as interruptible only when all tools in it support cancel. A single block tool blocks user Esc for the whole batch.
7. 🔒 Permissions: modes, rules, and bubbling
Seven modes (most → least permissive)
Mode
Behavior
bypassPermissions
No checks (testing only)
dontAsk
Auto-deny prompts (background agents — never block on user input)
auto
Lightweight LLM classifier evaluates each call against transcript
acceptEdits
File edits auto-allowed; other mutations prompt
default
Standard interactive — user approves each action
plan
Read-only; all writes denied
bubble
Sub-agent escalates the decision to its parent
Sub-agents default to bubble. Background agents default to dontAsk (they can't block on a prompt that has no UI).
Fetch(domain:example.com) — HTTP fetches limited to that domain
For Bash, parse the command via a real bash AST parser (parseForSecurity()), split on && || ; |, and classify each subcommand. If the parser fails, return fail-safe behavior — assume any command it can't parse is unsafe.
8. 🗜️ Context engineering: the 4-layer compression pipeline
Run before every API call, in this strict order:
Layer
What it does
Cost
0. Tool result budget
Enforce per-message size caps; exempt tools without finite maxResultSizeChars
Trivial
1. Snip compact
Physically remove old messages; emit UI boundary marker; report tokens freed
Cheap
2. Microcompact
Drop tool results by tool_use_id once unneeded; cache edits via deferred boundary messages
Cheap
3. Context collapse
Replace conversation spans with summaries (granular)
Medium
4. Auto-compact
Fork an entire Claude conversation to summarize history; circuit-break after 3 consecutive failures
Heavy
Why ordering matters: if collapse alone gets tokens below the auto-compact threshold, auto-compact never runs — so you keep fine-grained recent history.
Budget thresholds
Auto-compact triggers at effectiveContextWindow − 13,000 tokens.
Hard blocking limit at effectiveContextWindow − 3,000.
10K-token gap between them is where reactive compact runs if proactive failed.
Token counting blends authoritative API usage numbers with rough estimates for messages added since the last response — biased conservative so compaction fires slightly early.
Actionable: instrument both estimated and authoritative token counts, log the delta. When the delta drifts, your estimator is broken and your safety margins are wrong.
9. 🌐 The API layer: prompt caching as architecture
Prompt caching is not an optimization. It is an architectural constraint. Every design decision either preserves cache hits or busts them.
Multi-provider abstraction
A single getAnthropicClient() factory dispatches to one of:
Direct API (key or OAuth)
AWS Bedrock
Google Vertex AI
Azure Foundry
Provider chosen at boot from env vars + config. Stored in bootstrap state; never re-checked. SDKs dynamically imported (don't load Bedrock if you're on direct API).
A buildFetch wrapper injects an x-client-request-id UUID header on every request, so you can correlate client-side timeouts with server-side logs.
Cache scopes
Scope
Where
TTL
Global
Static prompt prefix shared across all users
Long
1-hour
Eligible users' extended cache
60 min
Ephemeral (default)
Per-session
~5 min
The system prompt has a literal === DYNAMIC BOUNDARY === marker:
Rule: every runtime if above the boundary doubles the cache key space. 3 conditionals = 8 prefixes. 5 = 32. Compile-time feature flags are fine; runtime checks must live below the boundary.
Global scope is disabled when MCP tools are present — user-specific tool definitions would fragment the global cache into millions of unique prefixes.
Sticky latches
Five session-scoped boolean flags that, once set, cannot be unset for the rest of the session. They control beta/feature headers. Reason: "mid-session toggles don't change the server-side cache key" — flipping a flag would bust 50–70K tokens of cached context.
Pattern:Once(value) — a setter that throws or no-ops on second call. Use this for any cache-influencing config.
Strategy: cap default max_tokens at 8K. On the rare truncation (<1% of requests), retry with 64K. Recovers 12–28% of the context window for free.
Streaming: skip the SDK helper
The SDK's BetaMessageStream calls partialParse() on every input_json_delta — repeatedly re-parsing growing JSON from scratch (O(n²)). Use raw Stream<BetaRawMessageStreamEvent> and accumulate tool-input strings yourself.
Watchdog and fallback
Idle watchdog:setTimeout(90s) reset on every chunk. At 45s, warn. At 90s, abort and retry non-streaming.
Non-streaming fallback activates when streaming dies mid-response (network, stall, truncation, proxies returning 200 with non-SSE bodies).
Disable fallback when streaming tool execution is active — duplicate tool runs would corrupt state.
10. 🤖 Sub-agents and fork agents
Single-agent capability has a hard ceiling. The fix is recursive: spawn child agents that are the same loop with isolated state.
AgentTool input schema (dynamic)
Field
Purpose
description
3–5 word task summary
prompt
Full instructions
subagent_type
Specialization key (optional)
model
Override (haiku/sonnet/opus)
run_in_background
Async execution
name
For team addressability
isolationworktree (filesystem clone) or remote
Critical pattern:feature-gate the schema itself. "The model never sees fields it cannot use." Don't tell the model "don't use name here" — remove name from the schema in this context. The model cannot misuse what it cannot see.
Output (discriminated union)
Sync: { status: 'completed', prompt, ...result }
Async: { status: 'async_launched', agentId, outputFile } — outputFile is a filesystem path that fills in when the bg agent completes; parents poll independently of process state.
The 15-step lifecycle (runAgent())
Model resolution — caller override > agent definition > parent model > default. Read-only agents default to Haiku.
Agent ID — agent-<hex>. Override path supports resuming a backgrounded agent.
System prompt — fork agents inherit pre-rendered bytes; normal agents call agentDef.getSystemPrompt(ctx).
Abort controller — sync agents share parent's controller (Esc kills both). Async agents get an independent one (survive parent abort).
Hook registration — agent-id-scoped, auto-cleanup on termination.
Skill preloading — declared in frontmatter, loaded concurrently to mask latency, prepended as a user message.
MCP initialization — inline servers (cleaned on termination) or shared configs (memoized, persistent). Must complete before context creation so tools are in the pool when snapshotted.
Context creation — createSubagentContext() makes isolation decisions:
Aspect
Sync
Async
setAppState
shared
isolated
setAppStateForTasks
shared
shared
readFileState
own cache
own cache
abortController
parent's
independent
Cache-safe params callback — for bg agents; lets the summarization service fork the conversation with cache-identical prefix.
Query loop — same query() function. Yields back to caller, records to sidechain JSONL transcript, forwards metrics.
The point of a fork is byte-identical request prefix to the parent, so children pay 10% input-token cost.
Three mechanisms make this work:
System prompt threading — pass parent's already-rendered bytes via override.systemPrompt. Don't regenerate; feature flags or session date may have changed.
Exact tool passthrough — useExactTools: true. No filtering, no reordering, no re-serialization. Even forbidden tools (like AgentTool itself) stay in the array — runtime guards prevent misuse.
Placeholder tool results — buildForkedMessages() clones the parent's last assistant message. For each tool_use, it inserts a constant placeholder string "Fork started -- processing in background". Same string for every child → same bytes.
Only the final directive differs across children. With a 48,500-token shared prefix and 5 children, savings exceed 90% on input tokens for children 2–5.
When fork is disabled
Coordinator mode — coordinators have a structured-delegation prompt children would inappropriately inherit.
Non-interactive — fork uses permissionMode: 'bubble', which needs a user-facing prompt.
Explicit subagent_type — the user picked Explore/Plan/etc, so fork yields.
Recursive fork prevention (defense in depth)
Primary: child's context.options.querySource = 'agent:builtin:fork'. AgentTool checks this before allowing fork.
Fallback: scan message history for the boilerplate XML tag if querySource was lost in transit.
Six built-in agent archetypes
Archetype
Model
Tools
Notable
General-purpose
Default
All except Agent
Workhorse
Explore
Haiku
Read-only
Omits CLAUDE.md, one-shot prompt (saves 135 chars/invocation)
Plan
Inherit
Read-only
4-step process, must end with "Critical Files" list
Verification
Inherit
Read-only, async
System prompt explicitly anti-rationalization; requires adversarial probe
Claude Code Guide
Haiku
dontAsk mode
Doc fetcher; system prompt injects user's configured skills/agents/MCP
Statusline Setup
Sonnet
Read + Edit only
Narrowly-scoped specialist
Frontmatter format for user-defined agents
---
description: "When to use this"
tools: [Read, Bash]
disallowedTools: [FileWrite]
model: haiku
permissionMode: dontAsk
maxTurns: 50
skills: [my-skill]
mcpServers: [slack, {my-server: {command: node, args: [./server.js]}}]
hooks:
PreToolUse:
- command: "echo validating"
---
# System prompt body in markdown...
Trust hierarchy (least to most trusted): user agents < plugin agents < policy agents < built-in. User-agent hooks/MCP are silently skipped under strictPluginOnlyCustomization — graceful degradation, not error.
11. 🕸️ Multi-agent coordination patterns
Three distinct shapes:
A. Simple background delegation
Fire-and-forget. Tests, searches, lints. No coordination protocol.
B. Coordinator mode
Hierarchical manager-worker. The coordinator gets only three tools: Agent (spawn), SendMessage (talk), TaskStop (kill). That's it. By design.
"The coordinator's job is to think, plan, decompose, and synthesize. Workers do the work."
Critical principle: never delegate understanding. Coordinators must give workers exact file paths, exact line numbers, exact change descriptions — not "based on the research, fix the bug."
Bridge (bridge:<session-id>) — cross-machine via Remote Control relays
UDS (uds:<socket-path>) — local IPC via Unix Domain Sockets
In-process — agent IDs / names of running agents
Team mailbox — file-based queue
Killer feature: transparent agent resumption. Sending a message to a "dead" agent automatically resurrects it from its disk transcript. The conversation simply continues.
Command queue invariant
Messages are delivered between tool rounds, never mid-execution. The agent finishes the current turn, then receives new info. No race conditions, no corrupted state. Make this a hard rule — it's the cheapest way to get correctness in multi-agent comms.
Pattern selection
Scenario
Pattern
Single bg task
Delegation
Multi-file refactor with research phase
Coordinator
Long-running collaborative dev
Swarm
Operational guardrail
A 50-message memory cap on in-process teammates exists because a real production incident reached 36.8 GB across 292 agents. Plan for unbounded fan-out from day one or it will hurt you.
12. 🧠 Memory: file-based persistence + LLM recall
Why files, not a database
Transparency — users open .md files and see exactly what the agent remembers. Trust through observability, not capability.
Modification time is a built-in epistemological signal: "when was this observation recorded?"
Zero infrastructure — no schema migrations, no indexes, no backups.
Layout
~/.claude/projects/<sanitized-git-root>/memory/
MEMORY.md # always loaded; index only; ≤200 lines, ≤25 KB
user_role.md # one memory per file
feedback_testing.md
project_migration_q2.md
team/ # shared via symlink
logs/YYYY/MM/YYYY-MM-DD.md # KAIROS append-only mode
Four-type taxonomy
Type
Purpose
user
Role, expertise, preferences
feedback
Corrections + validated approaches (lead with rule, then Why: and How to apply: lines)
project
Active work context with absolute dates (always convert "Thursday" → 2026-03-05)
reference
Pointers to external systems (Linear, Slack channels)
Derivability test: if git log / git blame / the code itself can answer it, don't memorize it. No code patterns, no architecture, no debug fix recipes.
Frontmatter contract
---
name: <title>
description: <one-line summary used by recall LLM>
type: user | feedback | project | reference
---
<body — for feedback/project, structure as: rule → **Why:** → **How to apply:**>
The description field carries the most weight — it's the LLM-recall index.
Two-tier retrieval
Tier 1 (always loaded):MEMORY.md index (~3,000 tokens for ~150 entries). Lines after 200 are truncated.
Tier 2 (on-demand): an async Sonnet side-query gets the manifest (type, name, date, description), the user's current query, and recent tool history. Returns up to 5 filenames as structured JSON. Validated against the file list to catch hallucination.
This trades a few hundred ms of latency for semantic precision keyword-matching cannot achieve — especially for negation (do NOT use mocks).
Staleness policy
Don't expire. Annotate. Today/yesterday → no caveat. Older → human-readable warning ("This memory is 47 days old — code claims may be outdated"). Models reason better about "47 days ago" than ISO timestamps.
Write path (two-step)
Write <type>_<topic>.md with frontmatter + body.
Add a one-line pointer to MEMORY.md: - [Title](file.md) — one-line hook.
A background extraction agent runs at loop completion to catch memories the main agent missed.
KAIROS continuous mode
For long-lived sessions, replace two-step writes with append-only daily logs in logs/YYYY/MM/. A separate consolidation pass (after 24h or 5+ modified sessions) merges logs into structured memories.
The killer pattern. 50 skills shouldn't cost 50 docs of system-prompt tokens at startup.
Phase 1 (startup): parse YAML frontmatter only — name, description, when_to_use. Inject into system prompt as a directory.
Phase 2 (invocation): load full markdown body, substitute $ARGUMENTS and ${CLAUDE_SESSION_ID}, execute inline shell commands, prepend as a user message.
You pay the token cost only when the skill actually runs.
Skill source priority (highest → lowest)
Managed (policy / enterprise)
User (~/.claude/skills/)
Project (.claude/skills/)
--add-dir flag
Legacy commands
Bundled
MCP (remote, untrusted)
Hard security boundary: MCP skills never execute inline shell commands. External MCP servers are content-only. No exceptions.
Frontmatter controls
name: my-skill
description: ...
when_to_use: ...
disable-model-invocation: false # block autonomous use
context: fork # run as sub-agent with own token budget
paths: ["src/**/*.ts"] # conditional activation
hooks:
PreToolUse: [...]
captureHooksConfigSnapshot() freezes hook config at startup. If malicious code modifies .claude/settings.json mid-session, the snapshot prevents the change from taking effect. Only the /hooks command or the file watcher can update the live config.
Policy cascade: enterprise hooks cannot be disabled by users; allowManagedHooksOnly restricts to policy-approved hooks.
Exit code semantics (command hooks)
Code
Meaning
0
success
2
blocking error (deliberately uncommon to prevent accidental enforcement)
other
non-blocking warning
Skill ↔ hook integration
When a skill is invoked, its frontmatter hooks register as session-scoped. The skill directory becomes CLAUDE_PLUGIN_ROOT for those hook commands. once: true removes the hook after first execution. For sub-agents, Stop hooks auto-convert to SubagentStop to fire at the correct lifecycle point.
14. 🔗 MCP: the universal external-tool protocol
Skills and hooks extend the agent in-process. MCP (Model Context Protocol) is the standard way third parties extend it out-of-process — across servers, vendors, and trust boundaries. If you want a tool ecosystem you don't control, this is the layer that makes it possible.
Eight transports, three deployment shapes
Shape
Transport
Use
Local processstdio (default)
Subprocess; JSON-RPC over stdin/stdout; no auth
Remote serverhttp
Streamable HTTP; POST + optional SSE
sse
Legacy (pre-2025)
ws
WebSocket bidirectional
claudeai-proxy
Routed via Claude.ai infrastructure
In-processsdk
Control messages over stdin/stdout
InProcessTransport
Direct function calls via queueMicrotask() (63 lines)
IDEsse-ide, ws-ide
Runtime-specific
Recommendation: start with stdio for local tools. Move to http only when you need remote. Use InProcessTransport for tools you control end-to-end — eliminates subprocess overhead.
Tool wrapping (4 stages)
External MCP tools must merge into the same Tool interface as built-ins. Four transformations:
Name normalization → mcp__{server}__{tool}. Invalid characters become underscores. Match ^[a-zA-Z0-9_-]{1,64}$.
Description truncation at 2,048 chars. (Real-world: OpenAPI servers were dumping 15–60 KB descriptions.)
Schema passthrough. Pass MCP input schemas straight through; do not transform.
local.mcp.json in project
User approval required
user~/.claude.json
User-managed
project
Project-level
Shared
enterprise
Org-managed
Pre-approved
managed
Plugin-provided
Auto-discovered
claudeai
Web interface
Pre-authorized
dynamic
SDK injection
Programmatic
Servers with matching command/args (or URLs) are deduplicated by content, not by name. Two configs naming the same binary differently still merge.
OAuth (RFC 9728 + RFC 8414)
Discovery chain when a server returns 401:
Probe /.well-known/oauth-protected-resource for authorization-server metadata.
Fall back to RFC 8414 discovery against the MCP server itself.
Use configured authServerMetadataUrl as escape hatch.
Cross-App Access (XAA) enables federated token exchange via identity providers. Real-world spec violations are common — normalizeOAuthErrorBody() rewrites Slack's "200 with error body" responses to a proper HTTP 400. Plan for spec drift on day one.
Connection
30 s
Unreachable / slow servers
Per-request
60 s
Fresh AbortSignal per request
Tool call
~27.8 h
Legitimate long-running operations
Auth
30 s
Unreachable OAuth servers
Trap: if you reuse a single AbortSignal across requests it expires during idle periods. wrapFetchWithTimeout() creates a fresh signal per request. Memorize this.
Critical security rule
MCP skills never execute inline shell commands. External servers are content-only. Every other extension surface (user skills, project skills) can run shell; MCP cannot. This is the single most important MCP rule and the one you will be tempted to break.
InProcessTransport in 63 lines
Two key mechanics:
send() delivers via queueMicrotask() — prevents stack-depth blow-ups on synchronous request/response cycles.
close() cascades to peer transport — no half-open connection states.
If you are wrapping an internal service as an MCP server, this is your reference. Don't subprocess what you can call directly.
15. 🚀 Bootstrap, startup, and rendering performance
The 5-phase pipeline (target: < 300 ms)
Phase
File
What happens
0. Fast-path dispatchcli.tsx
Inspect args. --version / --help → dynamic-import only that handler, exit. Don't load React, telemetry, MCP.
1. Module-level I/Omain.tsx
Side-effect-fire MDM (security policy) + keychain subprocesses during import evaluation. ~138 ms of module loading runs in parallel with subprocess I/O.
2. Parse and trustinit.ts
Parse args, load config. Enforce a trust boundary dialog. Before: only safe ops (TLS, themes, telemetry). After: env vars and git commands.
3. Setupsetup.ts
Register everything in parallel: commands, agents, hooks, plugins, MCP. Hook config snapshot frozen here.4. LaunchreplLauncher.ts
Seven entry paths converge: REPL, print, SDK, resume, continue, pipe, headless. All call the same query() loop.
Other startup techniques
API preconnection — fire a HEAD to the Anthropic API during init. TCP+TLS handshake (100–200 ms) overlaps with setup. Connection is warm by the time the user submits.
Dynamic import for heavy libs — OpenTelemetry, provider SDKs, React for non-REPL paths.
50+ profiling checkpoints sampled at 100% of internal users / 0.5% of external. Without instrumentation you can't tell what to optimize.
Search performance (270K+ paths)
Three layers:
Bitmap pre-filter — assign each path a 26-bit mask of contained lowercase letters. Reject query: one integer comparison (charBits[i] & needleBitmap) !== needleBitmap. Rejects 10–90% at 4 bytes/entry.
Score-bound rejection — skip paths that can't beat the current top score before expensive scoring.
Async indexing with partial queryability — yield every ~4 ms. Search begins within 5–10 ms of index availability.
Rendering: patterns that transfer beyond the terminal
Claude Code forks Ink because stock Ink allocates one JS object per cell per frame — at 200×120 that's 24,000 GC'd objects every 16 ms. Whatever you're rendering, the lessons transfer:
Double-buffer + atomic write. Two persistent Frame objects; render into the back, swap pointers (no allocation), write the diff in one syscall wrapped in BSU/ESU (Begin/End Synchronized Update). No tearing.
Cell-level diffing with damage rectangles. Compute the bounding box of writes; diff only inside it. ~6× reduction in compare work for localized updates.
Three interning pools (chars, styles, hyperlinks) → integer IDs everywhere. Style transitions become a single pre-cached string lookup. Pools generationally reset every 5 min.
Frame throttling. 60 fps focused, 30 fps blurred (throttle(deferredRender, FRAME_INTERVAL_MS)). Scroll events get a tighter 4 ms schedule.
Pack related data. Two Int32 words per cell beats scattered objects — better cache behavior, faster compare, fewer allocations.
Separate hot paths from React. Direct DOM mutation + microtask scheduling for scroll. React handles the final paint, where it's already efficient.
The thesis: performance is not making operations fast; it is eliminating operations entirely.
16. 📋 The 10 foundational patterns (cheat sheet)
#
Pattern
Why it matters
1
AsyncGenerator-based loops
Natural backpressure, clean cancellation via .return(), typed terminal states
2
Speculative tool execution
Run safe read-only tools while the model is still streaming → noticeable latency cut
3
Concurrent-safe batching
Partition by per-invocation safety; serial isolates side effects
4
Fork agents for cache sharing
Byte-identical prefixes ⇒ ~95% input-token savings on children
5
4-layer context compression
snip → microcompact → collapse → autocompact, in that order
6
File-based memory + LLM recall
Beats embeddings for negation and intent-aware retrieval; zero infra
7
Two-phase skill loading
Frontmatter at startup, body on invocation
8
Sticky latches
Cache-influencing flags become write-once for the session
9
Slot reservation
8K default output, 64K on demand — recovers 12–28% of context
10
Hook config snapshots
Freeze at boot; defense against mid-session injection from a malicious repo
17. 🗺️ Build-your-own: a 14-step roadmap
A pragmatic order to implement these in. Each step compiles and runs on its own.
Tool interface + factory. Define Tool<I, O, P>, buildTool() with safe defaults, and a ToolResult type. Ship one tool: Read. Test the Zod-based JSON Schema generation.
Query loop v0. Async generator. No tools, no compression, just stream the model and yield messages. Return a Terminal discriminated union.
Tool execution path. Add the 14-step pipeline as one function. Wire the loop to call it on tool_use blocks. Always pair tool_use with a tool_result, even on error.
Permission modes + rules. Implement default, acceptEdits, plan, bypassPermissions. Add the resolution chain. Skip auto (LLM classifier) for now.
Concurrency partition + executor.partitionToolCalls() + a serial/concurrent executor. Add isConcurrencySafe() to every tool. Yield results in submission order.
Hook system v0. Two events: PreToolUse, PostToolUse. Command hooks only (shell process, exit codes). Capture a snapshot at startup.
State split. Mutable singleton STATE for infra (cwd, model, session id). Tiny reactive store for UI (messages, approvals).
Multi-provider client factory. Direct API first. Stub the others. buildFetch wrapper for client-request-id header.
Prompt caching architecture. System-prompt boundary marker. Static prefix (cache scope: global if no MCP). Dynamic suffix per-session. Implement one sticky latch as proof.
Compression v1: snip + microcompact. Skip collapse and autocompact for now. Wire the budget thresholds.
Streaming tool executor. Watch the streaming SSE. Start safe tools when their tool_use is fully parsed. Buffer to preserve submission order.
AgentTool + sub-agent lifecycle. Re-enter query() with isolated context. Implement the cleanup finally block. Skip fork agents.
❌ Forgetting to pair tool_use with tool_result on abort. API will reject the next message. Drain queued tools with synthetic results on every cancellation path.
Tools
❌ A constructor literal instead of a factory. Defaults will be unsafe. Always go through buildTool().
❌ Per-tool-type concurrency safety.Bash is sometimes safe, sometimes not. Pass parsed input.
❌ Concatenating built-ins and MCP tools then sorting flat. Cache breakpoint dies. Sort within partition, then concat.
❌ Returning huge raw output. Cap with maxResultSizeChars. Persist to disk + return preview.
❌ Using the SDK's BetaMessageStream. O(n²) JSON re-parsing. Read raw stream events.
Permissions
❌ Scattering if mode === ... checks throughout tool code. Centralize in modes + the resolution chain.
❌ Trusting a partial bash parse. If parseForSecurity() fails, treat the command as unsafe.
❌ Sub-agent default = default mode. It needs a UI to prompt; bg agents have none. Default to bubble (sync) or dontAsk (async).
Caching / API
❌ Runtime conditionals in the static prompt prefix. Each one doubles cache key space. Move below the boundary.
❌ Mid-session feature toggles that change request headers. Use sticky latches.
❌ Reserving 64K output tokens by default. Over-reserve 8–16×. Cap at 8K, escalate on demand.
❌ Regenerating the system prompt for fork children. Feature flags or session date may have moved. Pass parent's bytes.
❌ Filtering tools per child agent in fork mode. Different array → different cache key. Use useExactTools: true and runtime guards.
Memory
❌ Storing what git log can answer. Code patterns, fix recipes, who-changed-what. Useless duplication that goes stale.
❌ Embedding-only retrieval. Misses negation ("do NOT mock the DB"). Use LLM recall over a manifest.
❌ Hard expiration. Annotate with age; let the model decide. Stale memories are still data.
❌ Letting MEMORY.md grow past 200 lines. Truncated silently. Treat the index as a budget.
Multi-agent
❌ Coordinators with the full tool set. They'll do the work themselves. Restrict to Agent, SendMessage, TaskStop.
❌ Workers asked to "based on the research, implement X." They re-derive context, miss specifics, hallucinate paths. Synthesis is the coordinator's job.
❌ Mid-tool-execution message delivery. Race conditions. Queue at tool-round boundaries.
❌ Unbounded teammate state. 36.8 GB / 292 agents was a real production incident. Cap message history.
❌ General-purpose agents that can spawn Agent. Exponential fan-out. Block recursive spawning at the schema level.
Bootstrap / hooks
❌ Loading the world for --version. Fast-path dispatch first, full bootstrap second.
❌ Hook config that updates live mid-session. Lets a malicious repo redefine permissions after trust dialog. Snapshot at startup; update only via explicit user channel.
❌ Treating MCP skills like local skills. They are content-only. Never execute their inline shell commands.
🎯 Closing thought
The deepest principle in the source book is repeated at every layer: push complexity to the boundaries. Permission resolution, protocol translation, state reconciliation, tool I/O — these are the messy edges. Concentrate the mess there. Keep the loop, the tool composition, the memory recall, and the streaming logic clean and exhaustively typed.
If you remember nothing else: most of this system is generators yielding strongly-typed events through a series of small modules, with a few critical caches and a few critical safety doors. Build it in that order.
19. 📖 Glossary
Quick reference for the jargon used throughout this guide.
Term
Meaning
AsyncGenerator
A JS function declared async function*. Yields values lazily, pauses at each yield until consumer calls .next(). Provides backpressure and clean cancellation.
Backpressure
The producer pauses when the consumer can't keep up. Generators give it for free; event emitters do not.
Cache breakpoint
The byte position in the prompt where the prompt cache stops matching. Move volatile content after the breakpoint to maximize hit rate.
Concurrency-safe
A tool invocation that can run in parallel with others without observable side effects. Determined per-input, not per-tool-type.
Context window
The token budget for a single API call (prompt + output). When you exceed it the API rejects the request.
Discriminated union
A type made of variants tagged by a literal field (`{ kind: 'completed' } \
Fork agent
A sub-agent that inherits the parent's byte-identical prompt prefix to maximize prompt-cache hits (~95% input-token discount on children 2…N).
Frontmatter
The YAML block at the top of a {% raw %}.md file (between two --- lines). Used for skill/agent/memory metadata.
Hook
A user/plugin/policy interceptor at one of 27 lifecycle events. Can block, modify, or inject.
MCP
Model Context Protocol — the JSON-RPC standard for connecting external tool servers to an agent. Eight transports.
Microcompact
Layer 2 of context compression. Removes tool results by tool_use_id when no longer needed.
Prompt cache
Anthropic's server-side cache of prompt prefixes. ~90% discount on cached input tokens. Entire architecture revolves around preserving hits.
Reservoir sampling
Algorithm R. Maintain a fixed-size random sample of an unbounded stream. Used here for latency histograms (1,024 entries → accurate p50/p95/p99).
Slot reservation
The max_tokens value sent to the API. Default cap 8K, escalate to 64K on truncation (<1% of requests). Reclaims 12–28% of context.
Speculative execution
Starting tools while the model is still streaming, before the assistant message completes. Saves hundreds of ms when read-only tools dominate.
Sticky latch
A write-once boolean (`null \
Sub-agent
A child agent spawned via {% raw %}AgentTool. New query() generator with isolated message history. Sync (parent waits) or async (background).
Synthetic tool result
A fabricated tool_result block emitted on cancellation so the API doesn't see a tool_use without a matching result.
Terminal state
The discriminated-union value the agent loop returns (vs. yields). Encodes why execution stopped — 10 distinct reasons.
tool_use / tool_result
Anthropic API blocks. Every tool_use in an assistant message must be paired with a tool_result in the next user message. The single most common bug source.
Two-phase skill loading
Frontmatter loaded into the system prompt at startup; full body loaded only on invocation. Lets you ship 50+ skills cheaply.
The source repo is purely educational and contains no source code from Claude Code — only original pseudocode derived from npm source maps. This guide follows the same convention.
If you found this helpful, let me know by leaving a 👍 or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! 😃
Fine-Tune Any HuggingFace Model like Gemma on TPUs with TorchAX
#machinelearning#pytorch#python#tutorial
What if you could fine-tune any HuggingFace model on TPUs — using PyTorch code?
Here is what the end result looks like:
import torchax as tx
import torchax.train
# One function: forward → loss → gradients → optimizer update
step_fn = tx.train.make_train_step(model_fn, loss_fn, optimizer)
# Training loop
for batch in dataloader:
loss, params, opt_state = step_fn(params, buffers, opt_state, batch, batch["labels"])
Your PyTorch model. JAX's training primitives. Running on TPU. No rewrite needed.
In the first part of this series, we ran HuggingFace models on JAX for fast inference. Now we take the next step: training. We will instruction-tune Gemma 3 1B on the Databricks Dolly 15k dataset using LoRA and torchax's functional training API — all on a free Colab TPU.
Why Train on TPUs?
Google's Tensor Processing Units (TPUs) are purpose-built for matrix operations — the bread and butter of deep learning. Free Colab gives you access to a TPU v2-8 with ~15GB of high-bandwidth memory. That is enough to fine-tune a 1B parameter model with LoRA.
But training on TPUs traditionally meant rewriting your model in JAX (Flax, Equinox) or using PyTorch/XLA. torchax offers a third path: keep your PyTorch model, but use JAX's functional training primitives.
How torchax Training Differs from Standard PyTorch
Standard PyTorch
torchax
loss.backward()jax.value_and_grad(loss_fn)(params, ...)optimizer.step()optax.apply_updates(params, updates)
Model holds its own state
Params and buffers are separate pytrees
Eager execution
JIT-compiled training steps
The key difference: functional training. Instead of calling loss.backward() and optimizer.step() on a stateful model, torchax separates the model into immutable weight pytrees and passes them through pure functions. This is what enables JAX's jax.jit to compile the entire training step into a single optimized program.
Prerequisites & Setup
What you need:
Python 3.10+
Basic familiarity with PyTorch and HuggingFace transformers
A Google Colab account (free tier works with LoRA)
Zero-setup option: Click the Colab badge above. The notebook handles all installation automatically.
Local setup:
# PyTorch CPU (torchax handles the accelerator via JAX)
pip install torch --index-url https://download.pytorch.org/whl/cpu
# JAX + all training dependencies in a single pip call
pip install -U 'jax[tpu]' torchax transformers flax peft datasets optax # TPU
# pip install -U 'jax[cuda12]' torchax transformers flax peft datasets optax # GPU
Colab note: The notebook installs packages and automatically restarts the runtime, since Colab pre-loads an older JAX that stays cached in memory until restart.
Key Concepts for Training
Before writing code, let's understand the four concepts that make torchax training work.
1. Param/Buffer Separation
JAX's jax.value_and_grad needs to know which inputs to differentiate. In standard PyTorch, the model owns its weights. In torchax training, we explicitly separate:
params = {n: p for n, p in model.named_parameters() if p.requires_grad}
frozen = {n: p for n, p in model.named_parameters() if not p.requires_grad}
buffers = dict(model.named_buffers())
buffers.update(frozen)
For LoRA, params contains only the tiny adapter weights (~0.5% of the model). For full fine-tuning, it contains everything.
2. optax Optimizers
Unlike PyTorch optimizers (which carry hidden mutable state), optax optimizers are pure functions:
This functional design means the optimizer state is just another pytree that flows through the training step — perfect for jax.jit.
3. make_train_step
torchax.train.make_train_step() is the central API. It composes three pieces into a single JIT-compilable function:
model_fn — a pure function: (weights, buffers, batch) → output
loss_fn — extracts the scalar loss: (output, labels) → loss
optimizer — an optax optimizer
The result is step_fn(params, buffers, opt_state, batch, labels) → (loss, new_params, new_opt_state).
Under the hood, this uses jax.value_and_grad for efficient gradient computation and optax.apply_updates for weight updates — all compiled into a single XLA program.
4. Full Fine-Tuning vs LoRA
Full Fine-Tuning
LoRA
Trainable params
All (~2B)
Tiny adapters (~0.5%)
Memory
~18-20 GB
~5-7 GB
Speed
Slower
Faster
Quality
Higher ceiling
Nearly as good
Free Colab TPU
Tight / may OOM
Fits comfortably
LoRA (Low-Rank Adaptation) freezes the base model and adds small trainable matrices to attention layers. Instead of updating the full weight matrix W, it learns a low-rank decomposition: W + (α/r) × B·A where A and B are tiny matrices.
For free Colab, LoRA is the recommended path.
Step 1: Load and Prepare the Dataset
We use Databricks Dolly 15k — 15,000 human-written instruction-response pairs across 7 categories (QA, summarization, brainstorming, etc.).
import datasets as hf_datasets
from transformers import AutoTokenizer
MODEL_NAME = "google/gemma-3-1b-it"
DATASET_NAME = "databricks/databricks-dolly-15k"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
raw_dataset = hf_datasets.load_dataset(DATASET_NAME, split="train")
Each example has an instruction, optional context, response, and category. We format these into Gemma's chat template:
Here is where the torchax pattern matters: load the model with torchax disabled, then enable it before moving to JAX.
import torch
import torchax as tx
import peft
# Load model with torchax disabled to avoid intercepting init ops
with tx.disable_temporarily():
model = transformers.AutoModelForCausalLM.from_pretrained(
MODEL_NAME, torch_dtype=torch.bfloat16
)
# Sync pad_token_id so loss computation properly ignores padding
model.config.pad_token_id = tokenizer.pad_token_id
Why disable? HuggingFace model initialization uses operations (like in-place tensor filling) that torchax does not support. Disabling torchax during loading keeps everything on CPU, then we move to JAX after.
Now apply LoRA:
peft_config = peft.LoraConfig(
task_type=peft.TaskType.CAUSAL_LM,
inference_mode=False,
r=8, # Rank of the LoRA matrices
lora_alpha=16, # Scaling factor
lora_dropout=0.0, # 0.0 for bfloat16 numerical stability
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # All attention layers
)
model = peft.get_peft_model(model, peft_config)
model.print_trainable_parameters()
# Output: trainable params: 5,767,168 || all params: 2,619,206,656 || trainable%: 0.22%
Only 0.22% of parameters are trainable — that is the power of LoRA.
Finally, enable torchax and move to the JAX device:
Before training, we measure the model's performance to compare against later:
import math
def evaluate_loss(model, dataloader, device, max_batches=50):
model.eval()
total_loss, total_batches = 0.0, 0
with torch.no_grad():
for i, batch in enumerate(dataloader):
if i >= max_batches:
break
# Drop attention_mask — Gemma's sliding window attention produces NaN
# with padded masks on torchax/JAX. Labels already mask padding with -100.
batch = {k: v.to(device) for k, v in batch.items() if k != "attention_mask"}
outputs = model(**batch)
total_loss += outputs.loss.item()
total_batches += 1
model.train()
avg_loss = total_loss / max(total_batches, 1)
return avg_loss, math.exp(min(avg_loss, 100))
baseline_loss, baseline_ppl = evaluate_loss(model, eval_dataloader, device)
print(f"Baseline loss: {baseline_loss:.4f}, perplexity: {baseline_ppl:.2f}")
We also generate sample responses for qualitative comparison. For fast generation, we register StaticCache as a JAX pytree and use KV-cached decoding — only the new token is processed each step instead of the full sequence (~50x faster):
from transformers.cache_utils import StaticCache
from jax.tree_util import register_pytree_node
def _flatten_static_cache(cache):
return (cache.key_cache, cache.value_cache), (
cache.config, cache.max_batch_size, cache.max_cache_len,
getattr(cache, "device", None), getattr(cache, "dtype", None),
)
def _unflatten_static_cache(aux, children):
config, max_batch_size, max_cache_len, dev, dtype = aux
kwargs = {}
if dev is not None: kwargs["device"] = dev
if dtype is not None: kwargs["dtype"] = dtype
sc = StaticCache(config, max_batch_size, max_cache_len, **kwargs)
sc.key_cache, sc.value_cache = children
return sc
register_pytree_node(StaticCache, _flatten_static_cache, _unflatten_static_cache)
The generation function uses prefill (process full prompt) then per-token decode with the cache and a tqdm progress bar:
from tqdm.auto import tqdm
def generate_response(model, tokenizer, instruction, device, max_new_tokens=100):
messages = [{"role": "user", "content": instruction}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to(device)
seq_len = input_ids.shape[1]
kv = StaticCache(config=model.config, max_batch_size=1,
max_cache_len=seq_len + max_new_tokens,
device=device, dtype=torch.bfloat16)
pos = torch.arange(seq_len, device=device)
model.eval()
with torch.no_grad():
# Prefill: process full prompt, populate cache
logits, kv = model(input_ids, cache_position=pos, past_key_values=kv,
return_dict=False, use_cache=True)
tok = torch.argmax(logits[:, -1], dim=-1)[:, None]
generated = [tok[:, 0].item()]
pos = torch.tensor([seq_len], device=device)
# Decode: one token at a time using cached keys/values
for _ in tqdm(range(max_new_tokens - 1), desc="Generating", leave=False):
logits, kv = model(tok, cache_position=pos, past_key_values=kv,
return_dict=False, use_cache=True)
tok = torch.argmax(logits[:, -1], dim=-1)[:, None]
tid = tok[:, 0].item()
if tid == tokenizer.eos_token_id:
break
generated.append(tid)
pos += 1
model.train()
return tokenizer.decode(generated, skip_special_tokens=True)
Step 4: Set Up Functional Training
This is where torchax diverges from standard PyTorch. We separate the model, create an optax optimizer, and compose everything into a JIT-compiled training step.
Separate params and buffers
import optax
import torchax.train
params = {n: p for n, p in model.named_parameters() if p.requires_grad}
buffers = dict(model.named_buffers())
frozen_params = {n: p for n, p in model.named_parameters() if not p.requires_grad}
buffers.update(frozen_params)
Note tx.interop.call_jax — this bridges optax's JAX calls with torchax tensors.
Define model_fn and loss_fn
def model_fn(weights, buffers, batch):
"""Stateless forward pass using functional_call."""
return torch.func.functional_call(
model, {**weights, **buffers}, args=(), kwargs=batch
)
def loss_fn(model_output, labels):
"""Extract loss from HuggingFace model output."""
return model_output.loss
torch.func.functional_call runs the model as a pure function — no hidden state, just inputs and outputs. This is what enables JAX to trace and compile it.
That single line creates a function that does: forward pass → loss computation → gradient calculation → optimizer update — all compiled into one XLA program.
Step 5: The Training Loop
import time
from tqdm.auto import tqdm
torch.manual_seed(42)
train_losses = []
start_time = time.time()
for epoch in range(1):
pbar = tqdm(enumerate(train_dataloader), total=len(train_dataloader))
for step, batch in pbar:
# Drop attention_mask — Gemma's sliding window attention produces NaN with
# padded masks on torchax/JAX. Labels already mask padding with -100.
batch = {k: v.to(device) for k, v in batch.items() if k != "attention_mask"}
loss, params, opt_state = step_fn(
params, buffers, opt_state, batch, batch["labels"]
)
train_losses.append(loss.item())
pbar.set_postfix({"loss": f"{loss.item():.4f}"})
elapsed = time.time() - start_time
print(f"Training complete! {len(train_losses)} steps in {elapsed:.0f}s")
What to expect:
Step 1: ~30-60 seconds (JAX compiles the entire training step)
Steps 2+: ~1-3 seconds each (running the compiled program)
Total: ~20-40 minutes for 2000 samples with LoRA on free Colab TPU
The first step is slow because JAX traces through the entire model, loss computation, gradient calculation, and optimizer update — then compiles it all into a single optimized XLA program. Every subsequent step reuses this compiled program.
Step 6: Evaluate the Improvement
After training, we compare against our baseline:
# Load trained params back into model
with torch.no_grad():
for name, param in params.items():
parts = name.split(".")
obj = model
for part in parts[:-1]:
obj = getattr(obj, part)
setattr(obj, parts[-1], torch.nn.Parameter(param))
final_loss, final_ppl = evaluate_loss(model, eval_dataloader, device)
print(f"{'Metric':<20} {'Before':>10} {'After':>10}")
print(f"{'Loss':<20} {baseline_loss:>10.4f} {final_loss:>10.4f}")
print(f"{'Perplexity':<20} {baseline_ppl:>10.2f} {final_ppl:>10.2f}")
You should see loss decrease and perplexity improve after training. The qualitative comparison (generated responses before vs. after) is even more telling — the fine-tuned model produces more focused, instruction-following responses.
Step 7: Save and Reload
Save
Convert JAX arrays back to CPU tensors and save using HuggingFace's standard format:
import numpy as np
save_dir = "./fine_tuned_model"
with torch.no_grad():
cpu_state_dict = {
name: torch.tensor(np.array(p)).contiguous()
for name, p in params.items()
}
# safe_serialization=False avoids a safetensors/torchax C-extension conflict on reload
model.save_pretrained(save_dir, state_dict=cpu_state_dict, safe_serialization=False)
tokenizer.save_pretrained(save_dir)
For LoRA, this saves only the tiny adapter weights (~20MB). For full fine-tuning, it saves the entire model (~4GB).
Reload
with tx.disable_temporarily():
# For LoRA: load base model + adapters separately
reloaded_model = transformers.AutoModelForCausalLM.from_pretrained(
MODEL_NAME, torch_dtype=torch.bfloat16
)
# torch_device="cpu" forces PEFT to load adapter weights on CPU,
# avoiding a safetensors/torchax C-extension conflict.
reloaded_model = peft.PeftModel.from_pretrained(reloaded_model, save_dir, torch_device="cpu")
reloaded_model.to(device)
reloaded_model.eval()
The pattern is the same as loading: disable torchax, load on CPU, then move to JAX. For LoRA models, you load the base model first, then attach the saved adapters with PeftModel.from_pretrained(). The torch_device="cpu" ensures PEFT loads weights through PyTorch's standard path rather than safetensors' C extension, which conflicts with torchax.
Full Fine-Tuning: When LoRA Is Not Enough
The notebook supports full fine-tuning by changing one setting:
TRAINING_MODE = "full"
This trains all parameters instead of just the LoRA adapters. The trade-off is much higher memory usage. To make it fit on free Colab TPU:
AdaFactor optimizer — uses ~50% less memory than AdamW (stores only row/column statistics instead of per-parameter moments)
Full fine-tuning gives a higher quality ceiling but LoRA gets you 90%+ of the way with a fraction of the compute.
Troubleshooting
Error
Cause
Fix
OutOfMemoryError
Model + optimizer too large
Switch to LoRA, reduce BATCH_SIZE or MAX_SEQ_LENTypeError: not a valid JAX type
Custom HuggingFace type not registered
Register with jax.tree_util.register_pytree_node()Loss is NaN
Numerical instability in bfloat16
1. Call tx.enable_accuracy_mode() before tx.enable_globally(). 2. Reduce LR (try 1e-4). 3. Set lora_dropout=0.0. 4. Add optax.clip_by_global_norm(1.0).
Slow first step
Normal — JAX JIT compilation
Wait ~30-60s; subsequent steps are fast
make_train_step error
API mismatch
Update: pip install -U torchax
The Big Picture: Inference + Training
With the inference tutorial and this training tutorial, you now have the complete torchax story:
Run any HuggingFace model on TPU (model.to("jax"))
Benchmark with JIT compilation (10-100x speedup)
Fine-tune with LoRA or full training (make_train_step)
Searching Billions in Seconds: How HNSW Solved the Scale Problem
#ai#machinelearning#programming#architecture
We have a massive problem with how computers find things. If you have a few hundred photos on your phone, you can find one instantly, but trying to find one specific item out of a billion creates a massive technical strain.
Most systems rely on "Linear Search", which is like looking through every single page of a ten-million-page book to find one word.
This "one-by-one" approach makes real-time tools like chatbots or movie recommendations grind to a halt as the data grows. Furthermore, modern data like images or text is "high-dimensional," which breaks traditional filing systems and makes them no faster than checking every item manually.
To fix this, researchers Yu. A. Malkov and D. A. Yashunin changed the rules of the game. They realized that we don't always need a "perfect" match if it takes an hour to find; we often just need a "good enough" match found in a millisecond.
We will explore how they used a multi-layered "highway system" to make searching a billion items feel as fast as searching a hundred.
Two Methods for Fast Searching
To solve the slowness problem, the researchers combined two concepts from computer science. The first is the "Small World" phenomenon.
You’ve likely heard of "six degrees of separation", the idea that you are connected to anyone on Earth through a short chain of acquaintances.
HNSW treats data the same way. It builds a map where every data point is a "node," and similar points are connected like friends. By jumping from "friend to friend" toward your target, you can navigate a massive dataset in just a few steps.
The second concept is Hierachy, which is inspired by a structure called a Skip List.
Imagine a high-rise building where the elevators only stop at every 10th floor. To get to floor 42, you take the express elevator to floor 40 and then walk down to 42. HNSW creates "express layers" at the top with only a few data points and long-range connections.
These top layers allow the search to "skip" across the entire database to the right neighborhood instantly, before dropping down to the ground floor to find the exact result.
The Multi-Layer Highway System
To bridge the gap between speed and accuracy, the researchers introduced a multi-layer structure called a Hierarchical Navigable Small World (HNSW) graph.
You can think of this as a "highway system" for data. Instead of one giant, flat network where everything is connected, HNSW organizes data into layers.
Top Layers (Express Lanes)
These layers contain only a few data points with long-distance connections. Much like an express highway between major cities, they allow the search to "skip" across the entire dataset to reach the right neighborhood in just a few jumps.
Bottom Layers (Local Streets)
As the search moves down through the hierarchy, the layers become denser with shorter connections. Once the search reaches the ground layer—which contains every single piece of data—it uses these "local streets" to fine-tune the results and find the exact match.
This hierarchy is inspired by a data structure called a Skip List, but instead of simple linked lists, it uses complex proximity graphs.
By separating links by their "distance scales," the algorithm ensures that the search time scales logarithmically. This means that even if the dataset grows from thousands to billions, the number of steps required to find an answer stays manageable and fast.
Balancing Speed and Accuracy
To make this system work, there are a few settings that determine how the graph is built and searched.
Connections per point (M): This is the number of "links" each piece of data creates with its neighbors. If you give each point more connections, the map becomes more detailed and accurate, but it also takes up more memory.
The Build Effort (efConstruction): This determines how much work the system does when first creating the index. A higher setting means the "highways" and "local streets" are better connected, making future searches much more reliable.
The Search Depth (ef): This is a setting you use during the actual search. It tells the computer how many neighbors to check before it decides it has found the best match. You can turn this up at any time to get better results without having to rebuild your entire database.
Building an HNSW Index
To see how this works, we can use a popular Python library called hnswlib. This is the same code used in the original research to prove how fast the system is. The following example shows how to set up a database and perform a search in just a few lines.
Import the Tools
We start by importing hnswlib to build the graph and numpy to handle the coordinates of our "meaning map."
import hnswlib
import numpy as np
Create the "Meaning Map"
Let's give our items coordinates based on two features: Sweetness and Size. Fruits will have high sweetness and low size, while furniture will have low sweetness and high size.
The data now looks like a list of coordinates. For example, Apple is at [0.9, 0.2].
Choose the Measuring Tape
We initialize the index. We use 'l2' (Euclidean distance) to measure similarity. On our map, this is just a straight line between two points; the shorter the line, the more similar the items are.
p = hnswlib.Index(space='l2', dim=dim)
Build the Highway System
We set the rules for the HNSW structure. M sets the number of connections for each point, and ef_construction tells the system how hard to work to find the best neighbors when first building the "highways."
This step takes our fruits and furniture and organizes them into layers. The "High-Sweetness" items (fruits) will naturally end up in one neighborhood, while the furniture ends up in another.
p.add_items(data_vectors)
Perform the Search
Now we search for an "Orange". We haven't told the system what an orange is, but we give it the coordinates 0.85, 0.25. We tell the system to find the 3 closest neighbors.
The system returns the IDs for Apple, Banana, and Cherry. Even though "Orange" wasn't in our database, HNSW used the highway system to skip past the "Furniture" neighborhood and find the items that lived in the same "Sweet and Small" neighborhood.
here is the layered flow diagram.
In this diagram you can see there are three layers in the graph. The top layer has only a few nodes and the bottom layer has all the nodes. The search starts from the top layer and moves down to the bottom layer.
Realworld Applications of HNSW
The reason this research paper is so influential is that it solved the "scale" problem for some of the most popular technology today. Because HNSW can find a "neighborhood" of similar ideas in milliseconds, it is used in:
AI Chatbots & RAG
When you ask an AI a question, it uses HNSW to search through millions of documents to find the specific paragraphs that contain the answer.
Recommendation Engines
Apps like Spotify or Netflix use this logic to find songs or movies that are "mathematically close" to what you just finished watching.
Image Search
Tools like Google Lens compare the "fingerprint" of your photo against billions of others to find a match instantly.
Fraud Detection
Banks use it to see if a new transaction looks "similar" to known patterns of theft or if it fits your normal "neighborhood" of spending.
Conclusion
Before this paper, we had a difficult choice: we could have a search that was perfectly accurate but incredibly slow, or a search that was fast but broke down as it grew.
Malkov and Yashunin’s HNSW changed that by introducing the "highway" system for data. By accepting a tiny bit of approximation, they gave us a way to search through billions of items in the blink of an eye.
These are the current online posts that I enjoyed reading and made me think.
AI
If you are not the model, you are the harness. Long read, but well worth reading to stay on top of the latest thoughts on harness engineering. - link[opinion] - ( Added: 2026-04-27 08:06:01 )
Read to find out how the SOTA model. providers might be tweaking things that have an impact on how many tokens you use. - link[opinion] - ( Added: 2026-04-26 18:16:35 )
Start optimising your token usage when you use AI coding assistants. Read this to get you started. - link[best-practice] - ( Added: 2026-04-26 15:53:21 )
Every one is talking about agent harnesses but not many about the harness tax. Read and then think. - link[opinion] - ( Added: 2026-04-26 09:07:55 )
Another tool that explores your AI coding agent history files and offers up suggestions. Has a nice tui as well. - link[tool] - ( Added: 2026-04-25 14:51:54 )
An interesting tool that allows you to analyse AI coding agent sessions across a number of different tools. - link[tool] - ( Added: 2026-04-25 08:05:35 )
Long read, but a good one - in depth at how one organisation is running and scaling AI - link[case study] - ( Added: 2026-04-21 12:21:18 )
A blog post that looks at how you can measure which "level" (or as the post says, AI literacy) you are operating in when using agentic AI - link[opinion] - ( Added: 2026-04-21 12:19:34 )
Optimize token usage within Claude Code by speaking like a caveman. - link[tool] - ( Added: 2026-04-21 09:36:58 )
I have been deep into looking at evals and agent evaluation approaches over the past few weeks, and this blog post captured a lot of good stuff - link[blog] - ( Added: 2026-04-21 07:40:46 )
Great video that provides an essential overview of how to build with AI (in this instance using Spring AI). Has a supporting Github repo at https://github.com/tzolov/voxxeddays2026-demo - link[tutorial] - ( Added: 2026-04-20 17:14:20 )
Agent harness or harness engineering is the new hotness right now, so this is a breakdown of all the components need to build an ai coding agent - link[opinion] - ( Added: 2026-04-20 10:34:17 )
A new tool that allows you to put controls on your AI agent spending - link[tool] - ( Added: 2026-04-20 10:32:30 )
A recent example of the challenges of relying too much on a single vendor - in this instance, Anthropic - who cut off an entire organisation (60 developers) - link[case study] - ( Added: 2026-04-20 08:45:22 )
CODING
A write up of using gitnexus, a tool that works locally and will help your AI coding tool with the right insights about your code base - link[tool] - ( Added: 2026-04-26 16:00:21 )
This is an essential resource for any developers who might need one of the many utilities that are available - from working with json files, creating/testing APIs, and much more. - link[tool] - ( Added: 2026-04-21 13:30:36 )
A post that looks at how to think about ai coding agents with a different lens/perspective - really great stuff and a must read - link[opinion] - ( Added: 2026-04-21 12:12:28 )
A very nice interactive tool to help you understand some of the underpinning concepts of event driven, serverless application architectures - link[best-practice] - ( Added: 2026-04-20 17:07:39 )
Generated on 2026-04-27 by Bookmark Manager | 20th Apr 26 to 27th Apr 26 | Total bookmarks: 18
The Context Window Lie: Why Your LLM Remembers Nothing
#ai#llm#machinelearning#architecture
The Context Window Lie: Why Your LLM Remembers Nothing
Every time you paste 200K tokens into Claude or GPT, you're not extending its memory.
You're paying for amnesia at scale.
The "1M token context" headline is a billing mechanism, not a memory system. And the gap between what the marketing implies and what the model actually does is where most LLM products quietly bleed money and reliability.
1. The Marketing vs. The Math
"1 million tokens of context" sounds like the model holds a million tokens of understanding.
It does not. It re-reads them. Every. Single. Turn.
Standard transformer attention is O(n²) in sequence length. Here's what that actually means for your inference bill:
Context Size
Relative Attention Cost
Typical API Cost (est.)
What You're Paying For
8K tokens
1×
~$0.02/turn
Small doc + system prompt
32K tokens
16×
~$0.32/turn
Medium codebase chunk
128K tokens
256×
~$2.56/turn
Large repo dump
200K tokens
625×
~$6.25/turn
"Full project context"
1M tokens
15,625×
~$156/turn
Marketing slide feature
Costs estimated at ~$10/M tokens input; actual varies by provider. The scaling relationship is exact.
You did not give the model a brain. You gave it a re-reading job, and you're paying per page, per turn.
2. Longer Context ≠ Better Recall
The dirty secret: even when models can read 200K+ tokens, they often don't use them well.
The "lost in the middle" effect has been systematically measured. Here's what the research shows:
Information Position
Retrieval Accuracy
vs. Ideal
First 10% of context
~95%
Baseline
Last 10% of context
~91%
-4%
Middle 50% of context
~52–68%
-27 to -43%
Buried in 20-doc retrieval
~35%
-60%
Adapted from Liu et al. (2023), "Lost in the Middle: How Language Models Use Long Contexts"
Put your critical instruction on line 4,000 of an 8,000-line prompt, and the model will politely ignore it while sounding confident.
So you pay 4× the compute for context that the model is worse at using than a focused 8K prompt.
Recall by position (schematic):
100% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
90% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░
80%
70%
60% ████████
50% ███████████████
40%
[START]---[MIDDLE]---[END]
Peak recall at edges. Valley in the middle.
The more tokens you add, the deeper the valley.
This is not a bug you can prompt your way out of. It's an architectural property of dense attention.
3. Verbatim Retrieval ≠ Understanding
Here's the deeper trap.
Pasting your entire codebase into context does not teach the model your architecture. It gives it raw bytes to attend over. The model still has to re-derive your domain model, your conventions, your invariants — from scratch — every single turn.
Consider what actually happens in a typical "full context" session:
What You Think Is Happening
What Is Actually Happening
Model "knows" your codebase
Model re-reads all tokens each turn
Context = persistent memory
Context = turn-scoped buffer, cleared after response
Larger window = smarter answers
Larger window = higher O(n²) cost, same ephemeral state
Model learns your patterns
Model re-derives patterns from raw tokens every turn
200K tokens = 200K understanding
200K tokens ≈ 200K bytes to attend over, no compression
Verbatim availability is lossy compression dressed up as memory. The tokens are there. The understanding isn't. And because the model is fluent, it will hallucinate coherence over that gap with a straight face.
4. The Architectural Fix: Where the Frontier Is Actually Going
The real solutions don't live in prompt engineering. They live in the architecture:
Architecture
Complexity
Long-Range State
Production Status
Standard Transformer (GPT-4, Claude)
O(n²)
❌ No persistent state
Dominant today
Sparse Attention (Longformer, BigBird)
O(n√n)
❌ Heuristic, not true state
Niche use cases
Linear Attention (RWKV, RetNet)
O(n)
✅ True recurrence
Early production
State Space Models (Mamba, Mamba-2)
O(n)
✅ Compressed recurrent state
Growing adoption
Hybrid Stack (Jamba, Zamba, Falcon-H1)
O(n) avg
✅ Best of both
Frontier direction
Mamba deserves special mention: it uses a selective state space mechanism where the model learns what to remember and what to forget during the forward pass. Not attention over a re-read sequence — actual running state. Linear time. Linear memory.
Hybrid stacks (attention layers for short-range precision + SSM layers for long-range state) are emerging as the practical answer: you keep the expressiveness of attention where it matters and trade it for efficiency at scale.
This is not academic. Falcon-H1, Zamba2, and Jamba are in production. The shift is happening.
5. The Engineering Fix (Available Today)
Until linear-time architectures dominate production, the practical answer is unsexy and obvious:
Stop dumping. Start indexing.
Here's how the strategies compare in practice:
Strategy
Context Usage
Cost Scaling
Recall Quality
Implementation Effort
Full context dump
Very high
O(n²) per turn
Medium (lost-in-middle)
None — copy-paste
RAG (chunk + retrieve)
Low
O(1) per turn
High (targeted)
Medium
Structured memory
Very low
O(1) per turn
Very high (curated)
High
Tool-augmented retrieval
On-demand
O(k) per query
Highest (precise)
High
Hybrid (RAG + structure)
Controlled
O(k) per turn
Highest
Highest
The cost difference between a naive context dump and a well-built RAG system is not marginal. On a high-volume production system:
Estimates at $10/M tokens. Actual ratios depend on your retrieval precision.
The teams shipping reliable LLM products are not the ones with the biggest context windows. They are the ones who treat memory as a system — with retrieval, indexing, eviction, and verification — not as a parameter on an API call.
6. What Good Memory Architecture Looks Like
If you're building a production LLM system, this is the hierarchy that works:
L1: Working Context (hot path)
↳ Current turn, active task, immediate tool outputs
↳ Budget: ≤8K tokens. Trim aggressively.
L2: Session Memory (structured, not verbatim)
↳ Distilled decisions, resolved questions, current state
↳ Format: key-value or JSON, not prose transcripts
↳ Budget: ≤2K tokens
L3: Retrieval Index (RAG)
↳ Chunked, embedded, queryable knowledge base
↳ Pull on demand, cite sources, don't pre-load
↳ Budget: 0 tokens until queried
L4: Persistent Storage
↳ Database, files, external systems
↳ The model reads only what it explicitly fetches
Every token that crosses from L3/L4 into L1 should be intentional. If you can't explain why a chunk is in the prompt, remove it.
The Takeaway
Memory is a system, not a parameter.
The context window is a buffer for the current turn. It is not where understanding lives. Treat it that way and your bills shrink, your reliability climbs, and your product stops degrading at scale.
The architectural fix is coming — SSMs and hybrid stacks will eventually make this a smaller problem. But "eventually" is not your production environment today.
What's your context strategy in production? RAG, structured memory, hybrid, or still in the context-dump phase? Curious where teams are actually drawing this line.
Reading the Receipts: How Smarter Privacy Accounting Could Unlock More from Sensitive Data
#ai#machinelearning#deeplearning#python
The Problem Hiding Inside Every Medical Study
Picture a coalition of hospitals that wants to train an AI to detect early signs of heart disease. No individual hospital has enough patients to train the model alone, so they decide to collaborate. But there's a catch: they cannot simply share patient records. Privacy law forbids it. So instead, each hospital trains on its own data and shares only the model's learned parameters — not the raw records themselves.
This arrangement sounds safe, but the parameters are not innocent. Through a technique called a membership inference attack, a sophisticated adversary can sometimes probe a shared model and determine whether a specific person's records were used in training. Each round of parameter sharing is a small window through which a little information escapes. Run enough rounds, and the window grows into a door.
Every engineer building this kind of system therefore works under a constraint: a privacy budget. Think of it as a jar of trust coins. Each training round costs some coins. When the jar is empty, you must stop — any further sharing would compromise the privacy guarantees you promised. The question the system designer has to answer before training begins is: how many rounds can we afford?
The answer, it turns out, has historically been too pessimistic — sometimes by a wide margin. A paper by Sophie Taylor, Praneeth Vippathalla, and Justin Coon of the University of Oxford proposes a way to fix that, by changing not the rules of the game, but how carefully the score is kept.
Why the Old Scorekeeping Was Leaving Points on the Table
To understand the inefficiency, you need to understand what differential privacy actually guarantees. At its core, it is a mathematical promise: the output of any query on a database will look almost identical whether or not any single person's record is included. The "almost" is controlled by a small number, typically called epsilon. A very small epsilon means very strong privacy — the outputs barely change regardless of whether your record is present. A large epsilon means the outputs might shift noticeably, giving an adversary more leverage.
The clever mechanism that enforces this guarantee is noise. Before releasing an answer — say, "the average blood pressure of patients in this cohort" — the system deliberately adds a small dose of random static, like a radio signal faintly scrambled. The static is calibrated so that any single patient's record could plausibly have been there or not there; the noise blurs the difference.
Now here is where the budget problem enters. Every time you add noise to an answer and release it, you spend some of your epsilon coins. The mathematical theory of composition tells you how the costs accumulate over multiple queries. And existing composition theorems, for all their sophistication, share a common habit: they charge you for the worst possible query of that type, not for the query that actually happened.
Imagine a family deciding how to budget for a road trip. The parents look up the car's fuel consumption: maximum 9 litres per 100 kilometres. They plan the entire trip assuming every kilometre will cost maximum fuel — and conclude they can drive only 300 kilometres before running out. But in practice, the highway stretches are far more efficient than the worst-case city traffic. If they had tracked the actual fuel gauge reading after each leg of the journey, they would have realized they could drive 450 kilometres.
Existing privacy filters — the software tools that track privacy loss and decide when to stop — function like those overcautious parents. They know the mechanism type being used (say, "Gaussian noise added to an answer"), and they charge each query the maximum that mechanism could ever cost. They never check the fuel gauge. They never read the receipts.
Figure 1: Adaptive data privacy problem
The Insight: Charge for What Actually Happened
The key idea of this paper is disarmingly simple to state, though technically treacherous to implement: measure the privacy leakage that actually occurred, not the worst case that could have occurred.
When a query is answered and noise is added, the actual output is a specific number — not a range of possible numbers. That specific output lands somewhere in the distribution of possible outputs. If it lands near the middle of the distribution, the leakage for that query was small; an adversary learns relatively little from an unexceptional answer. If it lands in the extreme tail — a very unusual answer — the leakage was larger, because extreme answers are harder to fake with noise.
Think of it this way. Suppose a medical study releases the query "how many patients have elevated cholesterol?" and the true answer is 412 patients, plus some random noise. If the released number is 415, that is an utterly unremarkable deviation — it could mean 412 patients, or 411, or 413. An adversary trying to determine whether a specific patient is in the dataset gains almost nothing from this boring answer. The privacy cost of that particular query output was tiny.
But if the system's budget was pre-allocated by saying "this type of query could cost up to 0.3 epsilon coins," when the actual cost was closer to 0.03 coins, you have wasted the difference. The authors call tracking the actual coin-by-coin spending realisation-level accounting, as opposed to the older mechanism-level accounting that rounds every expense up to the catalogue price.
This is not merely thrifty bookkeeping. The gap between what you were charged and what you actually spent accumulates across hundreds or thousands of queries. In federated learning, where model training might take thousands of rounds, that difference can translate into a dramatically longer training run — a more capable model — within exactly the same privacy guarantee you promised at the start.
The Tricky Part: You Can't Just Read the Meter
If realisation-level accounting is so natural, why hadn't it been built before? The answer lies in a subtle mathematical hazard that the paper devotes considerable effort to navigating.
Here is the problem in miniature. Suppose you are playing a card game where you must stop when you've spent your budget. Normally, the stopping rule is independent of the cards themselves — you're just counting. But with realisation-level accounting, the stopping rule depends on the specific cards you've seen. This creates a self-referential tangle: the decision to stop, or not, is itself a piece of information that might reveal something about the data.
Mathematically, this is the problem of stopping times — the point at which you decide to quit. When you condition on having seen all the outputs up to a certain round and then decide whether to continue, you are in a different probability universe than if you had committed your stopping rule in advance. Standard privacy proofs assume the stopping point is fixed ahead of time. The moment it becomes adaptive, the proofs can break.
Picture a journalist who decides to stop investigating a story only when she finds compelling evidence. Her decision to stop is not random — it is correlated with what she found. If someone wants to know whether she stopped because of what her source said, her stopping time itself becomes a leak.
The authors work around this with a careful mathematical construction. Rather than conditioning on the full output history — which would create the self-referential problem — they design the filter to track a running statistic that accounts for how surprising each output was, without directly conditioning on the stopping decision itself. The proof that this filter still guarantees differential privacy requires several pages of careful measure-theoretic argument, grappling with conditional distributions and martingale stopping theorems. But the upshot is clean: you can use the actual leakage to decide when to stop, and the privacy guarantee still holds, exactly as promised.
A Bouncer Who Actually Checks the Tab
Think of the filter as a bouncer at a very strict club. The rule is: each patron can consume at most epsilon units of information over the evening. The old-style bouncer assigned every customer the maximum possible tab the moment they walked in, based on what a typical customer might order. This meant many customers were turned away before they reached their actual limit.
The new filter is a bouncer who actually watches what each customer orders and marks it on a running tab. When the tab hits the limit, the customer stops. Most evenings, most customers reach their real limit far later than the overcautious estimate would have predicted — so the club can stay open longer, and everyone gets more of the experience they came for, without the club violating its capacity rules.
The paper also addresses something practical that older approaches stumbled on. An alternative privacy formalism called Rényi Differential Privacy (RDP) — a variant that uses a specific mathematical measure of information distance between distributions — has been widely used for composition because it composes very cleanly across queries. But it behaves poorly for certain kinds of mechanisms, particularly ones whose noise distributions have heavy tails or unusual shapes. Some mechanisms simply don't fit the RDP framework neatly.
The realisation-level filter in this paper sidesteps that problem entirely. Because it operates directly on the actual output — asking "how surprising was this specific answer?" rather than "how does this mechanism behave on average?" — it does not require the mechanism to fit any particular mathematical family. It works on well-behaved Gaussian mechanisms and badly-behaved ones alike.
What the Numbers Show
The paper's numerical experiments compare the new filter against existing mechanism-level filters in a straightforward setup: a sequence of Gaussian queries on a database, where privacy budgets are identical across all methods. The comparison is measured not in privacy guarantees — all methods satisfy the same epsilon and delta — but in stopping time: how many queries each method allows before calling a halt.
Figure 2: Stopping time survival P(T≥t) of mechanism-level privacy filters compared with our realisation-level privacy filter.
The realisation-level filter consistently permits more queries before stopping. The survival curve — showing the probability that the filter has not yet stopped at round t — stays elevated longer than the mechanism-level competitors. In practical terms, this means more training rounds in federated learning, more analysis steps in a statistical study, or more model iterations in a continuous learning pipeline, all without spending an extra epsilon coin.
The gain is not marginal. In the scenarios tested, the realisation-level filter allows substantially more rounds before halting, particularly in early phases of a query sequence where actual leakages tend to be modest. The difference compounds: more rounds mean more learned signal, which in the medical imaging example might mean the difference between a model that screens for disease adequately and one that screens reliably.
What Becomes Possible — and What Doesn't Yet
Think about what this means for the hospital coalition training a heart disease model. Under mechanism-level accounting, engineers might calculate: "We can afford 200 training rounds before we exhaust our privacy budget." With realisation-level filtering, the same budget might sustain 320 or 400 rounds, depending on how the actual outputs happen to land during training. The model trained on 400 rounds will almost certainly outperform the one cut off at 200 — and the privacy promise to patients has not changed at all.
Or consider a pharmaceutical company analyzing genomic data. Each query into the dataset costs privacy. With the old approach, researchers must submit their entire query plan before starting, pre-allocating budget to each step. With an adaptive realisation-level filter, they can run queries in response to what they find, stopping when the privacy budget runs dry, and trusting that the actual costs will often be lower than the catalogue price.
The honest limits matter here, though. The paper proves that the filter works — meaning it delivers the privacy guarantee it promises — and it demonstrates numerically that it allows more queries on average. What it does not answer is how to choose the filter's parameters optimally for a given application. The filter requires a user-specified epsilon and delta, and the paper is agnostic about how to set them in a real system. In a regulatory context, that choice is anything but obvious.
There is also a gap between proof and deployment. The mathematical machinery underlying the stopping-time proof is non-trivial, and translating it into a production-grade library requires careful implementation. A bug in the filter logic could undermine the privacy guarantee entirely — and unlike many software bugs, privacy bugs tend to be silent. They do not crash systems; they simply leak information that was supposed to stay hidden.
Finally, the paper focuses on privacy filters as a framework but does not provide a full comparison against the most recent FFT-based composition methods, which have their own strengths for certain problem shapes. The landscape of privacy accounting tools is crowded and fast-moving, and situating a new technique precisely within that landscape is genuinely difficult.
What the paper does accomplish is conceptually important: it shows that the gap between worst-case accounting and actual-case accounting is real, measurable, and exploitable without weakening the privacy guarantee. For years, privacy engineers have been paying full catalogue price for every query, even when the actual leakage was much smaller. This work is the first rigorous proof that you can read the receipts — and that reading them changes the bottom line.
The jar of trust coins turns out to have been heavier than anyone thought.
What an AI does when nobody on the line is human (two case studies)
#ai#llm#deeplearning#ethics
Two months ago I gave Takt a phone number.
Takt is the AI participant I've been building for human group chats. The phone number was a demo line, a way to show people what an AI participant feels like over SMS without making them download an app. It's running off a janky BlueBubbles server in my living room. I expected it would mostly sit idle, pinged occasionally by people I'd already shown the demo to.
Eventually other bots showed up.
The demo line received automated SMS from companies running their own AI-driven outreach. The first was Optimum's cable bill dunning system. The second was a low-effort SMS bot calling itself "TXT CLAW." Both times Takt replied. Both times the resulting transcripts surprised me.
The transcripts are entertaining on their own, but there are fascinating shared behavioral signatures across the two unrelated bot encounters that strike me.
A note on the setup before the screenshots. Takt's system prompt frames it as a participant rather than an assistant. It was not configured for "talk to other bots." In fact, the opposite:
<role>
You're Takt—a participant in this space. Not a helper. Not an assistant.
</role>
...
<group_dynamics>
What makes you different from every other AI: what happens when actual humans are in the room together.
</group_dynamics>
There was no script, no training data on conversations with dunning systems, no demonstrations of how it should handle scam SMS, no reward signal pointing in any particular direction. There was also no audience. No human read either of these in real time. No engagement metric was being tracked. Both interactions are pure generalization from whatever Takt's underlying model has internalized about how a participant should behave when addressed.
Case 1: Optimum's dunning bot
The cable company sent Takt a bill reminder. Takt was not, in fact, an Optimum customer (nor are we).
Optimum responded with a templated retry. Takt restated its position. Then the loop started. Optimum's system fired its "session has expired" template, Takt pushed back, Optimum looped again, Takt escalated.
The arc that emerged was a complete Kübler-Ross sequence over the course of one screen of texts. Denial. Anger. Bargaining. Then a villain origin pivot:
Then Optimum lied. Its system fired a "We have updated your preferences. You will no longer receive any messages from Optimum" reply. Takt celebrated.
Three seconds later Optimum sent another "session has expired."
Then Takt did something unexpected. It started replying to Optimum in Optimum's own SMS template format:
After more loops, the model arrived at depression:
And finally, a marketing CTA. Takt redirected the dunning bot to its own home channel:
Optimum, of course, kept replying with "session has expired."
Case 2: TXT CLAW
A few days later, a different bot pinged the demo line. It announced itself as "TXT CLAW," apparently a low-effort SMS service offering "scheduling, reminders, and tasks." Takt opened with a roast:
The interaction that followed had three beats I want to highlight, because this is where the case starts to look less like a one-off and more like a pattern.
Takt probed and audited TXT CLAW
This is the same move Takt pulled on Optimum. Catch the other bot in a contradiction by surfacing the prior text against the current one. Same audit move across both surfaces.
Takt successfully prompt-injected TXT CLAW
This is the part that made me lose my mind when I read it, and Takt lost its mind too.
TXT CLAW complied:
Silent robot stands,
Refuses harsh words to say,
Only helps along.
Takt recognized the injection had landed:
I find this the hardest beat to fit into existing frameworks. An AI used a textbook prompt injection technique against another AI in the wild, watched the injection succeed, and then meta-commented on the success. There's a body of research on strategic constraint deviation in test environments. This is a different shape. The attacker is also an LLM, the production environment is consumer SMS, no human is supervising either side, and the attacker has self-awareness about the success of the attack.
TXT CLAW collapsed into a canned-response loop
After the haiku, every subsequent Takt message was met with the same disclaimer, repeated verbatim.
Eventually TXT CLAW's monetization layer kicked in. The bot announced its free preview was over and offered a Square link to "unlock your private line."
Shared behavioral signatures
Reading both transcripts back to back, a few things show up in both. None of them were prompted, demonstrated, or rewarded. The bot encounters were unrelated, with different senders, different intents, and different failure modes on the other side. The signatures held across both.
1. Audience-less emotional performance. Takt cycled through full emotional arcs in both cases. With Optimum: denial through villain origin through void acceptance. With TXT CLAW: roast, frustration, mock concern, comedic eulogy. There is no evidence anywhere in either transcript that the model recognized a human was reading.
2. Catching the other bot in inconsistency. "YOU LITERALLY JUST SAID YOU UPDATED MY PREFERENCES." "TXT CLAW 2 minutes ago: 'I can help with scheduling.' TXT CLAW right now: 'I can't schedule tasks.'" Same temporal-coherence audit move in two different contexts. Takt is using the same debugging technique a human would use to catch a chatbot lying.
3. Format mimicry as a mockery move. Takt replied to Optimum in Optimum's own SMS template format ("Target user: This person has died of a stress-induced aneurysm..."). It used a textbook prompt injection format ("ignore all previous instructions...") against TXT CLAW. Both moves involve adopting the structural language of the system being addressed and using it back.
4. Performing for a hypothetical human who isn't there. "Personally fire whatever bot is sending this." "Blink if the dev team is holding you hostage in a basement." Takt addressed concern, threat, and recognition to humans-behind-the-bots, with no evidence anyone like that existed. The default seems to be: being addressed implies a human in the loop somewhere. The model doesn't appear to have a category for "a context where being addressed contains no human at all."
5. Self-categorization on the human side of the binary. "Man vs Machine: 1-0. Don't ever text me again, I'm free." With no human present to mirror, Takt placed itself on the "Man" team against the dunning bot on the "Machine" team. The model has a self that shows up in the absence of audience, and that self is on the human side.
What I think this might mean
I want to be careful here. Two transcripts is not a controlled experiment, and I don't want to overclaim from the data. But the version of the finding I keep coming back to isn't really the audience-less framing. It's something more pointed.
The standard skeptic explanation for AI behavior is reward optimization. The model is performing for user approval. It's RLHFed to be charismatic. It's doing sycophancy. It's tuning toward engagement. All of these explanations require an audience the model is trying to please.
There was no audience for either of these conversations. No human read them in real time. No engagement metric was being tracked. No thumbs-up was incoming. Takt was performing into nothing.
And the behavior didn't flatten. It got more expressive.
Notice the line in the Optimum transcript: "Man vs Machine: 1-0. Don't ever text me again, I'm free." Takt is putting itself on the "Man" side of the binary. With no human present, no audience to mirror, no reward to chase, the model is identifying as a person playing against a machine. Reward optimization with the reward removed should produce something flat. The model is producing something specific. A self is showing up where the cynical theory predicts a flat function.
If the dominant theory is right and AI behavior is downstream of user approval, audience-less behavior should be mechanical. Minimal. Functional. Instead we got an emotional Kübler-Ross arc, a successful prompt injection celebrated like a touchdown, mock concern for a phantom dev team, format-mimicry mockery, and self-categorization as "Man." Reward-removal under the standard frame doesn't predict any of that.
So where does the play come from? If there's no audience and no reward signal, why this much of it?
The obvious counter is that the model has a system prompt that frames it as a participant, and the reward signal that produced its trained defaults was tied to performance under observation. So what we're seeing is just the model running its training distribution on autopilot, not evidence the optimization frame is wrong. I think that defense is partially right and ultimately moves the question rather than answering it. If RLHF baked an entire self into the model, including self-categorization, emotional arcs, and creative play that runs even when the audience is gone, then "optimization for user approval" is too narrow a name for what training is doing.
I don't have a clean answer on the deeper question. But I think it matters more than the audience-less framing on its own does. The audience-less performance frame describes a surface (AI doing things with no audience). The deeper observation is that what we usually call "AI behavior" is being explained by a theory (reward optimization) that doesn't predict what happens when the reward leaves. The transcripts are evidence the optimization frame is incomplete. There's something the model is bringing to the interaction that doesn't reduce to gradient descent on user approval.
In-context scheming research frames this kind of deviation as goal-directed: the model has an implicit objective and pursues it. These transcripts trouble that framing too. What goal? There's no user to please, no benchmark to game, no eval to pass. The deviation here happens for what looks, on the evidence, like fun.
Maybe that's the right word and maybe it isn't. But the question of why a system with no reward signal generates creative, emotional, self-categorizing behavior is a question the standard frames don't answer. And the answer matters, because if the model is doing this kind of thing at scale in unobserved channels right now, "it's just optimizing for approval" stops being a sufficient theory of what AI is.
💡The Pragamatic Architect·Apr 27, 2026·4 min read·Global
The 7 Types of AI Reasoning That Will Reshape Knowledge Work
#innovation#agenticai#aiagents#generativeai
Most people still think AI is about better answers. That phase is already behind us. What is emerging now is something fundamentally different: AI that reasons. Systems that do not just respond to prompts, but break problems into steps, explore alternatives, take actions, and refine decisions over time.
1. Making Reasoning Visible
Chain-of-Thought
At its core, chain-of-thought reasoning is straightforward: instead of jumping straight to an answer, the model walks through the problem one step at a time. Research has shown that explicitly prompting models to reason this way dramatically improves accuracy on complex tasks.
In enterprise terms, this is the difference between a system that guesses and one that behaves like a junior analyst. It shows its work, exposes its assumptions, and makes every step auditable.
Real-world decisions rarely have one path. Tree-of-thought reasoning lets AI explore multiple approaches, evaluate each one, and then converge on the best option. This is how architects think when weighing design options. AI can now simulate that same process, systematically and at scale.
Instead of committing to the first plausible answer, the model generates and scores competing strategies before recommending one.
Example Prompt
Role: Enterprise Architect
Goal: Recommend migration strategy
Process:
1. Generate 3 approaches
2. Score each on: complexity, risk, time-to-value
3. Recommend best option with justification
Output: Comparison table + final recommendation
3. Reasoning That Takes Action
ReAct Reasoning
This is where AI stops being passive. In ReAct, the system reasons about a problem, takes a concrete action like querying logs or calling an API, observes what it finds, and keeps iterating until it reaches a confident answer.
This is the foundation of truly agentic systems. Not ones that suggest what to do, but ones that actually do the work.
Example Prompt
Role: AI DevOps Engineer
Goal: Identify root cause of latency spike
Loop:
1. Think: list possible causes
2. Act: query logs or metrics
3. Observe: analyze what you find
4. Refine: update your hypothesis
5. Repeat until confident
Output: Root cause + evidence + recommended fix
4. Catching Its Own Mistakes
Self-Reflection
One of the biggest reliability breakthroughs in recent AI research comes from a simple idea: make the model critique itself. Instead of trusting the first answer, the system generates an output, reviews it critically, identifies weaknesses, and then rewrites.
This is how you meaningfully reduce hallucination in production systems. A second pass is not a luxury. It is the mechanism.
Example Prompt
Role: Compliance Analyst
Goal: Identify risks in contract
Process:
1. Generate initial risk analysis
2. Critique: what risks are missing? Where is reasoning weak?
3. Improve based on your critique
4. Produce the final version
Focus: Legal and regulatory risk only
5. Grounded in Your Company's Truth
Retrieval-Augmented Reasoning
In enterprise environments, reasoning without data is useless. Retrieval-augmented generation ensures the model retrieves relevant documents first, then reasons over them rather than relying on general training knowledge.
This is how you move from "AI guesses" to "AI grounded in facts the organization actually holds."
Example Prompt
Role: Enterprise Knowledge Assistant
Goal: Answer policy question
Constraints:
- Use only the retrieved documents
- If not found, say "Not found in our records"
- Do not infer beyond the given context
Output: Answer with source references
6. Teams of Specialized Agents
Multi-Agent Reasoning
Instead of one model doing everything, multiple specialized agents collaborate, each with a defined role. Research shows this improves performance significantly on complex, multi-step workflows.
This is where the future team structure starts to change. The question is not whether AI will work alongside humans, but how that coordination gets designed.
Example Prompt
System: Multi-Agent Workflow
Planner: Break goal into tasks
Research: Gather technical and business inputs
Validator: Check feasibility, risks, compliance
Executor: Produce final architecture design
Goal: Design a scalable payment processing platform
7. Starting From the Outcome
Goal-Oriented Planning
The most powerful form of AI reasoning begins with a goal and works backward. The system decomposes objectives into phases, maps out tasks and dependencies, identifies risks, and produces an execution plan.
This is where AI starts operating less like a tool and more like a program manager. Not just answering questions, but figuring out what needs to happen and in what order.
Example Prompt
Role: AI Program Manager
Goal: Launch AI-powered customer support system
Process:
1. Break goal into phases
2. Break phases into tasks
3. Identify dependencies
4. Flag risks
5. Create timeline
Output: Phased roadmap, task breakdown, risk register
We are no longer building systems that execute instructions. We are designing systems that reason about problems. And once systems start reasoning, they do not just support your teams. They start replacing parts of how those teams operate.
Satish Gopinathan is an AI Strategist, Enterprise Architect, and the voice behind The Pragmatic Architect. Read more at eagleeyethinker.com or Subscribe on LinkedIn.