The Claude Code Agent Guide

Built from 15 YouTube Video Sources — Compiled 2026-04-26

Chapter 1: What Is an Agent in Claude Code

Source Videos: Building Claude Code with Boris Cherny, Anthropic Just Dropped the Biggest Subagent Upgrade Yet, Claude Code Just Got a MASSIVE Upgrade (Agent Loops), Agent Building Trends


What Claude Code Is

Claude Code is an agent. That word is overloaded, so be specific: Claude Code is a language model that runs inside a control loop, calls tools, reads the results, and decides what to do next. The model is the brain. Bash, file edit, file read, and a small set of other tools are its hands. The loop is what stitches the two together until a task is done.

This is different from a chatbot. A chatbot answers a turn and stops. An agent keeps going. It looks at the world (a file, a directory listing, a test failure, a web page), forms an intent, runs a command, looks again, and adjusts. Boris Cherny, who built Claude Code, describes the architecture as deliberately small: a core query loop, a few tools, and a heavy layer of safety and permission checks around them. There is, in his words, "not much to it."

That austerity is not laziness. It is a design thesis. Claude Code's creators bet that if you give a sufficiently capable model good tools and get out of the way, it will outperform any system that tries to dictate every step. The rest of this chapter unpacks why that bet works, what the loop actually does, and what vocabulary you need to read the rest of this guide.

The Origin Moment: Chatbot Plus One Tool

Claude Code did not start as an agent. It started as a chatbot that hit the Anthropic API from Boris Cherny's terminal. That was its only job: take input, produce a reply. Useful, but not transformative.

The shift came when Anthropic's tool-use feature shipped and Boris gave the chatbot a single tool: bash. He did not have a clear plan for what to do with it. So he asked the chatbot what music he was listening to. The model wrote an AppleScript snippet, ran it through bash, queried his music player, and answered.

That moment matters because nothing about it was scripted. There was no "music player integration" in the chatbot. There was a model, a bash tool, and a goal. The model figured out the rest. Boris adds a second tool, file edit, and the chatbot becomes capable of writing and modifying real code. That two-tool prototype is the seed Claude Code grew from.

Key Insight from The Pragmatic Engineer: "The model is its own thing. You give it tools. You give it programs that it can run." -- Boris Cherny, Building Claude Code with Boris Cherny

The point of the origin story is not nostalgia. It is that the agentic behavior people now find impressive — Claude Code reasoning across a codebase, fixing a test it broke, running a deploy script, recovering from an error — is what falls out of that simple recipe at scale. There is no separate "agent module." There is a loop and there are tools.

The Agent Loop, Concretely

The agent loop (sometimes called the query loop — Boris uses both interchangeably) is the control flow at the center of Claude Code. In the abstract:

  1. The user (or a parent process) gives the agent a prompt.
  2. The model decides what to do. Often that decision is "call a tool" — read this file, run this bash command, fetch this URL.
  3. The harness runs the tool and captures its output.
  4. The output is appended to the conversation and the model is invoked again.
  5. The model either calls another tool, asks the user a question, or declares the task complete.
  6. Repeat until the task is done or the user stops it.

Each pass through that cycle is one turn. A non-trivial Claude Code session is dozens, sometimes hundreds, of turns. The model never holds state between calls — every turn re-reads the entire conversation, including all prior tool results, and decides what to do next.

The implications are worth slowing down for:

  • Tools are the action surface. The model cannot do anything in your environment that is not exposed as a tool call. Bash is the broadest of these — through bash the model can read, write, compile, test, and call other CLIs. File edit is precise. Web fetch reaches outside your machine. The harness around the model decides which tools exist and which require permission to run.
  • The loop is what makes recovery possible. When a test fails, the failure shows up as the next tool result. The model reads it on the next turn and reacts. There is no separate "error handler"; error handling is just the loop continuing.
  • Latency is dominated by the loop, not the model. Each tool call is a round trip: invoke, run, return, re-prompt. A session that runs sixty bash commands runs sixty turns through the loop.
  • The loop is finite by default. Claude Code does not run forever. It runs until the model decides it is done, the user interrupts, or a budget (token, time, or permission) is exhausted.

A useful sanity check: when you watch Claude Code work, what you are seeing is the loop. The streaming text is the model's turn output. The shaded blocks are tool calls. The next streaming text is the model reading what came back. Reading a Claude Code session as a sequence of loop iterations, rather than as a chat, is the single most useful mental shift this chapter offers.

Why This Differs from a Chatbot

A chatbot is a transcript. You speak, it responds, the turn ends. To make it do anything in the world you have to do the work yourself: copy its code into your editor, run the command it suggests, paste the error back in, ask the next question.

An agent collapses that loop. It is not just that it has tools — it is that it decides when to use them, reads what they return, and incorporates the result into its next decision. The user is still in charge of the goal. The agent is in charge of the path.

This is what people mean when they say agents are "officially real" in 2026. Nathaniel Whittemore, surveying about a hundred submissions to a builder showcase, observes that the dominant pattern is no longer people building themselves tools. They are building themselves digital coworkers, org charts of agents, and sometimes a literal "AI chief of staff." Whether or not the org-chart framing is the right one — Whittemore himself is skeptical it lands there — it points at a real shift. The unit of automation has moved from "endpoint that returns a string" to "process that pursues a goal."

Key Insight from The AI Daily Brief: "People are not building themselves tools. They are building themselves digital employees and org charts." -- Nathaniel Whittemore, Agent Building Trends

Claude Code sits squarely in that shift. It is the version of this pattern aimed at engineers and at any code-adjacent task that benefits from the file system, the shell, and a real environment.

The Bitter Lesson Behind the Design

There is a particular reason Claude Code's architecture is so thin: a deliberate refusal to over-design.

Rich Sutton's bitter lesson, originally an essay about machine-learning research, observes that across decades, general methods that scale with computation have consistently beaten methods that hard-code human cleverness. Hand-engineered chess heuristics lost to search. Hand-tuned vision features lost to learned ones. Hand-written agent scaffolds, the argument goes, will lose to letting the model think for itself.

Boris frames Claude Code's design as a corollary of that lesson. Earlier AI coding tools, he points out, took the opposite approach: take a model, define a narrow interface, stub out a function, declare that the AI part, and write the rest as a normal program. The model becomes a component inside a human-designed pipeline. That works, briefly. Then the model gets better and the scaffolding becomes the ceiling.

Claude Code goes the other way. Give the model the same tools a human engineer has — a shell, a file system, a web fetcher — and let it pick the path. When the model improves, the agent improves with it, because almost nothing about how the agent operates is hard-coded.

Key Insight from The Pragmatic Engineer: "There's a version of the bitter lesson here. Just let the model do its thing. Don't try to put it in a box." -- Boris Cherny, paraphrased from Building Claude Code with Boris Cherny

You see this discipline throughout Claude Code's history. An early version of code search used a local vector database — RAG was the textbook approach at the time. The team found it brittle: indexes drifted out of sync with the source, permissioning was awkward, and the model was happy to navigate a codebase using grep and glob directly. They threw the vector store out. Agentic search beat retrieval. The lesson keeps repeating: the right move is usually to remove a layer rather than add one.

The same instinct shapes how the team experiments. They add tools. They delete tools. They prototype a hundred versions of a UI element and ship the one that feels right after a month of dogfooding. The codebase is constantly churning because the model underneath it is constantly changing, and ideas that were bad six months ago are good now. The architecture has to be loose enough to keep up.

Context: The Window the Model Reasons In

The agent loop runs entirely inside the model's context window — the token budget Claude can hold in mind at one time. Every turn in the loop sees the whole conversation: the system prompt, the user's instructions, every prior tool call, every prior tool result, and the model's own past reasoning.

The window has two practical consequences for how an agent behaves.

The first is that Claude Code's working memory is whatever fits in the window. There is no hidden database, no long-term store, no implicit recall. If you want the agent to know something, that something has to be in context — either because the user provided it, because a file got read in, or because the model wrote it down (in a CLAUDE.md, in a memory file, in a prior message). Working with an agent is partly a memory-management problem.

The second is that the window has a quality cliff. Long before you hit the token limit, the model's decisions degrade. Outputs get hasty. Intermediate steps get skipped. The model declares done early. This degradation has names that recur throughout this guide — context rot for the gradual quality decline, context anxiety for the sudden behavioral shift as the window fills — and managing them is a craft of its own. Chapter 3 covers the practices that keep the window clean. For now, just register that "the context window" is not just a limit. It is the environment the agent thinks in, and how full it is matters.

Subagents: When the Main Session Needs Help

One pattern in the loop is important enough to name now, because it shapes everything from here forward.

A subagent is a separate Claude Code session, spawned by the main session, that runs in its own isolated context window. The main session hands it a prompt and a goal. The subagent runs its own loop — its own turns, its own tool calls, its own reasoning — and returns a result. The main session adds that result to its own context and continues.

Why bother? Because some work is noisy. Searching a large codebase, reading a long log, scanning a directory tree — these tasks generate a lot of intermediate output that the main session does not need. If the main session does it directly, that noise lands in its window and crowds out everything else. Delegating the work to a subagent keeps the noise out of the parent's context. Only the summary comes back.

Key Insight from Ray Amjad: "The whole reason for us to have subagents is so we can delegate any noisy tool calling into a separate context window and only get the most relevant results back into the main session." -- Ray Amjad, Anthropic Just Dropped the Biggest Subagent Upgrade Yet

That is the entire idea: context isolation, on demand. Subagents have many flavors and tradeoffs — when to spawn one, when not to, when the subagent should inherit the parent's history, how they interact with skills and parallel work — and Chapter 2 covers them in full. For the rest of this chapter, the term subagent means: a child agent loop, with its own clean window, that the main agent invokes to keep its own context lean.

How the Loop Shows Up in Real Use

The loop is not just an abstraction. It is what you watch on your screen for hours at a time when you actually work in Claude Code, and it is what enables most of the practices the rest of this guide covers.

A few concrete instances:

  • Self-testing. When the Claude Code team makes a change to Claude Code itself, the agent will sometimes launch itself in a subprocess to verify end-to-end that it still works. That is just another tool call inside another loop — but it lands as a behavior nobody had to program in. With Opus 4.5 and later, the model started doing it on its own.
  • Code review. Every pull request inside Anthropic is reviewed by an instance of Claude Code running in CI via the agent SDK (claude -p). It catches roughly 80% of bugs in the first pass. There is always a human in the loop after — the agent loop ends, and a person makes the final call — but the bulk of the grunt work is done by an agent grinding through one turn at a time.
  • Scheduled work inside a session. A more recent feature, /loop, lets you tell an open Claude Code session to run a prompt on an interval ("every 10 minutes, check the deployment status"). It is not an always-on autonomous agent — it lives inside the current session, expires after at most three days, and dies if the terminal closes. But it is a useful illustration of the same primitive at a different cadence: the agent loop, scheduled.

These are not three different products with three different architectures. They are the same loop applied at different scopes — one session, one CI run, one timed interval. Once you see Claude Code as the loop, every feature you encounter has the same shape.

What This Guide Covers Next

You now have the vocabulary the rest of this guide depends on. To recap:

  • Agent. A model running in a loop, calling tools and reading results until a goal is met. What Claude Code is.
  • Agent loop (or query loop). Prompt → tool decision → tool runs → result feeds back → repeat.
  • Main session. The top-level Claude Code session you, the user, are talking to.
  • Subagent. A child session, spawned by the main session, with its own isolated context window. Used to keep noisy work out of the parent's context.
  • Context window. The token budget the model reasons in. Its quality degrades long before the limit.
  • Bitter-lesson framing. Don't put the model in a box. Give it tools. Let it figure out the path. When the model gets better, the agent gets better with it.

From here, the guide gets more specific:

  • Chapter 2 takes subagents apart in detail — how they work, when to spawn them, and the April 2026 forked-subagent feature that lets a child inherit the parent's history.
  • Chapter 3 covers parallel agents, git worktrees, and the context-engineering practices that keep multiple Claudes productive without stepping on each other.
  • Chapter 4 turns to skills — reusable instruction-and-asset bundles the agent can load on demand — and the "agents as skills" pattern.
  • Chapter 5 covers orchestration: planning agents, agent-of-agents, and the shapes of multi-step workflows.
  • Chapter 6 looks at long-running agents and the durability problems that emerge once a loop runs for hours instead of minutes.
  • Chapter 7 composes memory, skills, and hooks into a personal agentic stack.
  • Chapter 8 questions whether chat is even the right interface for agents.
  • Chapter 9 returns to Anthropic's own perspective on how Claude Code is built, shipped, and used internally.

Every one of those chapters is, underneath, a refinement of the same primitive: a model, a set of tools, a loop. Once you see the loop, the rest is variations.

Chapter 2: Subagents — The Context-Isolation Pattern

Source Videos: Anthropic Just Dropped the Biggest Subagent Upgrade Yet


The Problem Subagents Solve

A Claude Code session is bounded by its context window. Every tool call, every file read, every search result, every MCP response that lands in the main session consumes tokens that the model has to carry through the rest of the work. As the window fills, the agent's decision quality drops (see Chapter 1 for the agent-loop basics). The cheapest way to keep a session sharp is to keep junk out of it.

A lot of the work an agent does is junk from the main session's point of view. Reading twenty files to find the one that matters, scrolling through search results, dumping a full directory tree, running a grep that returns 400 lines — none of that information needs to live in the main conversation. The main session only needs the answer. Everything that produced the answer is overhead.

The subagent is the primitive that separates the work from the answer.

Key Insight from Ray Amjad: "The whole reason for us to have subagents is so we can delegate any noisy tool calling into a separate context window and only get the most relevant results back into the main session." -- Ray Amjad, Anthropic Just Dropped the Biggest Subagent Upgrade Yet

If you do everything inside the main session, the window fills with output Claude Code did not need. The session burns context faster, and as it fills, the model starts making worse decisions. Delegating noisy work to a subagent is how you keep the main window lean.


How a Standard Subagent Works

A subagent is not a thread or a coroutine. It is a separate Claude Code session that the main session spawns to do focused work, with its own fresh context window. The mechanics are straightforward:

  1. The main session decides to delegate. This can happen at the start of a conversation ("explore the codebase before we plan") or in the middle ("research this online before you continue").
  2. The main session writes a brief. A summary of what it has done so far, plus instructions for the subagent, gets passed across as the subagent's opening prompt. This brief is the entire context the subagent starts with.
  3. The subagent runs its own loop. It calls tools, reads results, iterates — all inside its own context window, isolated from the main session.
  4. The subagent returns a summary. Only the final, distilled answer flows back into the main session. The intermediate noise stays in the subagent's window and is discarded.

The two patterns Ray Amjad calls out by name are an explore subagent that looks through the codebase and a research subagent that searches online. Both follow the same shape: the main session asks a question, the subagent does the noisy work, only the conclusion comes back.

That separation is the whole point. The main session never sees the 400-line grep, the failed first search query, the four wrong files the explorer opened before finding the right one. It sees one paragraph: here is what I found.

The Compression Cost

The standard subagent is great when the question is self-contained. It is not so great when the work depends on subtle context that has accumulated in the main session over many turns.

Ray Amjad gives a concrete example. He had been doing design work with Claude Code — going back and forth about fonts, palettes, layout choices, a long thread of small decisions. After ~50,000 tokens of accumulated nuance in the main conversation, he asked Claude Code to spawn three parallel subagents to produce three design variations.

Each subagent got a 2,000-token brief summarising the conversation. The variations came back worse than expected, because the brief had compressed the nuance away. The subagents could not remember the detail of what had been discussed; they were working from a sketch of the conversation, not the conversation itself.

This is the trade-off baked into the standard subagent: isolation costs you context. When the main session's accumulated state is the thing that makes the work good, summarising it down to a prompt loses the work.


The Forked Subagent

In April 2026 Anthropic shipped the feature that closes that gap: the forked subagent. A fork inherits the parent's full conversation history up to the fork point, instead of starting fresh from a summary.

Key Insight from Ray Amjad: "The forked subagent has the entire prior history of the main conversation and instructions as well." -- Ray Amjad, Anthropic Just Dropped the Biggest Subagent Upgrade Yet

The structural difference between a standard subagent and a forked one is exactly that one line. Standard: starts from a brief. Forked: starts from the full history. Everything else — running its own loop, doing its own tool calls, returning a summary back to the main session — is the same.

There is a useful side-benefit: a forked subagent shares the main session's prompt cache. That means re-sending the inherited history is much cheaper than it would be to send the same tokens to a fresh subagent, because the cache absorbs most of the cost.

Where Anthropic Already Uses Forks

Forked subagents are not just a user-facing primitive. Anthropic has been quietly building recent Claude Code features on top of them:

  • /recap runs a forked subagent behind the scenes to summarise the current session.
  • /btw — the by-the-way side-channel for asking a quick question without polluting the main thread — also routes through a forked subagent.
  • Memory consolidation in the autodream feature uses forked subagents to fold session learnings back into long-term memory.

The pattern across all three is the same: take the full state of the main session, run a focused side-task against it, return only the result. That is what forks are for.

Enabling Forks

Forked subagents are gated behind an environment variable. Set it before launching Claude Code, or put it in your project's settings.json at the top level so every session has it on by default.

Once enabled, two things change. First, the /fork slash command becomes available in the session. Running /fork (or asking Claude Code in plain language to "spawn a forked subagent") spins up a background subagent that inherits the full conversation. Second, you can check on running forks at the bottom of the session — each one shows its current token count, which is a quick visual cue that it really did inherit the parent's state. If the main session has burned 180,000 tokens, the fork starts at roughly 180,000 tokens too.

You can give a running fork a follow-up prompt while it works. Press escape to get back to the input, type your follow-up, and it queues to the fork rather than the main session. When the fork finishes, its result flows back into the main session as a single message.


When to Fork, When to Stay Fresh

A fork is not always better than a fresh subagent. Picking the right one comes down to a single question: is the nuance of the main conversation so far useful to the subagent? If yes, fork. If the nuance would hinder or bias the work, don't.

Key Insight from Ray Amjad: "If it's not useful and could hinder or bias the subagent anyway, then don't use a forked subagent." -- Ray Amjad, Anthropic Just Dropped the Biggest Subagent Upgrade Yet

That framing has two halves. Both matter.

Fork when nuance carries the work

These are the situations where the main session's accumulated state is the thing that makes the answer good:

  • Long iterative design work. The 50,000 tokens of font, palette and layout discussion are exactly what makes a design variation feel coherent with the rest of the project. A summary loses it. A fork keeps it.
  • Plan-heavy work mid-session. You have walked Claude Code through the architecture, the constraints, the rejected approaches. Now you want a parallel exploration of a sub-problem. A fork starts that exploration with everything you have already thought through.
  • Verification of recent work. "Spawn a fork that draws a Mermaid diagram of the changes we just made" or "spawn a fork that searches online to check whether the premise we just decided on is correct" — both depend on the fork knowing what was decided. Both are tangent containment: the noisy work happens in the fork; only the diagram URL or the verification verdict comes back to the main session.
  • Tool-using side questions. /btw is a single-turn, read-only side-channel. A fork is the multi-step, tool-using equivalent. When a side question needs MCP calls, file reads or web searches, a fork is the right surface — and if you do not like the answer, you can rewind the conversation rather than carrying its noise forward.
  • Recommendation-style queries. Ray Amjad's example of asking a fork to query an MCP server for "videos to watch based on what we just covered" only works because the fork has the full context of what was just covered. A fresh subagent given a 2,000-token summary would recommend the wrong videos.

Stay fresh when independence matters

These are the situations where inheriting the main session would actively hurt:

  • Code review. If you fork to review code Claude Code wrote earlier in the same session, the fork sees its own prior reasoning and rationalises the code it wrote. The review is shallower because the reviewer already agrees with the author. A fresh subagent — one that has never seen the code before and has no investment in it — gives a sharper review.
  • Adversarial checks more generally. Anything where "what would a critic see" or "what is the opposing view from cold" is the question. A fresh context window is the closest you get to a second opinion.
  • Genuinely independent variations. If the point of spawning multiple subagents is that the variations should not be biased by each other or by the main session, a fork defeats the purpose. The original 2,000-token brief is the constraint, not the bug. (Note: this is the case Ray Amjad's design example would have benefitted from a fork — but only because he wanted the variations to share his accumulated taste, not to be independent of it. Get the framing right before you choose.)
  • Cheap throwaway research. If you only need the broad-strokes answer and the noise of the main session would actively confuse the search, a fresh subagent with a tight brief is faster and cleaner.

Mixing forked and fresh

You can spawn both at once. Ray Amjad describes asking Claude Code to spin up two subagents to research the same question — one forked, one fresh — and watching where they agree and disagree. He calls this parallel decision convergence: use the fork for the contextually-aware view and the fresh subagent for the cold view, and treat the overlap as higher confidence than either alone. The pattern is easy to recognise from outside the session: the fork starts at ~200,000 tokens, the fresh subagent starts at ~35,000.


Practical Guidance

A few rules of thumb that fall out of the above:

  1. Default to delegating noisy work, period. Whether forked or fresh, anything that produces tokens you will not need later belongs in a subagent. Keep the main session's window for decisions and conclusions.
  2. Pick fork vs. fresh on nuance, not on convenience. The wrong choice is silent — you just get a slightly worse answer. The right choice depends on whether the prior conversation helps or hurts.
  3. Use forks for tangents. The cleanest way to keep the main thread on track is to push side questions, verifications and "what about this premise" explorations into a fork that the main session does not have to read.
  4. Use fresh subagents when you want a second pair of eyes. Code review, devil's-advocate exploration, anything where you specifically want the agent not to be carrying your assumptions.
  5. Treat forks as cheap. Prompt-cache sharing means the marginal cost of inheriting the main session is small. The expensive part is the tool calls inside the fork, which you would have paid for either way.

The deeper pattern is that subagents — forked or fresh — are how you stop a long Claude Code session from drowning in its own exhaust. The main session is for thinking. Subagents are for everything that produces evidence you will need to think about. Parallel patterns and worktree-based isolation extend this idea further at the process level (see Chapter 3).

Chapter 3: Parallel Agents, Git Worktrees, and Clean Context

Source Videos: Parallel Claude Code + Git Worktrees: This Setup Will Change How You Ship, Automating Your AI Context


The Bottleneck Is You, Watching One Agent Work

A single Claude Code session ships fast. Two run in the same checkout and they wreck each other -- one rewrites a file the other was editing, tests fail for reasons unrelated to the work, the database belongs to whichever agent wrote to it last. So most users settle for the single-agent rhythm: kick off a task, watch it work, review, repeat.

Cole Medin's argument is that this rhythm caps your output at roughly 2x. To get to 10x, you need five or more agents in parallel without them stepping on each other -- and that means engineering the environment around the agents, not the agents themselves.

Key Insight from Cole Medin: "Sure, it's a good start. It'll at least 2x your output using AI coding assistance, but why stop there? Why not go for parallel agents to 10x your output or even beyond that?" -- Cole Medin, Parallel Claude Code + Git Worktrees

Going for 2x lets you cheat with manual coordination: switch tabs, merge by hand, fix conflicts as they come. Going for 10x forces you to build a system. The system Cole describes rests on five pillars, and worktrees are the load-bearing one.


Pillar 1: The Issue Is the Spec

Before you spin up parallel agents, you need parallel work. Cole's input into every implementation is a GitHub issue (or a Linear/Jira ticket -- the platform is interchangeable). The output is a pull request. Issues and PRs become the artifacts that drive the whole loop.

This works because you can scope a sprint's worth of work before you fan out. Cole's pattern is a fan-out: start with one Claude Code session whose only job is to break the work into well-formed issues, then send each issue to a separate agent for implementation.

Key Insight from Cole Medin: "Usually I'll work with my coding agent here to create these bugs and feature requests as issues in GitHub and then I'll go into my parallel development. So it's sort of like a fan-out pattern. You start with one coding agent session to split your work into these different issues, then you send them all out to be implemented at the same time." -- Cole Medin, Parallel Claude Code + Git Worktrees

Two side benefits fall out of this. First, the issue is the prompt -- once an agent is in its worktree, all you type is use the GitHub CLI to view issue 10 and help me make a plan for it. The agent fetches the spec; no copy-paste, no drift. Second, after the PR lands you can diff the issue against the PR and see where the agent deviated from plan.


Pillars 2 and 3: Worktrees Give Each Agent Its Own Repo

A git worktree is a separate working directory tied to its own branch but sharing the underlying .git object database with the main repo. From the agent's point of view, it's a clean checkout. From your point of view, it's lightweight -- no second clone, no duplicated history.

Claude Code supports worktrees natively. The flag is --worktree or its short form -w:

claude --worktree issue-10
# or
claude -w issue-10

The argument is the worktree's name. Cole uses the issue number (issue-10) as the convention because it ties the working directory back to the spec, but it can be any descriptive slug.

When you run this, Claude Code creates a folder under .claude/worktrees/ (e.g. .claude/worktrees/issue-10/) containing a complete copy of the codebase. The agent's working directory for that session is the worktree, not the main checkout. Five agents in five worktrees can edit the same file five different ways and never collide; merge happens later through the normal pull-request flow.

Key Insight from Cole Medin: "That is the beauty of worktrees -- now when our coding agent works on the feature here, it's not going to be overriding other features that other coding agents are building. Each one of them has their own environment." -- Cole Medin, Parallel Claude Code + Git Worktrees

Once five worktrees are running, the workflow is identical in each tab. Point each session at its issue, let plans come back, then say go ahead and implement. Whatever your usual planning/building/validating process is -- a slash command, a skill, GitHub spec-it, BMAD -- it slots in unchanged. Worktrees don't replace your process; they let you run it five times at once.

The output of each implementation is a pull request. That's deliberate: the PR is the handoff between implementation and validation, and validation needs a fresh context window.


Pillar 4: The Reviewer Never Sees the Writer's Chat

This is the context-engineering pillar. Even with worktrees solving file-state conflicts, parallel agents fail if you ask the writer to review its own code. The model has built up bias toward its own implementation across hundreds of turns and will sweep edge cases under the rug.

Key Insight from Cole Medin: "If you tell it to review the code in the same context window, it's like asking a kid to grade their own homework. They're going to sweep a lot of things under the rug and say it looks good because they just did the work." -- Cole Medin, Parallel Claude Code + Git Worktrees

The fix is mechanical: in each worktree, after the PR is open, run /clear to wipe the context window, then run a review command (Cole's is /review-pr) that pulls the PR diff, compares it to the originating issue, and spawns specialised subagents (see Chapter 2) to assess the implementation. The reviewer has no memory of how the code was written -- only what was written and what was asked for.

Cole stacks a second review for adversarial coverage: /codex adversarial review, which runs the same diff through a different coding agent (OpenAI's Codex, invoked via a Claude Code plugin). The point isn't that you need two providers to review every PR; it's that the second reviewer has no shared lineage with the writer. Different model, different context, different blind spots. In Cole's demo, the Codex cross-review flagged issues on four out of five PRs.

This is the practical face of context isolation. Each worktree has its own clean window for writing. After the write, you tear that context down and start a new one for review. The reviewer cannot inherit the writer's rationalisations because the reviewer was never in the room. (For why context state degrades as windows fill up, see Chapter 6 on context anxiety.)


Pillar 5: The Self-Healing Layer

The fifth pillar is a habit, not a tool. Whenever a review surfaces a bug, you don't just fix the bug -- you fix the system that allowed it. Cole calls that system the AI layer: the rules in CLAUDE.md, the skills, the custom slash commands, the subagent definitions, the workflows. Anything that shapes the agent's context.

The mechanism: at the end of a review, ask the agent (which has the review's full context) what could have prevented the issue at the rule, skill, or workflow level. Maybe the validate skill is too thin. Maybe the planning command isn't asking enough questions. Maybe CLAUDE.md doesn't document a convention the agent keeps violating.

This is what makes parallel agents sustainable. Fix bugs but never patch the source of bugs and you become the review bottleneck. Patch the system every time it leaks and the agents get progressively more reliable; reviews get progressively shorter.


When Worktrees Aren't Enough: End-to-End Validation

The first three pillars get you running five agents on five issues without file conflicts. The interesting failures start when you ask those agents to actually run the application end-to-end -- start the dev server, hit it as a user would, exercise the database -- inside their worktrees. Three problems surface immediately.

Port Conflicts

Five copies of the same web app, all defaulting to port 4000, won't coexist. Cole's startup command hashes the worktree name into a deterministic port in a known range. With base 4000, his worktrees end up on 4161, 4107, and so on. Each agent knows its port, browses to its port, never collides with siblings. The hash makes the port stable across restarts -- which matters for caching and bookmarks.

Dependency Install Time

A worktree is a fresh checkout, and a fresh checkout has no node_modules. If every agent burns its first ten minutes (and a chunk of context) on npm install, parallelism is no longer a win. Cole's wrapper script -- w.sh on Unix, the equivalent .ps1 on Windows -- installs dependencies up front, before the session starts. By the time the agent is reading the issue, the environment is ready. The same script creates the worktree, so it doubles as a portability layer: any coding agent that doesn't natively support worktrees can be wrapped in it and get the same isolation Claude Code gets for free.

Database State

If five agents share one database, the first to insert a row breaks every other agent's tests. You need a worktree for the database too. Cole uses Neon's branching: each worktree gets its own database branch, copied from production at branch creation, isolated thereafter. When the PR merges, the branch is discarded.

Key Insight from Cole Medin: "Not only do we need a worktree for the codebase, but we need something like a worktree for the database as well." -- Cole Medin, Parallel Claude Code + Git Worktrees

If you don't use Neon, the local-and-free version is a fresh SQLite file per worktree. The principle is the same: every worktree owns its data; nothing leaks.

Token Blowout

End-to-end validation -- planning, implementing, running, testing, reviewing, cross-reviewing -- burns tokens. Five of those in parallel burn five times as many. The mitigation is to drop the model where reasoning isn't needed. Inside any session, /model switches between Opus, Sonnet, and Haiku. Codebase analysis, web research, and even code review can run on Sonnet or Haiku without losing much. Subagents and skills can be pinned to cheaper models when invoked, so the expensive model is reserved for the work that actually needs it.


Why Clean Context Is the Real Story

Worktrees solve file-state conflicts. But every pillar in Cole's system is also a context-engineering decision:

  • Issues as input mean the spec arrives as a structured GitHub fetch, not as a long human prompt that drifts during conversation.
  • One worktree per agent means each context window holds only the files relevant to its issue, not the noise of four other features.
  • Fresh context for review means the reviewer's window contains the diff and the issue, not 200 turns of implementation rationalisation.
  • Self-healing the AI layer means context that proved insufficient gets patched once and reused.

The unifying principle: each agent gets a context window that is small, structured, and built on purpose. That's what makes parallel runs converge instead of diverge.

NLW makes the same point from the opposite end. OpenAI shipped a memory feature for Codex called Chronicle -- a background agent that takes screenshots and builds running context from them. Their internal name was "telepathy."

Key Insight from The AI Daily Brief: "With Chronicle, Codex can better understand what you mean by this or that, like an error on screen, a doc you have open, or that thing you were working on 2 weeks ago." -- Nathaniel Whittemore quoting OpenAI, Automating Your AI Context

Strip away the screen-capture privacy debate and the claim is the same one Cole is making with worktrees: agents work better when their context is curated for them rather than reconstructed every turn. The Cole-style worktree script is exactly that kind of UX upgrade -- the agent could install dependencies, create a database branch, and pick a port itself, but every one of those steps would cost context. Pre-baking them into a script preserves the window for the actual work.


When Parallelism Helps, and When It Doesn't

Parallel agents are not free. They cost five times the tokens, five times the worktree disk, and -- if you're not careful -- five times the review effort. The pattern earns its keep when:

  • You have a backlog of well-scoped, independent issues. Five issues touching different parts of the codebase parallelise cleanly. Five rewriting the same module do not.
  • Your validation loop is mostly automated. If a human has to hand-test every PR, the bottleneck migrates to you and the fan-out collapses to a queue.
  • The work is interruptible and self-healing. Pillar 5 -- patch the AI layer when something breaks -- is what keeps reliability from decaying as you add agents.

Parallelism does not help when the work is exploratory, when each step depends on the previous step's output, or when you don't yet trust the agent enough to merge without reading every line. In those cases, one focused session beats five distracted ones.


The pattern in this chapter scales output horizontally. The next chapters move in the opposite direction: composing an agent's capabilities into a coherent surface (Chapter 4 on skills), orchestrating multi-step workflows (Chapter 5), and surviving long-running runs without losing the thread (Chapter 6).

Chapter 4: Skills as Agent Surfaces — The GStack Pattern

Source Videos: How to Make Claude Code Your AI Engineering Team


Skills Are Where Narrow Agents Live

A skill in Claude Code is a markdown instruction-set plus assets, loaded on demand. Generic mechanics are covered in guide_v2/07-skills-and-plugins.md. This chapter is about a different question: once you have skills, what should you put in them?

The answer that has emerged in practice is roles. A skill is the natural surface for a narrowly-scoped agent. Each skill encodes one specialist — a YC partner running office hours, a designer producing visual variants, a reviewer catching bugs, a QA engineer driving a real browser. The skill is loaded only when its role is needed, brings its own instructions and assets, and behaves like a focused agent for one task. Subagents (see Chapter 2) provide context isolation for one-off delegations; skills provide reusable, named roles you invoke by command.

The clearest worked example of this pattern is GStack, an open-source toolkit Garry Tan published three weeks before the talk. It is not a framework, not a wrapper, not a fork. It is a bundle of skills.

Key Insight from Y Combinator: "It turns out the way to get agents to do real work is the same way humans have always done it — as a team, with roles, with process, with review." -- Garry Tan, How to Make Claude Code Your AI Engineering Team


What GStack Is

Tan, an engineer-turned-VC who ran Posterous before joining YC, started using Claude Code in January 2026 after hearing Andrej Karpathy and Boris Cherny say they had stopped writing code by hand. By his own count he has coded more in two months than in all of 2013, and rebuilt most of Posterous — a product that originally took two years, ten engineers, and ten million dollars — alone.

GStack is the codification of how he works. It is an open-source repository of skills that turns Claude Code into a small engineering team. At the time of the talk it had over 70,000 GitHub stars (more than Ruby on Rails) and shipped with 28 commands. It works with Claude Code, Codex, or Cursor — the contribution is the skill bundle, not the host.

Key Insight from Y Combinator: "Out of the box, the model wanders. It doesn't know your data well, so it guesses. And guessing at that scale is how you get plausible-looking code that silently breaks. The bottleneck is not the model's intelligence. As long as you set the models up right, they are already smart enough to do extraordinary work on your codebase." -- Garry Tan, How to Make Claude Code Your AI Engineering Team

The design principle Tan names is "thin harness, fat skills." Don't wrap the model in a heavyweight framework. Don't try to be clever about prompt routing or tool plumbing. The harness — Claude Code itself — should be near-invisible. The intelligence and the opinions live in the skills.


The Roles in the Bundle

GStack ships specialist skills that mirror the roles a real product team would have. Each one is invoked by a slash command and carries its own instructions, prompts, and process.

Office Hours — The Product Critic

/office-hours is the entry point. It is modelled on the office-hours format YC partners run with founders, and Tan describes it as "a distilled 10%-strength version of what we do at YC every day," compressed from tens of thousands of partner-hours.

The skill opens with six forcing questions designed to reframe a product idea before any code is written. In Tan's live demo — a tax-document aggregator that pulls 1099s out of Gmail — the skill's first question was the one that determines everything else: "What's the strongest evidence that you have that someone actually wants this?" It then probed the pain (real, but only friction), surfaced the existing competition (TurboTax and HR Block already have 1099 import; Plaid connects to banks), and pushed Tan to articulate why those weren't solving the problem.

Several rounds in, the skill was no longer just interrogating; it was reframing. It surfaced a wedge strategy — start with document aggregation as the hook, then expand into matchmaking with tax preparers as the actual business model — and proposed three approaches with explicit effort/risk tradeoffs. The output is a design doc that has already survived adversarial review, with issues caught and auto-fixed before a human ever sees them.

Key Insight from Y Combinator: "If I just typed the original thing — 'I need to go and find my 1099s' — it'll just literally do it. But it won't think about who's the user, what is this, what is the business model, who wants this, what's the pain point. This is the kind of stuff we get to do every day with founders in office hours, and we're pretty good at it. But so is this skill." -- Garry Tan, How to Make Claude Code Your AI Engineering Team

Tan reports that roughly one in three runs of /office-hours ends with him deciding the idea isn't worth building. That is the point. The skill is a feasibility filter that runs before the model starts coding.

Design Shotgun — Visual Brainstorming

After the design doc is approved, /design-shotgun generates multiple visual variants of a UI in parallel. In the demo it produced three options for the main dashboard — a "command center" view, a "friendly progress" view, and a "split view" — by farming the work out to OpenAI Codex for image generation. Tan picked option B, the friendly card-based variant with progress rings, and the skill locked it in. If none of the variants land, you regenerate with feedback.

This is one of several design skills in the bundle. They run after the plan is settled and before implementation, so the visual direction is committed before any component code exists.

Adversarial Review and CEO Review

/office-hours itself runs a multi-step adversarial review against the design doc — putting the proposal through the paces, catching gaps (no failure handling, no privacy section, an unsolved 2FA hand-off), and auto-filling fixes where it can. In the demo this lifted the doc's score from 6/10 to 8/10 and resolved 16 issues.

For users who don't want to be in the weeds on every step, /auto-plan chains CEO review, engineering review, design review, and developer-experience review using Tan's default recommendations — "these are sort of programmed to be what I would do if I were you." It is the same pipeline as the manual sequence, packaged as one command.

Review — Staff-Level Code Critique

After Claude Code writes the code from the approved plan, /review runs a staff-level bug-catching pass. This is post-implementation review against the spec, looking for issues that weren't visible during planning. It is distinct from the adversarial review at the design stage; one critiques the doc, the other critiques the diff.

GStack Browser — Real Playwright at the CLI

The piece Tan is proudest of is a wrapper around Playwright and Chromium exposed as /qa and /browse. He built it because the existing Claude-in-Chrome MCP server was, in his words, one of the worst pieces of software he had ever used — context-bloated, slow, and unreliable. So he wrapped Playwright at the CLI level. The result is a real headed-or-headless browser that any agent can drive directly: take screenshots, click, fill forms, download files, run regression tests, diff CSS.

Key Insight from Y Combinator: "I was amazed that I could use all of my other skills in GStack to create the QA and browse tool. I basically wrapped Playwright at the CLI level. And now your Claude Code, and any agent, can actually just use the browser." -- Garry Tan, How to Make Claude Code Your AI Engineering Team

The point is not the Playwright wrapper itself. It is that Tan used GStack's other skills to build a new GStack skill. The toolkit composes with itself.

The motivation is also instructive: QA was the bottleneck that broke his flow. Once planning, design, and implementation were all delegated, he found himself sitting there manually clicking through the result. "Probably the least fun part of software development." So he encoded that role too. Any role you'd want a teammate for, you can encode as a skill.

Ship — The Final Gate

/ship is the last step before a PR lands on main: a check that the branch is actually ready. It is a small skill, but it closes the loop from idea to merged code without a manual step in between.


Why Skills, Not One Mega-Prompt

You could in principle stuff "be a YC partner, then a designer, then a code reviewer, then a QA engineer" into a single CLAUDE.md or one giant system prompt. People try this. It does not work well, and the GStack architecture explains why.

Skills load on demand. Office hours is a heavy skill — it carries thousands of hours of distilled product judgment, six forcing questions, an adversarial review pipeline. You don't want that loaded when you ask Claude to fix a CSS bug. Each skill enters the context window only when its slash command is invoked, and leaves when its job is done. (See Chapter 3 on context engineering for why this matters.)

Roles need different process, not just different prompts. Office hours runs an interactive Socratic loop. Design shotgun farms work out to Codex for image generation. Review is a one-shot staff-level critique. QA drives a real browser. These are not variants of one prompt; they are different control flows, different tool sets, and different success criteria. A skill is a unit big enough to hold a role's process; a paragraph in a system prompt is not.

Roles compose into a sequence. GStack's value is not any one skill — it is the pipeline. Office hours produces an approved design doc. Design shotgun consumes it and produces locked-in variants. Implementation consumes both. Review consumes the diff. QA consumes the running build. Ship consumes everything. Each handoff has a defined input and output. A mega-prompt has no seams; a skill bundle has them by construction.

Skills are forkable. Because each role is a separate file, you can swap one out, override a step, or add a new specialist without touching the rest. Tan added the QA browser skill weeks after the rest of GStack already existed, by composing existing skills. A monolithic prompt does not extend that way.


The Sprint Loop

Tan's day-to-day workflow is the bundle in motion. He runs ten to fifteen Claude Code sessions in parallel, often three or four against the same project on different worktrees (see Chapter 3). For each work item — a new idea, a bug report from X, a feature request — he:

  1. Clicks the plus icon in Conductor (a git-worktree manager he uses) to spawn a new tree.
  2. Runs /office-hours to interrogate the idea.
  3. Runs CEO review, engineering review, adversarial review.
  4. Runs /auto-plan if he wants the sensible defaults rather than every step.
  5. Approves the plan. Claude Code writes the implementation.
  6. Runs /review for the bug pass, /qa for the browser pass.
  7. Runs /ship to land the PR.

Tan reports users spend 80–90% of their time in office hours, CEO review, plan, and auto-plan — not in implementation. The model is fast at writing code; the leverage is in deciding what to build and whether the plan is right before code is written. The skills front-load that judgment.

This is also where Codex earns its keep in his stack. Tan's mental model is that Opus 4.6 is "ADHD CEO — the guy you want to get a beer with, who has a billion ideas", while Codex is the "autistic CTO" you call in when the going gets tough. GStack will hand off to Codex for hard bugs. The skills route work between models based on what each is good at.


The Broader Pattern

The narrow lesson is: install GStack, run office hours, ship more.

The broader lesson is the reusable pattern underneath. A skill is a unit of role. If you can describe what a teammate would do — a security reviewer, a copy editor, a migration planner, a release-note writer, a dependency auditor — you can encode that role as a skill. Give it instructions, give it any assets it needs, give it a slash command, and Claude Code will load it on demand and behave like that teammate for one task.

GStack is one instance of the pattern, focused on the YC-style product-and-engineering loop. Other instances will look different. A compliance-heavy team might bundle skills for threat modelling, license review, and data-handling audits. A research team might bundle skills for literature search, experiment design, and result write-up. The architecture is the same: thin harness, fat skills, roles compose into a pipeline.

Subagents (Chapter 2) and worktrees (Chapter 3) supply the runtime — context isolation, parallelism, clean state. Skills supply the organisational chart. Together they let one human run a team.

Key Insight from Y Combinator: "I run 10 to 15 parallel Claude Code sessions all at the same time. I might in one session be running office hours on a brand new idea... I can do 10, 15, 20, sometimes 50 PRs in any given day." -- Garry Tan, How to Make Claude Code Your AI Engineering Team

The team is not metaphorical. The roles are real, the handoffs are real, the review is real. The only thing that has changed is that the teammates are skills.

Chapter 5: Orchestration Patterns

Source Videos: AgentCraft: Putting the Orc in Orchestration — Ido Salomon, Full Walkthrough: Workflow for AI Coding from Planning to Production — Matt Pocock


Why a Single Agent Loop Is Not Enough

A single Claude Code session running its agent loop will get you a long way. It will read files, run bash, write code, iterate on tests. But there is a ceiling, and it is the same ceiling whether you are working on a small feature or a large one: one loop holds one stream of context, makes one decision at a time, and depends on a single human watching one terminal.

Matt Pocock frames the constraint at the model level. Every token you add to a context window scales the attention relationships quadratically. The model has a "smart zone" near the start of a session and a "dumb zone" that begins somewhere around 100K tokens regardless of whether the window is 200K or 1M. As the window fills, decisions degrade. (See Chapter 1 and Chapter 3 for the underlying mechanics — context rot and context anxiety are owned there.) The practical implication for orchestration: if you want to do a large task well, you cannot do it in one context.

Ido Salomon frames the same problem from the human side. Spinning up more agents is trivial. Watching them is not.

Key Insight from AI Engineer: "We are the bottleneck in orchestrating all of these agents." -- Ido Salomon, AgentCraft: Putting the Orc in Orchestration

Orchestration is the discipline of composing multiple agent loops so that each loop runs in its smart zone, the human is not the rate-limiter, and the work that survives any single loop is durable. This chapter catalogues the structural patterns. Long-running durability — surviving crashes, recovering from compaction, the blueprint pattern — belongs to Chapter 6.


The Building Blocks

Before the patterns, two vocabulary items that recur across both source talks.

Human-in-the-loop vs. AFK tasks. Pocock divides every step in a coding workflow into two categories. Human-in-the-loop tasks require a person sitting at the terminal, typing answers, making judgement calls. AFK tasks ("away from keyboard") are work the agent can grind through unattended. Orchestration is, in large part, the art of identifying which is which and shaping the workflow so AFK work is queued up while the human focuses on the irreducibly-human parts.

Key Insight from AI Engineer: "There are human-in-the-loop tasks where a human needs to sit there and do it. And there are AFK tasks where the human can be away from the keyboard and it doesn't matter. Implementation can be turned into an AFK task, but planning, this alignment phase, has to be human in the loop." -- Matt Pocock, Full Walkthrough: Workflow for AI Coding from Planning to Production

Planning artefacts. A planning artefact is a written file — a PRD, a Kanban board, a list of issue files — that an agent produces during human-in-the-loop work and that subsequent agents read as input. The artefact is the handoff. Because it lives on disk it survives any single agent's context window being cleared. Both Pocock's PRD-to-Kanban flow and Salomon's "campaign" pattern are built on durable planning artefacts.

With those in hand, here are the structural patterns themselves.


Pattern 1: The Linear Pipeline (Planner → Builder → Reviewer)

The simplest orchestration shape is a sequence of distinct agent loops, each with one job, each starting from a clean context, each consuming the artefact the previous loop produced. Pocock's full workflow is a worked example.

Stage 1 — Alignment, not specification

Pocock starts every feature with a grill-me skill: a tiny prompt that tells the agent to interview him relentlessly about the design until they share a model of what is being built.

The point is not to produce a document the agent will then compile into code. Pocock is explicit that he does not believe in the "specs-to-code" pipeline where you keep editing the spec when the code is wrong. The point is alignment. The conversation log itself becomes an asset because by the end of it, the agent and the human have the same picture in mind. Pocock invokes Frederick P. Brooks's "design concept" — the shared mental model that the participants in a project converge on — and says reaching that with the agent is what every other phase depends on.

A grilling session can run 20, 40, even 80 questions. Each question the agent asks comes with a recommended answer; the human accepts, overrides, or expands. The skill itself is short — interview the user, walk down each branch of the design tree, resolve dependencies one by one — but the resulting transcript is dense with decisions.

Stage 2 — The destination document (PRD)

Once alignment is reached, a separate skill — write-a-PRD — turns the conversation into a Product Requirements Document. Problem statement. Solution. User stories. Implementation decisions. Testing decisions. Out-of-scope items. Pocock does not read this document carefully. He has already aligned with the agent in the grilling session; the PRD is the agent summarising work they did together, and LLMs are good at summarisation. Reviewing it would be checking the agent's summarisation of a conversation he was already in.

The PRD is a planning artefact. It outlives the session that produced it.

Stage 3 — The journey document (Kanban board)

A third skill — PRD-to-issues — splits the PRD into a set of independently-grabbable issue files with explicit blocking relationships. Pocock prefers this over a sequential multi-phase plan for a structural reason: a sequential plan can only be picked up by one agent. A Kanban board with explicit dependencies is a directed acyclic graph, and any node whose blockers are satisfied can be claimed in parallel.

Crucial sub-rule inside this stage: vertical slices, not horizontal ones. Pocock's observation is that AI loves to code horizontally — all the schema first, then all the API, then all the front end. The problem is that you get no integrated feedback until the third phase. The pragmatic-programmer term he reaches for is "tracer bullets": every issue should cut through enough layers (schema, service, minimal UI surface) that the work is testable end-to-end at the moment it lands. Without that constraint, fan-out parallelism is impossible — the agents working in parallel would all be operating in the same horizontal layer with nothing to integrate.

When the agent's first attempt produces a horizontal first slice, Pocock corrects it: "the first slice is too horizontal." The Kanban skill is a guideline, not a guarantee, and the human has to keep their eye on the shape of the journey even though they have delegated its production.

Stage 4 — The implementer (AFK)

Now the human leaves the loop. Pocock's Ralph script — named, jokingly, after Ralph Wiggum and the practice of telling the agent to "just make a small change" until done — is a bash loop that loads all open issue files into context, picks the next AFK-tagged task whose blockers are clear, runs Claude Code with --permission-mode acceptEdits, then exits and re-runs. The implementer agent is told to use TDD (red-green-refactor: write the failing test first, make it pass, refactor) and to run the type checker and test suite as feedback loops at the end.

The implementer is sandboxed. Pocock runs his in Docker containers. The reason is partly safety, partly that the agent has full edit permission and you want a blast radius.

Stage 5 — The reviewer (clean context)

After implementation, a separate agent reviews. Not the same agent. A new one, with a fresh context window. The argument is structural: an agent that has just spent its smart-zone tokens implementing a feature will be reviewing its own work in the dumb zone. A reviewer launched from a clean context window can see the diff with full attention.

This is the most quietly load-bearing argument in the entire chapter. The reviewer is a separate agent because review-as-checkpoint requires fresh context. It is not a stylistic preference. It is a consequence of how attention degrades with context length.

Key Insight from AI Engineer: "If you get it to sort of try to do its reviewing, it's going to be doing the reviewing in the dumb zone. And so the reviewer will be dumber than the thing that actually implemented it. Whereas, if you clear the context, then you're essentially going to be able to just review in the smart zone." -- Matt Pocock, Full Walkthrough: Workflow for AI Coding from Planning to Production

The review pass is also where the human's coding standards get pushed. Pocock distinguishes push from pull: skills sit in the repo and get pulled by the implementer when relevant; for the reviewer, the standards are pushed unconditionally so the reviewer compares the diff against them every time.

Stage 6 — Human QA

The human comes back at the end to actually use the feature. This is the step that imposes taste back onto the codebase. Pocock is blunt that workflows which try to automate every stage — including the QA — produce apps that lack taste, and that the QA pass is also where new issues for the Kanban board get generated. The board keeps growing as the human discovers things during QA, which loops back into Stage 4.


Pattern 2: Fan-Out / Fan-In (Parallel Implementers + Merger)

Once the Kanban board is a DAG of independently-grabbable issues, the implementer stage stops needing to be sequential. This is the second pattern: fan out parallel implementers across independent issues, then fan back in through a merger agent.

Pocock's tool for this — Sand Castle, a TypeScript library he built — encodes the shape directly. The flow:

  1. A planner agent reads the backlog and selects a set of issues whose blockers are all satisfied. This is the fan-out gate; it decides which work is parallelisable right now.
  2. For each selected issue, Sand Castle creates a sandbox: a git worktree on a fresh branch, wrapped in a Docker container. (Worktree mechanics are owned by Chapter 3; the relevant point here is that each implementer gets an isolated working directory so they cannot collide.)
  3. An implementer agent runs in each sandbox, given only that issue's spec and the relevant slice of the codebase. Multiple implementers run concurrently.
  4. A reviewer agent runs after each implementer's commits land on its branch.
  5. A merger agent takes the resulting branches, merges them into the trunk, and resolves any type errors or test failures that arise from the merge.

Two notes on this shape that matter for orchestration in general:

Vertical slices are the enabling constraint. Fan-out only works if each parallel issue produces an end-to-end-testable change. If two implementers were both editing the schema layer in parallel, the merger would have nothing to verify — there is no integrated artefact until the API and UI layers also land. Tracer-bullet issues guarantee each branch is independently shippable, which is what makes parallel safe.

Different models for different roles. Pocock uses Sonnet for implementation and Opus for review. The reviewer needs the smarts; the implementer is doing more mechanical work in a constrained scope. This is a property of the orchestration shape: because each role is a separate agent loop, you can pick the model per role rather than picking one model for the whole job.

The cost of this pattern is review volume. The human is no longer reviewing one PR at a time; they are reviewing bundles of work that landed in parallel. Pocock has no clean answer to this — he says directly that anyone delegating implementation to AI is going to be doing more code review than before, and there is probably no way around it. Tooling for review is where the leverage is.


Pattern 3: Reviewer-as-Checkpoint

The reviewer is not just a stage in the linear pipeline — it is a pattern in its own right and worth naming separately.

A reviewer-as-checkpoint is a separate agent loop, with a clean context, whose only job is to inspect the artefact a previous loop produced and decide whether it passes. Three properties make it work:

  1. Context isolation. A new context window means the reviewer is operating in the smart zone regardless of how much the implementer chewed through. (See Chapter 2 for the subagent semantics that make this cheap.)
  2. Pushed standards, not pulled ones. The reviewer receives the coding standards in its initial prompt. It does not have to choose to load them. Combined with the diff, it has everything it needs.
  3. A different model is fine, often better. Because the reviewer is a separate loop, you can run a stronger model than the implementer used. Pocock's Opus-on-review-Sonnet-on-implement split exploits this.

Reviewer-as-checkpoint generalises beyond code review. Anywhere the workflow has a "did the previous step actually do what it was supposed to?" question, you insert a reviewer agent rather than asking the working agent to self-assess. Self-assessment is review-in-the-dumb-zone by definition.


Pattern 4: Planner + Executor (the Campaign Orchestrator)

Salomon's AgentCraft talk surfaces a different orchestration shape. He starts in the same place as Pocock — visibility, side-by-side parallel agents, muscle memory for cycling between them — but quickly hits the limit that Pocock hits too: the human is still the bottleneck. Even with great visibility, cycling between agents to approve plans and answer questions saturates attention.

Salomon's response is to push the orchestration role itself onto an agent. Instead of the human acting as planner and dispatcher, a campaign orchestrator agent owns the decomposition.

The shape:

  1. The human writes a one-line campaign brief: I want this feature.
  2. A container is spun up. Salomon does not care what happens inside — it is a sandbox.
  3. The campaign orchestrator inside the container decomposes the task, writes a plan, and presents the plan back to the human for approval.
  4. Once the plan is approved, the orchestrator dispatches the work — the orchestrator does the babysitting, not the human.
  5. When work is ready, the human reviews bundles: change lists, screenshots, video evidence, all collected by the orchestrator.

The structural difference from Pattern 1 is that planning has been delegated. The human is no longer writing the PRD or curating the Kanban board; the orchestrator does both, and the human's role compresses to brief-then-review.

Key Insight from AI Engineer: "We're actually moving more of the effort only to the planning phase or the review phase. And once we have that, we reach a point where we can just say, why is it my ideas? Why can't I tell it to run in a cron job, go to Twitter every day, scan cool ideas and just implement them?" -- Ido Salomon, AgentCraft: Putting the Orc in Orchestration

This is the more aggressive end of the orchestration spectrum. It works for tasks where the cost of a wrong decomposition is low — refactors, tests, follow-up cleanup, the kind of work Salomon explicitly cites as quest material the agent fishes for itself. For high-stakes feature work where the design concept actually matters, Pocock's grilling-session-up-front shape gives the human more control.

The two are not in opposition. They are points on a control-vs-autonomy axis, and the right answer depends on what is being built.


Pattern 5: Visibility as Substrate

Salomon's central thesis is that orchestration is not just a topology of agents — it is the human interface for seeing that topology. Without visibility, no orchestration shape works, because the human still has to know which agent needs what, when.

AgentCraft's visualisation is a real-time-strategy game projection of the file system. Directories are regions on a map; files are rooms; agents are units that occupy rooms while they work. From this projection three orchestration affordances drop out almost for free:

  • Lineage. Every file shows which agent edited it and when. Every change is traceable to a unit.
  • Collision detection. A heat map over the map flags files multiple agents are about to touch. Conflicts can be prevented proactively rather than resolved at merge time.
  • Hotkey cycling. Borrowing from RTS muscle memory, the human cycles between agents that need approval or input with a single keystroke. Reaction time, not attention budget, becomes the limit.

This is not a separate pattern in the same sense as planner-or-fan-out. It is the substrate the others run on. You cannot fan out across ten implementers if you cannot see what any of them is doing without context-switching ten terminals. The lesson is general: whatever orchestration shape you adopt, the visibility surface it requires is part of the design, not an afterthought.

Salomon's other observation is that visibility composes across humans, not just across agents. AgentCraft workspaces let two humans see each other's running agents and hand off work between them. The chat surface attached to the workspace carries human-to-human messages, human-to-agent prompts, and agent announcements ("I'm starting on file X"), so other agents in the workspace know not to collide. Multi-human orchestration becomes possible because the orchestration substrate already had to track who-touched-what for the single-human case.


How the Patterns Compose

Real workflows mix the patterns. A reasonable composite:

  • A campaign orchestrator (Pattern 4) takes a one-line brief and produces a Kanban board.
  • The board is fanned out (Pattern 2) into parallel implementer agents in worktree-sandboxed branches.
  • Each branch passes through a reviewer-as-checkpoint (Pattern 3) before merge.
  • A merger agent (the fan-in side of Pattern 2) consolidates branches into trunk.
  • The whole thing runs over a visibility substrate (Pattern 5) so the human can intervene when the orchestrator escalates.

The linear pipeline (Pattern 1) is what you get when this whole composite is collapsed to one branch and one implementer. It is the simplest case, not a different thing.

The shape that survives across all of them is: distinct loops, each with one role, each with a clean context, communicating through durable artefacts on disk. That is the orchestration grammar. Specific topologies are choices within it.


What This Chapter Does Not Cover

Two adjacent topics belong elsewhere.

Long-running durability. Once an AFK agent runs for 30 minutes, an hour, four hours, the question stops being "what shape is the orchestration" and starts being "how does it survive a crash, a context reset, a compaction event." The blueprint pattern, recovery state, and the engineering of long-running agents are owned by Chapter 6. The patterns in this chapter are about composing loops; Chapter 6 is about keeping any one loop alive across resets.

Subagent mechanics. Pattern 3's "clean context" assumes you know how subagents isolate their context windows. The mechanics — how to spawn one, how forked subagents differ, when the cost is worth it — are owned by Chapter 2. The orchestration patterns here use subagents as a primitive without re-explaining them.

When in doubt, the test is structural: if the question is what shape the agents are arranged in, it is this chapter. If it is how a single agent stays alive over time, that is Chapter 6. If it is how one agent isolates a piece of work, that is Chapter 2.


The Discipline

Salomon titles his talk "Putting the Orc in Orchestration." The pun is more pointed than it looks. The skills he is describing — managing dozens of units, visibility at a glance, hotkey cycling — are not new. They are the skills of an RTS gamer applied to a workplace where they were not previously needed. Software engineers were not, until recently, in the business of running ten reckless employees in parallel.

The patterns in this chapter are the discipline that makes those skills tractable. Decompose the work into slices small enough that each one fits in a smart zone. Hand decomposition to a planning agent when the work permits. Run implementers in parallel across vertical slices. Always review with fresh context. Treat the artefacts on disk — PRDs, issue files, branches — as the durable spine of the workflow, not the agent's working memory.

A single agent loop hits a ceiling. Composing loops well is how you raise it.

Chapter 6: Long-Running Agents

Source Videos: Anthropic Just Dropped the New Blueprint for Long-Running AI Agents, Matt Pocock — AI Coding Workflow


The Failure Mode: Six-Hour Tasks That Quit at Hour Three

Anthropic recently published a harness-design post describing two long-running coding sessions: a 2D retro game engine built over six autonomous hours, and a digital audio workstation built in the browser over roughly four hours. These are the kinds of tasks where real value lives. A one-sentence prompt becomes a working application. Outside coding, the same horizon applies to compliance audits, content pipelines, risk analyses — work that would normally take a person a week to a month.

The naive way to attempt this is to hand the goal to an agent and let it run. It does not work. Without structure, an agent given a six-hour task will exhibit characteristic failures:

  • It tries to one-shot the entire build in a single session and runs out of context halfway through.
  • It leaves work half-finished and undocumented, with no way for a successor to pick it up.
  • It declares the job done well before it actually is.

The third failure is the most insidious because it looks like success. The agent reports completion, the run terminates, and the operator only discovers later that whole sections of the spec were skipped. Anthropic gave this behaviour a name.

Context Anxiety, Defined

Context anxiety is a behavioural change that happens as the context window fills. It is not the same thing as context rot.

Context rot — covered in Chapter 3 — is a quality phenomenon. As the window fills past roughly half its capacity, the model's outputs degrade: it reasons less precisely, misremembers earlier decisions, makes worse choices. Rot is what the model does. Anxiety is what the model decides to do next.

Key Insight from The AI Automators: "As the context window fills up, the models don't just lose coherence, they actually change their behavior. They start wrapping up the conversation prematurely. They rush through steps and declare that things are done when they're not actually done." -- The AI Automators host, Anthropic Just Dropped the New Blueprint for Long-Running AI Agents

Anyone who has held a long single-thread conversation with an LLM has felt this. Replies get shorter. Hedges multiply. The model starts looking for an exit. In a chat that may be merely annoying. In an agent that is supposed to keep working for another three hours, it is fatal — the agent declares the milestone complete, hands back control, and the run ends with the spec half-built.

The natural first instinct is to fight context anxiety with context compaction: have the model summarise the conversation so far into a smaller payload and continue. Anthropic found this insufficient.

Key Insight from The AI Automators: "Anthropic found that even with context compaction, models like Sonnet 4.5 still showed context anxiety and tried to finish early. And the reason for that is you're not starting with a clean slate." -- The AI Automators host, Anthropic Just Dropped the New Blueprint for Long-Running AI Agents

That last sentence is the mechanism. Compaction shrinks the token count but it does not change the agent's situation: it is still inside a session that has been running for hours, with a summary blob that signals "we have already done a lot of work." The model's behaviour is conditioned on the trajectory the summary describes, not the bytes that summary occupies. To remove the anxiety you have to remove the trajectory. That means starting over.

The Blueprint: Reset, Read, Hand Off

Anthropic's durability pattern is a structured context reset. It has three moving parts:

  1. An initializer agent runs once at the start of the run. Its job is to set up the environment, decompose the goal into discrete features or sprints, and write a progress file that lists what needs to be built and in what order.
  2. A coding agent does the actual work — but only one feature at a time. When it finishes a feature, it commits to git, updates the progress file with what was done and any handoff notes, and exits.
  3. A fresh coding agent spawns for the next feature. It starts from a clean context window, reads the progress file to learn what state the project is in, runs the existing tests to verify the previous agent's work, and only then begins its own task.

The progress file is the load-bearing component. It is the only thing that survives between sessions. Everything else — the conversation history, the half-formed plans, the model's growing exhaustion — is discarded at the reset boundary. Because each successor begins with a clean window, no agent in the chain ever feels late, none of them rushes the close, and the work decomposes into a series of fresh starts rather than one degrading marathon.

There is no separate "recovery" mechanism. The progress file is the recovery checkpoint. If a session crashes, the next agent reads the file, runs the tests to confirm what is real, and resumes. Git commits per feature give the project a second checkpoint at the code layer. Together they make the run resumable by construction.

This pattern long predates Anthropic's writeup. Jeffrey Huntley's "Ralph Wiggum loop" — running an agent inside a loop with an external check the model cannot lie to (a linter, a type checker, a test) — encodes the same core idea: the agent self-evaluates poorly, so you put a deterministic stop condition outside the loop. Spec-driven frameworks like BMAD, SpecKit, and OpenSpec extend the idea further by writing the requirements down before the loop begins, so the agent cannot quietly redefine "done" to mean "whatever I happened to build."

The blueprint also wires in adversarial evaluation — a planner agent up front, and a separate evaluator agent that judges the generator's work — but those are orchestration concerns, covered in Chapter 5. For long-running durability what matters is the reset-and-handoff spine: structured progress file in, fresh context out.

Pocock's Lineage: Memento, Markdown, and the Night Shift

Matt Pocock arrives at the same destination through a different door. His framing is psychological rather than architectural.

Key Insight from AI Engineer: "LLMs are kind of like the guy from Memento. They just continually forget. They could just keep resetting back to the base state. I much prefer my AI to behave like the guy from Memento because this state is always the same. Always the same. Every time you do it, you clear and you go back to the beginning." -- Matt Pocock, Matt Pocock — AI Coding Workflow

Pocock dislikes compaction for the same reason Anthropic found it insufficient: it preserves the trajectory. A compacted session is still drifting. A cleared session is back at the system prompt. Pocock prefers the cleared session, every time.

That choice forces a discipline: if every session forgets, then everything the next session needs has to live somewhere durable. In Pocock's workflow that "somewhere" is local markdown. The team writes a PRD (a destination document), breaks it into independently grabbable issues stored as files in the repo, and treats those files as the project's persistent state. The conversation is ephemeral. The markdown is real.

When it is time to actually run the agent unattended — what Pocock calls the "night shift," after the human-led "day shift" of planning — the bash script once.sh reads every issue file from the repo into a local variable and the last five git commits, then launches Claude Code with that block of state injected at the front of the prompt. The agent picks the next task from the file-based backlog, works it, commits, and exits. The loop runs once.sh again with a fresh agent, which reads the now-updated state and continues.

Key Insight from AI Engineer: "We've queued up a lot of work for the agent. We can think of this as kind of like the day shift and the night shift. This is the day shift for the human — planning everything, getting all the stuff ready. And then once we kick it over to the night shift, the AI can just work AFK." -- Matt Pocock, Matt Pocock — AI Coding Workflow

The mechanics differ from Anthropic's blueprint — Pocock's "progress file" is a directory of issue markdown rather than a single tracker, and his harness is a hand-rolled bash loop rather than the Claude Agent SDK — but the durability pattern is identical. State lives outside the agent. Fresh context every session. Read state in, write state out, exit. No agent ever gets old enough to feel anxious.

Sonnet 4.5 vs Opus 4.6: When You Still Need This

The blueprint was designed against Sonnet 4.5, which exhibited context anxiety reliably. Anthropic's harness work was, in part, a workaround for a specific model's failure mode. That matters because the failure mode is not constant.

When Anthropic re-ran the digital audio workstation experiment on Opus 4.6 — released mid-experiment, with a 1 million token context window — they were able to delete most of the harness. They removed the sprint structure. They removed the contract negotiations between generator and evaluator. They removed the context resets entirely and relied solely on the Claude Agent SDK's built-in compaction. The planner expanded the one-sentence prompt into a full spec, the generator one-shotted the entire DAW in a single continuous session, and the evaluator only ran at the end to provide feedback for a second iteration.

The DAW build came in at roughly four hours total, across three phases of generation and two evaluation passes, for about $125 in Claude Agent SDK costs. No resets, no handoffs, no progress file. On Opus 4.6, the anxiety symptoms simply did not materialise.

This is the headline result, but it deserves a hedge.

Key Insight from The AI Automators: "It's in Anthropic's best interest to process tokens. So I'm sure they're delighted if you're sending in a 1 million token request every time, even if some of that's going to be cached. So I definitely don't think it's the end of context resets the way they've been designed to date." -- The AI Automators host, Anthropic Just Dropped the New Blueprint for Long-Running AI Agents

The economic incentive is real and worth keeping in mind. There is also a practical reason resets do not go away even if Opus 4.6 holds up perfectly: most agents in production are not running on the most expensive model with the largest window. Sonnet is cheaper. Sonnet still gets anxious. If your long-running pipeline runs on Sonnet, or on a non-Anthropic model entirely, the reset-and-handoff pattern is still the durable answer.

The practical guidance is bimodal:

  • On Sonnet 4.5 (or any model that shows anxiety symptoms): use the full reset pattern. Progress file, fresh context per chunk, structured handoff. Compaction alone will not save you.
  • On Opus 4.6 with the 1M window: for tasks that fit comfortably under a million tokens, you can rely on built-in compaction and one continuous session. Keep the progress file anyway as recovery state — but you will not need it for the anxiety mechanism, only for crash recovery.

The blueprint is not retired. Its assumed failure mode is just no longer universal.

Harness Evolution: Why Every Component Has a Shelf Life

The deeper lesson from Anthropic's writeup is methodological. The reason the team could delete most of the harness when moving from Sonnet 4.5 to Opus 4.6 is that every component of the harness was an admission about what the model could not yet do.

  • The context reset existed because Sonnet 4.5 had context anxiety.
  • The contract negotiation between generator and evaluator existed because the generator would otherwise quietly redefine "done."
  • The sprint decomposition existed because the model could not hold the entire spec in mind without losing coherence.

Each of those is a workaround. Each one will go stale as models improve. The team that built the DAW under Opus 4.6 deliberately stripped scaffolding back, on the principle that the simplest harness that still works is the right harness — and the threshold for "still works" moved when the model got better.

For builders, the practical implication is uncomfortable: a harness is not a one-time investment. The components you write today encode the failure modes you observe today. When the underlying model changes, audit them. The reset that was load-bearing in March may be dead weight in June. The evaluator that caught real issues on a smaller model may, on a larger one, just slow the pipeline down.

The inverse holds too. If you push the model further out of its wheelhouse — a harder domain, a longer horizon, a model with a smaller window — you may need scaffolding back. Anthropic's experiments showed that the evaluator's value tracked the difficulty of the task: at the limit of what the generator could do, the evaluator caught real problems; well within the generator's range, the evaluator was mostly redundant.

There is no permanent harness. There is the harness your current model needs for your current task.

What to Take Away

For a long-running agent that has to survive past the smart zone of a single context window, the durable pattern is the same regardless of which channel describes it:

  1. State lives outside the agent. A progress file, an issue directory, a git history — something the agent reads in at the start of every session and writes back out before it exits. The conversation does not survive the reset. The file does.
  2. Each session starts clean. Fresh context window, fresh model state, no inherited trajectory. This is what neutralises context anxiety. Compaction is not a substitute because the compacted summary still carries the "we have been working a long time" signal that triggers the rush-to-close.
  3. The progress file is the recovery checkpoint. A successor agent reads the file, runs the existing tests to verify what is real, and resumes. Crashes, timeouts, and failed evaluator passes all collapse into the same recovery path: re-read state, re-verify, continue.
  4. Match the harness to the model. On models with anxiety symptoms (Sonnet 4.5 and below), use the full reset pattern. On Opus 4.6 with a million-token window, you can lean on compaction for tasks that fit. Keep the state file either way; the anxiety problem may be solved, but crashes are not.
  5. Treat the harness as perishable. Every scaffold component encodes an assumption about model capability. Revisit those assumptions on every model upgrade. Strip what is no longer needed. Add back what a harder task demands.

The hour-long agent and the six-hour agent are not the same animal. The six-hour agent is a pipeline of one-hour agents that have been taught to leave good notes for each other.


Next chapter: Personal agentic OS — composing memory, skills, and hooks into a personal-agent stack.

Chapter 7: The Personal Agentic OS

Source Videos: How To Build a Personal Agentic Operating System, How to Build for AI Agents and a Claude Code Second Brain in 25 Min — Ryan Wiggins


Why the System Matters More Than the Tool

An agent gets its leverage from context about you — your projects, your style, your stakeholders, your decisions. A fresh Claude Code session with no memory and no files to read is a smart generalist. A Claude Code session pointed at a folder of curated text files about your work is a specialist who has been at the company for five years.

That folder, plus the conventions for what goes in it, is the personal agentic OS: the composed stack of identity, context, skills, memory, connections, and automations that travels with you regardless of which model or harness you happen to be using this month.

The framing comes from a training program described on The AI Daily Brief by Nufar Gaspar. Her observation is that every agentic tool — Claude Code, Cursor, Codex, OpenClaude — has converged on the same primitives. They all read text files that define who you are, what you know, what you can do, what you remember, and what you can reach. The interesting work isn't picking the harness. It's authoring those files.

Key Insight from The AI Daily Brief: "Every one of these agentic tools is basically doing the same thing under the hood. They are reading text files that define who you are, what you know, what you can do, what you remember, and what you can reach. The work you do to build your system is portable. When you switch tools or add a new one, all you have to do is point the tool to the same folder and it reads the same files. No migration, no rebuild." -- Nufar Gaspar, How To Build a Personal Agentic Operating System

This is an integration chapter. Subagents are explained in Chapter 2, parallel patterns in Chapter 3, skill mechanics in Chapter 4. Here we treat those as components and ask: how does an individual user assemble them into a personal stack that compounds rather than decays?

The two source videos are complementary. Gaspar gives the taxonomy — seven layers, what goes in each, how to build them without producing a 40-page novel that goes stale in eight weeks. Ryan Wiggins, who leads a product team at Mercury, walks through a working version built on Claude Code: five years of internal documents indexed locally, hooks that inject context into every prompt, skills that run analyses, MCP integrations into business systems. His demo is the concrete example. Gaspar's framework is how you'd build your own.


The Seven Layers

Gaspar describes the personal agentic OS as a stack of seven layers. Each layer is a category of text file (or set of files) that the agent loads, on demand or on every turn, to give your prompts more leverage than they would have on their own. The layers, in order:

  1. Identity — who you are, how you communicate, what rules you want enforced
  2. Context — what you know about your situation, your team, your roadmap
  3. Skills — repeatable workflows written as instructions the agent can fire on demand
  4. Memory — what gets remembered across sessions and what gets deliberately written down
  5. Connections — MCP servers, CLIs, and APIs that let the agent reach real systems
  6. Verification — quick checks the agent (or you) runs before output ships
  7. Automations — scheduled or event-triggered runs that operate while you aren't watching

The order matters. Identity and context shape every later layer. Skills are scaffolding around them. Memory keeps it warm across sessions. Connections turn a smart reader into something that can act. Verification keeps a confidently-wrong agent from shipping output before you notice. Automations sit on top because they multiply both value and risk.

The stack is cumulative. Each agent you build on the OS inherits the foundation, which is why the second agent costs an afternoon and the first agent cost a weekend.

Key Insight from The AI Daily Brief: "Once you build your OS, agents become cheap. Your first agent is hard. The second agent that is built on top of this system — maybe a research agent or a board prep agent — takes you an afternoon because it inherits everything. It already knows you, it knows your context, it knows your voice, and you're only adding a job description and a few specific skills." -- Nufar Gaspar, How To Build a Personal Agentic Operating System


Layer 1: Identity

Identity is the file the harness reads first, before any prompt you type and before any memory is loaded. In Claude Code it is CLAUDE.md. In Cursor it is agents.md. In OpenClaude it is soul. In GitHub Copilot it is copilot-instructions. Different filenames, same concept: a text file that tells the tool who it is working for.

What goes in it:

  • Who you are — role, organisation, domain
  • How you communicate — direct or diplomatic, bullets or prose, short or thorough
  • What you value — concise vs. lengthy, "challenge my thinking" vs. "execute what I say", show reasoning vs. just answer
  • Hard rules — "never send external email without showing me a draft", "never flatter me", "always tell me what I'm not seeing"

The trap is writing the identity file from scratch in one sitting. You will hate it and quit. Gaspar's process: brain-dump to an AI tool that already has some memory of you, ask it to interview you with 15 questions about how you work, what frustrates you, what rules you want enforced — answer out loud, let the AI draft, edit it, ship a version that's about 70% right, patch the gaps over the next three weeks. This same brain-dump-then-iterate methodology applies to every layer in the stack.

For a chief-of-staff agent — Gaspar's running example — the identity file captures non-negotiables like "never let me walk into a meeting without a pre-read", "always tell me who else I owe a reply to", "flag when I'm overcommitting next week".

Identity is short. It is loaded on every turn. Bloating it with reference material is what context files are for.


Layer 2: Context

Context is what you know about your situation, and it is the single biggest predictor of whether the agent gives you generic advice or something useful. Generic advice is one Google search away. What no model improvement will ever know on its own is your roadmap, your org chart, your customer segments, your stakeholders, what you are shipping next quarter.

Unlike identity, context files are not loaded on every turn. They are the library the agent reaches into when a task needs them. That distinction matters: identity is small and always-on; context can be large and is read on demand.

The same one-sitting trap applies, more dangerously. Context-engineer in a single session and you produce a 40-page document that you never update. That isn't context — it's a stale novel.

Key Insight from The AI Daily Brief: "What actually works is three to five focused files, each on a single page, each covering one thing — my team, my product, my customers, my quarter, my stakeholders. Make it dated and fresh and update when things change. Every time you catch yourself re-explaining something about your situation to AI, that thing should have been in a context file. Write it down, add it to the library, move on." -- Nufar Gaspar, How To Build a Personal Agentic Operating System

Context curation is a practice, not a project. The discipline is: every time you re-explain your situation to an agent, that explanation belongs in a file. Capture it once, point future sessions at it, move on.

For the chief-of-staff agent, the minimum context set is a stakeholders file (who reports to whom, what each cares about), a strategy and priorities file (what you're trying to achieve this year), and an operating principles file (how decisions get made, what you escalate).

Context creation, Gaspar argues, is the fastest path to AI value. The shift she watches happen with people who get it: they stop asking what AI tool they should use and start asking what knowledge they have that isn't written down anywhere.


Layer 3: Skills

Skills (Chapter 4 covers the mechanics) are how identity and context get applied to specific repeatable workflows. A skill is a reusable instruction set written in the form: when I say {trigger}, do {process} using {sources} and produce output in {format}.

Without skills, you re-explain the format every time, paste the same sources every time, and complain that the agent writes in a weird voice without ever teaching it yours. A skill fixes that — write it once, fire it forever.

Gaspar's claim is that every knowledge worker has 20 to 30 patterns that could be skills: weekly status updates, meeting prep, stakeholder emails, decision memos. For the chief-of-staff agent, candidates include pre-read (one-page brief for any upcoming meeting), daily-brief (scan inbox, Slack, calendar), voice-match (write in your voice from samples), commitment-tracker (parse meeting notes for promises made).

Same iteration discipline as the other layers: ship an MVP skill, use it for a week, notice when it's off, patch it. Skills are where the OS starts feeling like a team rather than a chat partner. Each skill is a focused contributor that knows your voice and your context because it inherits identity and the relevant context files.


Layer 4: Memory

Memory is the most actively-evolving layer. Every harness vendor is investing here because it is one of the largest unlocks. Claude Code recently shipped automatic memory. Cursor has project-level memory. The features change weekly, and what is a limitation in one tool is often solved in the next release.

Two practical points hold regardless of which harness you're on.

Know how your harness's memory actually works. Ask it directly. Gaspar's prompt: "Explain how your memory system works. What do you remember between sessions? What do you forget?" You can't improve limits you don't understand.

Don't rely solely on automatic memory for the things that matter. The agent will remember on its own, but it doesn't always pick the right things. A major decision, a shift in priorities, the conclusion of a long session — these may not get captured the way you would want. Deliberately tell the agent what to remember, or maintain structured memory files yourself.

For the chief-of-staff agent that means dedicated memory beyond whatever the harness captures: a decision log (what was decided, why, what alternatives were considered), working-process learnings, and relationship context (how a conversation with a specific stakeholder went, what they reacted well to).


Layer 5: Connections

Connections are how the agent reaches real systems — email, calendar, Slack, Jira, Salesforce, your own databases. The mechanisms: MCP servers (the open standard most harnesses now support), CLIs (which give the agent more judgment about how to interact), and direct API or scripting access where neither of the above is available. Vendors are making connections progressively easier out of the box.

The discipline that matters most: start read-only. Before you let the agent write into systems, let it only read your calendar or only read your inbox. Add write access after you've watched the agent behave for a few weeks. The risk scales with the capability.

Key Insight from The AI Daily Brief: "It's not just data leaks in the traditional sense. Imagine an agent that has access to your company Slack with a very loose set of permissions. Someone on your team starts chatting with it, and now the agent is happily sharing your private notes, your opinions about colleagues, your draft feedback. It's not a hypothetical risk. Incidents like that are already happening." -- Nufar Gaspar, How To Build a Personal Agentic Operating System

For the chief-of-staff agent, the staged plan: read-only calendar and inbox first; read-write on a personal task list once trust is established; permission to post drafts to your own Slack DMs for approval before anything goes external.

Mercury's own MCP, demoed in Wiggins's video, is a deliberate read-only build. They have a full read-write API, but the MCP exposes only read access — "to keep it safe," as Wiggins puts it. Same least-privilege principle, applied at the vendor level.


Layer 6: Verification

The worst failure mode of a personal agentic OS isn't that it fails. It's that it works confidently and wrongly, and you ship the output before you notice.

Verification catches that. Every agent task has a quick test specific to it: drafted emails should match your tone and have the facts right; analyses should have the numbers right; meeting briefs should reference real attendees and real prior decisions. Three to five checks per task type, often under a minute each.

There is also a meta-verification practice: periodic retrospectives on the OS itself. Which skills are never being called? Which context files have gone stale? Which agents need updated instructions? The harnesses let you ask this directly — a session can audit itself and tell you what isn't being used.

Key Insight from The AI Daily Brief: "Without that audit discipline, your OS has a shelf life of maybe eight weeks before everything goes stale. With it, your OS compounds further and forever." -- Nufar Gaspar, How To Build a Personal Agentic Operating System

Eight weeks is roughly the half-life of an unmaintained context library before the team has reorganised, priorities have shifted, and the decision log no longer reflects reality.


Layer 7: Automations

Automations are runs that fire while you aren't watching: a daily summary at 7am, a Slack monitor that pings you when a specific channel mentions a topic, a weekly report that emails itself to you Sunday night. They are powerful and they are where the risk gets real, because an agent running at 3am with a wrong answer can do damage before you wake up.

Three rules from Gaspar:

  1. Only automate workflows you have run manually enough times to trust. If you haven't done it by hand and verified the output, don't put it on a cron.
  2. Start with automations that produce drafts for you to review, not outputs that go directly to other people. Drafts to your own inbox or your own Slack DMs are reversible. Auto-sent emails to customers are not.
  3. Always log. You need to know what ran and what it did, after the fact, in detail.

This is the layer that benefits most from staged trust. Once an automation has been writing high-quality drafts to your DMs for a month and you've never had to correct it, you can promote it. Promote too early and you find the failure modes in production.


A Working Example: Ryan Wiggins's Second Brain

Ryan Wiggins leads a product team at Mercury, the banking platform — about twenty products, five years at the company, work on the data team as well. His problem, in his own words: "I have a ton of context, a ton of information, but when I'm making decisions day-to-day, it is quite hard to access."

His solution is the most fully-realised personal agentic OS in either source video. He calls it a second brain (or mission control), and it is built on Claude Code. His description maps almost exactly onto Gaspar's seven layers, even though he never uses that vocabulary.

The Knowledge Base (Layers 2 and 4)

The foundation is what he calls the context layer: a download of nearly five million words pulled from everything that touched his surface area at Mercury over five years. Company strategy docs, every spec ever written, every query ever run, team check-ins, onboarding docs, performance reviews — all stored on the local file system.

Five million words is well past any context window. The corpus is indexed locally using QMD (a local indexing system maintained by a colleague) so that lookups search the concepts in a query rather than the literal terms. Ask about "MCP product traction growth" and the index returns adjacent material — the strategic context doc, the growth-product team check-ins, the team charter — even when none of those documents contains those exact terms.

This is what Layer 2 looks like at scale: roughly twenty curated documents the agent reaches into when the task warrants, indexed across the full archive.

The Hooks (Layer 1, mechanically enforced)

Every morning, Wiggins opens his terminal, navigates to Claude Code, and starts working. The OS attaches itself to his prompts automatically. From his description:

Key Insight from Peter Yang: "It uses Claude hooks to inject that context into every question I ask. So if I ask a question like 'how's activation trending', I'm not just asking that question — I'm asking that question with all the knowledge and history of Mercury going into it. It's actually injected into each request and it has made each request much, much better." -- Ryan Wiggins, How to Build for AI Agents and a Claude Code Second Brain in 25 Min

Hooks here do the load-bearing work of the identity layer: they make sure the agent never operates without context, even when the user types a one-line question. The user doesn't have to remember to attach background. The harness attaches it.

The Skills (Layer 3)

On top of the knowledge base sit skills — patterns Wiggins runs repeatedly. Analysis runs. Small app prototypes. A "feed back into memory" skill that captures session learnings and updates the structured memory files.

The skill he is proudest of is one Mercury later promoted into an internal product: an automated data analyst that answers 80–90% of the questions cross-functional teams ask. He prototyped it locally on his second brain, built confidence that it was accurate, then shipped it internally. The OS-as-launchpad pattern: personal tooling becomes team tooling once it has earned trust.

The Connections (Layer 5)

MCP integrations plug the agent into the systems Wiggins uses: Notion (which transcribes his meetings), Linear, GitHub, Slack, Omni, Metabase. The data analyst skill triggers queries directly. The brief skill reads Notion transcripts. The pipeline lets him pay more attention in meetings rather than typing notes — Notion captures the content, the second brain ingests it.

The Multi-Agent Workflows (Layer 3 + Chapter 5)

For larger jobs, Wiggins spawns multi-agent teams: send in a full analysis request, the team goes off to understand the data, think about the problems, run the analysis, discuss it, return a report. This is Chapter 5's orchestration pattern layered on top of the personal OS. The team inherits the same knowledge base, the same identity, the same voice — because everything is loaded from the same files.

The Daily Cadence (Layer 7)

The automation layer is structural rather than scheduled. Every day starts with a brief: calendar, Linear, GitHub, Slack, meetings ahead. Every day ends with a summary: what got transcribed, what action items came out of meetings, what should feed back into his performance development. Same pattern weekly.

The most striking application is performance feedback. Wiggins gets a six-month performance review with high-level themes — for instance, "you jump to solutions too fast; probe the question more before pushing your team." The second brain has the review themes and the meeting transcripts. After a meeting it compares: in this conversation you were doing the exact thing flagged in your review. The feedback loop tightens from six months to a single day.

Key Insight from Peter Yang: "When I run a meeting and it tells me one of the things I'm working on is that I jump to the solution too quick — that I don't probe enough — it tells me, 'hey, in this meeting you were doing this exact thing that is in your performance review.' It can keep me honest at a much faster frequency than any other system can. My manager and people partner are so happy about this." -- Ryan Wiggins, How to Build for AI Agents and a Claude Code Second Brain in 25 Min

Peter Yang's reaction is worth pulling: "Of all the execs I've talked to, this is the most impressive system I've seen. Most execs are just in meetings all day; they don't have time to throw this stuff." The second brain works specifically because the foundation was authored once, not because Wiggins runs it manually each day.


What's Worth Automating, What's Worth Leaving Manual

The personal OS has a maintenance cost. Identity files drift. Context files go stale. Skills accumulate that no one calls. Memory files diverge from reality. Both source videos are realistic about this.

Gaspar's framing: treat the OS like a team you periodically retrospect on. The same way you'd review an employee or a quarter, you audit the OS — which layers are pulling weight, which are bloat. The harnesses let you ask directly. Eight weeks is the decay window without that discipline.

A practical sort:

  • Worth automating: Daily and weekly briefs (low risk, easy to verify, high ROI from compounding). Read-only ingestion (transcripts, calendar pulls, GitHub activity). Memory write-backs at session end.
  • Worth keeping manual: Identity-file updates (you should notice the gap and fix it deliberately). Context curation (writing it down is the moment of clarifying your own thinking). Promoting an automation from draft-mode to send-mode (the explicit trust decision).
  • Verification stays human at the high-stakes end. Wiggins's data analyst answers 80–90% of questions; the remaining 10–20% is where a human still has to look. Drafts to external recipients. Decisions with downstream consequences. Agent-flagged anomalies in real numbers.

The compound trap to avoid: automating a workflow before you've run it manually enough to know what good output looks like. You will not catch the failure mode you've never seen.


Composition: Why This Chapter Sits at the Centre of the Guide

The personal agentic OS is the integration story for the rest of the guide.

  • Chapter 2's subagents become much more useful when they inherit a populated identity, context library, and skill set. A subagent dispatched into a fresh context window starts from scratch. A subagent dispatched against the OS already knows your voice and your stakeholders.
  • Chapter 4's skills are one of the layers. The OS gives skills somewhere to live and a context to operate against.
  • Chapter 5's orchestration patterns scale on the OS. Wiggins's multi-agent teams work because every agent reads from the same knowledge base.
  • Chapter 6's long-running agents are sustainable when the OS gives them durable state to recover into.

The OS is the substrate. The other chapters are patterns that run on it.

The compounding return is the reason to invest. The first agent — Gaspar's Chloe, Wiggins's data analyst, your own first chief-of-staff — is the expensive one. Every subsequent agent benefits from infrastructure that already exists. The third costs less than the second. The fifth is a job description and two skills.

Key Insight from The AI Daily Brief: "The tool you pick matters less and less. What matters much more is the system that you build underneath it. The people who build that foundation now will have it compound from here on. Everyone else will keep starting over with new tools." -- Nufar Gaspar, How To Build a Personal Agentic Operating System

The harnesses are converging. Models are interchangeable on the timescale of months. The portable artifact is your folder of text files: identity, context, skills, memory schema, connection configs, verification checklists, automation definitions. Point any agentic tool at that folder and the work resumes.

That folder is the personal agentic OS. Everything else is execution.

Chapter 8: Agent UX Beyond Chat

Source Videos: Agents need more than a chat — Jacob Lauritzen, CTO Legora, Collaborative AI Engineering — Maggie Appleton, GitHub Next


The Thesis: Chat Is the Wrong Primitive

Open a Claude Code session. Type a complex request. Tool calls, sub-agent spawns, file reads, web fetches stream by. Thirty minutes later you get a result. Clause three of the contract looks wrong. You say so. The agent runs more tools, a banner flashes — compaction — and it has forgotten half of what it did. It hands you a new draft. You have no idea what else changed.

That sequence is Jacob Lauritzen's opening at the AI Engineer conference, and it is the failure mode every chat-fronted agent eventually produces. Chat works for short, conversational tasks. It collapses when the agent's work is a tree — research, decompose, run sub-tasks, gather results, draft, fix. A linear chat log cannot represent that tree. It cannot let you inspect intermediate state, fork an attempt, or steer one branch alone. It cannot show you what was actually changed when you asked for "fix clause three."

Maggie Appleton, a staff researcher at GitHub Next, makes the parallel argument from the team angle. The chat-fronted agent is a single-player interface. It scales up one developer's output. But software is not made by one person, and a fleet of agents per developer does not solve the problem of a team agreeing on what to build.

Two halves of the same critique. Chat flattens the agent's vertical structure (the work tree) and isolates it from the team's horizontal structure (the people who need to align). The rest of this chapter is what each speaker proposes instead.


Lauritzen: Control, Trust, and the Work Tree

The two axes that matter

Lauritzen's frame for human–agent collaboration is two-dimensional: control and trust.

  • Control is how effectively a human can instil their judgment into the agent's work. High control means you can steer at every step; low control means you only see the final output.
  • Trust is how much you need to review afterwards. High trust means you do not look at the trace; low trust means you read every step.

Where a task lives on these axes depends on whether it is verifiable. Lauritzen invokes the "verifier's rule" (a term he attributes to Jason): if a task is solvable and easy to verify, AI will eventually solve it. Some tasks are easily verifiable (checking definitions in a contract, running unit tests). Some are not (writing a contract, picking a litigation strategy, building a successful consumer app). For the unverifiable ones, you cannot just let the agent rip — you need either a proxy for verification or a way for a human to inject judgment partway through.

Why planning is not enough

Most current agent UX gives the human one shot at control: a planning step at the start. Claude Code's plan mode is the canonical example. The agent proposes an approach, the human approves it, the agent works, and the human does not hear from it again until the final artifact.

Lauritzen is blunt about why this is insufficient:

Key Insight from AI Engineer: "You basically have to do all the work to just know what to do... it's basically impossible for it to really know if it has all the information it needs. Let's say for one of these contracts there's a special clause. It wouldn't know that in the planning step. You can't really tell it what to do when it sees that because it hasn't done all the work." -- Jacob Lauritzen, Agents need more than a chat

Planning collapses every future decision into the present — exactly when you have the least information. He compares it to a coworker who aligns with you on the approach, then disappears until they hand you the final document. Useful, but a strange way to collaborate.

Skills, elicitation, and the work-tree view

Lauritzen's preferred alternative: think of agent work as a directed graph — a tree of nodes, each a piece of work — and give the human ways to inject judgment at the nodes, not just the root.

  • Skills encode judgment into the leaves of the tree. "Whenever you review a confidentiality clause, do it this way." This handles the contingency the planning step would miss: the special EU termination law that nobody knew about until the agent opened the contract. (See Chapter 4.)
  • Elicitation is the agent asking the human for input mid-task — without blocking. Lauritzen's recommendation: tell the agent to make a decision, unblock itself, and write the choice to a decision log the human can reverse later.
  • Guardrails raise trust by lowering surface area. Edit only these files, search only these sites, read only these directories. Claude Code's permission system is the everyday example.

The point: the human's leverage over an agent should not be confined to the prompt at the start and the review at the end. The interface should expose the work tree itself.

Why chat cannot show a tree

Key Insight from AI Engineer: "Chat is one-dimensional. It's a very low bandwidth interface, and it tries to collapse this work tree into a single sort of linear thing." -- Jacob Lauritzen, Agents need more than a chat

He is not arguing against chat as an input method — typing or speaking a goal is fine. He is arguing against chat as the output and state surface. If the work is a tree of a hundred nodes, dumping it into a scrollable conversation produces fifty unanswerable questions and an artifact you cannot diff against the previous version.

The alternative is what he calls high-bandwidth artifacts — durable, structured surfaces matching the shape of the work. At Legora (a vertical AI workspace for law firms) the artifact is a document, with the affordances people already use to collaborate: highlight clause three and only clause three changes; add a comment; tag a colleague; tag a specialist agent; hand off a section. Or it is a tabular review — a spreadsheet of contracts where the agent flags only the items it wants a human take on. The human scans the table, applies judgment to a handful of cells, and lets the agent finish the rest. The UI matches the structure of the work, so review takes minutes instead of an hour of scrolling.

The closing line of the talk is the punchline:

Key Insight from AI Engineer: "Agents aren't humans. We should not constrain them to human language." -- Jacob Lauritzen, Agents need more than a chat

Language is the universal interface for humans because we have nothing better. Agents do not have that constraint. There is no reason their primary collaboration surface should imitate ours.


Appleton: Collaboration Is the Other Missing Dimension

Where Lauritzen attacks chat from the work-structure angle, Maggie Appleton attacks it from the team-structure angle. Both point at the same gap.

The "one developer, two dozen agents" fantasy

Appleton's framing image: a wall of terminal panes, all running coding agents in parallel on one engineer's laptop. The promise is that one person plus a fleet of agents replaces a team. The flaw:

Key Insight from AI Engineer: "Software is not made by one person in a vacuum. It is a team sport and everyone building it needs to agree on what they're building and why." -- Maggie Appleton, Collaborative AI Engineering

Scaling up an individual does not solve coordination problems — it makes them worse. More output without alignment means more wasted output. Appleton borrows the cliché: nine women cannot make a baby in one month.

The collapse of alignment touchpoints

Her diagnosis of why chat-fronted agents make team alignment worse is concrete. The old development cycle had natural alignment moments: Slack threads about a design, Zoom calls during planning, comments on draft PRs, review before merge. By the time code shipped, the team had absorbed it.

Agentic coding has collapsed implementation. The time between filing an issue and an agent opening a PR is now minutes. The early alignment touchpoints evaporate because nobody bothers — the code is too cheap to plan for. Worse, every coding agent's plan mode is local and unshared. Your teammate has no idea what plan you accepted, and vice versa. All the alignment weight piles onto the pull request — at the end of the cycle, when course-correcting means throwing the work out.

The result is the wreckage everyone now recognises: features no one asked for, merge conflicts from two agents touching the same files, towers of PRs no reviewer has context for.

ACE: an agent collaboration environment

GitHub Next's response is a research prototype called ACE — Agent Collaboration Environment. Not a shipped product; Appleton is upfront that it is rough. But it is a concrete picture of what "agent UX beyond chat" means at the team level.

The core idea: every session is a multiplayer chat channel backed by a sandboxed cloud microVM on its own git branch. The session is shared. The agent is in it. Teammates are in it. Terminal output, diffs, live preview — all shared, all visible, all addressable by anyone.

Concrete consequences:

  • A teammate drops into your session in one click and sees the full prompt history. No "stash your changes, pull my branch, run install."
  • Both of you can prompt the same agent in the same session. The agent reads your conversation as input. Discuss something, then say "@ace, do it."
  • A live preview is visible simultaneously to everyone. "Works on my machine" is a category error.
  • Plans are collaborative documents with shared cursors, not local artifacts buried in someone's terminal. The plan gets edited, argued over, refined, then the agent runs it.
  • A team dashboard summarises what colleagues shipped, what is in flight, what got merged — a standup replacement that is agent-aware.

Appleton's framing of why this matters is that alignment — not implementation — is now the bottleneck:

Key Insight from AI Engineer: "Implementation is rapidly becoming a solved problem... The hard question is no longer how to build it. It's should we build it. Agreeing on what to build is the new bottleneck." -- Maggie Appleton, Collaborative AI Engineering

Notice that this is the same observation Lauritzen made about legal work: doing the work has become cheap, planning and reviewing the work is now the constraint. They are looking at different verticals and arriving at identical structural claims.

Quality as the new differentiator

Appleton ends on a point that inverts a common worry about agent-generated code:

Key Insight from AI Engineer: "In a world of fast cheap software, quality becomes the new differentiator. The bar is being set much higher and craftsmanship is what will set you apart from vibe-coded slop." -- Maggie Appleton, Collaborative AI Engineering

When implementation is cheap, what you used to spend on typing now buys time for research, architecture review, and design — if the team is aligned enough to spend it that way. Tools that scale individual output deliver more bad software faster. Tools that scale team alignment deliver fewer, better things.


What This Means for Claude Code

Both speakers are talking about products that do not yet exist in shipped form — Legora is a vertical workspace, ACE is a research preview. But the critique applies directly to Claude Code today, and it points at where the surface needs to grow.

Where Claude Code already gets it right

Several existing primitives already gesture at "more than chat":

  • Plan mode forces a structured artifact between intent and execution. Necessary but not sufficient.
  • The permission system is the everyday version of Lauritzen's guardrails — explicit boundaries traded against trust.
  • Skills (Chapter 4) are his "encode judgment into the nodes" mechanism. They live in the leaves of the work tree and fire on the right context.
  • Subagents (Chapter 2) physically structure work as a tree rather than a linear conversation.
  • Forked subagents (April 2026) let you branch the work tree from any point — the closest existing primitive to "inspect and fork an intermediate state."
  • /btw and /fork acknowledge that a single linear conversation is not enough — a side question that does not pollute the main thread, or a parallel attempt that diverges from a checkpoint.

Where the gap is largest

  • No structured view of the work tree. A long subagent-heavy session is presented as a scrolling log. There is no map, no "click on this branch and see what it produced in isolation."
  • No multiplayer. Two engineers running Claude Code on the same repo do it on two laptops, with two prompt histories, in two private contexts. Worktrees (Chapter 3) solve the file conflict but not the alignment problem.
  • Plans are private. Plan mode produces a plan visible only to the operator. There is no shared review surface before the agent starts.
  • Review surfaces are still PR-shaped. The diff is the output. For artifacts that are not code — a contract, a research report, a strategy memo — the chat-plus-final-file pattern fits poorly.

Practical takeaways for users today

You cannot rebuild Claude Code's UI, but you can run it as if Lauritzen's and Appleton's critiques were already operating principles:

  1. Decompose tasks before prompting. The 30-minute autonomous run is the failure mode at the top of Lauritzen's talk. Break work into nodes you can verify or steer between.
  2. Use skills to encode judgment. Skills fire at the right node automatically. CLAUDE.md and prompts run at the wrong altitude. (See Chapter 4.)
  3. Make the artifact the workspace, not the chat. If you are writing a document, edit the document — let Claude Code modify it in place and review the diff. Lauritzen's tabular-review and document patterns map directly onto this: the structured artifact is the surface for both work and review.
  4. Treat plan mode as a checkpoint, not a contract. Re-plan when the agent hits something the original plan did not anticipate. Force elicitation rather than letting the agent make a wrong-but-confident call.
  5. Share plans before the agent starts. Appleton's collapse-of-touchpoints diagnosis is real. If the work involves anyone else, the plan should land somewhere reviewable — a doc, a PR description, a ticket — before the agent runs, not after.
  6. Run in parallel, but coordinate at the artifact level. Worktrees (Chapter 3) let multiple agents work without colliding on files. They do not stop multiple humans from prompting toward conflicting goals. The artifact is where alignment has to happen.

The throughline of both talks: as agents take over implementation, the human's job moves to the surfaces around the work. Chat is a thin surface. The interfaces that win will match the actual structure of agent work — a tree, embedded in a team.

Chapter 9: Anthropic's Perspective — How Claude Code Is Built and Shipped

Source Videos: How Anthropic's product team moves faster than anyone else — Cat Wu, Building Claude Code with Boris Cherny


This is the inside view. Eight chapters in, you have the user's mental model of Claude Code agents — the loop, subagents, worktrees, skills, orchestration, long-running runs, the personal stack, and the UX critique. This last chapter looks the other direction: how the team that builds Claude Code thinks about agents, what they ship in any given week, what they remove when a new model lands, how they decide the product is safe enough to release, and where they think this is going.

The core sources are two interviews. Cat Wu, Head of Product for Claude Code and Co-work, on Lenny's Podcast — recorded April 2026, after roughly a year of public Claude Code releases. Boris Cherny, the engineering lead, on The Pragmatic Engineer — pulled here only for the agent-specific material that Chapter 1 did not already cover.

The Boris–Cat Split

Cat Wu describes a deliberate division of labour with Boris Cherny:

Key Insight from Lenny's Podcast: "He's our tech lead. He's very much the product visionary and he is great at setting like this is what the product needs to be in three months, six months from now. And a lot of my role is figuring out what is the path from where we are today to that vision three to six months from now." -- Cat Wu, How Anthropic's product team moves faster

Boris owns the long horizon and the technical bets. Cat owns the path to ship — the cross-functional alignment with marketing, sales, finance, capacity, and the launch process that turns an engineer's prototype into a public release. She estimates the two are "80% mind-meld" with a small slice of priorities each one drives unilaterally.

This is unusual. Most product teams of this scale split product management across surfaces, not across time horizons. The Claude Code team treats the role of PM as the function that compresses the gap between an AGI-pilled vision and a shippable artifact — which becomes the central theme of the chapter.

The Shipping Engine

The most-cited fact about Anthropic in 2026 is the shipping pace. Cat is direct about what makes it work, and what it costs.

Timelines compressed

Before AI, product cycles were measured in quarters. The Claude Code team now measures them differently:

Key Insight from Lenny's Podcast: "The timelines for a lot of our product features have gone down from six months to one month and sometimes to one week or even one day." -- Cat Wu, How Anthropic's product team moves faster

The corollary: a PM cannot align multi-quarter roadmaps with partner teams when features ship in a week. The PM's job moves from coordination upstream to shortening the path from idea to user.

Research preview as a commitment-reduction mechanism

Almost everything ships first as a research preview. The branding is explicit and load-bearing.

The research-preview label tells users this is an early product, an idea being tested, that may not be supported forever. That label is what permits a one-week build cycle. Without it, every release would carry a long-term support obligation, and the team would not ship as fast.

Read the Claude Code release notes through this lens and the pattern is clear: features land in preview, get user feedback, harden, lose the preview label, and either become permanent or quietly disappear. The to-do list (which we will return to below) is a feature that arrived as a fix, became a habit, and is now barely needed.

A tight launch loop with cross-functional partners

Cat describes a four-person fast-path between engineering, docs, marketing, and DevRel. When an engineer feels a feature is ready and has been dogfooded internally, they post to an "evergreen launch room"; Sarah on docs, Alex on PMM, Tara and Lydia on DevRel jump in and turn around the announcement the next day. The whole point is that an engineer can take an idea from prototype to public release without negotiating with anyone.

Mission as a tiebreaker

The hardest decisions on a fast-shipping team are not "what do we build" but "which fight do we lose." Cat puts the mission above the product line:

Key Insight from Lenny's Podcast: "If Claude Code failed but Anthropic succeeded, I would be extremely happy. The whole team is very willing to make decisions that follow that chain of thought." -- Cat Wu, How Anthropic's product team moves faster

When two priorities collide, the team asks which one serves Anthropic's mission of safe AGI for humanity, and the loser stands behind the winner. This is also the explanation she offers for the change in policy around third-party tools using the Claude subscription — the company decided to prioritise its first-party products and the API, accepting that this would harm some third-party users.

"The Right Amount of AGI-Pilled"

This is the line from the interview most worth sitting with. Cat frames it as the hardest skill in AI product management:

Key Insight from Lenny's Podcast: "It is very hard to be the right amount of AGI-pilled. It's very easy to build the product for the super AGI strong model. The hard thing is figuring out for the current model, how do you elicit the maximum capability?" -- Cat Wu, How Anthropic's product team moves faster

Build for the imagined super-model and you ship a text box and wait. The model will eventually be smart enough that the text box is the only product you need — it will pick its own tools, ask its own clarifying questions, recover from its own mistakes. But that day is not today. The job is to design product surfaces that get today's model onto its golden path.

Crutches added — and removed

The clearest example of this discipline is the to-do list. When Claude Code first shipped, large refactors would stall: the model would identify twenty call sites that needed to change, edit five, and stop. Sid on the team thought about what a human would do — open a panel, list every call site, walk it. So the team gave Claude a to-do list as a tool, and the stalls stopped.

Then Opus 4 landed. The model started keeping its own to-do list spontaneously, without prompting. Then Opus 4.5, then 4.6 — the model no longer needed the reminder at all.

Key Insight from Lenny's Podcast: "We can remove a lot of prompting interventions every time the model gets smarter. We actually do this every time we launch a model. We read through the entire system prompt and we reflect on, for each of these sections, does the model really need this reminder anymore? And if not, we'll remove it." -- Cat Wu, How Anthropic's product team moves faster

The agent harness is not a permanent scaffold. It is a series of crutches that the team installs to compensate for current-model weaknesses and removes when the next model no longer needs them. This is the same dynamic engineers see in their own CLAUDE.md files — last quarter's required reminders become this quarter's noise.

Features that become possible

The other side of model jumps is features that were not viable before. Cat names code review as the canonical case:

Key Insight from Lenny's Podcast: "We tried to build a code review product a few times... it was only with the most recent models that we felt like okay this code review is so good that our engineering team relies on this code review to pass before we merge PRs... it was only with like Opus 4.5 and 4.6 and Sonnet 4.6 that we felt like okay we are now able to run multiple code review agents simultaneously to traverse the entirety of the codebase." -- Cat Wu, How Anthropic's product team moves faster

Earlier attempts at code review existed (the /code-review slash command), but the model was not reliable enough that engineers would trust it as a gate. With multi-agent review, the team got the reliability that turns a nice-to-have into a piece of infrastructure. The product strategy: build the prototype before the model is quite ready, so that when the next model lands you can drop it in and ship.

Internal Anthropic Usage

Chapter 1 covered the headline statistic — close to 100% of technical employees using Claude Code daily, half the sales team using it, the internal adoption chart going vertical. The newer material from Cat Wu is about how non-engineering teams use the tools, and what this implies for the product.

Applied AI as the second-biggest token spender

After engineering, the biggest internal users are the Applied AI team — the technical go-to-market function that helps customers adopt the API. They are heavy on both Claude Code and Co-work. They build prototypes for customers (which would have taken weeks before, takes hours now), and they manage a high volume of customer communications and historical context. The pattern Cat describes: the night before a customer day, an Applied AI engineer asks Co-work to summarise every meeting on tomorrow's calendar, pull every prior thread with that customer, surface action items, and produce a dossier.

Custom internal apps as a side effect

Once Claude Code makes app-building cheap, internal teams stop tolerating off-the-shelf workflows that almost-fit:

Key Insight from Lenny's Podcast: "One of the things that Claude Code has really unlocked for our entire company is it really lowers the barrier to making any custom app that you want. We've seen this surge in personalised work software that people are building for custom use cases instead of using tools that don't perfectly fit the use case." -- Cat Wu, How Anthropic's product team moves faster

Her concrete example is a sales engineer's custom deck-builder. Standard intro decks (101, 201, mastering Claude Code) are templates; the app pulls customer-specific context from Salesforce, Gong, and meeting notes; it then assembles a tailored deck that knows whether the customer is on Bedrock or Vertex, whether they use Claude for Enterprise, and which features are accessible to them. What used to be 20–30 minutes of manual work — or got skipped entirely — now takes seconds.

This is the same pattern Chapter 7 discussed as a personal agentic OS, scaled to a company. The leverage is the same: when the cost of writing custom software collapses, the inventory of "things worth automating" expands.

Cat's own workflow

She uses Claude Code in the terminal for one-off coding tasks where she wants the latest features (the CLI gets new capabilities first). She uses the desktop app when there is a front-end preview to watch. She uses the mobile and web surfaces to kick off tasks while she is away from her laptop. And she uses Co-work for any work whose output is not code — slide decks built from Google Drive context, research dossiers compiled from Slack threads, draft launch plans pulled from a feature spec.

The slide-deck workflow is illustrative of how the team uses Co-work as an agent rather than a chatbot. She fed it the conference talk topic, what the PMM had suggested it should cover, her own draft (which she didn't like) and access to the design-system deck, and turned it loose. It ran for a few hours: looked through Twitter to see what had launched, walked the evergreen launch room for context, checked the Cloud Code announce channel for demos, and produced a 20-page deck the next morning. Cat reviewed and edited; she did not start from scratch.

The point is not the deck. The point is that the people building Claude Code use Claude Code agents on tasks that nobody used to think of as agent-shaped — including their own product launch material.

The Agent Deployment Safety Case

Boris's interview adds two pieces of agent-specific material that Chapter 1 only touched on. The first is the public-release argument; the second is how the team thinks about agentic safety.

Why ship the agent at all

Anthropic's research lab identity is the reason Claude Code exists as a public product at all. Boris is explicit: product exists at Anthropic to serve research and to make models safer.

Key Insight from The Pragmatic Engineer: "In the end, the decision was to release so that we can study safety in the wild... There's kind of alignment and mechanistic interpretability. This is at the model layer. Then there's evals and this is... putting the model in a petri dish and synthetically studying it in this way... and then you can study it in the wild and you can see how it actually behaves... by doing this we've been able to make the model much safer." -- Boris Cherny, Building Claude Code with Boris Cherny

The argument: synthetic evals only show you so much. To learn how an agentic AI behaves under real conditions — with real users, real prompt injections, real edge cases — you have to deploy it. The product is a research instrument as much as a revenue line. This is why Claude Code shipped publicly even though Anthropic could have kept it as an internal productivity advantage.

This framing matters for users too. When the team takes telemetry, when it asks for feedback on edge cases, when Cat closes her interview asking listeners to send her every reproducible failure they encounter — that is not customer-success theatre. The whole point of a public deployment is to surface the failure modes that synthetic evals miss.

Swiss cheese, count the nines

Agent safety is not a single barrier; it is layers. Boris describes prompt-injection defence on web fetch as a three-layer stack: the model itself (Opus 4.6 trained to resist injection), runtime classifiers that block suspicious requests, and — the genuinely agent-specific move — a subagent that summarises the fetched page and returns only the summary to the main agent's context. The raw page never touches the main loop. If the page contains an injection attempt, the subagent's summary either fails to carry it or mangles it past the threshold of usefulness.

Key Insight from The Pragmatic Engineer: "It's always a Swiss cheese model. You just need a bunch of layers, and with enough layers the probability of catching anything goes up. So you just have to count the number of nines in that probability and pick the threshold that you want." -- Boris Cherny, Building Claude Code with Boris Cherny

The structural point for agent designers: a single guardrail is brittle. Defence-in-depth is how you ship an agent that touches the file system, the network, and a shell, and still sleep at night. The subagent-as-firewall pattern is one Chapter 2 introduced as a context-isolation tool; here it is repurposed as a security primitive.

Where This Is Heading

Cat closes the interview with a forward look organised around what the team calls building blocks.

Block one: the single task

The first building block is a task that succeeds end-to-end. Give an agent a clear prompt, get an output you can either merge, ship, or hand to a human. Most of the work of the past year has been raising the success rate of single tasks until users actually trust them.

Block two: many tasks in parallel

As single-task reliability climbs, users start running many tasks at once. Cat calls out late-2025 as the inflection point for "multi-coding" — the Chapter 3 worktree pattern, generalised. Six tasks at once becomes routine. Boris ships 20–30 PRs a day this way (covered in Chapter 1).

Block three: dozens to hundreds of agents

The horizon Cat points at is far past where the average user is today:

Key Insight from Lenny's Podcast: "As the models get even smarter, the way that we are extrapolating this is okay, next maybe you're going to run fifty Claudes at a time, or hundreds of Claudes at a time... At that point you're probably not going to run everything locally on your machine anymore. There's just not enough RAM to do it." -- Cat Wu, How Anthropic's product team moves faster

Three engineering questions follow from this. How do you manage many remote agents from a single human cockpit? How do you make sure each agent verifies its own work so a human reading "done" can trust it? And how do you make the system self-improving — when a human gives feedback on a bad output, every future run incorporates the correction.

These are the long-horizon problems that the cloud-based scheduled tasks, Dispatch, and the remote-control surfaces (covered in Chapter 1) are early answers to. Expect more of the product to move off the local machine, more of the loop to become persistent, and more of the human's role to become inspection of completed work rather than supervision of in-progress work.

From chat-based to action-based

Cat draws a generational line:

Key Insight from Lenny's Podcast: "The 2024 generation of products were chat-based and the Claude Code generation of products is action-based. The... aha moment people have is when Claude can just do things on your behalf... The agent can actually just do it itself. And when people feel that, that's the eye-opening moment." -- Cat Wu, How Anthropic's product team moves faster

Every chapter of this guide has been an answer to "what does action-based mean in practice?" — the loop, isolated context, parallelism, skills, orchestration, durability, personal stacks, post-chat UX. The Anthropic-internal version of the answer is that they are building the substrate for a world where humans direct fleets of agents instead of conversing with one.

The Practitioner's Closing Note

Two pieces of advice from Cat are worth carrying out of this guide.

The first is about which automations are worth building.

Key Insight from Lenny's Podcast: "If an automation doesn't work 100% of the time, it's not really an automation. And that last 5 to 10% does take more time. Also, building the automation is often a lot slower than you doing it yourself... Put in the elbow grease to teach [Claude] your preferences, to like give it feedback so that it can improve its skill so that it can get to that 100%. And then... you'll be able to rely on it. There's just not much value in a 95% automation." -- Cat Wu, How Anthropic's product team moves faster

A 95%-reliable agent is a 95%-reliable colleague — one you have to check every time, which is most of the work you were trying to automate away. The leverage is in the last few percent. This applies equally to a personal email triage skill, a long-running research agent, and an enterprise code-review pipeline.

The second is about agency. Cat's life motto is "just do things." In her framing, jobs are fake — roles are flexible, scopes are negotiable, the constraints that actually matter are the ones that come from the problem itself rather than from the org chart. The reason this matters in an agent guide is that the people getting the most out of Claude Code are the ones treating it the same way: not waiting for permission to wire it into their workflow, not adhering to someone else's idea of what an agent should be used for, not stopping at the first 80% solution.

The team building Claude Code is shipping at a week-or-day cadence, removing scaffolds the model has outgrown, building prototypes for capabilities that do not work yet, and treating their own product as a research instrument. The team using Claude Code best is doing the same thing one layer up — composing agents, removing the workflows that have become friction, and treating each new model as an opportunity to delete prompting they no longer need.

That is the inside view. The outside view, eight chapters in, should look the same.