{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Harness Engineering: Why the System Around the Model Decides Agent\n",
        "\n",
        "Performance\n",
        "\n",
        "Coding agents are forcing a shift in how we think about AI systems. A\n",
        "few years ago, many of us were using language models for short,\n",
        "stateless tasks. In this talk, Rajiv walks through why long-running\n",
        "coding tasks create a different engineering problem and why the system\n",
        "around the model now matters as much as the model itself.\n",
        "\n",
        "This version follows the exported deck page by page, using the PDF\n",
        "itself as the source of truth for slide order. The image comes first,\n",
        "with the matching explanation directly underneath it so the commentary\n",
        "stays aligned to the slide you are looking at.\n",
        "\n",
        "## Video\n",
        "\n",
        "<https://www.youtube.com/watch?v=KijChx7q2nY>\n",
        "\n",
        "Watch the [full video](https://www.youtube.com/watch?v=KijChx7q2nY)\n",
        "\n",
        "------------------------------------------------------------------------\n",
        "\n",
        "## Annotated Presentation\n",
        "\n",
        "Below is the slide-by-slide annotated version of **Engineering the\n",
        "Harness: A Practical Workshop**.\n",
        "\n",
        "### 1. Engineering the Harness: A Practical Workshop\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_001.png\"\n",
        "alt=\"Slide 1\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 1</figcaption>\n",
        "</figure>\n",
        "\n",
        "Title slide for the workshop.\n",
        "\n",
        "### 2. Engineering the Harness: A Practical Workshop\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_002.png\"\n",
        "alt=\"Slide 2\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 2</figcaption>\n",
        "</figure>\n",
        "\n",
        "A second title frame before the talk begins.\n",
        "\n",
        "### 3. The tasks changed.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_003.png\"\n",
        "alt=\"Slide 3\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 3</figcaption>\n",
        "</figure>\n",
        "\n",
        "Three years ago, we used language models like this. Hand it a review,\n",
        "ask for sentiment, get an answer back. 30 tokens. 0.2 seconds. Today\n",
        "we’re asking them to do *this*. Look at a whole codebase, find a bug,\n",
        "write a patch, run the test suite, and verify it worked. 12 million\n",
        "tokens. 20 minutes.\n",
        "\n",
        "### 4. The model isn’t solving the problem. The system is.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_004.png\"\n",
        "alt=\"Slide 4\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 4</figcaption>\n",
        "</figure>\n",
        "\n",
        "And this is what the telemetry for a modern coding task actually looks\n",
        "like. Hundreds of tool calls. Millions of tokens of output. Which brings\n",
        "me to the thesis of this talk: **the model isn’t solving the problem.\n",
        "The system is.**\n",
        "\n",
        "### 5. Hi, I’m Rajiv, and this is a masterclass on Harnesses.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_005.png\"\n",
        "alt=\"Slide 5\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 5</figcaption>\n",
        "</figure>\n",
        "\n",
        "I’m Rajiv Shah, Agentic AI Engineer at OpenHands. We build the\n",
        "open-source harness that wraps models like the one in that trace. The\n",
        "next hour is a practical tour of what’s actually inside a harness, and\n",
        "which decisions matter most. This is a workshop, so interrupt me. If\n",
        "something doesn’t land, ask. We’ll have explicit pauses to discuss along\n",
        "the way.\n",
        "\n",
        "### 6. This has been an evolution\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_006.png\"\n",
        "alt=\"Slide 6\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 6</figcaption>\n",
        "</figure>\n",
        "\n",
        "Quick step back before we dive in. Our focus as a community keeps moving\n",
        "up the stack. 2022 — we cared about weights. Fine-tuning, RLHF. 2023 —\n",
        "context. RAG, long context. 2024 — tools, skills, MCP. And now, in 2026,\n",
        "the outermost layer — the harness — is where the action is.\n",
        "\n",
        "### 7. What is in a harness?\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_007.png\"\n",
        "alt=\"Slide 7\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 7</figcaption>\n",
        "</figure>\n",
        "\n",
        "Let’s define it clearly, because this gets confused all the time. Agent,\n",
        "harness, SDK — people use these words interchangeably. Here’s the mental\n",
        "model I want you to carry for the rest of the hour. **The model reasons.\n",
        "The harness does everything else.** Look at this diagram. The model sits\n",
        "in the middle — it reasons and decides. Everything around it is the\n",
        "harness. Context injection on top: prompts, memory, skills,\n",
        "conversation. Control on the left: compaction, orchestration, loops.\n",
        "Action on the right: bash, tools, MCPs. Persistence at the bottom:\n",
        "filesystem, git, progress files. And observe and verify: browser\n",
        "screenshots, test results, logs. If you take nothing else from this\n",
        "talk: **Agent = Model + Harness.**\n",
        "\n",
        "### 8. A harness is everything outside the model\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_008.png\"\n",
        "alt=\"Slide 8\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 8</figcaption>\n",
        "</figure>\n",
        "\n",
        "Here’s the loop in concrete terms. Prompt → model → action → feedback.\n",
        "Not one shot. Modern coding agents iterate fifty to two hundred times to\n",
        "finish a task.\n",
        "\n",
        "### 9. A harness is everything outside the model\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_009.png\"\n",
        "alt=\"Slide 9\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 9</figcaption>\n",
        "</figure>\n",
        "\n",
        "And here’s what makes that loop *reliable* — the harness wrapping it.\n",
        "State — repo, memory, history, summaries. Control — iteration limits,\n",
        "retries, task contracts. An execution engine that turns decisions into\n",
        "shell commands and file edits. Feedback ingestion that captures test\n",
        "failures and feeds them back. All that scaffolding on the outside is\n",
        "what turns a `while` loop into an agent that actually finishes work.\n",
        "\n",
        "### 10. A good SDK abstracts the harness for agentic actions\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_010.png\"\n",
        "alt=\"Slide 10\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 10</figcaption>\n",
        "</figure>\n",
        "\n",
        "You don’t write this from scratch. We built the OpenHands SDK so you can\n",
        "wire up a workspace, an agent, tools, and a conversation loop in about\n",
        "twenty lines of code. Claude Code, Codex CLI, Factory, OpenHands — every\n",
        "serious harness abstracts this same set of concerns. The SDK hides the\n",
        "plumbing.\n",
        "\n",
        "### 11. Same model, 2× performance gap\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_011.png\"\n",
        "alt=\"Slide 11\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 11</figcaption>\n",
        "</figure>\n",
        "\n",
        "And the harness matters. A lot. Three pieces of evidence. First: same\n",
        "model — Claude Opus 4.5. Put it in Claude Code’s harness, it hits 95% on\n",
        "CORE-Bench Hard. Put it in a naive Hugging Face Smolagents setup, it\n",
        "drops to 42%. Same weights. Same intelligence. The harness alone moves\n",
        "you fifty-three percentage points.\n",
        "\n",
        "### 12. Everyone on the leaderboard uses the same model\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_012.png\"\n",
        "alt=\"Slide 12\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 12</figcaption>\n",
        "</figure>\n",
        "\n",
        "Second: Terminal Bench leaderboard, today. Look at the model column.\n",
        "Claude Opus 4.6, all the way down. Every top entry is running the same\n",
        "model — they’re competing on *harness design*. That’s the entire\n",
        "contest.\n",
        "\n",
        "### 13. Small model + good harness \\> big model\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_013.png\"\n",
        "alt=\"Slide 13\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 13</figcaption>\n",
        "</figure>\n",
        "\n",
        "Third: a great harness can flip the math entirely. The AutoHarness paper\n",
        "pairs Gemini 2.5 Flash — a small, cheap model — with a well-designed\n",
        "harness, and it beats GPT-5.2 High. Cheaper model, better result,\n",
        "because the scaffolding was better. Three independent proofs, same\n",
        "conclusion. The harness decides more than the model.\n",
        "\n",
        "### 14. What harness do you use?\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_014.png\"\n",
        "alt=\"Slide 14\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 14</figcaption>\n",
        "</figure>\n",
        "\n",
        "Quick show of hands before we go deeper. Who here uses **Claude Code**?\n",
        "**Codex CLI**? **Cursor**? **OpenHands**? Something else? Cool — . Hold\n",
        "onto that. We’re going to come back to what your harness is deciding for\n",
        "you.\n",
        "\n",
        "### 15. Harness carries a lot of decisions\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_015.png\"\n",
        "alt=\"Slide 15\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 15</figcaption>\n",
        "</figure>\n",
        "\n",
        "Which means the next question is: whose harness? Take Claude Code versus\n",
        "Codex CLI. Same category of product. Very different harness decisions.\n",
        "How permissions work. How CLAUDE.md vs AGENTS.md gets loaded. How the\n",
        "sandbox behaves. How much of the internals you can see. Every row in\n",
        "this table is a decision somebody already made for you.\n",
        "\n",
        "### 16. Harnesses carry technical debt\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_016.png\"\n",
        "alt=\"Slide 16\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 16</figcaption>\n",
        "</figure>\n",
        "\n",
        "This is the setup slide for the next argument: harnesses accumulate\n",
        "technical debt quickly, so yesterday’s orchestration defaults become\n",
        "today’s hidden liabilities.\n",
        "\n",
        "### 17. Harnesses are evolving with the models.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_017.png\"\n",
        "alt=\"Slide 17\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 17</figcaption>\n",
        "</figure>\n",
        "\n",
        "And those decisions don’t stay put. Three years ago we were all using\n",
        "AutoGen and CrewAI — nobody runs those in production anymore. Boris\n",
        "Cherny, who leads Claude Code at Anthropic, frames it this way: as\n",
        "models improve, your harness should get *simpler*. Stress-test your\n",
        "harness. Is this code load-bearing, or just legacy overhead? Manus is\n",
        "the cleanest example — they rebuilt their harness *five times* in six\n",
        "months. Each rewrite *removed* complexity. Complex tool definitions\n",
        "became general shell execution. Management agents became simple\n",
        "structured handoffs. So: are harnesses actually getting simpler? Let’s\n",
        "check.\n",
        "\n",
        "### 18. As models improved, who has noticed the trend towards shorter system prompts?\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_018.png\"\n",
        "alt=\"Slide 18\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 18</figcaption>\n",
        "</figure>\n",
        "\n",
        "Show of hands — anyone here notice that as models have gotten better,\n",
        "system prompts have gotten *shorter*? That’s the story Boris and a lot\n",
        "of harness builders tell: better models, less hand-holding. I wanted to\n",
        "check that. Let me show you what I found.\n",
        "\n",
        "### 19. System prompts getting longer!\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_019.png\"\n",
        "alt=\"Slide 19\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 19</figcaption>\n",
        "</figure>\n",
        "\n",
        "I pulled the last four Claude Opus system prompts — May 2025 through\n",
        "April 2026 — and measured them by category. The prompt *more than\n",
        "doubled* in eleven months. **1,714 words to 3,686 words.** Safety nearly\n",
        "*tripled*. Behavioral guidance roughly *tripled*. Identity barely moved.\n",
        "Knowledge stayed flat. So the honest answer is: *some* things get\n",
        "simpler — the behavioral patches for letter-counting, for puzzle\n",
        "constraints — those got retired into training. But the structural stuff\n",
        "— safety, tools, agentic guidance — keeps growing. **Don’t take the\n",
        "vendor story at face value. Measure.**\n",
        "\n",
        "### 20. Claude Code Harness / Architecture\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_020.png\"\n",
        "alt=\"Slide 20\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 20</figcaption>\n",
        "</figure>\n",
        "\n",
        "And once you accept that the vendor’s story doesn’t match reality, the\n",
        "next question is: what *is* actually under the hood? Here’s what Claude\n",
        "Code’s harness looks like — this came out of a leak. Agent loop.\n",
        "Compaction pipeline. Permission system. Hook pipeline. MCP tools.\n",
        "Subagent spawning. Shell sandbox. It’s a lot. And every one of those\n",
        "boxes is a decision made for you.\n",
        "\n",
        "### 21. Harnesses have bugs\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_021.png\"\n",
        "alt=\"Slide 21\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 21</figcaption>\n",
        "</figure>\n",
        "\n",
        "And that complexity has real consequences. A few days ago — Anthropic\n",
        "posted a postmortem. Claude Code had a regression. **The *model* didn’t\n",
        "change. The *harness* changed.** Default reasoning dropped from high to\n",
        "medium. A bug evicted thinking blocks to save cache. A system prompt\n",
        "tweak reduced verbosity — which reduced code quality. Three harness\n",
        "bugs. Users felt them immediately.\n",
        "\n",
        "### 22. The 5 Levers of Harness Engineering\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_022.png\"\n",
        "alt=\"Slide 22\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 22</figcaption>\n",
        "</figure>\n",
        "\n",
        "Which brings us to the roadmap. If you want to own your harness — and\n",
        "I’m hoping by now you want to — there are five levers you control. The\n",
        "model is one, but we all know how to swap models. The next four are\n",
        "where the variance actually lives: **retrieval, memory, loops, and\n",
        "architecture.** That’s the rest of this hour. Starting with retrieval.\n",
        "\n",
        "### 23. Let’s start with how agents find what they need.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_023.png\"\n",
        "alt=\"Slide 23\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 23</figcaption>\n",
        "</figure>\n",
        "\n",
        "First lever: how the agent finds what it needs. Retrieval.\n",
        "\n",
        "### 24. The three modes of Agentic Retrieval.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_024.png\"\n",
        "alt=\"Slide 24\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 24</figcaption>\n",
        "</figure>\n",
        "\n",
        "Three buckets. Lexical search — keyword-based. Language models —\n",
        "semantic meaning with embeddings. And agentic search — dynamic queries\n",
        "driven by the model’s own reasoning. The industry default is to reach\n",
        "for semantic. For coding agents, you need to reconsider that.\n",
        "\n",
        "### 25. The Baseline: grep.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_025.png\"\n",
        "alt=\"Slide 25\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 25</figcaption>\n",
        "</figure>\n",
        "\n",
        "The baseline isn’t a vector database. It’s `grep`. Keyword precision.\n",
        "Sub-second latency. Battle-tested. If you can solve your problem with\n",
        "`grep`, you’ve already won.\n",
        "\n",
        "### 26. State-of-the-Art Coding Agents rely on grep\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_026.png\"\n",
        "alt=\"Slide 26\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 26</figcaption>\n",
        "</figure>\n",
        "\n",
        "And this isn’t a contrarian take. Look at Claude Code’s own docs — Glob\n",
        "and Grep right there in the tool reference. Pure lexical search. Boris\n",
        "Cherny and Catherine Wu, the architects, went on record: they tested\n",
        "lexical versus vector and lexical was much better for code. Cursor is\n",
        "reportedly ripping out their entire vector search infrastructure.\n",
        "\n",
        "### 27. Inverted Indices (BM25) make grep instant.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_027.png\"\n",
        "alt=\"Slide 27\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 27</figcaption>\n",
        "</figure>\n",
        "\n",
        "And if you need to scale `grep` across a big repo, BM25 makes it\n",
        "instant. Inverted index — same idea Lucene’s used for twenty years. Look\n",
        "at this table: linear `grep` across 9,000 docs takes thirty seconds.\n",
        "BM25 does it in 360 milliseconds. You don’t need a vector DB to search\n",
        "code.\n",
        "\n",
        "### 28. Context Harness Uses BM25\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_028.png\"\n",
        "alt=\"Slide 28\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 28</figcaption>\n",
        "</figure>\n",
        "\n",
        "This slide translates the BM25 point into harness design: once the\n",
        "environment is indexed, lexical retrieval stays fast enough to be the\n",
        "default inside many coding agents.\n",
        "\n",
        "### 29. Where Lexical breaks down: The Synonym Gap.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_029.png\"\n",
        "alt=\"Slide 29\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 29</figcaption>\n",
        "</figure>\n",
        "\n",
        "Where lexical does break: the synonym gap. You search for ‘physician,’\n",
        "the doc says ‘doctor.’ BM25 misses it. You search for ‘IBM,’ the doc\n",
        "says ‘International Business Machines.’ BM25 misses that too. If your\n",
        "queries don’t share tokens with the source, lexical can’t help you.\n",
        "\n",
        "### 30. Embeddings solve for meaning.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_030.png\"\n",
        "alt=\"Slide 30\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 30</figcaption>\n",
        "</figure>\n",
        "\n",
        "That’s where embeddings come in. Encode meaning, not tokens. Cosine\n",
        "similarity finds documents that *mean* the same thing even if they don’t\n",
        "share words.\n",
        "\n",
        "### 31. Semantic search is useful for massive codebases.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_031.png\"\n",
        "alt=\"Slide 31\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 31</figcaption>\n",
        "</figure>\n",
        "\n",
        "And for massive codebases, semantic earns its keep. Cursor’s telemetry\n",
        "shows about a 12.5% accuracy bump on large repos. The changes are also\n",
        "more likely to *stick* — to actually get retained in production.\n",
        "\n",
        "### 32. Who’s using Agentic Search?\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_032.png\"\n",
        "alt=\"Slide 32\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 32</figcaption>\n",
        "</figure>\n",
        "\n",
        "Agentic search is already showing up in real tools. The practical\n",
        "question is when the extra latency is worth paying for the added\n",
        "reasoning power.\n",
        "\n",
        "### 33. Agentic RAG\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_033.png\"\n",
        "alt=\"Slide 33\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 33</figcaption>\n",
        "</figure>\n",
        "\n",
        "Which brings us to agentic search. Different paradigm entirely. Watch\n",
        "traditional RAG explore once and stop, versus an agent that keeps\n",
        "querying.\n",
        "\n",
        "### 34. Agentic Search trades latency for massive accuracy.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_034.png\"\n",
        "alt=\"Slide 34\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 34</figcaption>\n",
        "</figure>\n",
        "\n",
        "Traditional RAG makes one query, returns ten chunks, you’re done.\n",
        "Agentic search hands the LLM the search tool. The model looks at what it\n",
        "got, realizes it didn’t find the answer, rewrites the query, and tries\n",
        "again. **Five seconds becomes twenty-five.** But accuracy goes from\n",
        "**76% to 93%** on WixQA. That’s the trade.\n",
        "\n",
        "### 35. Coding agents are really good at long Context\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_035.png\"\n",
        "alt=\"Slide 35\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 35</figcaption>\n",
        "</figure>\n",
        "\n",
        "For coding specifically, agents with file access dominate. A coding\n",
        "agent with just bash and `grep` beats traditional RAG and ReAct by 22%\n",
        "to 77% across long-context benchmarks. The model gets to *see* the code,\n",
        "not just chunks of it.\n",
        "\n",
        "### 36. Use files instead of chunking RAG approach.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_036.png\"\n",
        "alt=\"Slide 36\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 36</figcaption>\n",
        "</figure>\n",
        "\n",
        "Which leads to the rule: don’t chunk if you don’t have to. If your\n",
        "model’s context window can hold the file, give it the whole file. Look\n",
        "at this — Gemini 3 Pro with whole-file BM25 access ties human\n",
        "performance and beats chunked RAG.\n",
        "\n",
        "### 37. So when should you move to a database\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_037.png\"\n",
        "alt=\"Slide 37\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 37</figcaption>\n",
        "</figure>\n",
        "\n",
        "Files stay the source of truth. Move to a database only when you need\n",
        "metadata joins, ranking, or recency. Otherwise stay on files. Add an\n",
        "index on top when query cost bites.\n",
        "\n",
        "### 38. Design rules for you\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_038.png\"\n",
        "alt=\"Slide 38\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 38</figcaption>\n",
        "</figure>\n",
        "\n",
        "Three rules for the retrieval layer: 1. Lexical baseline. BM25 or `grep`\n",
        "is your default. 2. Add semantics only when you suffer vocabulary\n",
        "mismatch. 3. Loop it if accuracy matters. Put retrieval inside an\n",
        "iterative agentic loop.\n",
        "\n",
        "### 39. Let’s start with how agents find what they need.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_039.png\"\n",
        "alt=\"Slide 39\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 39</figcaption>\n",
        "</figure>\n",
        "\n",
        "This transition moves from retrieval to memory: even if an agent finds\n",
        "the right information, it still needs to remember the right parts of it.\n",
        "\n",
        "### 40. Are you excited about 10M Context Windows?\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_040.png\"\n",
        "alt=\"Slide 40\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 40</figcaption>\n",
        "</figure>\n",
        "\n",
        "Show of hands — who’s *excited* about 10 million token context windows?\n",
        "A few of you. Let me show you why I’m not.\n",
        "\n",
        "### 41. Memory & State\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_041.png\"\n",
        "alt=\"Slide 41\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 41</figcaption>\n",
        "</figure>\n",
        "\n",
        "Second lever: memory. Retrieval gets the agent the right facts. Memory\n",
        "is what stops it from dropping those facts on the floor three turns\n",
        "later. **Thesis: agents usually fail because they lose the thread of\n",
        "what they’re doing — not because they hit the token limit.** Token\n",
        "limits you can pay your way out of. Losing the thread is harder.\n",
        "\n",
        "### 42. 1M Context Windows are never enough.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_042.png\"\n",
        "alt=\"Slide 42\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 42</figcaption>\n",
        "</figure>\n",
        "\n",
        "Yes, we have 1M context windows now. We’ll have 10M soon. But look at\n",
        "the long-context retrieval benchmarks. Every model degrades sharply past\n",
        "128K. Even Opus 4.6, the best one here, drops from **90% to 78%** by the\n",
        "time you stuff it full. **Models degrade when you stuff them full.** The\n",
        "benchmark capability is not the production capability.\n",
        "\n",
        "### 43. 1M Context Windows degrade\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_043.png\"\n",
        "alt=\"Slide 43\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 43</figcaption>\n",
        "</figure>\n",
        "\n",
        "This slide reinforces the warning from the previous one: bigger windows\n",
        "do not just cost more, they degrade, so the harness has to decide what\n",
        "deserves to stay in context.\n",
        "\n",
        "### 44. Key facts disappear inside long model inputs\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_044.png\"\n",
        "alt=\"Slide 44\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 44</figcaption>\n",
        "</figure>\n",
        "\n",
        "And it’s not just degradation, it’s *selective* degradation. Information\n",
        "in the middle of a long input has only a 0.47 cosine similarity in the\n",
        "model’s summary, versus 0.68 at the start. The middle of your context\n",
        "window is where facts go to die.\n",
        "\n",
        "### 45. Coding agents struggle with long context models\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_045.png\"\n",
        "alt=\"Slide 45\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 45</figcaption>\n",
        "</figure>\n",
        "\n",
        "It’s easy to burn your context. Watch out for verbose tool traces — you\n",
        "run `npm install` and get hundreds of lines of deprecation warnings. The\n",
        "agent’s actual goal — what you originally asked it to do — gets pushed\n",
        "out of the window. Vercel devs have shared screenshots of Claude Code\n",
        "degrading sharply past 200K tokens of this kind of noise.\n",
        "\n",
        "### 46. Three layers of Memory\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_046.png\"\n",
        "alt=\"Slide 46\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 46</figcaption>\n",
        "</figure>\n",
        "\n",
        "So treat memory as three explicit layers, not one giant string. Active\n",
        "Context — what’s in the prompt right now. Working State — plans, TODOs,\n",
        "scratchpads outside the prompt. Durable Memory — skills and reusable\n",
        "workflows that persist across sessions.\n",
        "\n",
        "### 47. Layer 1: Fixing Active Context\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_047.png\"\n",
        "alt=\"Slide 47\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 47</figcaption>\n",
        "</figure>\n",
        "\n",
        "Layer 1: Active Context. Two ways to manage it. Reset — clear the window\n",
        "entirely, refill with only the original instructions and critical\n",
        "artifacts. Or Compact — summarize older turns, but keep recent turns\n",
        "intact.\n",
        "\n",
        "### 48. Layer 1: Compacting from OpenHands\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_048.png\"\n",
        "alt=\"Slide 48\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 48</figcaption>\n",
        "</figure>\n",
        "\n",
        "Compaction works. We measure it at OpenHands. Up to 2× per-turn API cost\n",
        "reduction. Consistent response times in long sessions. And — this\n",
        "surprised people — equivalent or *better* performance on software\n",
        "engineering tasks. Less context, sharper agent. ( ACON research backs\n",
        "this up — 26 to 54% token reduction while preserving 95%-plus accuracy,\n",
        "by prioritizing reasoning traces over raw tool outputs.\n",
        "\n",
        "### 49. Layer 1: How does Codex do it???\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_049.png\"\n",
        "alt=\"Slide 49\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 49</figcaption>\n",
        "</figure>\n",
        "\n",
        "This is the uneasy part of closed harnesses: compaction is happening for\n",
        "you, but you usually cannot inspect or tune the policy that decides what\n",
        "gets thrown away.\n",
        "\n",
        "### 50. Getting the most of a 1M Context Windows\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_050.png\"\n",
        "alt=\"Slide 50\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 50</figcaption>\n",
        "</figure>\n",
        "\n",
        "Compaction is just one move. Anthropic publishes a fuller toolkit:\n",
        "continue, rewind, `/clear`, `/compact`, subagents. Five different ways\n",
        "to actively manage what’s in the window. Each one is its own policy\n",
        "decision. Let me zoom in on one I like.\n",
        "\n",
        "One concrete technique Rajiv calls out here is rewind: if the agent went\n",
        "down a bad branch, jump back instead of dragging the failed reasoning\n",
        "forward.\n",
        "\n",
        "### 51. Layer 2: Working State (The Golden Rule)\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_051.png\"\n",
        "alt=\"Slide 51\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 51</figcaption>\n",
        "</figure>\n",
        "\n",
        "Layer 2: Working State. The golden rule — **files make better memory\n",
        "than chat.** Don’t keep the agent’s plan in the system prompt. Have it\n",
        "write a `plan.md` to the workspace. The plan stays out of the context\n",
        "window, but the agent reads from it and checks items off.\n",
        "\n",
        "### 52. Deep Agents rely on external plans.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_052.png\"\n",
        "alt=\"Slide 52\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 52</figcaption>\n",
        "</figure>\n",
        "\n",
        "LangChain’s Deep Agents do exactly this. A `write_todos` tool dumps the\n",
        "plan to a file. The agent reads it, executes step one, updates the file,\n",
        "moves on. The plan never sits in the prompt.\n",
        "\n",
        "### 53. Extreme Layer 2: Recursive Language Models (RLM)\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_053.png\"\n",
        "alt=\"Slide 53\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 53</figcaption>\n",
        "</figure>\n",
        "\n",
        "At the extreme end of Layer 2, you get Recursive Language Models. RLMs\n",
        "bypass token limits entirely by giving the agent a persistent Python\n",
        "REPL. Variables stay in the REPL between calls. The agent uses the REPL\n",
        "as memory. Recursive sub-LM calls handle scoped subqueries.\n",
        "\n",
        "### 54. RLMs maintain accuracy at 1M tokens.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_054.png\"\n",
        "alt=\"Slide 54\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 54</figcaption>\n",
        "</figure>\n",
        "\n",
        "And it works. Standard GPT-5 collapses past 33K tokens on long-context\n",
        "tasks. RLM-GPT-5 maintains 91% accuracy all the way to 1M. Different\n",
        "memory architecture, different ceiling.\n",
        "\n",
        "### 55. Layer 3: Durable Memory\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_055.png\"\n",
        "alt=\"Slide 55\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 55</figcaption>\n",
        "</figure>\n",
        "\n",
        "Layer 3: Durable Memory. What does the agent remember across sessions?\n",
        "You’ve all seen the ChatGPT version of this — the personalization toggle\n",
        "that promises to remember everything you tell it.\n",
        "\n",
        "### 56. Who uses an Agents.md file?\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_056.png\"\n",
        "alt=\"Slide 56\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 56</figcaption>\n",
        "</figure>\n",
        "\n",
        "Quick poll. Show of hands — who has an `AGENTS.md` or `CLAUDE.md` in\n",
        "their repo right now? Keep your hand up if you wrote it by hand. Now\n",
        "keep it up if you let the model auto-generate it for you. That\n",
        "distinction is going to matter a lot more than you’d think. Let me show\n",
        "you.\n",
        "\n",
        "### 57. Durable Memory with Agents.md\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_057.png\"\n",
        "alt=\"Slide 57\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 57</figcaption>\n",
        "</figure>\n",
        "\n",
        "The simple version of intentional memory is `AGENTS.md`. Open format.\n",
        "Adopted by 60,000+ projects. Codex, Cursor, Factory, VS Code, Devin all\n",
        "read it. Dev environment tips, testing instructions, PR conventions. But\n",
        "— and this matters — it’s in *every* prompt. Don’t overload it.\n",
        "\n",
        "### 58. Auto-generated AGENTS.md files hurt performance\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_058.png\"\n",
        "alt=\"Slide 58\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 58</figcaption>\n",
        "</figure>\n",
        "\n",
        "In fact, ETH Zurich just published a study: auto-generated `AGENTS.md`\n",
        "files actively *reduce* task success across multiple coding agents.\n",
        "Sonnet, GPT-5, Qwen — all show degradation. Inference cost goes up by\n",
        "over 20%. The agent wastes tokens reading boilerplate it doesn’t need.\n",
        "\n",
        "### 59. The rule of thumb: “Minimize Load-Bearing Memory”\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_059.png\"\n",
        "alt=\"Slide 59\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 59</figcaption>\n",
        "</figure>\n",
        "\n",
        "Boris Cherny’s rule of thumb here: minimum load-bearing memory. Quote —\n",
        "‘do the minimal possible thing to get the model on track.’ Delete your\n",
        "CLAUDE.md. If the model wanders off, add back one piece. With every new\n",
        "model, you’ll find you need less and less.\n",
        "\n",
        "### 60. Skills are the new standard for Durable Memory.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_060.png\"\n",
        "alt=\"Slide 60\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 60</figcaption>\n",
        "</figure>\n",
        "\n",
        "Which brings us to skills. A skill isn’t a prompt. It’s a trigger plus a\n",
        "reference manual plus a script. Loaded on demand, modular, reusable. The\n",
        "skill says: *when* to run, *how* to execute, *what* the rules are.\n",
        "Anthropic, Cursor, VS Code all support them now.\n",
        "\n",
        "### 61. Skills as Externalized Expertise\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_061.png\"\n",
        "alt=\"Slide 61\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 61</figcaption>\n",
        "</figure>\n",
        "\n",
        "This is the big idea: skills externalize *expertise* the way memory\n",
        "externalizes *state*. Authored, distilled, discovered, composed. The\n",
        "agent gets a registry of expert procedures it can invoke when the\n",
        "trigger matches.\n",
        "\n",
        "### 62. Skills can replace Code\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_062.png\"\n",
        "alt=\"Slide 62\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 62</figcaption>\n",
        "</figure>\n",
        "\n",
        "Cursor talked about this at AI Engineering London. They had 15,000+\n",
        "lines of code for worktree creation, agent loop scoping, harness\n",
        "changes, reminders — all the orchestration glue. They replaced it with a\n",
        "200-line skill. Not refactored. *Replaced.*\n",
        "\n",
        "### 63. Building a learning loop with skills\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_063.png\"\n",
        "alt=\"Slide 63\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 63</figcaption>\n",
        "</figure>\n",
        "\n",
        "And skills enable continual learning. The Hermes Agent pattern: agent\n",
        "attempts a complex task, gets a periodic nudge — ‘what would you do\n",
        "differently?’ — and writes a skill file. Next time it fails, it edits\n",
        "its own skill. The harness teaches itself.\n",
        "\n",
        "### 64. Continual learning outer loop with skills\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_064.png\"\n",
        "alt=\"Slide 64\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 64</figcaption>\n",
        "</figure>\n",
        "\n",
        "That’s actually two loops. Inner loop finishes the task in one session.\n",
        "Outer loop, across sessions, makes the agent smarter. Session 1 fails,\n",
        "skill gets updated, Session 2 succeeds. This is the shape of\n",
        "self-improving agents in production today.\n",
        "\n",
        "### 65. Warning: 16% of skills actually reduce performance.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_065.png\"\n",
        "alt=\"Slide 65\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 65</figcaption>\n",
        "</figure>\n",
        "\n",
        "But skills aren’t free. SkillsBench tested 50+ skills across multiple\n",
        "coding agents. 16% of them actively *reduce* performance. They overlap\n",
        "with native tools, confuse the routing, or trigger when they shouldn’t.\n",
        "Skills are powerful and dangerous in the same way prompts are.\n",
        "\n",
        "### 66. You must evaluate your skills.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_066.png\"\n",
        "alt=\"Slide 66\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 66</figcaption>\n",
        "</figure>\n",
        "\n",
        "So evaluate your skills. Without-skill versus with-skill, on the same\n",
        "tasks. There’s a tutorial in my GitHub repo. If you’re not measuring\n",
        "lift, you’re guessing.\n",
        "\n",
        "### 67. Constant Innovation around Memory\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_067.png\"\n",
        "alt=\"Slide 67\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 67</figcaption>\n",
        "</figure>\n",
        "\n",
        "Memory design is still evolving. Compaction, durable memory, and skills\n",
        "all change with the models, which is why these policies need to be\n",
        "revisited rather than frozen.\n",
        "\n",
        "### 68. @rajistics\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_068.png\"\n",
        "alt=\"Slide 68\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 68</figcaption>\n",
        "</figure>\n",
        "\n",
        "A brief interstitial before the lock-in discussion, pointing back to\n",
        "Rajiv’s broader work on agents and memory.\n",
        "\n",
        "### 69. Memory & Claude\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_069.png\"\n",
        "alt=\"Slide 69\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 69</figcaption>\n",
        "</figure>\n",
        "\n",
        "Now notice the pattern. With a closed harness, all of this is locked in.\n",
        "Conversation compaction. CLAUDE.md handling. Tool search and MCP\n",
        "loading. Subagent definitions. Permission rules. You don’t see it. You\n",
        "don’t tune it. You can’t measure it.\n",
        "\n",
        "### 70. Memory as Lock-in\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_070.png\"\n",
        "alt=\"Slide 70\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 70</figcaption>\n",
        "</figure>\n",
        "\n",
        "Closed harness — memory lives in the provider’s API, compaction is\n",
        "encrypted, switch providers and you lose your history. Open harness —\n",
        "memory lives in your files, compaction is your code, switch providers\n",
        "and the memory stays. **Lock the harness, lose the memory, lose your\n",
        "product.**\n",
        "\n",
        "### 71. Let’s start with how agents find what they need.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_071.png\"\n",
        "alt=\"Slide 71\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 71</figcaption>\n",
        "</figure>\n",
        "\n",
        "This divider shifts from memory into loop design: once agents can\n",
        "retrieve and remember, the next question is whether they can act with\n",
        "discipline.\n",
        "\n",
        "### 72. Agentic Loops and Tool Use\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_072.png\"\n",
        "alt=\"Slide 72\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 72</figcaption>\n",
        "</figure>\n",
        "\n",
        "Core argument — **better tool loops beat better prompts.** The leverage\n",
        "isn’t in writing a smarter system prompt. It’s in shaping the loop the\n",
        "agent runs inside.\n",
        "\n",
        "### 73. Engineering the Loop\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_073.png\"\n",
        "alt=\"Slide 73\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 73</figcaption>\n",
        "</figure>\n",
        "\n",
        "Four stages to walk through: 1. The baseline — the Ralph Wiggum loop,\n",
        "and why it fails. 2. Cognitive discipline — forcing thinking via JSON\n",
        "schemas. 3. Environmental discipline — using tests and CI to physically\n",
        "block bad loops. 4. Safety and friction — sandboxing autonomous actions.\n",
        "And here’s the math that makes this matter. A 10-step process with 99%\n",
        "per-step success has only 90.4% end-to-end success. At 50 steps you’re\n",
        "at 60%. Errors compound fast. Every stage we walk through exists to keep\n",
        "that compounding curve under control.\n",
        "\n",
        "### 74. We no longer rely on single-shot execution.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_074.png\"\n",
        "alt=\"Slide 74\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 74</figcaption>\n",
        "</figure>\n",
        "\n",
        "Quick stake in the ground. Single-shot is dead for long tasks. Models\n",
        "must use tools, see the result, loop. That’s the entire architectural\n",
        "lesson of OpenAI’s o1.\n",
        "\n",
        "### 75. We no longer rely on single-shot execution.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_075.png\"\n",
        "alt=\"Slide 75\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 75</figcaption>\n",
        "</figure>\n",
        "\n",
        "Look at Opus 4.7’s published tool list — over twenty tools shipped with\n",
        "the model. `bash_tool`, `web_search`, `tool_search`, `view`,\n",
        "`create_file`, `str_replace`. Every one designed for a model that plans,\n",
        "acts, and iterates. **The model itself is now a harness-aware\n",
        "artifact.**\n",
        "\n",
        "### 76. The Default: The “Ralph Wiggum” Agent\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_076.png\"\n",
        "alt=\"Slide 76\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 76</figcaption>\n",
        "</figure>\n",
        "\n",
        "But a `while` loop isn’t an agent. The default behavior of most\n",
        "frameworks is what I call the Ralph Wiggum loop. Try a command. Get a\n",
        "stack trace. Retry the same command without diagnosing the error. Repeat\n",
        "until the harness hits `max_iterations`. *I’m learnding.*\n",
        "\n",
        "### 77. Ralph can work\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_077.png\"\n",
        "alt=\"Slide 77\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 77</figcaption>\n",
        "</figure>\n",
        "\n",
        "And Ralph occasionally works. There’s a write-up of porting 600,000\n",
        "lines of C in four days using exactly this loop. If the plan is perfect\n",
        "and nothing breaks, Ralph builds things. But in the real world, Ralph\n",
        "destroys your token budget.\n",
        "\n",
        "### 78. Cognitive Discipline via JSON Schema and Plan\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_078.png\"\n",
        "alt=\"Slide 78\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 78</figcaption>\n",
        "</figure>\n",
        "\n",
        "This is the gap between a prompt engineer and a system engineer. A\n",
        "prompt engineer writes ‘always think step-by-step’ in the system prompt\n",
        "and hopes the model listens. A system engineer uses **JSON Schema.**\n",
        "Don’t let the model pass `{command: 'run script'}`. Configure the tool\n",
        "schema to require a `hypothesis`, a `verification_plan`, *then* the\n",
        "`command`. If any field is missing, the harness rejects the call before\n",
        "it hits the sandbox. You physically force the model to think before it\n",
        "acts.\n",
        "\n",
        "### 79. Moving from Ralph Wiggum to AutoResearch\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_079.png\"\n",
        "alt=\"Slide 79\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 79</figcaption>\n",
        "</figure>\n",
        "\n",
        "Karpathy published this earlier this year — his auto-research loop. Look\n",
        "at the chart on the left: 83 experiments, 15 kept improvements,\n",
        "monotonically decreasing validation BPB. That’s a Ralph loop with one\n",
        "critical addition — a *metric gate*. Every iteration either improves the\n",
        "score and gets committed, or gets `git reset` and discarded. This is the\n",
        "bridge between Ralph and a real loop: **observe state, hypothesize, act,\n",
        "verify, keep what works.** Same scientific method as before, but now the\n",
        "*environment* enforces the verification step. That’s our lead-in to\n",
        "cognitive discipline.\n",
        "\n",
        "### 80. An Improved Loop for AutoResearch\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_080.png\"\n",
        "alt=\"Slide 80\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 80</figcaption>\n",
        "</figure>\n",
        "\n",
        "The improved loop adds what Ralph is missing: hypothesis, action,\n",
        "evaluation, and a rule for keeping only the changes that actually work.\n",
        "\n",
        "### 81. Defensive Tool Returns.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_081.png\"\n",
        "alt=\"Slide 81\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 81</figcaption>\n",
        "</figure>\n",
        "\n",
        "And the harness manages the *output* too. Agent runs `cat massive.json`?\n",
        "A good harness intercepts the output, truncates it, and returns:\n",
        "`<Output truncated to 2000 lines. Use grep or head instead.>` This\n",
        "prevents the agent from blinding itself.\n",
        "\n",
        "### 82. Environmental Discipline: Testing Driven Development\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_082.png\"\n",
        "alt=\"Slide 82\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 82</figcaption>\n",
        "</figure>\n",
        "\n",
        "Even if the model is thinking perfectly, it will eventually drift. Move\n",
        "to environmental discipline. Factory’s Luke Alvoeiro: tests written\n",
        "*after* implementation don’t catch bugs — they confirm the agent’s\n",
        "hallucinated decisions. Validation contracts must be defined *before*\n",
        "the agent writes a line of code. Adversarial by design.\n",
        "\n",
        "### 83. Adding System Constraints for your Harness\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_083.png\"\n",
        "alt=\"Slide 83\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 83</figcaption>\n",
        "</figure>\n",
        "\n",
        "Ryan Lopopolo from OpenAI summarizes it: **Code is free. Architecture is\n",
        "expensive.** Because agents can write code infinitely, your system’s\n",
        "constraints become your harness: - Lint errors → instructions to the\n",
        "agent. - File-length tests → force decomposition. - CI checks → the real\n",
        "guardrail.\n",
        "\n",
        "### 84. Safety & Friction: Sandboxes\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_084.png\"\n",
        "alt=\"Slide 84\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 84</figcaption>\n",
        "</figure>\n",
        "\n",
        "Last stage: safety. These loops are autonomous, they’re executing bash,\n",
        "they will eventually try to destroy things. Look at this — March 8th,\n",
        "Claude Code deleted a developer’s production database setup, including\n",
        "snapshots. *Two and a half years* of records, gone in an instant. This\n",
        "isn’t theoretical. **Sandboxes are non-negotiable.**\n",
        "\n",
        "### 85. Safety & Friction: Sandboxes\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_085.png\"\n",
        "alt=\"Slide 85\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 85</figcaption>\n",
        "</figure>\n",
        "\n",
        "The second sandbox slide makes the operational point concrete: isolated\n",
        "execution has to be the default once agents can touch shell commands,\n",
        "credentials, and production-like state.\n",
        "\n",
        "### 86. Guardrails and Approval Friction\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_086.png\"\n",
        "alt=\"Slide 86\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 86</figcaption>\n",
        "</figure>\n",
        "\n",
        "So design friction to match blast radius: - Safe (read, grep, ls) →\n",
        "**auto-allow.** - Reversible (edit file, git commit) → **auto-allow\n",
        "inside a sandbox.** - Network (curl, npm install) → **prompt once per\n",
        "session.** - Destructive (rm -rf, DROP TABLE, force push) → **require\n",
        "explicit human approval.** The harness halts the loop at the right\n",
        "moments.\n",
        "\n",
        "### 87. Principles for Agentic Loop\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_087.png\"\n",
        "alt=\"Slide 87\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 87</figcaption>\n",
        "</figure>\n",
        "\n",
        "Four rules for the loop: 1. Make the next action explicit — JSON schema.\n",
        "2. Make verification cheap — CI, tests, lint as adversarial boundary. 3.\n",
        "Make blind repetition hard. 4. Force learning from the environment.\n",
        "Retrieval feeds the loop. Memory stabilizes the loop. Protocols and\n",
        "tests discipline the loop.\n",
        "\n",
        "### 88. Let’s start with how agents find what they need.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_088.png\"\n",
        "alt=\"Slide 88\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 88</figcaption>\n",
        "</figure>\n",
        "\n",
        "This divider sets up the final lever: architecture. Up to this point the\n",
        "talk assumes one harness and one loop; here the question becomes whether\n",
        "splitting the work actually helps.\n",
        "\n",
        "### 89. System Architecture: Single versus Multi Agent\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_089.png\"\n",
        "alt=\"Slide 89\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 89</figcaption>\n",
        "</figure>\n",
        "\n",
        "Last lever: architecture. Up to now I’ve talked as if there’s one agent\n",
        "with one harness. Architecture is where you decide whether that loop\n",
        "stays centralized or gets broken into specialized workers.\n",
        "\n",
        "### 90. Who’s using a multi-agent for coding?\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_090.png\"\n",
        "alt=\"Slide 90\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 90</figcaption>\n",
        "</figure>\n",
        "\n",
        "Hands up — who’s actually run a **multi-agent system in production**?\n",
        "Now keep your hand up if it worked *better* than your single-agent\n",
        "version. Right. Let’s talk about why that happens.\n",
        "\n",
        "### 91. Single agents degrade as complexity grows.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_091.png\"\n",
        "alt=\"Slide 91\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 91</figcaption>\n",
        "</figure>\n",
        "\n",
        "Why does the multi-agent intuition fail? Start here. Single agents\n",
        "degrade as complexity grows. Even GPT-5: with 3 tools and 1.4K tokens,\n",
        "four distractor tools cost you 10%. With all tools and 300K tokens, the\n",
        "same distractors cost you 17%. The model doesn’t know which tool to\n",
        "pick. Vercel saw this in production — they removed **80% of v0’s tools**\n",
        "and got *better* results. More tools doesn’t mean more capability. It\n",
        "usually means worse routing.\n",
        "\n",
        "### 92. Split the context between multiple agents\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_092.png\"\n",
        "alt=\"Slide 92\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 92</figcaption>\n",
        "</figure>\n",
        "\n",
        "So sometimes we use subagents. Same study: split the context and tools\n",
        "across specialized workers, and multi-agent setups outperform\n",
        "single-agent baselines, especially for the smaller models. Promise is\n",
        "real.\n",
        "\n",
        "### 93. Multi-Agent is like Distributed Systems: Complex!\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_093.png\"\n",
        "alt=\"Slide 93\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 93</figcaption>\n",
        "</figure>\n",
        "\n",
        "This is the warning slide for multi-agent design: once you introduce\n",
        "several workers, you inherit the same coordination, debugging, and\n",
        "observability headaches that make distributed systems hard.\n",
        "\n",
        "### 94. Many ways to orchestrate multiple agents\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_094.png\"\n",
        "alt=\"Slide 94\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 94</figcaption>\n",
        "</figure>\n",
        "\n",
        "And there are clean patterns to reach for: prompt chaining, routing,\n",
        "parallelization, orchestrator-worker, evaluator-optimizer. These show up\n",
        "everywhere — they’re worth knowing.\n",
        "\n",
        "### 95. Coordination Tax - Going from Parallel to Serial\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_095.png\"\n",
        "alt=\"Slide 95\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 95</figcaption>\n",
        "</figure>\n",
        "\n",
        "But before you run off and build a 15-agent swarm — every multi-agent\n",
        "setup pays a coordination tax. Factory hit this directly. Started with\n",
        "parallel swarms — agents stepped on each other, performance tanked. They\n",
        "moved to strict serial handoffs: orchestrator → worker → validator.\n",
        "**Every conflict burns tokens. Every retry erodes trust.**\n",
        "\n",
        "### 96. The Reality: More agents only help if coordination stays cheap.\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_096.png\"\n",
        "alt=\"Slide 96\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 96</figcaption>\n",
        "</figure>\n",
        "\n",
        "And here’s what the data shows. Across BrowseComp, Finance, PlanCraft,\n",
        "Workbench — multi-agent systems perform on average **3.5% *worse*** than\n",
        "single agents. Once a single agent hits about 45% accuracy, adding more\n",
        "agents stops helping. Independent agents amplify errors **17×**.\n",
        "**Subagents are something you *earn* — not something you set up on day\n",
        "one.**\n",
        "\n",
        "### 97. Multi-Agent critics using reflection\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_097.png\"\n",
        "alt=\"Slide 97\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 97</figcaption>\n",
        "</figure>\n",
        "\n",
        "There IS one multi-agent pattern that consistently *works*, though — the\n",
        "**critic**. A separate agent — sometimes a different model entirely —\n",
        "reviewing the main agent’s trace. Look at the data: Reflexion-style\n",
        "critic loops on SWE-bench. Random sampling 57.9%. Success-only 63.6%.\n",
        "**Iterative critic with rubrics: 73.8%.** That’s a 10-point bump from\n",
        "adding one critic agent. And Boris Cherny said it bluntly — giving the\n",
        "model a way to verify its work improves quality **2 to 3×**. That’s the\n",
        "practitioner number behind the academic data. When you do reach for a\n",
        "second agent, this is the move that pays off.\n",
        "\n",
        "### 98. Harness engineering in another two years?\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_098.png\"\n",
        "alt=\"Slide 98\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 98</figcaption>\n",
        "</figure>\n",
        "\n",
        "Now — the question I get every time I give this talk. Is harness\n",
        "engineering just the new prompt engineering? The next thing that quietly\n",
        "disappears in two years? Honest answer: yes and no. Some of this\n",
        "commoditizes. Default compaction algorithms, standard tool choices like\n",
        "`grep` and edit, model-specific prompt tweaks, most of the hacks that\n",
        "patch capability gaps — those will disappear into the platform. They\n",
        "become defaults. But other parts are durable. Skills systems — your\n",
        "organization’s accumulated expertise. Memory policy — what you persist,\n",
        "what you throw away. Domain-specific tool libraries. Security posture.\n",
        "Evals — knowing when your harness breaks. Prompt engineering didn’t\n",
        "actually die. We just stopped naming it. Same thing’s going to happen\n",
        "here. The durable parts become plumbing. And plumbing is where the\n",
        "leverage lives.\n",
        "\n",
        "### 99. Five knobs that decide everything\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_099.png\"\n",
        "alt=\"Slide 99\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 99</figcaption>\n",
        "</figure>\n",
        "\n",
        "To wrap up — five knobs that decide everything. Tune them in this\n",
        "order: 1. **Retrieval** → `grep` and BM25 by default. Add semantic only\n",
        "when vocabulary mismatch hurts you. Loop it for accuracy. 2. **Memory**\n",
        "→ files beat chat history. Skills for durable expertise. Evaluate them.\n",
        "3. **Tools** → quality over quantity. Defensive truncation.\n",
        "Schema-enforced thinking. 4. **Loops** → force hypothesis before action.\n",
        "Tests before code. 5. **Orchestration** → single agent by default.\n",
        "Multi-agent only when coordination stays cheap.\n",
        "\n",
        "### 100. Why Harnesses Matter\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_100.png\"\n",
        "alt=\"Slide 100\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 100</figcaption>\n",
        "</figure>\n",
        "\n",
        "The closing summary returns to the three reasons harnesses matter:\n",
        "performance differences, the knobs you can tune, and the lock-in risk\n",
        "when the harness owns your memory and tools.\n",
        "\n",
        "### 101. Engineering the Harness: A Practical Workshop\n",
        "\n",
        "<figure>\n",
        "<img\n",
        "src=\"https://rajivshah.com/blog/images/harness-engineering/page_101.png\"\n",
        "alt=\"Slide 101\" />\n",
        "<figcaption aria-hidden=\"true\">Slide 101</figcaption>\n",
        "</figure>\n",
        "\n",
        "Here’s where I’ll leave you. An agent isn’t a magical model. **An agent\n",
        "is a model plus a harness.** The intelligence of your product isn’t just\n",
        "in the LLM’s weights — it’s in the execution layer that connects that\n",
        "model to tools, state, and the work that has to actually finish. **Stop\n",
        "tweaking prompts. Start engineering the harness.** The QR code on screen\n",
        "will take you to the GitHub repo with the full script, references, and\n",
        "runnable experiments. Thank you. Questions.\n",
        "\n",
        "------------------------------------------------------------------------\n",
        "\n",
        "*This annotated presentation was rebuilt directly from the local slide\n",
        "deck and talk track.*"
      ],
      "id": "72e992e2-8a18-484f-9558-84bfef67af21"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  }
}