19 Feb 2026

Updated: May 22, 2026

Build the harness, not the code: a staff/principal engineer's guide to AI-agent systems

A team at OpenAI shipped a production product with zero lines of manually-written code. Empty git repo in August 2025. One million lines five months later. Three engineers initially, averaging 3.5 PRs per person per day. And throughput increased as the team grew to seven.

That’s not a research demo. It’s a product with hundreds of daily internal users and external alpha testers.

The constraint was intentional: humans steer, agents execute. Every line - application logic, tests, CI config, docs, internal tooling - was written by Codex. They estimate they built this in about 1/10th the time it would have taken to write the code by hand.

So what did the humans actually do?

They built the harness.

And you can build the same harness with Claude code.

What's in this guide

3 core harness lessons from OpenAI’s internal story, with parallel Claude Code implementation patterns
10 operational tips for skills, shell execution, and context compaction (OpenAI + Claude paths)
15 hard-won lessons from building ChatGPT Apps, adapted for Claude Desktop and MCP servers
Architecture enforcement patterns with mechanical linting via hooks (OpenAI + Claude)
11-step autonomy ladder showing the path from bug reproduction to autonomous PR merge
Quarterly implementation blueprint for both Codex and Claude Code
Metrics, failure scenarios, and tool-agnostic adaptation proven across ecosystems

This guide shows you both paths. Whether you’re using OpenAI Codex, Claude Code, or planning to switch between them, you’ll learn the underlying patterns that work everywhere.

This is a companion to Build a code review operating system . This article builds the harness. That one closes the feedback loop through review.

What “harness engineering” actually means

You know how it goes - you’re a staff engineer, you spend your time reviewing code, fixing architectural drift, making sure junior devs follow conventions. Now imagine that “junior dev” is an AI agent that can write code 10x faster than any human but has no institutional memory, no taste, and no understanding of why you chose PostgreSQL over DynamoDB.

The harness is everything you build around the agent to make its output reliable. Instructions, tools, checks, context, feedback loops. It’s the environment design that turns “agent can write code” into “agent can ship production features.”

OpenAI learned this the hard way. Early progress was slower than expected - not because Codex was incapable, but because the environment was underspecified. The agent lacked tools, abstractions, and internal structure to make progress toward high-level goals. The lack of hands-on human coding introduced a different kind of engineering work, focused on systems, scaffolding, and compounding returns.

In practice, this meant working depth-first: breaking down larger goals into smaller building blocks (design, code, review, test), prompting the agent to construct those blocks, and using them to unlock more complex tasks. When something failed, the fix was almost never “try harder.” It was always: what capability is missing, and how do I make it both legible and enforceable for the agent?

That reframe is not philosophical. It’s operational. And it’s exactly what this article teaches you to do.

I’ve been applying these patterns in my own project - flowforge , a compile-time data contracts framework. Throughout this article, I’ll occasionally show how I applied a pattern in practice. But the teaching comes from four critical sources:

Harness engineering - OpenAI’s internal story of shipping 1M lines with zero manual code
Skills + shell + compaction - operational primitives for long-running agents (OpenAI and Claude)
15 lessons building ChatGPT Apps - hard-won lessons adapted for Claude Desktop and MCP
Platform shifts in 2025 - what changed across OpenAI and Anthropic ecosystems

The mental model

flowchart LR
    A[Business intent] --> B[Harness design]
    B --> C[Agent execution]
    C --> D[Validation and evals]
    D --> E[Auto-remediation and learning]
    E --> B

You design the map (instructions, boundaries, tools), not a giant manual. You encode quality as checks, not tribal review comments. You treat context artifacts as infrastructure.

Here’s the heuristic I trust most: anything that helps humans ship better software faster usually helps agents do the same.

Skills make this obvious. If you’ve worked with internal engineering wikis, you’ve already seen this movie: distilled playbooks for recurring situations, easy to find, scoped to a concrete job. That pattern worked for humans for years. Agent skills are the same pattern, now executable.

The same logic carries over to compilers and type systems, dependency injection, errors-as-values, and effect systems. These abstractions didn’t survive because they’re fashionable. They survived because they improve reliability and change velocity in real codebases. Agents benefit from those constraints too.

I still keep seeing the “language won’t matter, agents will just emit assembly” take. I don’t buy it. If a workflow is hard for humans to reason about, review, and maintain, it’ll usually break down for agents at scale too. Honestly, I’m tired of the “AI is alien magic” framing. It isn’t. AI doesn’t erase engineering fundamentals; it magnifies them. Strong systems get stronger. Fragile systems fail faster.

This pattern is tool-agnostic. Whether you’re building with Codex, Claude code, Gemini, or Grok, the loop stays the same.

Lesson 1: give agents a map, not a manual

Both OpenAI and Anthropic teams learned the same lesson: the “one big instruction file” approach fails predictably.

Why the giant instruction file fails

Context is a scarce resource. A giant instruction file crowds out the task, the code, and the relevant docs - so the agent either misses key constraints or starts optimizing for the wrong ones.
Too much guidance becomes non-guidance. When everything is “important,” nothing is. Agents end up pattern-matching locally instead of navigating intentionally.
It rots instantly. A monolithic manual turns into a graveyard of stale rules. Agents can’t tell what’s still true, humans stop maintaining it, and the file quietly becomes an attractive nuisance.
It’s hard to verify. A single blob doesn’t lend itself to mechanical checks (coverage, freshness, ownership, cross-links), so drift is inevitable.

What should actually go into AGENTS.md

There is now benchmark evidence that this is not just a style preference. In analysis highlighted by Addy Osmani (covering SWE-bench style agent runs), adding an auto-generated AGENTS.md summary reduced task success by roughly 2-3 percentage points while increasing token cost by about 20-30%.

The failure mode is simple: the agent can already inspect your tree, infer your stack, and read module docs on demand. If you preload that same information as prose, you create context anchoring noise. The model keeps looking at the pink elephant in the room instead of the concrete diff it should make.

Use a strict filter for what earns a line in AGENTS.md:

Undiscoverable operational constraints: setup and tooling gotchas that are not visible from source code alone
Operational landmines: risky areas where “cleanup” can break production behavior
Non-obvious conventions: intentional local patterns that look wrong generically but are right for this codebase

Everything else should be linked, not duplicated. If an agent can discover it by reading the repository, cut it from the instruction file.

Treat AGENTS.md as a temporary smell tracker, not a permanent knowledge dump. If agents repeatedly misuse a dependency, write to the wrong folder, or keep violating architecture boundaries, the real fix is usually structural: rename modules, improve folder semantics, add linters/hooks, strengthen tests. Then remove the compensating prose.

The solution: index file as table of contents

OpenAI’s approach: AGENTS.md

Instead of treating AGENTS.md as the encyclopedia, OpenAI treats it as the table of contents. The repository’s knowledge base lives in a structured docs/ directory. A short AGENTS.md (roughly 100 lines) is injected into context and serves primarily as a map, with pointers to deeper sources of truth.

OpenAI's actual repository structure

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
AGENTS.md              (100 lines, injected into context as the map)
ARCHITECTURE.md        (top-level map of domains and package layering)
docs/
├── design-docs/
│   ├── index.md
│   ├── core-beliefs.md   (agent-first operating principles)
│   └── ...
├── exec-plans/
│   ├── active/
│   ├── completed/
│   └── tech-debt-tracker.md
├── generated/
│   └── db-schema.md
├── product-specs/
│   ├── index.md
│   ├── new-user-onboarding.md
│   └── ...
├── references/
│   ├── design-system-reference-llms.txt
│   ├── nixpacks-llms.txt
│   └── ...
├── DESIGN.md
├── FRONTEND.md
├── QUALITY_SCORE.md
└── SECURITY.md

Plans are first-class artifacts. Active plans, completed plans, and known technical debt are all versioned and co-located, allowing agents to operate without relying on external context.

Claude code’s approach: CLAUDE.md + .claude/ directory

Claude Code uses CLAUDE.md as the agent’s “constitution” - its primary source of truth for how your specific repository works. Unlike AGENTS.md, CLAUDE.md is automatically read at the start of each session and holds project-specific instructions you’d otherwise repeat in every prompt.

The filename is case-sensitive and must be exactly CLAUDE.md (uppercase CLAUDE, lowercase .md).

Claude Code's repository structure

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
CLAUDE.md              (50-120 lines, auto-loaded every session)
.claude/
├── agents/            (specialized subagents)
│   ├── reviewer.md
│   ├── security.md
│   └── test-writer.md
├── commands/          (custom slash commands)
│   └── deploy.md
├── hooks/             (lifecycle automation)
│   ├── PreToolUse.sh
│   ├── PostToolUse.sh
│   └── AfterCompaction.sh
└── skills/            (versioned procedure bundles)
    ├── api-design.md
    └── schema-migration.md
docs/
├── architecture/
│   └── decisions.md  (ADRs)
├── evidence/
│   └── current-state.md
└── runbooks/
    └── deployment.md

Key differences from OpenAI’s approach:

.claude/agents/ holds specialized subagents that run in isolated contexts
.claude/hooks/ enforces quality mechanically at 15 lifecycle events
CLAUDE.md is smaller (50-120 lines vs 100+) because hooks handle enforcement
MCP servers provide external tool access via standardized protocol

Progressive disclosure in both ecosystems

flowchart LR
    subgraph openai ["OpenAI: AGENTS.md"]
        A1["AGENTS.md ~100 lines"] -->|points to| A2["docs/design-docs/"]
        A1 -->|points to| A3["docs/exec-plans/"]
        A2 -->|deep dive| A4["core-beliefs.md"]
    end
    subgraph claude ["Claude: CLAUDE.md + .claude/"]
        C1["CLAUDE.md 50-120 lines"] -->|delegates to| C2[".claude/agents/"]
        C1 -->|links to| C3["docs/architecture/"]
        C2 -->|runs| C4["reviewer.md security.md"]
    end
    style openai fill: #d4edda, stroke: #28a745
    style claude fill: #e8f4fd, stroke: #0d6efd

The agent reads the index file first (AGENTS.md or CLAUDE.md). From there, it follows links to whichever deep doc is relevant to the current task. It never loads the whole tree at once.

Mechanical enforcement

OpenAI: Dedicated linters and CI jobs validate that the knowledge base is up to date, cross-linked, and structured correctly. A recurring “doc-gardening” agent scans for stale or obsolete documentation and opens fix-up pull requests.

Claude code: Hooks provide deterministic enforcement . PreToolUse hooks block dangerous operations before they run. PostToolUse hooks enforce formatting immediately after code changes. AfterCompaction hooks inject critical context back after summarization.

The key distinction from Claude Code best practices : If it’s a suggestion, use CLAUDE.md. If it’s a requirement, use hooks.

How to implement this yourself

For OpenAI codex:

Start with AGENTS.md as an index (under 120 lines)
Split docs by concern into docs/ subdirectories
Add ADRs for decisions with dated, status-marked records
Create evidence docs with file paths and quantified claims
Enforce mechanically with CI jobs for freshness and cross-links
Run a doc-gardening agent on a schedule

For Claude code:

Create CLAUDE.md in repo root (50-120 lines, project-specific instructions only)
Structure .claude/agents/ for specialized subagents (review, security, test-writing)
Define .claude/hooks/ for enforcement (PreToolUse blocks, PostToolUse formats)
Link from CLAUDE.md to docs/architecture/ for deep dives
Use MCP servers for external tool access
Let automatic compaction manage context with AfterCompaction hooks to pin critical state

Universal patterns (both ecosystems):

Keep the index file minimal (suggestions only)
Use deep-linked docs for detailed context
Enforce quality mechanically (linters/hooks, not prose)
Version all decisions and plans in the repo

In flowforge, I use both patterns: AGENTS.md for Codex compatibility with links to docs/adr/INDEX.md (25+ decisions), and .claude/hooks/PostToolUse.sh that runs scalafix after every code change, ensuring golden principles are enforced regardless of which agent I’m using.

Lesson 2: automate cleanup, or drown in it

Full agent autonomy introduces a specific failure mode. Agents replicate patterns that already exist in the repository - even uneven or suboptimal ones. Over time, this inevitably leads to drift.

The cost of skipping this

OpenAI’s team spent every Friday - 20% of their week - cleaning up “AI slop.” That didn’t scale. If you don’t encode cleanup rules mechanically, you’ll spend a fifth of your week doing it by hand.

Golden principles: encoding taste as enforcement

Both ecosystems learned to encode “golden principles” - opinionated, mechanical rules that keep the codebase legible and consistent for future agent runs.

OpenAI’s production examples:

Prefer shared utility packages over hand-rolled helpers to keep invariants centralized.
Parse data shapes at the boundary. Don’t probe data “YOLO-style” - validate boundaries or rely on typed SDKs. Follow the “parse, don’t validate” principle.
Statically enforce structured logging, naming conventions for schemas and types, file size limits, and platform-specific reliability requirements with custom lints.
Custom lint error messages inject remediation instructions into agent context. When a lint fails, the error message tells the agent how to fix it.

On a regular cadence, background Codex tasks scan for deviations, update quality grades, and open targeted refactoring PRs. Most can be reviewed in under a minute and automerged.

Community validation: Karpathy’s four principles

When Karpathy published his LLM coding pitfalls in January 2026, the community responded with andrej-karpathy-skills - a single CLAUDE.md encoding four principles, now over 143,000 stars. It’s installable from the Claude Code marketplace. They map directly onto what golden-principle engineering requires:

Think before coding. Ask clarifying questions. Surface assumptions. Don’t let the model run with the wrong premise.
Simplicity first. The model’s default instinct is overcomplication. Golden principles push back against that mechanically.
Surgical changes. Minimal diffs. Localized edits. “Don’t touch what isn’t broken” encoded as a rule.
Goal-driven execution. Give the agent success criteria, not a prescription. Let it find the path.

The repo’s popularity isn’t about the principles being novel. It’s about them being installable. That’s the harness insight: good engineering taste is worthless unless it’s enforced.

Claude code’s approach: hooks for deterministic enforcement

Claude Code uses hooks at 15 lifecycle events to enforce golden principles. The critical insight from production engineering teams :

CLAUDE.md rules are suggestions. Hooks are enforcement. CLAUDE.md saying “don’t edit .env” → parsed by LLM → weighed against other context → maybe followed. PreToolUse hook blocking .env edits → always runs → returns exit code 2 → operation blocked.

Example .claude/hooks/PostToolUse.sh enforcing golden principles:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#!/bin/bash
# Runs after every file write

# Golden principle 1: no raw console.log in production code
if echo "$TOOL_OUTPUT" | grep -q "console\.log" && [[ ! "$FILE_PATH" =~ test ]]; then
  echo "❌ Blocked: Use structured logging (logger.info) instead of console.log"
  exit 2  # Block the operation
fi

# Golden principle 2: auto-format all code changes
if [[ "$FILE_PATH" =~ \.(ts|js|tsx|jsx)$ ]]; then
  npx prettier --write "$FILE_PATH" > /dev/null 2>&1
fi

# Golden principle 3: run linter with auto-fix
if [[ "$FILE_PATH" =~ \.(ts|tsx)$ ]]; then
  npx eslint --fix "$FILE_PATH" > /dev/null 2>&1
fi

exit 0  # Allow operation after enforcement

Additional Claude code patterns from production repos :

PreToolUse hooks prevent dangerous operations (deleting production DBs, exposing secrets)
AfterCompaction hooks inject critical context back (active plans, golden principles)
SessionEnd hooks generate summaries and cleanup artifacts

Agents favor boring technology

Both OpenAI and Anthropic teams favored dependencies and abstractions that could be fully internalized and reasoned about in-repo. Technologies often described as “boring” tend to be easier for agents to model due to composability, API stability, and representation in the training set.

OpenAI’s example: rather than pulling in a generic p-limit-style package, they implemented their own map-with-concurrency helper - tightly integrated with their OpenTelemetry instrumentation, with 100% test coverage, behaving exactly the way their runtime expected.

The feedback loop

flowchart LR
    A[Agent output] --> B[Golden principles scanner]
    B -->|pass| C[merge]
    B -->|fixable| D[auto-fix PR]
    B -->|not fixable| E[human escalation]
    E --> F[update golden principles]
    F --> B
    D --> G[quality grade updated]
    G --> B

OpenAI: Review comments, refactoring PRs, and user-facing bugs are captured as documentation updates or encoded directly into tooling. When documentation falls short, the rule gets promoted into code.

Claude code: Failed hook enforcement triggers human escalation. The fix becomes a new hook or updated CLAUDE.md rule. Over time, the harness learns from failures. This is what turns it into a code review operating system , not just a linter.

How to implement this yourself

For OpenAI codex:

Define 5-10 golden principles (mechanical rules your team enforces manually today)
Encode them as custom linters or structural tests
Make lint errors prescriptive (tell agents how to fix violations)
Run drift scans daily, auto-open PRs for low-risk violations
Track manual cleanup time (goal: trending to zero)

For Claude code:

Define golden principles in .claude/hooks/PostToolUse.sh
Use exit code 2 to block violations, exit 0 after auto-fixes
Add AfterCompaction hooks to preserve principles across compaction
Use PreToolUse hooks to prevent dangerous operations
Track hook trigger rate and human escalations as metrics

Universal patterns:

Start with 5-10 rules, expand as patterns emerge
Favor boring, composable technology (agents model it reliably)
Make error messages prescriptive (not just “what” but “how”)
Update principles based on real failures, not hypothetical risks

In flowforge, I use .scalafix.conf (banning var, asInstanceOf, wildcard imports) for Codex compatibility, plus .claude/hooks/PostToolUse.sh that runs scalafix automatically. I also maintain compile-fail tests that must fail to compile - proving the core promise that pipelines won’t compile when contracts drift. These work identically across both ecosystems.

Lesson 3: context is infrastructure, not documentation

From OpenAI’s harness engineering post:

From the agent’s point of view, anything it can’t access in-context while running effectively doesn’t exist.

Knowledge that lives in Google Docs, chat threads, or people’s heads is not accessible to the system. Repository-local, versioned artifacts (code, markdown, schemas, executable plans) are all the agent can see.

This is equally true for Claude Code. Claude’s automatic compaction shows the same constraint: if context isn’t in the conversation, repo files, or MCP-accessible sources, it doesn’t exist.

flowchart LR
    subgraph visible ["What agents CAN see"]
        direction TB
        A["Code + tests"]
        B["Markdown docs ADRs, specs, plans"]
        C["Schemas + configs"]
        D["Exec plans with decision logs"]
        E["Logs, metrics (via tools/MCP)"]
    end
    subgraph invisible ["What agents CANNOT see"]
        direction TB
        F["Slack threads"]
        G["Google Docs"]
        H["People's heads"]
        I["Verbal agreements"]
        J["Undocumented conventions"]
    end
    visible -->|" agent works here "| K["Reliable output"]
    invisible -->|" effectively doesn't exist "| L["Missed constraints wrong assumptions"]
    style visible fill: #d4edda, stroke: #28a745
    style invisible fill: #f8d7da, stroke: #dc3545

If a decision isn’t in the repo, it doesn’t exist for the agent. That Slack discussion that aligned the team on an architectural pattern? Same as telling a new hire who joins three months later - they’ll never know unless you write it down.

Agent legibility is the goal

OpenAI’s approach: Because the repository is entirely agent-generated, it’s optimized first for Codex’s legibility. In the same way teams aim to improve navigability of their code for new engineering hires, the human engineers’ goal was making it possible for an agent to reason about the full business domain directly from the repository itself.

Claude code’s equivalent: As software engineering shifts to agent orchestration , the folder and file structure becomes a form of context engineering. The best practices for Claude Code emphasize treating repository structure as the agent’s primary navigation system.

Making the application itself legible to agents

OpenAI’s production setup:

Per-worktree app instances: app bootable per git worktree, so Codex can launch and drive one instance per change.
Chrome DevTools Protocol: wired into the agent runtime with skills for working with DOM snapshots, screenshots, and navigation.
Ephemeral observability per worktree: logs, metrics, and traces exposed via a local observability stack that’s ephemeral for any given worktree.
Agents can query logs with LogQL and metrics with PromQL. With this context available, prompts like “ensure service startup completes in under 800ms” become tractable.

flowchart TD
    A["Agent gets task"] --> B["git worktree created"]
    B --> C["App boots in isolated instance"]
    C --> D["Agent drives app (CDP for Codex, MCP for Claude)"]
    D --> E{"Bug reproduced?"}
    E -->|yes| F["Implement fix"]
    E -->|no| G["Query logs/metrics (LogQL/PromQL for Codex, MCP server for Claude)"]
    G --> D
    F --> H["Validate fix by driving app again"]
    H --> I["Ephemeral observability torn down"]
    style C fill: #d4edda, stroke: #28a745
    style G fill: #fff3cd, stroke: #ffc107

Claude code’s equivalent:

MCP servers for observability: Claude Code connects to tools via MCP - the Model Context Protocol is an open standard for AI-tool integrations. Instead of baking observability into Codex’s runtime, you expose it via MCP servers.
Example MCP servers from the Claude Code ecosystem : filesystem access, Git operations, database queries, HTTP APIs, browser automation.
Per-worktree isolation: Same pattern works with Claude Code. The agent creates a worktree, boots the app, drives tests via MCP-connected tools, validates, tears down.
Artifacts for state tracking: Claude artifacts (up to 20MB) store structured data across sessions useful for test results, metrics history, and multi-session workflows.

They regularly see single Codex runs work on a single task for upwards of six hours - often while the humans are sleeping. With Claude Code’s conversation compacting , long-running sessions stay coherent through automatic summarization at token thresholds.

Technology choices favor agent comprehension

Both OpenAI and Anthropic teams favored dependencies and abstractions that could be fully internalized and reasoned about in-repo. Pulling more of the system into a form the agent can inspect, validate, and modify directly multiplies your output - not just for Codex, but for other agents (like Aardvark for OpenAI, or Claude Code subagents for Anthropic) working on the codebase.

This is the same mental model in practical form: choices that improve human legibility and maintainability usually improve agent outcomes too. Chasing assembly-level generation over useful abstractions isn’t acceleration; it’s throwing away leverage we already spent years building.

How to implement this yourself

Universal patterns (both ecosystems):

Push decisions from chat to repo. Every Slack alignment discussion becomes an ADR.
Create evidence docs, not aspirational docs. Document current state with file paths and line numbers.
Write unvarnished reviews. Brutal honesty about gaps. Agents work better with truth than marketing.
Define acceptance criteria as measurable checks. If an agent can’t verify it, it’s not a criterion.
Make the app bootable per worktree. Each agent task gets an isolated instance.

For OpenAI codex:

Wire Chrome DevTools Protocol into Codex runtime
Expose logs/metrics via ephemeral per-worktree observability
Enable LogQL/PromQL query access for agents

For Claude code:

Build MCP servers for observability access (logs, metrics, traces)
Use Claude artifacts (up to 20MB) to store test results and metrics history across sessions
Enable automatic compaction with AfterCompaction hooks to preserve critical context
Connect Claude Code to your observability stack via MCP (Datadog, Grafana, CloudWatch)

In flowforge, I maintain docs/evidence/unvarnished-review.md - quotes like “Hand-rolled codegen JSON parser… will explode on unions, nested records” - and 25 measurable acceptance criteria for v1.0 in docs/plan/v1.0-readiness.md. These work identically for both Codex and Claude code because they’re plain markdown in the repo.

Knowledge as infrastructure: the LLM wiki pattern

The three lessons above frame harness engineering around code. In April 2026, Karpathy’s LLM Knowledge Bases post pointed at the same infrastructure problem from a different angle.

His observation: a large fraction of productive agent work isn’t code manipulation. It’s knowledge manipulation - reading source documents, building structured summaries, answering complex questions across a growing corpus. The same harness discipline that makes code agents reliable applies directly here.

The pattern:

flowchart LR
    A["raw/\n(immutable sources)"] --> B["LLM compiler"]
    B --> C["wiki/\n(agent-maintained .md)"]
    C --> D["query / lint / ingest"]
    D --> C
    style A fill:#fff3cd,stroke:#ffc107
    style C fill:#d4edda,stroke:#28a745
    style B fill:#e8f4fd,stroke:#0d6efd

Raw: source documents go into raw/ unchanged - articles, papers, repos, datasets. Immutable, like your git history.

Wiki: the LLM compiles sources into structured .md files: summaries, backlinks, concept articles, cross-links. The agent writes and maintains all of it. You rarely touch it directly.

Operations: once the wiki reaches roughly 100 articles, three operations become practical:

Ingest: "file this new doc to our wiki: (path)". After the early bootstrap phase, the LLM gets the pattern and each addition is incremental.
Query: ask complex questions across the full corpus. The LLM auto-maintains index files and brief per-document summaries, so you don’t need RAG at this scale.
Lint: health checks that find inconsistent data, impute missing values via web search, surface new article candidates.

Output goes back into Obsidian - formatted files, Marp slides, matplotlib plots - viewable alongside the raw sources that produced them.

The harness parallels are direct. Raw is your immutable source of truth. Wiki is agent-owned state - like .claude/, maintained by the agent, not manually edited. Operations are golden principles for knowledge work: defined, repeatable procedures applied consistently.

Vannevar Bush described the Memex in 1945: a desk that stored and linked all your documents, let you trace associative trails, and built a record of your thinking over time. His unsolved problem was maintenance - who keeps the index current, who updates the cross-links as the corpus grows? The LLM handles that. The harness is what makes it reliable.

The harness isn’t specific to code.

The three operational primitives: skills, shell, compaction

We’re shifting from single-turn assistants to long-running agents that handle real knowledge work: reading large datasets, updating files, and writing apps. Based on developer feedback and their own experience building internal agents, both OpenAI and Anthropic released three primitives that make long-horizon work practical.

The pattern is identical across ecosystems. The implementations differ.

Skills: procedures agents load on demand

OpenAI’s approach:

A skill is a bundle of files plus a SKILL.md manifest containing frontmatter and instructions. Think: a versioned playbook the model can consult when it’s time to do real work. Skills are aligned with the Agent Skills open standard.

When skills are available, the platform exposes each skill’s name, description, and path to the model. The model uses that metadata to decide whether to invoke a skill. If it does, it reads SKILL.md for the full workflow.

Claude code’s approach:

Claude code skills work identically. A skill is a markdown file with frontmatter (name, description, version) plus step-by-step instructions. Store skills in .claude/skills/ or install community skills via the Claude Skills Library .

Key difference: Claude code’s skills ecosystem integrates with MCP servers for external tool access, while OpenAI’s skills use the shell tool for execution.

The andrej-karpathy-skills repo - 143,000+ stars, installable from the Claude Code marketplace - shows what this looks like at its simplest: four engineering principles in a single CLAUDE.md, available as a per-project copy or an installable plugin.

flowchart LR
    subgraph openai ["OpenAI Skills"]
        O1["SKILL.md manifest"] --> O2["Agent reads procedure"]
        O2 --> O3["Executes via shell tool"]
    end
    subgraph claude ["Claude Code Skills"]
        C1[".claude/skills/ manifest"] --> C2["Agent reads procedure"]
        C2 --> C3["Executes via MCP servers"]
    end
    style openai fill: #d4edda, stroke: #28a745
    style claude fill: #e8f4fd, stroke: #0d6efd

Shell: execution for agents

OpenAI’s approach:

The shell tool lets models work inside a real terminal environment - either hosted containers managed by OpenAI, or a local shell runtime you execute yourself (same tool semantics, but you control the machine). Hosted shell runs through the Responses API, which means your requests come with stateful work, tool calls, multi-turn continuation, and artifacts.

Claude code’s approach:

Claude code runs in your local terminal by default - no hosted containers. You control the machine. The CLI integrates directly with your development environment, maintaining persistent awareness of your entire project.

For remote/cloud execution, you can:

Run Claude code CLI in CI/CD pipelines
Use Claude Desktop for GUI-based workflows
Connect to remote machines via SSH + Claude code CLI

Compaction: keep long runs moving

OpenAI’s approach:

As workflows get longer, they run into context window limits. Server-side compaction keeps long runs moving by managing the context window and compressing conversation history automatically.

Two modes:

Server-side compaction: when context crosses the threshold, compaction runs automatically in-stream.
Standalone /responses/compact endpoint: use when you want explicit control over when compaction happens.

Claude code’s approach:

Automatic conversation compacting works identically. When a conversation approaches the token limit (configurable threshold), Claude automatically summarizes the conversation, creates a compaction block, and continues with the compacted context.

Key additions for Claude code:

Customize compaction in CLAUDE.md: “When compacting, always preserve the full list of modified files and any test commands”
AfterCompaction hooks: inject critical context back after summarization (active plans, golden principles)
**Artifacts don’t count against token limits **: Claude artifacts (up to 20MB) store code and outputs separately from conversation context
Manual trigger: /compact Focus on the API changes for explicit control

Claude code’s automatic compaction continues to improve memory usage in long sessions and preservation of critical state.

Why they’re better together

flowchart LR
    subgraph skills ["Skills = the HOW"]
        S1["SKILL.md manifest"]
        S2["Templates + examples"]
        S3["Guardrails + routing"]
    end
    subgraph shell ["Shell/MCP = the DO"]
        SH1["Install dependencies"]
        SH2["Run scripts + tools"]
        SH3["Write artifacts"]
    end
    subgraph compaction ["Compaction = the CONTINUITY"]
        C1["Auto-compress when context fills"]
        C2["Pin immutable constraints"]
        C3["Multi-hour runs stay coherent"]
    end
    skills -->|" model loads procedure "| shell
    shell -->|" long run hits limit "| compaction
    compaction -->|" resumes with full context "| shell
    style skills fill: #e8f4fd, stroke: #0d6efd
    style shell fill: #d4edda, stroke: #28a745
    style compaction fill: #fff3cd, stroke: #ffc107

Skills reduce prompt spaghetti by moving stable procedures and examples into a reusable bundle.
Shell/MCP provides a full execution environment, letting you install code, run scripts, and write outputs.
Compaction preserves continuity on long runs, so the same workflow can keep executing without manual context surgery.

The philosophy, summarized: Use skills to encode the how (procedures, templates, guardrails). Use shell/MCP to execute the do (install, run, write artifacts). Use compaction to keep long runs coherent (without hand-managing context).

The 10 tips that actually matter

These come from OpenAI’s developer blog, informed by their work building Codex and production experience from Glean. Each tip below shows OpenAI implementation patterns and Claude Code adaptations where they differ.

Applicability to Claude Code: Tips 1-5 (skill design) and Tip 10 (local/cloud parity) apply directly with same patterns. Tips 6-9 (security, networking, credentials) use different mechanisms (MCP permission tiers vs OpenAI allowlists) but same principles.

Tips 1-5: skill design and routing

Tip 1: Write skill descriptions like routing logic, not marketing copy.

Your skill’s description is effectively the model’s decision boundary. It should answer: when should I use this? When should I not? What are the outputs and success criteria?

OpenAI + Claude: Include a short “Use when vs. don’t use when” block directly in the front matter description.

Tip 2: Add negative examples and edge cases to reduce misfires.

A surprising failure mode is that making skills available can initially reduce correct triggering. Glean saw skill-based routing drop by about 20% in targeted evals, then recovered after they added negative examples and edge case coverage.

OpenAI + Claude: Write explicit “Don’t call this skill when..” cases in the skill body and suggest what to do instead.

Tip 3: Put templates and examples inside the skill, not the system prompt.

Templates inside skills have two advantages: they’re available exactly when needed (when the skill is invoked), and they don’t inflate tokens for unrelated queries. Glean reported this pattern drove some of their biggest quality and latency gains in production.

OpenAI: Include templates in SKILL.md under “Examples” section. Claude: Include templates in .claude/skills/*.md with concrete examples inline.

Tip 4: Design for long runs early with container reuse and compaction.

Long-horizon agents rarely succeed as one-shot prompts. Plan for continuity from the start:

Reuse the same container/session across steps when you want stable dependencies, cached files, and intermediate outputs.
Pass previous_response_id (OpenAI) or maintain conversation thread (Claude) so the model can continue work.
Use compaction as a default long-run primitive, not an emergency fallback.

Claude-specific: Use artifacts (up to 20MB) to persist structured data across compaction events.

Tip 5: For determinism, just tell the model to use the skill.

“Use the <skill-name> skill.” That’s it. Simplest reliability lever you can pull. Turns fuzzy routing into an explicit contract.

OpenAI + Claude: Works identically in both ecosystems.

Tips 6-10: security, networking, and portability

Tip 6: Treat skills plus networking as a high-risk combo.

Security posture

Combining skills with open network access creates a high-risk path for data exfiltration. If you use networking, keep network allowlists strict, assume tool output is untrusted, and avoid open internet plus unrestricted procedures in consumer-facing flows.

OpenAI: Use org-level and request-level network_policy allowlists.
Claude: Use MCP server vetting and permission tiers (Deny/Ask/Allow rules).

Strong default posture for both:

Skills: allowed
Shell/MCP: allowed
Network: enabled only with a minimal allowlist, per request, for narrowly scoped tasks

Tip 7: Make a standard handoff boundary for artifacts.

OpenAI: Treat /mnt/data as the standard place to write outputs you’ll retrieve, review, or pass back into subsequent steps.

Claude: Use Claude artifacts for structured outputs that persist across sessions. For file-based outputs, define a standard path like ./artifacts/ in CLAUDE.md.

Mental model: tools write to disk, models reason over disk, developers retrieve from disk.

Tip 8: Understand allowlists as a two-layer system.

OpenAI: Networking is controlled via org-level allowlist (admin-configured) and request-level network_policy (must be subset of org allowlist).

Claude: MCP security uses permission tiers : Deny rules block operations, Ask rules require user approval, Allow rules auto-approve. Hook controls let you disable all hooks between sessions to prevent persistent malicious code.

flowchart TD
    A["Agent request with network/tool access"] --> B{"In allowlist (OpenAI) or permission tier(Claude)?"}
    B -->|no| C["Blocked"]
    B -->|yes| D{"Needs auth?"}
    D -->|no| E["Request proceeds"]
    D -->|yes| F["Inject credentials (domain_secrets for OpenAI, MCP auth for Claude)"]
    F --> E
    style C fill: #f8d7da, stroke: #dc3545
    style E fill: #d4edda, stroke: #28a745

Tip 9: Use secure credential injection for authenticated calls.

OpenAI: Use domain_secrets so the model never sees raw credentials. At runtime, the model sees placeholders (e.g., $API_KEY), and a sidecar injects real values only for approved destinations.

Claude: MCP servers handle authentication separately from the model. Credentials live in MCP server config, never in conversation context.

Tip 10: Use the same APIs in the cloud and locally.

OpenAI: Skills work with hosted shell and local shell mode. Shell has a local execution mode where you execute shell_call yourself and return shell_call_output back to the model.

Claude: Claude code runs locally by default . For cloud execution, run Claude Code CLI in CI/CD or use Claude Desktop for GUI workflows. Skills and MCP servers work identically across local/remote.

Practical dev loop (both ecosystems):

Start local (fast iteration, access to internal tooling, easy debugging).
Move to hosted/CI when you want repeatability, isolation, and deployment consistency.
Keep skills the same across both modes (the workflow stays stable even when execution moves).

Three build patterns

flowchart LR
    subgraph A ["Pattern A: Artifact"]
        direction TB
        A1["Install libs"] --> A2["Fetch data / call API"] --> A3["Write to standard path (/mnt/data or ./artifacts/)"]
    end
    subgraph B ["Pattern B: Repeatable"]
        direction TB
        B1["Load skill"] --> B2["Execute in shell/MCP"] --> B3["Produce artifact deterministically"]
    end
    subgraph C ["Pattern C: Enterprise"]
        direction TB
        C1["Skill = living SOP"] --> C2["Multi-tool orchestration"] --> C3["Consistent execution across org"]
    end
    A -->|" add skills for consistency "| B
    B -->|" scale across teams "| C
    style A fill: #d4edda, stroke: #28a745
    style B fill: #e8f4fd, stroke: #0d6efd
    style C fill: #fff3cd, stroke: #ffc107

Each pattern builds on the previous one. Start with A for quick wins, graduate to B for repeatable workflows, scale to C across the org.

Pattern A: Install, fetch, write artifact.

OpenAI: Install libraries, call API, write report to /mnt/data/report.md. Claude: Install libraries, call API via MCP server, write to Claude artifact or ./artifacts/report.md.

This creates a clean review boundary - your app can show the artifact to the user, log it, diff it, or feed it into a later step.

Pattern B: Skills + shell/MCP for repeatable workflows.

OpenAI: Encode workflow in SKILL.md, mount into shell environment, execute deterministically. Claude: Encode workflow in .claude/skills/*.md, execute via MCP servers for external tool access.

Particularly effective for spreadsheet analysis, dataset cleaning, and standardized report generation.

Pattern C: Skills as enterprise workflow carriers.

One early pattern is a loss of accuracy in the gap between single tool invocation and multi-tool orchestration. Skills close that gap.

Glean’s Salesforce-oriented skill (works for both OpenAI and Claude): increased eval accuracy from 73% to 85% and reduced time-to-first-token by 18.1%. Practical tactics: careful routing, negative examples, embedding templates inside the skill.

Skills become living SOPs (standard operating procedures): updated as your org evolves, executed consistently by agents across both ecosystems.

Architecture enforcement: boundaries, not micromanagement

Agents are most effective in environments with strict boundaries and predictable structure . OpenAI built their application around a rigid architectural model. Each business domain is divided into a fixed set of layers, with strictly validated dependency directions.

Claude Code teams are doing the same thing. As agentic coding becomes standard in 2026 , the consensus is clear: constraints enable speed without decay.

The rule (enforced mechanically):

Within each business domain (for example, App Settings), code can only depend “forward” through a fixed set of layers. Cross-cutting concerns (auth, connectors, telemetry, feature flags) enter through a single explicit interface: * Providers*. Anything else is disallowed and enforced mechanically.

flowchart LR
    subgraph domain ["Business domain (e.g. App Settings)"]
        direction LR
        T["Types"] --> CF["Config"] --> R["Repo"] --> S["Service"] --> RT["Runtime"] --> U["UI"]
    end
    subgraph providers ["Providers (single entry point)"]
        direction TB
        P1["Auth"]
        P2["Connectors"]
        P3["Telemetry"]
        P4["Feature flags"]
    end
    providers -->|" only allowed cross-cutting edge "| S
    T -.->|" backward dependency BLOCKED by linter "| U
    style domain fill: #e8f4fd, stroke: #0d6efd
    style providers fill: #fff3cd, stroke: #ffc107
    style T fill: #d4edda, stroke: #28a745

Dependencies flow left to right only. The linter blocks any backward edge. Providers are the only way cross-cutting concerns get in.

Why this matters early

This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift.

Enforcement through custom lints and structural tests

OpenAI: Custom linters and structural tests (Codex-generated), plus “taste invariants.” They statically enforce structured logging, naming conventions, file size limits. Because the lints are custom, they write error messages that inject remediation instructions into agent context.

Claude code: Hooks provide deterministic enforcement . PostToolUse hooks run linters immediately after code changes. PreToolUse hooks block violations before they happen. Example from production repos:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
# .claude/hooks/PostToolUse.sh

# Enforce layered architecture
if [[ "$FILE_PATH" =~ ^src/(.+)/types/ ]]; then
  LAYER="types"
elif [[ "$FILE_PATH" =~ ^src/(.+)/service/ ]]; then
  LAYER="service"
fi

if [[ -n "$LAYER" ]]; then
  # Check for backward dependencies (service importing from UI, etc.)
  if grep -q "from.*UI" "$FILE_PATH" && [[ "$LAYER" == "service" ]]; then
    echo "❌ Blocked: Service layer cannot import from UI layer"
    echo "Fix: Move shared logic to Types or create a Provider"
    exit 2
  fi
fi

# Run language-specific linter
npx eslint --fix "$FILE_PATH"
exit 0

The philosophy: enforce boundaries centrally, allow autonomy locally. You care deeply about boundaries, correctness, and reproducibility. Within those boundaries, you allow teams - or agents - significant freedom in how solutions are expressed.

The resulting code doesn’t always match human stylistic preferences, and that’s okay. As long as the output is correct, maintainable, and legible to future agent runs, it meets the bar.

In FlowForge, I enforce this through sbt module definitions (dependencies only flow one direction), .scalafix.conf rules (no var, no asInstanceOf, no println), and per-module coverage thresholds (core 90%, connectors 80%). These rules work identically for both Codex and Claude Code because they’re language-level enforcement, not agent-specific.

The autonomy ladder: from drafting code to merging PRs

As more of the development loop was encoded directly into the system - testing, validation, review, feedback handling, and recovery - OpenAI’s repository crossed a meaningful threshold where Codex can end-to-end drive a new feature.

Claude Code teams are reaching the same milestone. Building agents with the Claude Agent SDK shows similar autonomous workflows becoming production-ready.

Given a single prompt, the agent can now drive a feature end-to-end:

flowchart TD
    A["1. Validate codebase state"] --> B["2. Reproduce reported bug"]
    B --> C["3. Record evidence\n(video for Codex, artifact for Claude)"]
    C --> D["4. Implement fix"]
    D --> E["5. Validate fix by driving app"]
    E --> F["6. Record verification (video/artifact)"]
    F --> G["7. Open PR"]
    G --> H["8. Respond to agent + human feedback"]
    H --> I{"9. Build passes?"}
    I -->|no| J["Remediate build failure"]
    J --> H
    I -->|yes| K{"10. Judgment required?"}
    K -->|yes| L["Escalate to human"]
    K -->|no| M["11. Merge"]
    style A fill: #e8f4fd, stroke: #0d6efd
    style M fill: #d4edda, stroke: #28a745
    style L fill: #fff3cd, stroke: #ffc107

That’s 11 steps from a single prompt. The agent loops on feedback until reviewers are satisfied ( the Ralph Wiggum Loop - agent-to-agent review iteration). Humans only step in when actual judgment is needed.

Implementation differences:

OpenAI codex:

Uses Chrome DevTools Protocol for step 2 (reproduce bug) and step 5 (validate fix)
Records videos as evidence artifacts
Reviews own changes locally, requests agent reviews in cloud
Uses gh CLI for PR operations

Claude code:

Uses MCP servers for app interaction (browser automation, API testing)
Records evidence in Claude artifacts ( structured test results, screenshots)
Subagents handle specialized review (security, test coverage, performance)
Uses gh CLI for PR operations (identical to Codex)

Caveat

This behavior depends heavily on the specific structure and tooling of the repository and should not be assumed to generalize without similar investment - at least, not yet. Both OpenAI and Anthropic teams spent months building the harness infrastructure.

How agents interact with the system

OpenAI: Humans interact almost entirely through prompts: describe a task, run the agent, allow it to open a pull request. To drive a PR to completion, Codex reviews its own changes locally, requests additional agent reviews, responds to feedback, and iterates until all reviewers are satisfied. Codex uses standard development tools directly (gh, local scripts, repository-embedded skills).

Claude code: The workflow is iterative and conversational , feeling like you have a junior developer sitting next to you. Claude Code integrates directly with your development environment, maintaining persistent awareness of your entire project. For review, specialized subagents handle security scans, test coverage, and code quality checks.

Human review is optional, not required. Over time, both teams have pushed almost all review effort towards being handled agent-to-agent.

Throughput changes the merge philosophy

As agent throughput increased, many conventional engineering norms became counterproductive. Both OpenAI and Claude Code teams operate with minimal blocking merge gates. Pull requests are short-lived. Test flakes are often addressed with follow-up runs rather than blocking progress indefinitely.

In a system where agent throughput far exceeds human attention, corrections are cheap, and waiting is expensive.

This would be irresponsible in a low-throughput environment. Here, it’s often the right tradeoff.

What “agent-generated” actually means

When they say the codebase is generated by agents, they mean everything: product code and tests, CI configuration and release tooling, internal developer tools, documentation and design history, evaluation harnesses, review comments and responses, scripts that manage the repository itself, and production dashboard definition files.

Humans always remain in the loop, but work at a different layer of abstraction. They prioritize work, translate user feedback into acceptance criteria, and validate outcomes. When the agent struggles, they treat it as a signal: identify what is missing - tools, guardrails, documentation - and feed it back into the repository, always by having the agent itself write the fix.

15 lessons from building ChatGPT apps (adapted for Claude ecosystem)

Alpic built two dozen ChatGPT apps over three months. Their core insight is what they call the three body problem: traditional web apps have two actors (user, UI), but AI apps add a third (model). The hard part is managing context asymmetry - each body has partial knowledge, and no single one has the full picture.

These patterns apply to Claude Desktop apps and MCP servers . The three-body problem exists whether you’re building for ChatGPT or Claude.

flowchart TD
    subgraph traditional ["Traditional web app"]
        direction LR
        U1["User"] <-->|" clicks, sees "| UI1["UI"]
    end
    subgraph ai ["AI app (ChatGPT or Claude)"]
        direction TB
        U2["User"] <-->|" types, sees "| UI2["Widget/UI/Desktop"]
        UI2 <-->|" tools/MCP "| M["Model"]
        M <-->|" tool results "| U2
    end
    style traditional fill: #f0f0f0, stroke: #999
    style ai fill: #e8f4fd, stroke: #0d6efd

Three actors, each with partial knowledge. The user sees the chat and the UI. The UI sees its own state and can push context to the model. The model sees tool outputs but not UI internals unless you explicitly expose them. Managing who knows what - that’s the whole game.

Lessons 1-4: system and architecture (ChatGPT + Claude adaptations)

Lesson 1: Not all context should be shared.

Different parts need intentionally different views of state. In a murder mystery game, the model needs to know who the killer is to roleplay correctly, while the UI and user must not. In a Time’s Up game, reversed: the UI shows the secret word, the model must stay unaware.

ChatGPT: Use structuredContent for data both model and widget need. Use _meta for response metadata visible only to the widget, hidden from the model.

Claude: Use MCP resources to expose only what the model needs. Use Claude artifacts for client-side state that stays hidden from the model.

Lesson 2: Lazy-loading doesn’t translate well to AI apps.

In AI apps, tool calls imply delays - often several seconds due to security sandboxing and model reasoning. Front-load aggressively: send as much data as possible into the initial tool response.

ChatGPT: Hydrate the widget via window.openai.toolOutput. If the widget can safely fetch from a public API without sharing info with the model, classic XHR calls work.

Claude: Use MCP servers for data fetching . Pre-load data into artifacts or MCP resources. For public APIs, direct HTTP works (model doesn’t see the internals).

Lesson 3: The model needs visibility.

When a user interacts with a widget (selecting a product) then asks a question in chat, the model has no idea what they’re referring to.

ChatGPT: Use window.openai.setWidgetState(state) for imperative updates. Attach data-llm attributes directly to components for declarative context. Alpic built a Vite plugin that scrapes these attributes and automatically updates widgetState.

Claude: For Claude Desktop, push state updates via MCP tools . For Claude Code, maintain state in artifacts (up to 20MB) or .claude/context/ files that persist across sessions.

Lesson 4: Different interactions require different APIs.

Widget-to-server, model-to-server, widget-to-model - each path exists to support a different kind of interaction. Make these communication paths explicit and be intentional about which mechanism handles which part of the experience.

ChatGPT + Claude: Identical principle. Define clear boundaries: what’s HTTP (widget ↔ server), what’s tool calls ( model ↔ server), what’s state updates (widget → model via MCP or widgetState).

Lessons 5-8: UI and product behavior (ChatGPT + Claude adaptations)

Lesson 5: UI must adapt to multiple display modes.

ChatGPT: Apps can appear inline (stays in conversation history), fullscreen (takes entire screen with chat bar at bottom), or picture-in-picture (floats on top). Account for device-specific safe zones.

Claude Desktop: Similar modes exist. Claude Desktop’s visual diffs show side-by-side comparisons. Artifacts can display inline or in separate pane. Design for both.

Claude code CLI: No UI modes (terminal-only), but artifacts can be opened in external viewers. Design for text-first presentation.

Lesson 6: UI consistency matters in an embedded environment.

ChatGPT: Use the OpenAI Apps SDK UI Kit for ready-to-use components that align with ChatGPT’s design system.

Claude Desktop: No official UI kit yet, but MCP-connected apps should follow system UI conventions (match macOS/Windows design language). Use native components where possible.

Lesson 7: Language-first filtering beats traditional UI controls.

When users can say “Sunny destinations in Europe for under $200,” forcing them through checkboxes and range sliders adds friction. Provide the model with a List of Values (LOV) for tool parameters so it maps natural language directly to backend API requirements.

ChatGPT + Claude: Identical pattern. Define LOVs in tool/MCP schemas. Let the model do natural language → structured parameters mapping.

Lesson 8: Files unlock richer interactions.

Users upload a photo of a product, model identifies it, widget continues into product matching and discovery.

ChatGPT: Tools consume files via openai/fileParams. Widgets handle files via window.openai.uploadFile and window.openai.getFileDownloadUrl.

Claude: Claude Desktop handles files natively . MCP servers can expose file upload endpoints. Artifacts store up to 20MB including images, PDFs, and structured data.

Lessons 9-10: production readiness (ChatGPT + Claude adaptations)

Lesson 9: CSPs are the new CORS.

ChatGPT: OpenAI renders Apps inside a double-nested iframe. Content Security Policies are strictly enforced. Declare connectDomains, resourceDomains, frameDomains, redirectDomains in app manifest.

Claude Desktop: MCP servers run with permission tiers . Define allowed operations in MCP manifest. Deny rules block dangerous operations, Ask rules require user approval, Allow rules auto-approve.

Security considerations:

ChatGPT: CSP violations = blocked iframe operations
Claude: MCP permission violations = blocked tool calls

Both require explicit allowlisting of external domains/operations.

Lesson 10: Small flags have outsized impact.

ChatGPT: widgetDomain is required for submission. widgetAccessible controls whether widget can call tools on its own via callTool. Tool annotations (readOnly, destructiveHint, openWorldHint) are required.

Claude: MCP tool definitions require similar metadata: operation safety level, required parameters, auth requirements. Hook configurations define which lifecycle events trigger which automation.

Lessons 11-14: iteration velocity (ChatGPT + Claude adaptations)

Lesson 11: Fast iteration requires hot reload.

ChatGPT: Long-TTL resource caching makes standard HMR incompatible. Alpic built a Vite plugin that intercepts resource requests and injects real-time updates into the ChatGPT iframe.

Claude Desktop: MCP servers support hot reload via server restart. Use file watchers to auto-reload MCP configs on change.

Claude code CLI: Supports live file watching. Changes to .claude/ directory (skills, hooks, agents) take effect immediately or on next session start.

Lesson 12: Not every test belongs in production environment.

ChatGPT: Build a lightweight local emulator mocking the ChatGPT host environment. Reserve real ChatGPT tests for validating model interactions.

Claude: Claude Code runs locally by default - your dev environment IS the test environment. Use MCP server mocks for integration tests. Reserve Claude Desktop tests for end-to-end workflows.

Lesson 13: Mobile testing requires explicit support.

ChatGPT: Vite’s default localhost makes tunnelled URLs inaccessible from other devices. Extend with domain forwarding on tunnelled ports.

Claude Desktop: Desktop app is macOS/Windows only (no mobile). Claude mobile apps exist but focus on conversational use cases. Design for desktop-first, mobile web as fallback.

Lesson 14: Familiar abstractions speed delivery.

ChatGPT: The Apps SDK exposes low-level JavaScript APIs. Introduce React-friendly abstractions - hooks like useCallTool, useWidgetState, useLocale.

Claude: MCP SDKs exist for Python and TypeScript . Build higher-level abstractions around common patterns (auth, caching, rate limiting). Reuse across projects.

Codification and reuse (lesson 15)

Lesson 15: Turn lessons into reusable tooling.

ChatGPT ecosystem: Alpic created the Skybridge Framework (open-source React framework with hooks, dev tools, data-llm attribute) and a Codex Skill covering the full lifecycle.

Claude ecosystem: Community is building similar patterns . Production-tested configs from everything-claude-code , comprehensive examples from claude-code-showcase . Claude Skills Library has 50+ ready-to-use MCP servers and skills.

The pattern

That’s harness engineering applied to app development. Don’t keep rediscovering the same issues - encode them into reusable skills, MCP servers, and hooks. Share across the ecosystem.

Your operating model: staff vs principal split

Operating principle: engineers should optimize the agentic system (instructions + tools + checks + context), not only the immediate code artifact.

This is tool-agnostic. Whether you’re using Codex or Claude Code, the split stays the same.

Level	Primary ownership	Success signal	Failure signal
Staff engineer	Team harness implementation	Agent tasks reliable for team workflows	Repeated manual cleanup, inconsistent outputs
Principal engineer	Org-wide harness standards	Cross-team reuse, fewer regressions, faster onboarding	Fragmented conventions, model-specific drift

Software engineering track

Architecture maps (AGENTS.md or CLAUDE.md + deep docs)
Typed contracts and policy checks
CI-integrated drift scanners (or hooks for Claude)
Tool traces and remediation loops

Data engineering track

Data contracts and schema policies (compile-time or CI-time)
Deterministic transformation runbooks (purity enforcement)
Data quality and lineage checks in harness
Incident playbooks for failed pipelines and late data

The blueprint you can implement this quarter

gantt
    title Harness implementation timeline
    dateFormat YYYY-MM-DD
    axisFormat %b W%W
    section Foundation
        Instruction architecture: a1, 2026-03-02, 2w
        Quality enforcement: a2, after a1, 2w
    section Execution
        Skills and procedures: a3, after a2, 2w
        Shell/MCP and execution: a4, after a3, 2w
    section Lifecycle
        Compaction and context: a5, after a4, 4w

Five layers, each building on the last. You can run layers 1-2 in parallel if you have the team.

Layer 1: instruction architecture (weeks 1-2)

For OpenAI codex:

Ship AGENTS.md as an index (under 120 lines)
Structure docs/ directory by concern
Create ADR log in docs/decisions/

For Claude code:

Create CLAUDE.md in repo root (50-120 lines, project-specific only)
Structure .claude/agents/ for specialized subagents
Link from CLAUDE.md to docs/architecture/ for deep dives

Universal:

Define evidence docs with file paths and line numbers
Version all decisions and plans in the repo

Layer 2: quality enforcement (weeks 3-4)

For OpenAI codex:

Define golden principles (5-10 rules)
Deploy daily drift scans
Make lint errors prescriptive

For claude code:

Define .claude/hooks/PostToolUse.sh for auto-fixes
Define .claude/hooks/PreToolUse.sh for blocking violations
Track hook trigger rate as a metric

Universal:

Auto-fix low-risk violations
Track manual cleanup time (goal: trending to zero)

Layer 3: executable procedures / skills (weeks 5-6)

For OpenAI Codex:

Convert top 3-5 workflows into SKILL.md files
Add templates, examples, negative examples
Store in skills/ directory with versioning

For Claude Code:

Convert workflows into .claude/skills/*.md files
Include routing logic (when to use, when not to use)
Install community skills from Claude skills library

Universal:

Write skill descriptions like routing logic, not marketing copy
Include negative examples to reduce misfires
Test skill invocation accuracy (target >= 90%)

Layer 4: execution substrate / shell or MCP (weeks 7-8)

For OpenAI Codex:

Enable shell-backed execution (hosted or local)
Define standard artifact path (/mnt/data)
Instrument traces and incident metrics

For Claude Code:

Build MCP servers for external tools
Define standard artifact path (Claude artifacts or ./artifacts/)
Connect observability stack via MCP (logs, metrics, traces)

Universal:

Start with local mode (fast iteration)
Move to hosted/CI when ready for repeatability
Keep skills the same across local/remote

Layer 5: context lifecycle / compaction (ongoing)

For OpenAI Codex:

Enable server-side compaction (auto or manual /responses/compact)
Pass previous_response_id for multi-turn continuity
Pin immutable constraints in system message

For Claude Code:

Enable automatic compaction with custom CLAUDE.md instructions
Use .claude/hooks/AfterCompaction.sh to inject critical context after summarization
Store long-lived state in Claude artifacts (up to 20MB)

Universal:

Implement periodic context hygiene
Track context hit rate (target >= 90%)
Measure long-run coherence (target: 6+ hour sessions without drift)

Metrics that actually matter

The shift: from measuring lines of code written to measuring harness effectiveness.

Metric	Target	What it tells you
Agent task success rate (successful runs / total)	>= 85%	Harness is well-specified
Auto-remediation rate (auto-fixed / total issues)	>= 60%	Quality enforcement is mechanical
Context hit rate (resolved lookups / total)	>= 90%	Docs structure works
Manual cleanup time (weekly hours)	Trending to zero	Golden principles coverage
Skill invocation accuracy	>= 90%	Descriptions + negative examples work
Human escalation rate	< 15%	Boundaries are clear

Leading indicators (measure first): context hit rate, skill invocation accuracy, manual cleanup time.

Lagging indicators (improve as harness matures): agent task success rate, auto-remediation rate, human escalation rate.

How to measure across ecosystems:

OpenAI codex:

Task success: PR merge rate without human intervention
Auto-remediation: drift scanner auto-fix rate
Context hits: AGENTS.md link click-throughs (instrument in docs)

Claude code:

Task success: /stats command shows session metrics
Auto-remediation: hook auto-fix vs. block rate
Context hits: track .claude/agents/ invocation frequency

How OpenAI measures effectiveness: not by volume (1M lines is impressive but not the point). By velocity (3.5 PRs/engineer/day, increasing). By autonomy (end-to-end without human intervention). By amplification (humans one layer up from implementation).

When things go wrong: failure scenarios and fixes

Agent output quality regresses suddenly

Symptoms: PR quality drops, tests fail more often, manual cleanup time increases.

Check in order:

Did instruction file (AGENTS.md or CLAUDE.md) become too large or contradictory?
Did a skill version change without tests?
Did compaction remove critical constraints?
Did evaluation thresholds change silently?

Fix (OpenAI): Audit AGENTS.md length and cross-links. Roll back skill version. Add compaction preservation rules in system message.

Fix (Claude): Check CLAUDE.md length (should be 50-120 lines). Review .claude/hooks/AfterCompaction.sh - is it injecting critical context back? Add preservation rules to CLAUDE.md.

Universal: Document findings in exec plans, have the agent implement the fix, update golden principles.

Agent keeps missing team conventions

Symptoms: Repeated style violations, inconsistent patterns, manual corrections needed.

Root cause: Knowledge lives in people’s heads or chat threads.

Fix (OpenAI): Move conventions to docs/conventions.md, link from AGENTS.md, add machine checks (custom linters).

Fix (Claude): Encode conventions in .claude/hooks/PostToolUse.sh as auto-fixes. Add high-level guidance to CLAUDE.md. Use PreToolUse hooks to block severe violations.

Universal: If an agent keeps getting it wrong, the convention isn’t in the repo. Make it mechanical.

Long workflows collapse after many turns

Symptoms: Agent loses context mid-task, forgets earlier decisions, loops on same issues.

Root cause: Compaction removed critical context, or no compaction enabled (hit hard limit).

Fix (OpenAI): Ensure server-side compaction is active. Pin immutable constraints in system message. Split long goals into checkpointed sub-runs with explicit handoff boundaries. Pass previous_response_id for continuation.

Fix (Claude): Enable automatic compaction in API settings. Add compaction preservation rules to CLAUDE.md: “When compacting, always preserve the full list of modified files and any test commands.” Use .claude/hooks/AfterCompaction.sh to inject critical state. Store long-lived context in Claude artifacts (up to 20MB).

Universal: Track compaction events as a metric. Measure pre/post-compaction coherence.

Humans spend 20% of time cleaning up AI slop

Symptoms: Manual Friday cleanup sessions, inconsistent code style, repeated fixes.

Root cause: Golden principles are under-specified.

Fix (OpenAI): Define explicit principles (5-10 rules). Build daily drift scanner. Auto-open fix-up PRs. Track cleanup time as a metric (goal: zero).

Fix (Claude): Encode principles in .claude/hooks/PostToolUse.sh as deterministic enforcement. Use exit code 2 to block violations before they land. Track hook trigger rate and human override rate.

Universal: The goal is trending to zero manual cleanup. If it’s not declining, your principles aren’t specific enough.

Skill routing accuracy drops after adding new skills

Symptoms: Agent invokes wrong skill for task, skips relevant skills, lower eval scores.

Root cause: Model can’t disambiguate between similar skills.

Fix (OpenAI + Claude): Add explicit negative examples (“Don’t call this skill when…”) to each skill’s description. Include edge case coverage. Add “Use when vs. don’t use when” blocks in frontmatter.

Evidence from production: Glean recovered a 20% accuracy drop with this pattern alone.

Universal: Skill routing is a leading indicator. Monitor invocation accuracy (target >= 90%). When you add a new skill, immediately add negative examples to existing skills.

Agent can't reproduce bug or validate fix

Symptoms: Agent reports “can’t verify” or “unable to test,” relies on human to run app.

Root cause: Application behavior isn’t legible to the agent.

Fix (OpenAI): Make app bootable per worktree. Wire Chrome DevTools Protocol into agent runtime. Expose logs and metrics via ephemeral observability stack (LogQL/PromQL access).

Fix (Claude): Build MCP servers for app interaction : browser automation ( Playwright/Puppeteer), API testing (HTTP client), log access (tail -f via MCP). Make app bootable per worktree. Use Claude artifacts to store test results and evidence.

Universal: The app must be drivable by the agent. If the agent can’t boot it, query it, and observe its behavior, it can’t validate fixes.

Tool-agnostic adaptation: OpenAI, Claude, Gemini, Grok

The patterns in this article work regardless of model.

flowchart TD
    subgraph harness ["Your harness (model-agnostic)"]
        I["Instruction contract AGENTS.md or CLAUDE.md"]
        SK["Skill contract versioned procedures"]
        EX["Execution contract shell or MCP"]
        V["Validation contract evals + policy checks"]
        CX["Context contract compaction + pinned state"]
    end
    I --> SK --> EX --> V --> CX
    harness --> CO["OpenAI Codex"]
    harness --> CL["Claude Code"]
    harness --> GE["Gemini Code Assist"]
    harness --> GR["Grok"]
    style harness fill: #e8f4fd, stroke: #0d6efd

Build the harness once, swap the model underneath. Keep this normalized interface:

Contract	OpenAI implementation	Claude implementation	Universal pattern
Instruction	AGENTS.md (~100 lines) + docs/	CLAUDE.md (50-120 lines) + .claude/	Short index file + deep-linked docs
Skill	SKILL.md with frontmatter	.claude/skills/*.md with frontmatter	Versioned workflow packs with routing logic
Execution	Shell tool (hosted or local)	MCP servers (local by default)	Sandboxed execution with audit trail
Validation	Custom linters + structural tests	Hooks (Pre/PostToolUse)	Mechanical enforcement at lifecycle events
Context	Server-side compaction + previous_response_id	Auto compaction + AfterCompaction hooks + artifacts	Automatic summarization with critical context pinning

Migration paths

Codex → Claude code:

Rename AGENTS.md → CLAUDE.md (trim to 50-120 lines)
Convert custom linters → .claude/hooks/PostToolUse.sh
Convert shell workflows → MCP servers
Keep docs/ structure unchanged (works for both)
Keep skills/ intact (same format)

Claude code → codex:

Merge CLAUDE.md + .claude/ context → AGENTS.md (~100 lines)
Convert .claude/hooks/ → CI-based enforcement
Convert MCP servers → shell tool scripts
Keep docs/ structure unchanged
Keep .claude/skills/ as skills/ (same format)

Universal patterns that port directly:

Architecture maps (docs/architecture/)
ADRs (docs/decisions/)
Evidence docs (docs/evidence/)
Golden principles (encoded as enforcement, not prose)
Skill routing patterns (negative examples, LOV, templates)

The goal is convergence. As standards like AGENTS.md , MCP , and Skills mature through ecosystem collaboration, these patterns become more portable. For now, adapt to your platform’s primitives but preserve the underlying principles.

Platform context: what changed in 2025

Background context

This section covers the platform shifts that make harness engineering possible. If you’re already familiar with the 2025 model landscape, you can skip ahead.

Note from May 2026

A short addendum since this article first published in February. The platform picture moved in five directions worth noting, without changing any of the lessons above:

Data-engineering vendors made the harness their product. Google’s BigQuery Data Engineering Agent went GA on 22 April 2026. dbt’s Coalesce 2025 conference shipped dbt Agents (Developer, Discovery, Observability, Analyst) and the dbt Fusion MCP server. Snowflake released Managed MCP Servers. Microsoft Fabric framed itself as an “Agentic Fabric.” The substrate question this article opens with is now a vendor category.
Type-system safety for agent code became a published research direction. Odersky’s group released the OPAW framework (Tracking Capabilities for Safer Agents, March 2026), Scalar 2026 ran How Can We Trust Our Agents? in April, and ACL Anthology 2025’s OMMM workshop published TypePilot: Leveraging the Scala type system for secure LLM-generated code. The harness frame in this article is general; the type-system version of it is now an active academic line.
Production telemetry now exists to back the failure-mode claims. Datadog’s 2026 State of AI Engineering report (covering 1,000+ deployments) measured a 5 percent outright LLM-call failure rate and a 60 percent rate-limit share of agent errors. MIT-led research in April 2026 showed a hallucination-reasoning trade-off: stronger reasoning through RL increases tool-hallucination rates in lockstep with task gains. Numbers we previously inferred are now measured.
The community turned Karpathy’s pitfall observations into an installable CLAUDE.md. His LLM coding pitfalls tweet (January 2026) described agent failure modes; andrej-karpathy-skills distilled them into four principles - think before coding, simplicity first, surgical changes, goal-driven execution - packaged as a single CLAUDE.md. The repo has 143,000+ stars and is in the Claude Code marketplace.
Harness discipline extended to knowledge work. Karpathy’s LLM Knowledge Bases post (April 2026) described agents compiling and maintaining structured knowledge rather than code. The raw/wiki/operations pattern maps directly onto the same three lessons.

The full citations are at the bottom of this article under 2026 platform developments, Adjacent work, and Karpathy: practitioner patterns.

Reasoning: OpenAI and Anthropic approaches

OpenAI: Reasoning models evolved from o1/o3 research into production capabilities. By end of 2025, reasoning converged with general-purpose models in the GPT-5.2 family, with GPT-5.2 Pro for deeper reasoning workloads. “Think harder vs. respond faster” became a tunable developer decision - use Pro models for complex multi-step work, use standard models for fast iteration.

Anthropic: Claude 3 Opus and Claude 3.5 Sonnet prioritize extended reasoning through larger context windows (200K tokens) and extended thinking capabilities . Claude Desktop introduced Thinking Mode (early 2026) - explicitly showing reasoning steps before answering. Control via system prompts: “Think step-by-step” or “Show your reasoning.”

Convergence: Both ecosystems now treat reasoning depth as a tunable parameter rather than a model-specific feature.

Multimodality: OpenAI and Anthropic

OpenAI: PDFs and documents directly in the API (including PDF-by-URL without upload). Whisper for speech-to-text, TTS models for controllable text-to-speech. GPT Image 1.5 for higher-fidelity image generation and editing. Sora for video generation with temporal coherence. gpt-realtime for low-latency voice conversations. Image generation integrated into multi-turn conversations via tool calls.

Anthropic: Claude 3.5 Sonnet supports vision (images, PDFs, screenshots), with industry-leading OCR and chart comprehension. Claude Desktop supports drag-and-drop for images and PDFs. Artifacts support rich media up to 20MB including images, interactive charts, and structured documents. Audio and video capabilities announced for 2026.

Key difference: OpenAI’s multimodality is output-focused (generation via GPT Image 1.5, Sora). Claude’s is input-focused (comprehension via vision, OCR). For generation, use GPT Image/Sora via API with Claude handling orchestration and reasoning.

Agent-native APIs: OpenAI and Anthropic

OpenAI: The Responses API supports multiple inputs/outputs including different modalities, reasoning controls, and tool calling during reasoning. Open-source Agents SDK for Python and TypeScript - provider-agnostic, with documented paths for non-OpenAI models. AgentKit added Agent Builder, ChatKit, Connector Registry, and evaluation loops. Conversation state + Conversations API for durable threads. Connectors and MCP servers for external context. The Apps SDK extends MCP to let developers build UIs alongside MCP servers.

Anthropic: Claude Messages API supports multi-turn conversations with tool use, vision, and system prompts. Claude Agent SDK provides higher-level abstractions for agentic workflows. MCP (Model Context Protocol) is an open standard for AI-tool integrations - 50+ community servers available. Claude Desktop provides native GUI for agent workflows. Prompt caching reduces costs for repetitive prefixes by 90%.

Convergence: Both adopted MCP as the standard integration protocol. Both provide multi-turn conversation APIs with tool use. OpenAI’s Agents SDK is provider-agnostic and works with Claude.

Coding assistants: Codex and Claude Code

OpenAI codex: The open-source CLI brought agent-style coding into local environments. AGENTS.md support, MCP integration, sandboxing, and approval modes made it production-ready. Codex can be orchestrated via the Agents SDK by running the CLI as an MCP server. Codex Autofix in CI tightened the loop.

Claude code: Production-ready AI coding assistant that understands entire codebases. Key features: CLAUDE.md auto-loading, hooks at 15 lifecycle events , specialized subagents , MCP server integration , automatic compaction for long sessions. Both CLI (terminal-first) and Desktop (GUI) modes available.

Key difference: Codex focuses on hosted/cloud-first execution with local fallback. Claude Code focuses on local-first execution with persistent project awareness. Both support MCP, both integrate with CI/CD, both provide approval modes.

Anthropic: Claude Code ecosystem matured

Claude code launched as a production-ready AI coding assistant that understands entire codebases. Key milestones in 2025/early 2026:

CLAUDE.md specification: Automatic loading of repo-specific instructions at session start
MCP standardization: Model Context Protocol became the standard for AI-tool integrations , with growing community ecosystem
Hooks launched: 15 lifecycle events for deterministic automation (early 2026)
Subagents: Specialized agents for security, testing, review
Artifacts expanded: Up to 20MB storage per artifact , enabling stateful apps
Desktop + CLI convergence: Both modes share skills , MCP servers , and configuration
Expanded use cases: Claude Code for general knowledge work beyond software development

Eight trends defining how software gets built in 2026 : agent orchestration replaced human-authored logic as the primary mode.

Both ecosystems: production concerns became system design

Prompt caching for latency and input costs on shared prefixes. Background mode for long-running responses without holding connections open. Webhooks for event-driven systems. Rate limits matured. Building agents is now as much about system design as prompting.

OpenAI: Responses API, background mode, /responses/compact endpoint. Claude: Automatic compaction , artifacts for state, MCP for integration.

Both ecosystems: open standards converged

OpenAI: Pushed AGENTS.md spec and participated in the Agentic AI Foundation (AAIF) alongside MCP and Skills standards. Released open-weight models: gpt-oss 120b & 20b for self-hosting, plus gpt-oss-safeguard 120b & 20b safety models. Apps SDK extends MCP for UI development.

Claude: CLAUDE.md + AGENTS.md compatibility , MCP as primary integration protocol , Skills aligned with Agent Skills standard . Can use gpt-oss models via API or local deployment.

Shared infrastructure: Open standards (AGENTS.md, MCP, Skills via AAIF), evals frameworks, reinforcement fine-tuning (RFT), supervised fine-tuning, distillation patterns for pushing quality into smaller models.

Recommended models by task (end of 2025):

Model naming context

This section uses OpenAI’s model naming from their end-of-2025 platform update (prep source material). For Claude equivalents, use current production model names as of early 2026.

Task	OpenAI	Anthropic	Notes
General-purpose (text + multimodal)	GPT-5.2	Claude 3.5 Sonnet	For chat, long-context work, and multimodal inputs
Deep reasoning / reliability-sensitive	GPT-5.2 Pro	Claude 3 Opus	Planning and tasks where quality is worth extra compute
Coding and software engineering	GPT-5.2-Codex	Claude Code (Sonnet 3.5)	Code generation, review, repo-scale reasoning
Image generation and editing	GPT Image 1.5	N/A	Higher-fidelity generation and iterative edits
Realtime voice	gpt-realtime	N/A	Low-latency speech-to-speech and live voice agents

Key insights:

OpenAI strengths: Native multimodal generation (GPT Image 1.5, video via Sora), realtime voice (gpt-realtime), reasoning capabilities (GPT-5.2 Pro)
Claude strengths: Longer context windows (200K), superior code understanding, better at following complex instructions, MCP-first integration
For harness engineering: Both ecosystems work. Choose based on your existing infrastructure, cost constraints, and specific task requirements. The patterns in this article apply to both.

Where this frame goes next

This article is deliberately tool-agnostic and domain-agnostic. The same frame applies with extra force in domains where the substrate is itself a system the agent has to reason about. The next piece in this series applies the harness model specifically to data engineering: the substrate becomes algebraic (typestate builders that refuse to compile half-configured pipelines, schema-policy lattices, retry-strategy monoids), and a Scalafix layer mechanically bans what the algebras cannot express. The architectural decision moves out of a document and into CI.

If your domain happens to be data, the in-progress companion article at /posts/2026/05/ai-orchestrated-de/ and the upcoming Scala Days 2026 talks pick up the thread.

References

Adjacent Scala-language and academic work

Scalar 2026: How Can We Trust Our Agents? (Odersky framing): https://www.slideshare.net/slideshow/how-can-we-trust-our-agents-talk-given-at-scala3-2026/286692416
OPAW: Tracking Capabilities for Safer Agents (March 2026): https://blog.georgovassilis.com/2026/03/13/opaw-tracking-capabilities-for-safer-agents/
TypePilot: Leveraging the Scala type system for secure LLM-generated code (ACL Anthology 2025, OMMM workshop): https://aclanthology.org/2025.ommm-1.11/
Hivemind Technologies: Prompting Safely (Lambda World 2025): https://github.com/HivemindTechnologies/scala-llms-dsl_final
llm4s: Agentic and LLM Programming in Scala (under the Scala Center, GSoC 2025/2026): https://github.com/llm4s/llm4s

Production telemetry and failure-mode evidence

Why AI Agents Break: A Field Analysis (Arize): https://arize.com/blog/common-ai-agent-failures/
AI Agents for Data Engineering: 2026 reliability guide (Atlan): https://atlan.com/know/ai-agents-for-data-engineering/
Detecting AI Agent Failure Modes in Production (Latitude): https://latitude.so/blog/ai-agent-failure-detection-guide
Data Engineering Open Forum 2025 (recordings playlist): https://www.youtube.com/playlist?list=PLfXiENmg6yyXKICQiUNutmDyJKk84BVSP

Karpathy: practitioner patterns

Andrej Karpathy - LLM coding pitfalls (January 2026): https://x.com/karpathy/status/2015883857489522876
andrej-karpathy-skills - four-principle CLAUDE.md, also installable as a Claude Code plugin (143K+ stars): https://github.com/multica-ai/andrej-karpathy-skills
Andrej Karpathy - LLM Knowledge Bases (April 2026): https://x.com/karpathy/status/2039805659525644595
Karpathy LLM wiki gist (raw/wiki/operations pattern): https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f

🔗 Related Posts

If you’re staff or principal today, your highest-impact work isn’t writing excellent code anymore.

It’s in building a harness where excellent code becomes the default output of the system.

OpenAI shipped 1 million lines in 5 months with agents generating every line. They did it by investing in environment design, not implementation speed.

You can build the same harness with Claude Code. The patterns are identical. The tools differ. The outcomes are the same.

Read the companion: Build a code review operating system

You can start building your harness today. You’ve got the map for both ecosystems. Start with your instruction file (AGENTS.md or CLAUDE.md). The rest follows.

Vitthal Mirji

Staff Data Engineer @ Walmart

Mumbai, India

Staff Data Engineer & Architect from Mumbai, India. Sharing insights on Data Engineering, Functional programming, Scala, Open source, and life.

Expertise

Data Engineering
Scala
Apache Spark
Functional Programming
Cloud Architecture
GCP
Big Data

llm4s: Type-safe LLM infra for Scala that makes runtime errors compile-time
Deep-dive into llm4s, a production-grade Scala framework for building LLM applications. From …
October 31, 2025
Build a code review operating system: prevent 2 AM incidents in serious codebases
A code review operating system for serious codebases: A practical JVM and Scala focused code review …
February 5, 2026
Rust fundamentals: a precise 6-week plan for systems-minded data engineers now
A hands-on, zero-nonsense plan to learn Rust with precision: ownership, traits, async, Arrow, …
October 28, 2025
Effect polymorphism in Scala: write once, choose your runtime later safely now
Deep-dive guide to effect polymorphism in Scala using F[_] and type classes. Write generic code that …
October 18, 2025
Kleisli for data engineers: the category trick that makes pipelines compose
Learn Kleisli from first principles to compose effectful data pipelines with Cats or Cats Effect, …
October 8, 2025

Edit this on GitHub

Next time, we'll talk about "The Tao of Microservices: How to Turn One Problem Into 47 Problems"

Build the harness, not the code: a staff/principal engineer's guide to AI-agent systems

What “harness engineering” actually means

The mental model

Lesson 1: give agents a map, not a manual

What should actually go into AGENTS.md

The solution: index file as table of contents

Progressive disclosure in both ecosystems

Mechanical enforcement

Lesson 2: automate cleanup, or drown in it

Golden principles: encoding taste as enforcement

Agents favor boring technology

The feedback loop

Lesson 3: context is infrastructure, not documentation

Agent legibility is the goal

Making the application itself legible to agents

Technology choices favor agent comprehension

Knowledge as infrastructure: the LLM wiki pattern

The three operational primitives: skills, shell, compaction

Skills: procedures agents load on demand

Shell: execution for agents

Compaction: keep long runs moving

Why they’re better together

The 10 tips that actually matter

Three build patterns

Architecture enforcement: boundaries, not micromanagement

Enforcement through custom lints and structural tests

The autonomy ladder: from drafting code to merging PRs

How agents interact with the system

Throughput changes the merge philosophy

What “agent-generated” actually means

15 lessons from building ChatGPT apps (adapted for Claude ecosystem)

Codification and reuse (lesson 15)

Your operating model: staff vs principal split

The blueprint you can implement this quarter

Layer 1: instruction architecture (weeks 1-2)

Layer 2: quality enforcement (weeks 3-4)

Layer 3: executable procedures / skills (weeks 5-6)

Layer 4: execution substrate / shell or MCP (weeks 7-8)

Layer 5: context lifecycle / compaction (ongoing)

Metrics that actually matter

When things go wrong: failure scenarios and fixes

Tool-agnostic adaptation: OpenAI, Claude, Gemini, Grok

Migration paths

Platform context: what changed in 2025

Where this frame goes next

References

OpenAI sources

Claude/Anthropic sources

Community resources

Universal patterns

2026 platform developments (added May 2026)

Adjacent Scala-language and academic work

Production telemetry and failure-mode evidence

Karpathy: practitioner patterns

🔗 Related Posts

Vitthal Mirji

Expertise