Build the harness, not the code: a staff/principal engineer's guide to AI-agent systems
A team at OpenAI shipped a production product with zero lines of manually-written code. Empty git repo in August 2025. One million lines five months later. Three engineers initially, averaging 3.5 PRs per person per day. And throughput increased as the team grew to seven.
That’s not a research demo. It’s a product with hundreds of daily internal users and external alpha testers.
The constraint was intentional: humans steer, agents execute. Every line - application logic, tests, CI config, docs, internal tooling - was written by Codex. They estimate they built this in about 1/10th the time it would have taken to write the code by hand.
So what did the humans actually do?
They built the harness.
And you can build the same harness with Claude code.
- 3 core harness lessons from OpenAI’s internal story, with parallel Claude Code implementation patterns
- 10 operational tips for skills, shell execution, and context compaction (OpenAI + Claude paths)
- 15 hard-won lessons from building ChatGPT Apps, adapted for Claude Desktop and MCP servers
- Architecture enforcement patterns with mechanical linting via hooks (OpenAI + Claude)
- 11-step autonomy ladder showing the path from bug reproduction to autonomous PR merge
- Quarterly implementation blueprint for both Codex and Claude Code
- Metrics, failure scenarios, and tool-agnostic adaptation proven across ecosystems
This guide shows you both paths. Whether you’re using OpenAI Codex, Claude Code, or planning to switch between them, you’ll learn the underlying patterns that work everywhere.
This is a companion to Build a code review operating system . This article builds the harness. That one closes the feedback loop through review.
What “harness engineering” actually means
You know how it goes - you’re a staff engineer, you spend your time reviewing code, fixing architectural drift, making sure junior devs follow conventions. Now imagine that “junior dev” is an AI agent that can write code 10x faster than any human but has no institutional memory, no taste, and no understanding of why you chose PostgreSQL over DynamoDB.
The harness is everything you build around the agent to make its output reliable. Instructions, tools, checks, context, feedback loops. It’s the environment design that turns “agent can write code” into “agent can ship production features.”
OpenAI learned this the hard way. Early progress was slower than expected - not because Codex was incapable, but because the environment was underspecified. The agent lacked tools, abstractions, and internal structure to make progress toward high-level goals. The lack of hands-on human coding introduced a different kind of engineering work, focused on systems, scaffolding, and compounding returns.
In practice, this meant working depth-first: breaking down larger goals into smaller building blocks (design, code, review, test), prompting the agent to construct those blocks, and using them to unlock more complex tasks. When something failed, the fix was almost never “try harder.” It was always: what capability is missing, and how do I make it both legible and enforceable for the agent?
That reframe is not philosophical. It’s operational. And it’s exactly what this article teaches you to do.
I’ve been applying these patterns in my own project - flowforge , a compile-time data contracts framework. Throughout this article, I’ll occasionally show how I applied a pattern in practice. But the teaching comes from four critical sources:
- Harness engineering - OpenAI’s internal story of shipping 1M lines with zero manual code
- Skills + shell + compaction - operational primitives for long-running agents (OpenAI and Claude)
- 15 lessons building ChatGPT Apps - hard-won lessons adapted for Claude Desktop and MCP
- Platform shifts in 2025 - what changed across OpenAI and Anthropic ecosystems
The mental model
flowchart LR
A[Business intent] --> B[Harness design]
B --> C[Agent execution]
C --> D[Validation and evals]
D --> E[Auto-remediation and learning]
E --> B
You design the map (instructions, boundaries, tools), not a giant manual. You encode quality as checks, not tribal review comments. You treat context artifacts as infrastructure.
Here’s the heuristic I trust most: anything that helps humans ship better software faster usually helps agents do the same.
Skills make this obvious. If you’ve worked with internal engineering wikis, you’ve already seen this movie: distilled playbooks for recurring situations, easy to find, scoped to a concrete job. That pattern worked for humans for years. Agent skills are the same pattern, now executable.
The same logic carries over to compilers and type systems, dependency injection, errors-as-values, and effect systems. These abstractions didn’t survive because they’re fashionable. They survived because they improve reliability and change velocity in real codebases. Agents benefit from those constraints too.
I still keep seeing the “language won’t matter, agents will just emit assembly” take. I don’t buy it. If a workflow is hard for humans to reason about, review, and maintain, it’ll usually break down for agents at scale too. Honestly, I’m tired of the “AI is alien magic” framing. It isn’t. AI doesn’t erase engineering fundamentals; it magnifies them. Strong systems get stronger. Fragile systems fail faster.
This pattern is tool-agnostic. Whether you’re building with Codex, Claude code, Gemini, or Grok, the loop stays the same.
Lesson 1: give agents a map, not a manual
Both OpenAI and Anthropic teams learned the same lesson: the “one big instruction file” approach fails predictably.
- Context is a scarce resource. A giant instruction file crowds out the task, the code, and the relevant docs - so the agent either misses key constraints or starts optimizing for the wrong ones.
- Too much guidance becomes non-guidance. When everything is “important,” nothing is. Agents end up pattern-matching locally instead of navigating intentionally.
- It rots instantly. A monolithic manual turns into a graveyard of stale rules. Agents can’t tell what’s still true, humans stop maintaining it, and the file quietly becomes an attractive nuisance.
- It’s hard to verify. A single blob doesn’t lend itself to mechanical checks (coverage, freshness, ownership, cross-links), so drift is inevitable.
What should actually go into AGENTS.md
There is now benchmark evidence that this is not just a style preference. In analysis highlighted by Addy Osmani
(covering SWE-bench style agent runs), adding an auto-generated AGENTS.md summary reduced task success by roughly
2-3 percentage points while increasing token cost by about 20-30%.
The failure mode is simple: the agent can already inspect your tree, infer your stack, and read module docs on demand. If you preload that same information as prose, you create context anchoring noise. The model keeps looking at the pink elephant in the room instead of the concrete diff it should make.
Use a strict filter for what earns a line in AGENTS.md:
- Undiscoverable operational constraints: setup and tooling gotchas that are not visible from source code alone
- Operational landmines: risky areas where “cleanup” can break production behavior
- Non-obvious conventions: intentional local patterns that look wrong generically but are right for this codebase
Everything else should be linked, not duplicated. If an agent can discover it by reading the repository, cut it from the instruction file.
Treat AGENTS.md as a temporary smell tracker, not a permanent knowledge dump. If agents repeatedly misuse a
dependency, write to the wrong folder, or keep violating architecture boundaries, the real fix is usually structural:
rename modules, improve folder semantics, add linters/hooks, strengthen tests. Then remove the compensating prose.
The solution: index file as table of contents
OpenAI’s approach: AGENTS.md
Instead of treating AGENTS.md as the encyclopedia, OpenAI treats it as the table of contents. The repository’s
knowledge base lives in a structured docs/ directory. A short AGENTS.md (roughly 100 lines) is injected into context
and serves primarily as a map, with pointers to deeper sources of truth.
OpenAI's actual repository structure
| |
Plans are first-class artifacts. Active plans, completed plans, and known technical debt are all versioned and co-located, allowing agents to operate without relying on external context.
Claude code’s approach: CLAUDE.md + .claude/ directory
Claude Code uses CLAUDE.md
as the agent’s “constitution” - its primary source of truth for how your specific repository works. Unlike AGENTS.md, CLAUDE.md is automatically read at the start of each session and holds project-specific instructions you’d otherwise repeat in every prompt.
The filename is case-sensitive and must be exactly CLAUDE.md (uppercase CLAUDE, lowercase .md).
Claude Code's repository structure
| |
Key differences from OpenAI’s approach:
.claude/agents/holds specialized subagents that run in isolated contexts.claude/hooks/enforces quality mechanically at 15 lifecycle events- CLAUDE.md is smaller (50-120 lines vs 100+) because hooks handle enforcement
- MCP servers provide external tool access via standardized protocol
Progressive disclosure in both ecosystems
flowchart LR
subgraph openai ["OpenAI: AGENTS.md"]
A1["AGENTS.md ~100 lines"] -->|points to| A2["docs/design-docs/"]
A1 -->|points to| A3["docs/exec-plans/"]
A2 -->|deep dive| A4["core-beliefs.md"]
end
subgraph claude ["Claude: CLAUDE.md + .claude/"]
C1["CLAUDE.md 50-120 lines"] -->|delegates to| C2[".claude/agents/"]
C1 -->|links to| C3["docs/architecture/"]
C2 -->|runs| C4["reviewer.md security.md"]
end
style openai fill: #d4edda, stroke: #28a745
style claude fill: #e8f4fd, stroke: #0d6efd
The agent reads the index file first (AGENTS.md or CLAUDE.md). From there, it follows links to whichever deep doc is relevant to the current task. It never loads the whole tree at once.
Mechanical enforcement
OpenAI: Dedicated linters and CI jobs validate that the knowledge base is up to date, cross-linked, and structured correctly. A recurring “doc-gardening” agent scans for stale or obsolete documentation and opens fix-up pull requests.
Claude code: Hooks provide deterministic enforcement . PreToolUse hooks block dangerous operations before they run. PostToolUse hooks enforce formatting immediately after code changes. AfterCompaction hooks inject critical context back after summarization.
The key distinction from Claude Code best practices : If it’s a suggestion, use CLAUDE.md. If it’s a requirement, use hooks.
For OpenAI codex:
- Start with
AGENTS.mdas an index (under 120 lines) - Split docs by concern into
docs/subdirectories - Add ADRs for decisions with dated, status-marked records
- Create evidence docs with file paths and quantified claims
- Enforce mechanically with CI jobs for freshness and cross-links
- Run a doc-gardening agent on a schedule
For Claude code:
- Create
CLAUDE.mdin repo root (50-120 lines, project-specific instructions only) - Structure
.claude/agents/for specialized subagents (review, security, test-writing) - Define
.claude/hooks/for enforcement (PreToolUse blocks, PostToolUse formats) - Link from CLAUDE.md to
docs/architecture/for deep dives - Use MCP servers for external tool access
- Let automatic compaction manage context with AfterCompaction hooks to pin critical state
Universal patterns (both ecosystems):
- Keep the index file minimal (suggestions only)
- Use deep-linked docs for detailed context
- Enforce quality mechanically (linters/hooks, not prose)
- Version all decisions and plans in the repo
In flowforge, I use both patterns:
AGENTS.mdfor Codex compatibility with links todocs/adr/INDEX.md(25+ decisions), and.claude/hooks/PostToolUse.shthat runs scalafix after every code change, ensuring golden principles are enforced regardless of which agent I’m using.
Lesson 2: automate cleanup, or drown in it
Full agent autonomy introduces a specific failure mode. Agents replicate patterns that already exist in the repository - even uneven or suboptimal ones. Over time, this inevitably leads to drift.
Golden principles: encoding taste as enforcement
Both ecosystems learned to encode “golden principles” - opinionated, mechanical rules that keep the codebase legible and consistent for future agent runs.
OpenAI’s production examples:
- Prefer shared utility packages over hand-rolled helpers to keep invariants centralized.
- Parse data shapes at the boundary. Don’t probe data “YOLO-style” - validate boundaries or rely on typed SDKs. Follow the “parse, don’t validate” principle.
- Statically enforce structured logging, naming conventions for schemas and types, file size limits, and platform-specific reliability requirements with custom lints.
- Custom lint error messages inject remediation instructions into agent context. When a lint fails, the error message tells the agent how to fix it.
On a regular cadence, background Codex tasks scan for deviations, update quality grades, and open targeted refactoring PRs. Most can be reviewed in under a minute and automerged.
Community validation: Karpathy’s four principles
When Karpathy published his LLM coding pitfalls
in January 2026, the community responded with andrej-karpathy-skills - a single CLAUDE.md encoding four principles, now over 143,000 stars. It’s installable from the Claude Code marketplace. They map directly onto what golden-principle engineering requires:
- Think before coding. Ask clarifying questions. Surface assumptions. Don’t let the model run with the wrong premise.
- Simplicity first. The model’s default instinct is overcomplication. Golden principles push back against that mechanically.
- Surgical changes. Minimal diffs. Localized edits. “Don’t touch what isn’t broken” encoded as a rule.
- Goal-driven execution. Give the agent success criteria, not a prescription. Let it find the path.
The repo’s popularity isn’t about the principles being novel. It’s about them being installable. That’s the harness insight: good engineering taste is worthless unless it’s enforced.
Claude code’s approach: hooks for deterministic enforcement
Claude Code uses hooks at 15 lifecycle events to enforce golden principles. The critical insight from production engineering teams :
CLAUDE.md rules are suggestions. Hooks are enforcement. CLAUDE.md saying “don’t edit .env” → parsed by LLM → weighed against other context → maybe followed. PreToolUse hook blocking .env edits → always runs → returns exit code 2 → operation blocked.
Example .claude/hooks/PostToolUse.sh enforcing golden principles:
| |
Additional Claude code patterns from production repos :
- PreToolUse hooks prevent dangerous operations (deleting production DBs, exposing secrets)
- AfterCompaction hooks inject critical context back (active plans, golden principles)
- SessionEnd hooks generate summaries and cleanup artifacts
Agents favor boring technology
Both OpenAI and Anthropic teams favored dependencies and abstractions that could be fully internalized and reasoned about in-repo. Technologies often described as “boring” tend to be easier for agents to model due to composability, API stability, and representation in the training set.
OpenAI’s example: rather than pulling in a generic p-limit-style package, they implemented their own
map-with-concurrency helper - tightly integrated with their OpenTelemetry instrumentation, with 100% test coverage,
behaving exactly the way their runtime expected.
The feedback loop
flowchart LR
A[Agent output] --> B[Golden principles scanner]
B -->|pass| C[merge]
B -->|fixable| D[auto-fix PR]
B -->|not fixable| E[human escalation]
E --> F[update golden principles]
F --> B
D --> G[quality grade updated]
G --> B
OpenAI: Review comments, refactoring PRs, and user-facing bugs are captured as documentation updates or encoded directly into tooling. When documentation falls short, the rule gets promoted into code.
Claude code: Failed hook enforcement triggers human escalation. The fix becomes a new hook or updated CLAUDE.md rule. Over time, the harness learns from failures. This is what turns it into a code review operating system , not just a linter.
For OpenAI codex:
- Define 5-10 golden principles (mechanical rules your team enforces manually today)
- Encode them as custom linters or structural tests
- Make lint errors prescriptive (tell agents how to fix violations)
- Run drift scans daily, auto-open PRs for low-risk violations
- Track manual cleanup time (goal: trending to zero)
For Claude code:
- Define golden principles in
.claude/hooks/PostToolUse.sh - Use exit code 2 to block violations, exit 0 after auto-fixes
- Add AfterCompaction hooks to preserve principles across compaction
- Use PreToolUse hooks to prevent dangerous operations
- Track hook trigger rate and human escalations as metrics
Universal patterns:
- Start with 5-10 rules, expand as patterns emerge
- Favor boring, composable technology (agents model it reliably)
- Make error messages prescriptive (not just “what” but “how”)
- Update principles based on real failures, not hypothetical risks
In flowforge, I use
.scalafix.conf(banningvar,asInstanceOf, wildcard imports) for Codex compatibility, plus.claude/hooks/PostToolUse.shthat runs scalafix automatically. I also maintain compile-fail tests that must fail to compile - proving the core promise that pipelines won’t compile when contracts drift. These work identically across both ecosystems.
Lesson 3: context is infrastructure, not documentation
From OpenAI’s harness engineering post:
From the agent’s point of view, anything it can’t access in-context while running effectively doesn’t exist.
Knowledge that lives in Google Docs, chat threads, or people’s heads is not accessible to the system. Repository-local, versioned artifacts (code, markdown, schemas, executable plans) are all the agent can see.
This is equally true for Claude Code. Claude’s automatic compaction shows the same constraint: if context isn’t in the conversation, repo files, or MCP-accessible sources, it doesn’t exist.
flowchart LR
subgraph visible ["What agents CAN see"]
direction TB
A["Code + tests"]
B["Markdown docs ADRs, specs, plans"]
C["Schemas + configs"]
D["Exec plans with decision logs"]
E["Logs, metrics (via tools/MCP)"]
end
subgraph invisible ["What agents CANNOT see"]
direction TB
F["Slack threads"]
G["Google Docs"]
H["People's heads"]
I["Verbal agreements"]
J["Undocumented conventions"]
end
visible -->|" agent works here "| K["Reliable output"]
invisible -->|" effectively doesn't exist "| L["Missed constraints wrong assumptions"]
style visible fill: #d4edda, stroke: #28a745
style invisible fill: #f8d7da, stroke: #dc3545
If a decision isn’t in the repo, it doesn’t exist for the agent. That Slack discussion that aligned the team on an architectural pattern? Same as telling a new hire who joins three months later - they’ll never know unless you write it down.
Agent legibility is the goal
OpenAI’s approach: Because the repository is entirely agent-generated, it’s optimized first for Codex’s legibility. In the same way teams aim to improve navigability of their code for new engineering hires, the human engineers’ goal was making it possible for an agent to reason about the full business domain directly from the repository itself.
Claude code’s equivalent: As software engineering shifts to agent orchestration , the folder and file structure becomes a form of context engineering. The best practices for Claude Code emphasize treating repository structure as the agent’s primary navigation system.
Making the application itself legible to agents
OpenAI’s production setup:
- Per-worktree app instances: app bootable per git worktree, so Codex can launch and drive one instance per change.
- Chrome DevTools Protocol: wired into the agent runtime with skills for working with DOM snapshots, screenshots, and navigation.
- Ephemeral observability per worktree: logs, metrics, and traces exposed via a local observability stack that’s ephemeral for any given worktree.
- Agents can query logs with LogQL and metrics with PromQL. With this context available, prompts like “ensure service startup completes in under 800ms” become tractable.
flowchart TD
A["Agent gets task"] --> B["git worktree created"]
B --> C["App boots in isolated instance"]
C --> D["Agent drives app (CDP for Codex, MCP for Claude)"]
D --> E{"Bug reproduced?"}
E -->|yes| F["Implement fix"]
E -->|no| G["Query logs/metrics (LogQL/PromQL for Codex, MCP server for Claude)"]
G --> D
F --> H["Validate fix by driving app again"]
H --> I["Ephemeral observability torn down"]
style C fill: #d4edda, stroke: #28a745
style G fill: #fff3cd, stroke: #ffc107
Claude code’s equivalent:
- MCP servers for observability: Claude Code connects to tools via MCP - the Model Context Protocol is an open standard for AI-tool integrations. Instead of baking observability into Codex’s runtime, you expose it via MCP servers.
- Example MCP servers from the Claude Code ecosystem : filesystem access, Git operations, database queries, HTTP APIs, browser automation.
- Per-worktree isolation: Same pattern works with Claude Code. The agent creates a worktree, boots the app, drives tests via MCP-connected tools, validates, tears down.
- Artifacts for state tracking: Claude artifacts (up to 20MB) store structured data across sessions useful for test results, metrics history, and multi-session workflows.
They regularly see single Codex runs work on a single task for upwards of six hours - often while the humans are sleeping. With Claude Code’s conversation compacting , long-running sessions stay coherent through automatic summarization at token thresholds.
Technology choices favor agent comprehension
Both OpenAI and Anthropic teams favored dependencies and abstractions that could be fully internalized and reasoned about in-repo. Pulling more of the system into a form the agent can inspect, validate, and modify directly multiplies your output - not just for Codex, but for other agents (like Aardvark for OpenAI, or Claude Code subagents for Anthropic) working on the codebase.
This is the same mental model in practical form: choices that improve human legibility and maintainability usually improve agent outcomes too. Chasing assembly-level generation over useful abstractions isn’t acceleration; it’s throwing away leverage we already spent years building.
Universal patterns (both ecosystems):
- Push decisions from chat to repo. Every Slack alignment discussion becomes an ADR.
- Create evidence docs, not aspirational docs. Document current state with file paths and line numbers.
- Write unvarnished reviews. Brutal honesty about gaps. Agents work better with truth than marketing.
- Define acceptance criteria as measurable checks. If an agent can’t verify it, it’s not a criterion.
- Make the app bootable per worktree. Each agent task gets an isolated instance.
For OpenAI codex:
- Wire Chrome DevTools Protocol into Codex runtime
- Expose logs/metrics via ephemeral per-worktree observability
- Enable LogQL/PromQL query access for agents
For Claude code:
- Build MCP servers for observability access (logs, metrics, traces)
- Use Claude artifacts (up to 20MB) to store test results and metrics history across sessions
- Enable automatic compaction with AfterCompaction hooks to preserve critical context
- Connect Claude Code to your observability stack via MCP (Datadog, Grafana, CloudWatch)
In flowforge, I maintain
docs/evidence/unvarnished-review.md- quotes like “Hand-rolled codegen JSON parser… will explode on unions, nested records” - and 25 measurable acceptance criteria for v1.0 indocs/plan/v1.0-readiness.md. These work identically for both Codex and Claude code because they’re plain markdown in the repo.
Knowledge as infrastructure: the LLM wiki pattern
The three lessons above frame harness engineering around code. In April 2026, Karpathy’s LLM Knowledge Bases post pointed at the same infrastructure problem from a different angle.
His observation: a large fraction of productive agent work isn’t code manipulation. It’s knowledge manipulation - reading source documents, building structured summaries, answering complex questions across a growing corpus. The same harness discipline that makes code agents reliable applies directly here.
The pattern:
flowchart LR
A["raw/\n(immutable sources)"] --> B["LLM compiler"]
B --> C["wiki/\n(agent-maintained .md)"]
C --> D["query / lint / ingest"]
D --> C
style A fill:#fff3cd,stroke:#ffc107
style C fill:#d4edda,stroke:#28a745
style B fill:#e8f4fd,stroke:#0d6efd
Raw: source documents go into raw/ unchanged - articles, papers, repos, datasets. Immutable, like your git history.
Wiki: the LLM compiles sources into structured .md files: summaries, backlinks, concept articles, cross-links. The agent writes and maintains all of it. You rarely touch it directly.
Operations: once the wiki reaches roughly 100 articles, three operations become practical:
- Ingest:
"file this new doc to our wiki: (path)". After the early bootstrap phase, the LLM gets the pattern and each addition is incremental. - Query: ask complex questions across the full corpus. The LLM auto-maintains index files and brief per-document summaries, so you don’t need RAG at this scale.
- Lint: health checks that find inconsistent data, impute missing values via web search, surface new article candidates.
Output goes back into Obsidian - formatted files, Marp slides, matplotlib plots - viewable alongside the raw sources that produced them.
The harness parallels are direct. Raw is your immutable source of truth. Wiki is agent-owned state - like .claude/, maintained by the agent, not manually edited. Operations are golden principles for knowledge work: defined, repeatable procedures applied consistently.
Vannevar Bush described the Memex in 1945: a desk that stored and linked all your documents, let you trace associative trails, and built a record of your thinking over time. His unsolved problem was maintenance - who keeps the index current, who updates the cross-links as the corpus grows? The LLM handles that. The harness is what makes it reliable.
The harness isn’t specific to code.
The three operational primitives: skills, shell, compaction
We’re shifting from single-turn assistants to long-running agents that handle real knowledge work: reading large datasets, updating files, and writing apps. Based on developer feedback and their own experience building internal agents, both OpenAI and Anthropic released three primitives that make long-horizon work practical.
The pattern is identical across ecosystems. The implementations differ.
Skills: procedures agents load on demand
OpenAI’s approach:
A skill is a bundle of files plus a SKILL.md manifest containing frontmatter and instructions. Think: a versioned
playbook the model can consult when it’s time to do real work. Skills are aligned with the Agent Skills open standard.
When skills are available, the platform exposes each skill’s name, description, and path to the model. The model
uses that metadata to decide whether to invoke a skill. If it does, it reads SKILL.md for the full workflow.
Claude code’s approach:
Claude code skills
work identically. A skill is a markdown file with
frontmatter (name, description, version) plus step-by-step instructions. Store skills in .claude/skills/ or install
community skills via the Claude Skills Library
.
Key difference: Claude code’s skills ecosystem integrates with MCP servers for external tool access, while OpenAI’s skills use the shell tool for execution.
The andrej-karpathy-skills repo
- 143,000+ stars, installable
from the Claude Code marketplace - shows what this looks like at its simplest: four engineering principles in a single
CLAUDE.md, available as a per-project copy or an installable plugin.
flowchart LR
subgraph openai ["OpenAI Skills"]
O1["SKILL.md manifest"] --> O2["Agent reads procedure"]
O2 --> O3["Executes via shell tool"]
end
subgraph claude ["Claude Code Skills"]
C1[".claude/skills/ manifest"] --> C2["Agent reads procedure"]
C2 --> C3["Executes via MCP servers"]
end
style openai fill: #d4edda, stroke: #28a745
style claude fill: #e8f4fd, stroke: #0d6efd
Shell: execution for agents
OpenAI’s approach:
The shell tool lets models work inside a real terminal environment - either hosted containers managed by OpenAI, or a local shell runtime you execute yourself (same tool semantics, but you control the machine). Hosted shell runs through the Responses API, which means your requests come with stateful work, tool calls, multi-turn continuation, and artifacts.
Claude code’s approach:
Claude code runs in your local terminal by default - no hosted containers. You control the machine. The CLI integrates directly with your development environment, maintaining persistent awareness of your entire project.
For remote/cloud execution, you can:
- Run Claude code CLI in CI/CD pipelines
- Use Claude Desktop for GUI-based workflows
- Connect to remote machines via SSH + Claude code CLI
Compaction: keep long runs moving
OpenAI’s approach:
As workflows get longer, they run into context window limits. Server-side compaction keeps long runs moving by managing the context window and compressing conversation history automatically.
Two modes:
- Server-side compaction: when context crosses the threshold, compaction runs automatically in-stream.
- Standalone
/responses/compactendpoint: use when you want explicit control over when compaction happens.
Claude code’s approach:
Automatic conversation compacting works identically. When a conversation approaches the token limit (configurable threshold), Claude automatically summarizes the conversation, creates a compaction block, and continues with the compacted context.
Key additions for Claude code:
- Customize compaction in CLAUDE.md: “When compacting, always preserve the full list of modified files and any test commands”
- AfterCompaction hooks: inject critical context back after summarization (active plans, golden principles)
- **Artifacts don’t count against token limits **: Claude artifacts (up to 20MB) store code and outputs separately from conversation context
- Manual trigger:
/compact Focus on the API changesfor explicit control
Claude code’s automatic compaction continues to improve memory usage in long sessions and preservation of critical state.
Why they’re better together
flowchart LR
subgraph skills ["Skills = the HOW"]
S1["SKILL.md manifest"]
S2["Templates + examples"]
S3["Guardrails + routing"]
end
subgraph shell ["Shell/MCP = the DO"]
SH1["Install dependencies"]
SH2["Run scripts + tools"]
SH3["Write artifacts"]
end
subgraph compaction ["Compaction = the CONTINUITY"]
C1["Auto-compress when context fills"]
C2["Pin immutable constraints"]
C3["Multi-hour runs stay coherent"]
end
skills -->|" model loads procedure "| shell
shell -->|" long run hits limit "| compaction
compaction -->|" resumes with full context "| shell
style skills fill: #e8f4fd, stroke: #0d6efd
style shell fill: #d4edda, stroke: #28a745
style compaction fill: #fff3cd, stroke: #ffc107
- Skills reduce prompt spaghetti by moving stable procedures and examples into a reusable bundle.
- Shell/MCP provides a full execution environment, letting you install code, run scripts, and write outputs.
- Compaction preserves continuity on long runs, so the same workflow can keep executing without manual context surgery.
The philosophy, summarized: Use skills to encode the how (procedures, templates, guardrails). Use shell/MCP to execute the do (install, run, write artifacts). Use compaction to keep long runs coherent (without hand-managing context).
The 10 tips that actually matter
These come from OpenAI’s developer blog, informed by their work building Codex and production experience from Glean. Each tip below shows OpenAI implementation patterns and Claude Code adaptations where they differ.
Applicability to Claude Code: Tips 1-5 (skill design) and Tip 10 (local/cloud parity) apply directly with same patterns. Tips 6-9 (security, networking, credentials) use different mechanisms (MCP permission tiers vs OpenAI allowlists) but same principles.
Tips 1-5: skill design and routing
Tip 1: Write skill descriptions like routing logic, not marketing copy.
Your skill’s description is effectively the model’s decision boundary. It should answer: when should I use this? When should I not? What are the outputs and success criteria?
OpenAI + Claude: Include a short “Use when vs. don’t use when” block directly in the front matter description.
Tip 2: Add negative examples and edge cases to reduce misfires.
A surprising failure mode is that making skills available can initially reduce correct triggering. Glean saw skill-based routing drop by about 20% in targeted evals, then recovered after they added negative examples and edge case coverage.
OpenAI + Claude: Write explicit “Don’t call this skill when..” cases in the skill body and suggest what to do instead.
Tip 3: Put templates and examples inside the skill, not the system prompt.
Templates inside skills have two advantages: they’re available exactly when needed (when the skill is invoked), and they don’t inflate tokens for unrelated queries. Glean reported this pattern drove some of their biggest quality and latency gains in production.
OpenAI: Include templates in SKILL.md under “Examples” section.
Claude: Include templates in .claude/skills/*.md with concrete examples inline.
Tip 4: Design for long runs early with container reuse and compaction.
Long-horizon agents rarely succeed as one-shot prompts. Plan for continuity from the start:
- Reuse the same container/session across steps when you want stable dependencies, cached files, and intermediate outputs.
- Pass
previous_response_id(OpenAI) or maintain conversation thread (Claude) so the model can continue work. - Use compaction as a default long-run primitive, not an emergency fallback.
Claude-specific: Use artifacts (up to 20MB) to persist structured data across compaction events.
Tip 5: For determinism, just tell the model to use the skill.
“Use the <skill-name> skill.” That’s it. Simplest reliability lever you can pull. Turns fuzzy routing into an explicit contract.
OpenAI + Claude: Works identically in both ecosystems.
Tips 6-10: security, networking, and portability
Tip 6: Treat skills plus networking as a high-risk combo.
Security posture
Combining skills with open network access creates a high-risk path for data exfiltration. If you use networking, keep network allowlists strict, assume tool output is untrusted, and avoid open internet plus unrestricted procedures in consumer-facing flows.
OpenAI: Use org-level and request-level network_policy allowlists.
Claude: Use MCP server vetting
and permission tiers
(Deny/Ask/Allow rules).
Strong default posture for both:
- Skills: allowed
- Shell/MCP: allowed
- Network: enabled only with a minimal allowlist, per request, for narrowly scoped tasks
Tip 7: Make a standard handoff boundary for artifacts.
OpenAI: Treat /mnt/data as the standard place to write outputs you’ll retrieve, review, or pass back into
subsequent steps.
Claude:
Use Claude artifacts
for structured outputs that persist across sessions.
For file-based outputs, define a standard path like ./artifacts/ in CLAUDE.md.
Mental model: tools write to disk, models reason over disk, developers retrieve from disk.
Tip 8: Understand allowlists as a two-layer system.
OpenAI: Networking is controlled via org-level allowlist (admin-configured) and request-level network_policy (must
be subset of org allowlist).
Claude: MCP security uses permission tiers : Deny rules block operations, Ask rules require user approval, Allow rules auto-approve. Hook controls let you disable all hooks between sessions to prevent persistent malicious code.
flowchart TD
A["Agent request with network/tool access"] --> B{"In allowlist (OpenAI) or permission tier(Claude)?"}
B -->|no| C["Blocked"]
B -->|yes| D{"Needs auth?"}
D -->|no| E["Request proceeds"]
D -->|yes| F["Inject credentials (domain_secrets for OpenAI, MCP auth for Claude)"]
F --> E
style C fill: #f8d7da, stroke: #dc3545
style E fill: #d4edda, stroke: #28a745
Tip 9: Use secure credential injection for authenticated calls.
OpenAI: Use domain_secrets so the model never sees raw credentials. At runtime, the model sees placeholders (e.g., $API_KEY), and a sidecar injects real values only for approved destinations.
Claude: MCP servers handle authentication separately from the model. Credentials live in MCP server config, never in conversation context.
Tip 10: Use the same APIs in the cloud and locally.
OpenAI: Skills work with hosted shell and local shell mode. Shell has a local execution mode where you execute
shell_call yourself and return shell_call_output back to the model.
Claude: Claude code runs locally by default . For cloud execution, run Claude Code CLI in CI/CD or use Claude Desktop for GUI workflows. Skills and MCP servers work identically across local/remote.
Practical dev loop (both ecosystems):
- Start local (fast iteration, access to internal tooling, easy debugging).
- Move to hosted/CI when you want repeatability, isolation, and deployment consistency.
- Keep skills the same across both modes (the workflow stays stable even when execution moves).
Three build patterns
flowchart LR
subgraph A ["Pattern A: Artifact"]
direction TB
A1["Install libs"] --> A2["Fetch data / call API"] --> A3["Write to standard path (/mnt/data or ./artifacts/)"]
end
subgraph B ["Pattern B: Repeatable"]
direction TB
B1["Load skill"] --> B2["Execute in shell/MCP"] --> B3["Produce artifact deterministically"]
end
subgraph C ["Pattern C: Enterprise"]
direction TB
C1["Skill = living SOP"] --> C2["Multi-tool orchestration"] --> C3["Consistent execution across org"]
end
A -->|" add skills for consistency "| B
B -->|" scale across teams "| C
style A fill: #d4edda, stroke: #28a745
style B fill: #e8f4fd, stroke: #0d6efd
style C fill: #fff3cd, stroke: #ffc107
Each pattern builds on the previous one. Start with A for quick wins, graduate to B for repeatable workflows, scale to C across the org.
Pattern A: Install, fetch, write artifact.
OpenAI: Install libraries, call API, write report to /mnt/data/report.md.
Claude: Install libraries, call API via MCP server, write to Claude artifact or ./artifacts/report.md.
This creates a clean review boundary - your app can show the artifact to the user, log it, diff it, or feed it into a later step.
Pattern B: Skills + shell/MCP for repeatable workflows.
OpenAI: Encode workflow in SKILL.md, mount into shell environment, execute deterministically.
Claude: Encode workflow in .claude/skills/*.md, execute via MCP servers for external tool access.
Particularly effective for spreadsheet analysis, dataset cleaning, and standardized report generation.
Pattern C: Skills as enterprise workflow carriers.
One early pattern is a loss of accuracy in the gap between single tool invocation and multi-tool orchestration. Skills close that gap.
Glean’s Salesforce-oriented skill (works for both OpenAI and Claude): increased eval accuracy from 73% to 85% and reduced time-to-first-token by 18.1%. Practical tactics: careful routing, negative examples, embedding templates inside the skill.
Skills become living SOPs (standard operating procedures): updated as your org evolves, executed consistently by agents across both ecosystems.
Architecture enforcement: boundaries, not micromanagement
Agents are most effective in environments with strict boundaries and predictable structure . OpenAI built their application around a rigid architectural model. Each business domain is divided into a fixed set of layers, with strictly validated dependency directions.
Claude Code teams are doing the same thing. As agentic coding becomes standard in 2026 , the consensus is clear: constraints enable speed without decay.
The rule (enforced mechanically):
Within each business domain (for example, App Settings), code can only depend “forward” through a fixed set of layers. Cross-cutting concerns (auth, connectors, telemetry, feature flags) enter through a single explicit interface: * Providers*. Anything else is disallowed and enforced mechanically.
flowchart LR
subgraph domain ["Business domain (e.g. App Settings)"]
direction LR
T["Types"] --> CF["Config"] --> R["Repo"] --> S["Service"] --> RT["Runtime"] --> U["UI"]
end
subgraph providers ["Providers (single entry point)"]
direction TB
P1["Auth"]
P2["Connectors"]
P3["Telemetry"]
P4["Feature flags"]
end
providers -->|" only allowed cross-cutting edge "| S
T -.->|" backward dependency BLOCKED by linter "| U
style domain fill: #e8f4fd, stroke: #0d6efd
style providers fill: #fff3cd, stroke: #ffc107
style T fill: #d4edda, stroke: #28a745
Dependencies flow left to right only. The linter blocks any backward edge. Providers are the only way cross-cutting concerns get in.
Enforcement through custom lints and structural tests
OpenAI: Custom linters and structural tests (Codex-generated), plus “taste invariants.” They statically enforce structured logging, naming conventions, file size limits. Because the lints are custom, they write error messages that inject remediation instructions into agent context.
Claude code: Hooks provide deterministic enforcement . PostToolUse hooks run linters immediately after code changes. PreToolUse hooks block violations before they happen. Example from production repos:
| |
The philosophy: enforce boundaries centrally, allow autonomy locally. You care deeply about boundaries, correctness, and reproducibility. Within those boundaries, you allow teams - or agents - significant freedom in how solutions are expressed.
The resulting code doesn’t always match human stylistic preferences, and that’s okay. As long as the output is correct, maintainable, and legible to future agent runs, it meets the bar.
In FlowForge, I enforce this through sbt module definitions (dependencies only flow one direction),
.scalafix.confrules (novar, noasInstanceOf, noprintln), and per-module coverage thresholds (core 90%, connectors 80%). These rules work identically for both Codex and Claude Code because they’re language-level enforcement, not agent-specific.
The autonomy ladder: from drafting code to merging PRs
As more of the development loop was encoded directly into the system - testing, validation, review, feedback handling, and recovery - OpenAI’s repository crossed a meaningful threshold where Codex can end-to-end drive a new feature.
Claude Code teams are reaching the same milestone. Building agents with the Claude Agent SDK shows similar autonomous workflows becoming production-ready.
Given a single prompt, the agent can now drive a feature end-to-end:
flowchart TD
A["1. Validate codebase state"] --> B["2. Reproduce reported bug"]
B --> C["3. Record evidence\n(video for Codex, artifact for Claude)"]
C --> D["4. Implement fix"]
D --> E["5. Validate fix by driving app"]
E --> F["6. Record verification (video/artifact)"]
F --> G["7. Open PR"]
G --> H["8. Respond to agent + human feedback"]
H --> I{"9. Build passes?"}
I -->|no| J["Remediate build failure"]
J --> H
I -->|yes| K{"10. Judgment required?"}
K -->|yes| L["Escalate to human"]
K -->|no| M["11. Merge"]
style A fill: #e8f4fd, stroke: #0d6efd
style M fill: #d4edda, stroke: #28a745
style L fill: #fff3cd, stroke: #ffc107
That’s 11 steps from a single prompt. The agent loops on feedback until reviewers are satisfied ( the Ralph Wiggum Loop - agent-to-agent review iteration). Humans only step in when actual judgment is needed.
Implementation differences:
OpenAI codex:
- Uses Chrome DevTools Protocol for step 2 (reproduce bug) and step 5 (validate fix)
- Records videos as evidence artifacts
- Reviews own changes locally, requests agent reviews in cloud
- Uses
ghCLI for PR operations
Claude code:
- Uses MCP servers for app interaction (browser automation, API testing)
- Records evidence in Claude artifacts ( structured test results, screenshots)
- Subagents handle specialized review (security, test coverage, performance)
- Uses
ghCLI for PR operations (identical to Codex)
How agents interact with the system
OpenAI: Humans interact almost entirely through prompts: describe a task, run the agent, allow it to open a pull
request. To drive a PR to completion, Codex reviews its own changes locally, requests additional agent reviews, responds
to feedback, and iterates until all reviewers are satisfied. Codex uses standard development tools directly (gh, local
scripts, repository-embedded skills).
Claude code: The workflow is iterative and conversational , feeling like you have a junior developer sitting next to you. Claude Code integrates directly with your development environment, maintaining persistent awareness of your entire project. For review, specialized subagents handle security scans, test coverage, and code quality checks.
Human review is optional, not required. Over time, both teams have pushed almost all review effort towards being handled agent-to-agent.
Throughput changes the merge philosophy
As agent throughput increased, many conventional engineering norms became counterproductive. Both OpenAI and Claude Code teams operate with minimal blocking merge gates. Pull requests are short-lived. Test flakes are often addressed with follow-up runs rather than blocking progress indefinitely.
In a system where agent throughput far exceeds human attention, corrections are cheap, and waiting is expensive.
This would be irresponsible in a low-throughput environment. Here, it’s often the right tradeoff.
What “agent-generated” actually means
When they say the codebase is generated by agents, they mean everything: product code and tests, CI configuration and release tooling, internal developer tools, documentation and design history, evaluation harnesses, review comments and responses, scripts that manage the repository itself, and production dashboard definition files.
Humans always remain in the loop, but work at a different layer of abstraction. They prioritize work, translate user feedback into acceptance criteria, and validate outcomes. When the agent struggles, they treat it as a signal: identify what is missing - tools, guardrails, documentation - and feed it back into the repository, always by having the agent itself write the fix.
15 lessons from building ChatGPT apps (adapted for Claude ecosystem)
Alpic built two dozen ChatGPT apps over three months. Their core insight is what they call the three body problem: traditional web apps have two actors (user, UI), but AI apps add a third (model). The hard part is managing context asymmetry - each body has partial knowledge, and no single one has the full picture.
These patterns apply to Claude Desktop apps and MCP servers . The three-body problem exists whether you’re building for ChatGPT or Claude.
flowchart TD
subgraph traditional ["Traditional web app"]
direction LR
U1["User"] <-->|" clicks, sees "| UI1["UI"]
end
subgraph ai ["AI app (ChatGPT or Claude)"]
direction TB
U2["User"] <-->|" types, sees "| UI2["Widget/UI/Desktop"]
UI2 <-->|" tools/MCP "| M["Model"]
M <-->|" tool results "| U2
end
style traditional fill: #f0f0f0, stroke: #999
style ai fill: #e8f4fd, stroke: #0d6efd
Three actors, each with partial knowledge. The user sees the chat and the UI. The UI sees its own state and can push context to the model. The model sees tool outputs but not UI internals unless you explicitly expose them. Managing who knows what - that’s the whole game.
Lessons 1-4: system and architecture (ChatGPT + Claude adaptations)
Lesson 1: Not all context should be shared.
Different parts need intentionally different views of state. In a murder mystery game, the model needs to know who the killer is to roleplay correctly, while the UI and user must not. In a Time’s Up game, reversed: the UI shows the secret word, the model must stay unaware.
ChatGPT: Use structuredContent for data both model and widget need. Use _meta for response metadata visible only
to the widget, hidden from the model.
Claude: Use MCP resources to expose only what the model needs. Use Claude artifacts for client-side state that stays hidden from the model.
Lesson 2: Lazy-loading doesn’t translate well to AI apps.
In AI apps, tool calls imply delays - often several seconds due to security sandboxing and model reasoning. Front-load aggressively: send as much data as possible into the initial tool response.
ChatGPT: Hydrate the widget via window.openai.toolOutput. If the widget can safely fetch from a public API without
sharing info with the model, classic XHR calls work.
Claude: Use MCP servers for data fetching . Pre-load data into artifacts or MCP resources. For public APIs, direct HTTP works (model doesn’t see the internals).
Lesson 3: The model needs visibility.
When a user interacts with a widget (selecting a product) then asks a question in chat, the model has no idea what they’re referring to.
ChatGPT: Use window.openai.setWidgetState(state) for imperative updates. Attach data-llm attributes directly to
components for declarative context. Alpic built a Vite plugin that scrapes these attributes and automatically updates
widgetState.
Claude: For Claude Desktop, push state updates via MCP tools
. For Claude
Code, maintain state
in artifacts
(up to 20MB) or
.claude/context/ files that persist across sessions.
Lesson 4: Different interactions require different APIs.
Widget-to-server, model-to-server, widget-to-model - each path exists to support a different kind of interaction. Make these communication paths explicit and be intentional about which mechanism handles which part of the experience.
ChatGPT + Claude: Identical principle. Define clear boundaries: what’s HTTP (widget ↔ server), what’s tool calls ( model ↔ server), what’s state updates (widget → model via MCP or widgetState).
Lessons 5-8: UI and product behavior (ChatGPT + Claude adaptations)
Lesson 5: UI must adapt to multiple display modes.
ChatGPT: Apps can appear inline (stays in conversation history), fullscreen (takes entire screen with chat bar at bottom), or picture-in-picture (floats on top). Account for device-specific safe zones.
Claude Desktop: Similar modes exist. Claude Desktop’s visual diffs show side-by-side comparisons. Artifacts can display inline or in separate pane. Design for both.
Claude code CLI: No UI modes (terminal-only), but artifacts can be opened in external viewers. Design for text-first presentation.
Lesson 6: UI consistency matters in an embedded environment.
ChatGPT: Use the OpenAI Apps SDK UI Kit for ready-to-use components that align with ChatGPT’s design system.
Claude Desktop: No official UI kit yet, but MCP-connected apps should follow system UI conventions (match macOS/Windows design language). Use native components where possible.
Lesson 7: Language-first filtering beats traditional UI controls.
When users can say “Sunny destinations in Europe for under $200,” forcing them through checkboxes and range sliders adds friction. Provide the model with a List of Values (LOV) for tool parameters so it maps natural language directly to backend API requirements.
ChatGPT + Claude: Identical pattern. Define LOVs in tool/MCP schemas. Let the model do natural language → structured parameters mapping.
Lesson 8: Files unlock richer interactions.
Users upload a photo of a product, model identifies it, widget continues into product matching and discovery.
ChatGPT: Tools consume files via openai/fileParams. Widgets handle files via window.openai.uploadFile and
window.openai.getFileDownloadUrl.
Claude: Claude Desktop handles files natively . MCP servers can expose file upload endpoints. Artifacts store up to 20MB including images, PDFs, and structured data.
Lessons 9-10: production readiness (ChatGPT + Claude adaptations)
Lesson 9: CSPs are the new CORS.
ChatGPT: OpenAI renders Apps inside a double-nested iframe. Content Security Policies are strictly enforced. Declare
connectDomains, resourceDomains, frameDomains, redirectDomains in app manifest.
Claude Desktop: MCP servers run with permission tiers . Define allowed operations in MCP manifest. Deny rules block dangerous operations, Ask rules require user approval, Allow rules auto-approve.
Security considerations:
- ChatGPT: CSP violations = blocked iframe operations
- Claude: MCP permission violations = blocked tool calls
Both require explicit allowlisting of external domains/operations.
Lesson 10: Small flags have outsized impact.
ChatGPT: widgetDomain is required for submission. widgetAccessible controls whether widget can call tools on its
own via callTool. Tool annotations (readOnly, destructiveHint, openWorldHint) are required.
Claude: MCP tool definitions require similar metadata: operation safety level, required parameters, auth requirements. Hook configurations define which lifecycle events trigger which automation.
Lessons 11-14: iteration velocity (ChatGPT + Claude adaptations)
Lesson 11: Fast iteration requires hot reload.
ChatGPT: Long-TTL resource caching makes standard HMR incompatible. Alpic built a Vite plugin that intercepts resource requests and injects real-time updates into the ChatGPT iframe.
Claude Desktop: MCP servers support hot reload via server restart. Use file watchers to auto-reload MCP configs on change.
Claude code CLI: Supports live file watching. Changes to .claude/ directory (skills, hooks, agents) take effect
immediately or on next session start.
Lesson 12: Not every test belongs in production environment.
ChatGPT: Build a lightweight local emulator mocking the ChatGPT host environment. Reserve real ChatGPT tests for validating model interactions.
Claude: Claude Code runs locally by default - your dev environment IS the test environment. Use MCP server mocks for integration tests. Reserve Claude Desktop tests for end-to-end workflows.
Lesson 13: Mobile testing requires explicit support.
ChatGPT: Vite’s default localhost makes tunnelled URLs inaccessible from other devices. Extend with domain forwarding on tunnelled ports.
Claude Desktop: Desktop app is macOS/Windows only (no mobile). Claude mobile apps exist but focus on conversational use cases. Design for desktop-first, mobile web as fallback.
Lesson 14: Familiar abstractions speed delivery.
ChatGPT: The Apps SDK exposes low-level JavaScript APIs. Introduce React-friendly abstractions - hooks like
useCallTool, useWidgetState, useLocale.
Claude: MCP SDKs exist for Python and TypeScript . Build higher-level abstractions around common patterns (auth, caching, rate limiting). Reuse across projects.
Codification and reuse (lesson 15)
Lesson 15: Turn lessons into reusable tooling.
ChatGPT ecosystem: Alpic created the Skybridge Framework
(open-source React
framework with hooks, dev tools, data-llm attribute) and
a Codex Skill
covering the full lifecycle.
Claude ecosystem: Community is building similar patterns . Production-tested configs from everything-claude-code , comprehensive examples from claude-code-showcase . Claude Skills Library has 50+ ready-to-use MCP servers and skills.
Your operating model: staff vs principal split
Operating principle: engineers should optimize the agentic system (instructions + tools + checks + context), not only the immediate code artifact.
This is tool-agnostic. Whether you’re using Codex or Claude Code, the split stays the same.
| Level | Primary ownership | Success signal | Failure signal |
|---|---|---|---|
| Staff engineer | Team harness implementation | Agent tasks reliable for team workflows | Repeated manual cleanup, inconsistent outputs |
| Principal engineer | Org-wide harness standards | Cross-team reuse, fewer regressions, faster onboarding | Fragmented conventions, model-specific drift |
Software engineering track
- Architecture maps (AGENTS.md or CLAUDE.md + deep docs)
- Typed contracts and policy checks
- CI-integrated drift scanners (or hooks for Claude)
- Tool traces and remediation loops
Data engineering track
- Data contracts and schema policies (compile-time or CI-time)
- Deterministic transformation runbooks (purity enforcement)
- Data quality and lineage checks in harness
- Incident playbooks for failed pipelines and late data
The blueprint you can implement this quarter
gantt
title Harness implementation timeline
dateFormat YYYY-MM-DD
axisFormat %b W%W
section Foundation
Instruction architecture: a1, 2026-03-02, 2w
Quality enforcement: a2, after a1, 2w
section Execution
Skills and procedures: a3, after a2, 2w
Shell/MCP and execution: a4, after a3, 2w
section Lifecycle
Compaction and context: a5, after a4, 4w
Five layers, each building on the last. You can run layers 1-2 in parallel if you have the team.
Layer 1: instruction architecture (weeks 1-2)
For OpenAI codex:
- Ship
AGENTS.mdas an index (under 120 lines) - Structure
docs/directory by concern - Create ADR log in
docs/decisions/
For Claude code:
- Create
CLAUDE.mdin repo root (50-120 lines, project-specific only) - Structure
.claude/agents/for specialized subagents - Link from CLAUDE.md to
docs/architecture/for deep dives
Universal:
- Define evidence docs with file paths and line numbers
- Version all decisions and plans in the repo
Layer 2: quality enforcement (weeks 3-4)
For OpenAI codex:
- Define golden principles (5-10 rules)
- Deploy daily drift scans
- Make lint errors prescriptive
For claude code:
- Define
.claude/hooks/PostToolUse.shfor auto-fixes - Define
.claude/hooks/PreToolUse.shfor blocking violations - Track hook trigger rate as a metric
Universal:
- Auto-fix low-risk violations
- Track manual cleanup time (goal: trending to zero)
Layer 3: executable procedures / skills (weeks 5-6)
For OpenAI Codex:
- Convert top 3-5 workflows into
SKILL.mdfiles - Add templates, examples, negative examples
- Store in
skills/directory with versioning
For Claude Code:
- Convert workflows into
.claude/skills/*.mdfiles - Include routing logic (when to use, when not to use)
- Install community skills from Claude skills library
Universal:
- Write skill descriptions like routing logic, not marketing copy
- Include negative examples to reduce misfires
- Test skill invocation accuracy (target >= 90%)
Layer 4: execution substrate / shell or MCP (weeks 7-8)
For OpenAI Codex:
- Enable shell-backed execution (hosted or local)
- Define standard artifact path (
/mnt/data) - Instrument traces and incident metrics
For Claude Code:
- Build MCP servers for external tools
- Define standard artifact path (Claude artifacts or
./artifacts/) - Connect observability stack via MCP (logs, metrics, traces)
Universal:
- Start with local mode (fast iteration)
- Move to hosted/CI when ready for repeatability
- Keep skills the same across local/remote
Layer 5: context lifecycle / compaction (ongoing)
For OpenAI Codex:
- Enable server-side compaction (auto or manual
/responses/compact) - Pass
previous_response_idfor multi-turn continuity - Pin immutable constraints in system message
For Claude Code:
- Enable automatic compaction with custom CLAUDE.md instructions
- Use
.claude/hooks/AfterCompaction.shto inject critical context after summarization - Store long-lived state in Claude artifacts (up to 20MB)
Universal:
- Implement periodic context hygiene
- Track context hit rate (target >= 90%)
- Measure long-run coherence (target: 6+ hour sessions without drift)
Metrics that actually matter
The shift: from measuring lines of code written to measuring harness effectiveness.
| Metric | Target | What it tells you |
|---|---|---|
| Agent task success rate (successful runs / total) | >= 85% | Harness is well-specified |
| Auto-remediation rate (auto-fixed / total issues) | >= 60% | Quality enforcement is mechanical |
| Context hit rate (resolved lookups / total) | >= 90% | Docs structure works |
| Manual cleanup time (weekly hours) | Trending to zero | Golden principles coverage |
| Skill invocation accuracy | >= 90% | Descriptions + negative examples work |
| Human escalation rate | < 15% | Boundaries are clear |
Leading indicators (measure first): context hit rate, skill invocation accuracy, manual cleanup time.
Lagging indicators (improve as harness matures): agent task success rate, auto-remediation rate, human escalation rate.
How to measure across ecosystems:
OpenAI codex:
- Task success: PR merge rate without human intervention
- Auto-remediation: drift scanner auto-fix rate
- Context hits: AGENTS.md link click-throughs (instrument in docs)
Claude code:
- Task success:
/statscommand shows session metrics - Auto-remediation: hook auto-fix vs. block rate
- Context hits: track
.claude/agents/invocation frequency
How OpenAI measures effectiveness: not by volume (1M lines is impressive but not the point). By velocity (3.5 PRs/engineer/day, increasing). By autonomy (end-to-end without human intervention). By amplification (humans one layer up from implementation).
When things go wrong: failure scenarios and fixes
Agent output quality regresses suddenly
Symptoms: PR quality drops, tests fail more often, manual cleanup time increases.
Check in order:
- Did instruction file (AGENTS.md or CLAUDE.md) become too large or contradictory?
- Did a skill version change without tests?
- Did compaction remove critical constraints?
- Did evaluation thresholds change silently?
Fix (OpenAI): Audit AGENTS.md length and cross-links. Roll back skill version. Add compaction preservation rules in system message.
Fix (Claude): Check CLAUDE.md length (should be 50-120 lines). Review .claude/hooks/AfterCompaction.sh - is it
injecting critical context back? Add preservation rules to CLAUDE.md.
Universal: Document findings in exec plans, have the agent implement the fix, update golden principles.
Agent keeps missing team conventions
Symptoms: Repeated style violations, inconsistent patterns, manual corrections needed.
Root cause: Knowledge lives in people’s heads or chat threads.
Fix (OpenAI): Move conventions to docs/conventions.md, link from AGENTS.md, add machine checks (custom linters).
Fix (Claude): Encode conventions in .claude/hooks/PostToolUse.sh as auto-fixes. Add high-level guidance to
CLAUDE.md. Use PreToolUse hooks to block severe violations.
Universal: If an agent keeps getting it wrong, the convention isn’t in the repo. Make it mechanical.
Long workflows collapse after many turns
Symptoms: Agent loses context mid-task, forgets earlier decisions, loops on same issues.
Root cause: Compaction removed critical context, or no compaction enabled (hit hard limit).
Fix (OpenAI): Ensure server-side compaction is active. Pin immutable constraints in system message. Split long goals
into checkpointed sub-runs with explicit handoff boundaries. Pass previous_response_id for continuation.
Fix (Claude): Enable automatic compaction
in API
settings. Add compaction preservation rules to CLAUDE.md: “When compacting, always preserve the full list of modified
files and any test commands.” Use .claude/hooks/AfterCompaction.sh to inject critical state. Store long-lived context
in Claude artifacts
(up to
20MB).
Universal: Track compaction events as a metric. Measure pre/post-compaction coherence.
Humans spend 20% of time cleaning up AI slop
Symptoms: Manual Friday cleanup sessions, inconsistent code style, repeated fixes.
Root cause: Golden principles are under-specified.
Fix (OpenAI): Define explicit principles (5-10 rules). Build daily drift scanner. Auto-open fix-up PRs. Track cleanup time as a metric (goal: zero).
Fix (Claude): Encode principles in .claude/hooks/PostToolUse.sh as deterministic enforcement. Use exit code 2 to
block violations before they land. Track hook trigger rate and human override rate.
Universal: The goal is trending to zero manual cleanup. If it’s not declining, your principles aren’t specific enough.
Skill routing accuracy drops after adding new skills
Symptoms: Agent invokes wrong skill for task, skips relevant skills, lower eval scores.
Root cause: Model can’t disambiguate between similar skills.
Fix (OpenAI + Claude): Add explicit negative examples (“Don’t call this skill when…”) to each skill’s description. Include edge case coverage. Add “Use when vs. don’t use when” blocks in frontmatter.
Evidence from production: Glean recovered a 20% accuracy drop with this pattern alone.
Universal: Skill routing is a leading indicator. Monitor invocation accuracy (target >= 90%). When you add a new skill, immediately add negative examples to existing skills.
Agent can't reproduce bug or validate fix
Symptoms: Agent reports “can’t verify” or “unable to test,” relies on human to run app.
Root cause: Application behavior isn’t legible to the agent.
Fix (OpenAI): Make app bootable per worktree. Wire Chrome DevTools Protocol into agent runtime. Expose logs and metrics via ephemeral observability stack (LogQL/PromQL access).
Fix (Claude): Build MCP servers for app interaction : browser automation ( Playwright/Puppeteer), API testing (HTTP client), log access (tail -f via MCP). Make app bootable per worktree. Use Claude artifacts to store test results and evidence.
Universal: The app must be drivable by the agent. If the agent can’t boot it, query it, and observe its behavior, it can’t validate fixes.
Tool-agnostic adaptation: OpenAI, Claude, Gemini, Grok
The patterns in this article work regardless of model.
flowchart TD
subgraph harness ["Your harness (model-agnostic)"]
I["Instruction contract AGENTS.md or CLAUDE.md"]
SK["Skill contract versioned procedures"]
EX["Execution contract shell or MCP"]
V["Validation contract evals + policy checks"]
CX["Context contract compaction + pinned state"]
end
I --> SK --> EX --> V --> CX
harness --> CO["OpenAI Codex"]
harness --> CL["Claude Code"]
harness --> GE["Gemini Code Assist"]
harness --> GR["Grok"]
style harness fill: #e8f4fd, stroke: #0d6efd
Build the harness once, swap the model underneath. Keep this normalized interface:
| Contract | OpenAI implementation | Claude implementation | Universal pattern |
|---|---|---|---|
| Instruction | AGENTS.md (~100 lines) + docs/ | CLAUDE.md (50-120 lines) + .claude/ | Short index file + deep-linked docs |
| Skill | SKILL.md with frontmatter | .claude/skills/*.md with frontmatter | Versioned workflow packs with routing logic |
| Execution | Shell tool (hosted or local) | MCP servers (local by default) | Sandboxed execution with audit trail |
| Validation | Custom linters + structural tests | Hooks (Pre/PostToolUse) | Mechanical enforcement at lifecycle events |
| Context | Server-side compaction + previous_response_id | Auto compaction + AfterCompaction hooks + artifacts | Automatic summarization with critical context pinning |
Migration paths
Codex → Claude code:
- Rename
AGENTS.md→CLAUDE.md(trim to 50-120 lines) - Convert custom linters →
.claude/hooks/PostToolUse.sh - Convert shell workflows → MCP servers
- Keep
docs/structure unchanged (works for both) - Keep
skills/intact (same format)
Claude code → codex:
- Merge
CLAUDE.md+.claude/context →AGENTS.md(~100 lines) - Convert
.claude/hooks/→ CI-based enforcement - Convert MCP servers → shell tool scripts
- Keep
docs/structure unchanged - Keep
.claude/skills/asskills/(same format)
Universal patterns that port directly:
- Architecture maps (docs/architecture/)
- ADRs (docs/decisions/)
- Evidence docs (docs/evidence/)
- Golden principles (encoded as enforcement, not prose)
- Skill routing patterns (negative examples, LOV, templates)
The goal is convergence. As standards like AGENTS.md , MCP , and Skills mature through ecosystem collaboration, these patterns become more portable. For now, adapt to your platform’s primitives but preserve the underlying principles.
Platform context: what changed in 2025
A short addendum since this article first published in February. The platform picture moved in five directions worth noting, without changing any of the lessons above:
- Data-engineering vendors made the harness their product. Google’s BigQuery Data Engineering Agent went GA on 22 April 2026. dbt’s Coalesce 2025 conference shipped dbt Agents (Developer, Discovery, Observability, Analyst) and the dbt Fusion MCP server. Snowflake released Managed MCP Servers. Microsoft Fabric framed itself as an “Agentic Fabric.” The substrate question this article opens with is now a vendor category.
- Type-system safety for agent code became a published research direction. Odersky’s group released the OPAW framework (Tracking Capabilities for Safer Agents, March 2026), Scalar 2026 ran How Can We Trust Our Agents? in April, and ACL Anthology 2025’s OMMM workshop published TypePilot: Leveraging the Scala type system for secure LLM-generated code. The harness frame in this article is general; the type-system version of it is now an active academic line.
- Production telemetry now exists to back the failure-mode claims. Datadog’s 2026 State of AI Engineering report (covering 1,000+ deployments) measured a 5 percent outright LLM-call failure rate and a 60 percent rate-limit share of agent errors. MIT-led research in April 2026 showed a hallucination-reasoning trade-off: stronger reasoning through RL increases tool-hallucination rates in lockstep with task gains. Numbers we previously inferred are now measured.
- The community turned Karpathy’s pitfall observations into an installable CLAUDE.md. His LLM coding pitfalls tweet
(January 2026) described agent failure modes;
andrej-karpathy-skillsdistilled them into four principles - think before coding, simplicity first, surgical changes, goal-driven execution - packaged as a singleCLAUDE.md. The repo has 143,000+ stars and is in the Claude Code marketplace. - Harness discipline extended to knowledge work. Karpathy’s LLM Knowledge Bases post (April 2026) described agents compiling and maintaining structured knowledge rather than code. The raw/wiki/operations pattern maps directly onto the same three lessons.
The full citations are at the bottom of this article under 2026 platform developments, Adjacent work, and Karpathy: practitioner patterns.
Reasoning: OpenAI and Anthropic approaches
OpenAI: Reasoning models evolved from o1/o3 research into production capabilities. By end of 2025, reasoning converged with general-purpose models in the GPT-5.2 family, with GPT-5.2 Pro for deeper reasoning workloads. “Think harder vs. respond faster” became a tunable developer decision - use Pro models for complex multi-step work, use standard models for fast iteration.
Anthropic: Claude 3 Opus and Claude 3.5 Sonnet prioritize extended reasoning through larger context windows (200K tokens) and extended thinking capabilities . Claude Desktop introduced Thinking Mode (early 2026) - explicitly showing reasoning steps before answering. Control via system prompts: “Think step-by-step” or “Show your reasoning.”
Convergence: Both ecosystems now treat reasoning depth as a tunable parameter rather than a model-specific feature.
Multimodality: OpenAI and Anthropic
OpenAI: PDFs and documents directly in the API (including PDF-by-URL without upload). Whisper for speech-to-text, TTS models for controllable text-to-speech. GPT Image 1.5 for higher-fidelity image generation and editing. Sora for video generation with temporal coherence. gpt-realtime for low-latency voice conversations. Image generation integrated into multi-turn conversations via tool calls.
Anthropic: Claude 3.5 Sonnet supports vision (images, PDFs, screenshots), with industry-leading OCR and chart comprehension. Claude Desktop supports drag-and-drop for images and PDFs. Artifacts support rich media up to 20MB including images, interactive charts, and structured documents. Audio and video capabilities announced for 2026.
Key difference: OpenAI’s multimodality is output-focused (generation via GPT Image 1.5, Sora). Claude’s is input-focused (comprehension via vision, OCR). For generation, use GPT Image/Sora via API with Claude handling orchestration and reasoning.
Agent-native APIs: OpenAI and Anthropic
OpenAI: The Responses API supports multiple inputs/outputs including different modalities, reasoning controls, and tool calling during reasoning. Open-source Agents SDK for Python and TypeScript - provider-agnostic, with documented paths for non-OpenAI models. AgentKit added Agent Builder, ChatKit, Connector Registry, and evaluation loops. Conversation state + Conversations API for durable threads. Connectors and MCP servers for external context. The Apps SDK extends MCP to let developers build UIs alongside MCP servers.
Anthropic: Claude Messages API supports multi-turn conversations with tool use, vision, and system prompts. Claude Agent SDK provides higher-level abstractions for agentic workflows. MCP (Model Context Protocol) is an open standard for AI-tool integrations - 50+ community servers available. Claude Desktop provides native GUI for agent workflows. Prompt caching reduces costs for repetitive prefixes by 90%.
Convergence: Both adopted MCP as the standard integration protocol. Both provide multi-turn conversation APIs with tool use. OpenAI’s Agents SDK is provider-agnostic and works with Claude.
Coding assistants: Codex and Claude Code
OpenAI codex: The open-source CLI brought agent-style coding into local environments. AGENTS.md support, MCP integration, sandboxing, and approval modes made it production-ready. Codex can be orchestrated via the Agents SDK by running the CLI as an MCP server. Codex Autofix in CI tightened the loop.
Claude code: Production-ready AI coding assistant that understands entire codebases. Key features: CLAUDE.md auto-loading, hooks at 15 lifecycle events , specialized subagents , MCP server integration , automatic compaction for long sessions. Both CLI (terminal-first) and Desktop (GUI) modes available.
Key difference: Codex focuses on hosted/cloud-first execution with local fallback. Claude Code focuses on local-first execution with persistent project awareness. Both support MCP, both integrate with CI/CD, both provide approval modes.
Anthropic: Claude Code ecosystem matured
Claude code launched as a production-ready AI coding assistant that understands entire codebases. Key milestones in 2025/early 2026:
- CLAUDE.md specification: Automatic loading of repo-specific instructions at session start
- MCP standardization: Model Context Protocol became the standard for AI-tool integrations , with growing community ecosystem
- Hooks launched: 15 lifecycle events for deterministic automation (early 2026)
- Subagents: Specialized agents for security, testing, review
- Artifacts expanded: Up to 20MB storage per artifact , enabling stateful apps
- Desktop + CLI convergence: Both modes share skills , MCP servers , and configuration
- Expanded use cases: Claude Code for general knowledge work beyond software development
Eight trends defining how software gets built in 2026 : agent orchestration replaced human-authored logic as the primary mode.
Both ecosystems: production concerns became system design
Prompt caching for latency and input costs on shared prefixes. Background mode for long-running responses without holding connections open. Webhooks for event-driven systems. Rate limits matured. Building agents is now as much about system design as prompting.
OpenAI: Responses API, background mode, /responses/compact endpoint.
Claude: Automatic compaction
, artifacts for
state, MCP for integration.
Both ecosystems: open standards converged
OpenAI: Pushed AGENTS.md spec and participated in the Agentic AI Foundation (AAIF) alongside MCP and Skills standards. Released open-weight models: gpt-oss 120b & 20b for self-hosting, plus gpt-oss-safeguard 120b & 20b safety models. Apps SDK extends MCP for UI development.
Claude: CLAUDE.md + AGENTS.md compatibility , MCP as primary integration protocol , Skills aligned with Agent Skills standard . Can use gpt-oss models via API or local deployment.
Shared infrastructure: Open standards (AGENTS.md, MCP, Skills via AAIF), evals frameworks, reinforcement fine-tuning (RFT), supervised fine-tuning, distillation patterns for pushing quality into smaller models.
Recommended models by task (end of 2025):
| Task | OpenAI | Anthropic | Notes |
|---|---|---|---|
| General-purpose (text + multimodal) | GPT-5.2 | Claude 3.5 Sonnet | For chat, long-context work, and multimodal inputs |
| Deep reasoning / reliability-sensitive | GPT-5.2 Pro | Claude 3 Opus | Planning and tasks where quality is worth extra compute |
| Coding and software engineering | GPT-5.2-Codex | Claude Code (Sonnet 3.5) | Code generation, review, repo-scale reasoning |
| Image generation and editing | GPT Image 1.5 | N/A | Higher-fidelity generation and iterative edits |
| Realtime voice | gpt-realtime | N/A | Low-latency speech-to-speech and live voice agents |
Key insights:
- OpenAI strengths: Native multimodal generation (GPT Image 1.5, video via Sora), realtime voice (gpt-realtime), reasoning capabilities (GPT-5.2 Pro)
- Claude strengths: Longer context windows (200K), superior code understanding, better at following complex instructions, MCP-first integration
- For harness engineering: Both ecosystems work. Choose based on your existing infrastructure, cost constraints, and specific task requirements. The patterns in this article apply to both.
Where this frame goes next
This article is deliberately tool-agnostic and domain-agnostic. The same frame applies with extra force in domains where the substrate is itself a system the agent has to reason about. The next piece in this series applies the harness model specifically to data engineering: the substrate becomes algebraic (typestate builders that refuse to compile half-configured pipelines, schema-policy lattices, retry-strategy monoids), and a Scalafix layer mechanically bans what the algebras cannot express. The architectural decision moves out of a document and into CI.
If your domain happens to be data, the in-progress companion article at
/posts/2026/05/ai-orchestrated-de/ and the upcoming Scala Days 2026 talks pick up the thread.
References
OpenAI sources
- Harness engineering: https://openai.com/index/harness-engineering/
- Skills + shell + compaction tips: https://developers.openai.com/blog/skills-shell-tips
- 15 lessons building ChatGPT apps: https://developers.openai.com/blog/15-lessons-building-chatgpt-apps
- OpenAI for Developers in 2025: https://developers.openai.com/blog/openai-for-developers-2025
- Codex CLI: https://github.com/openai/codex
- OpenAI Agents SDK (Python): https://openai.github.io/openai-agents-python/
- OpenAI Agents SDK (TypeScript): https://openai.github.io/openai-agents-js/
- AgentKit: https://openai.com/index/introducing-agentkit/
- Aardvark: https://openai.com/index/introducing-aardvark/
- Skybridge Framework: https://github.com/alpic-ai/skybridge
Claude/Anthropic sources
- Claude Code Best Practices: https://code.claude.com/docs/en/best-practices
- Creating the Perfect CLAUDE.md: https://dometrain.com/blog/creating-the-perfect-claudemd-for-claude-code/
- Claude Code Hooks Guide: https://aiorg.dev/blog/claude-code-hooks
- Connect Claude Code to tools via MCP: https://code.claude.com/docs/en/mcp
- Building MCP Servers: https://www.sitepoint.com/building-mcp-servers-custom-context-for-claude-code/
- Claude Code Subagents: https://code.claude.com/docs/en/sub-agents
- Claude Compaction Documentation: https://platform.claude.com/docs/en/build-with-claude/compaction
- Claude Skills Documentation: https://code.claude.com/docs/en/skills
- Claude Skills Library: https://mcpservers.org/claude-skills
- Claude Artifacts Guide: https://support.claude.com/en/articles/9487310-what-are-artifacts-and-how-do-i-use-them
- Understanding Claude’s conversation Compacting: https://www.ajeetraina.com/understanding-claudes-conversation-compacting-a-deep-dive-into-context-management/
- Claude Code security best practices: https://www.mintmcp.com/blog/claude-code-security
- Eight trends defining software in 2026: https://claude.com/blog/eight-trends-defining-how-software-gets-built-in-2026
- Building agents with the Claude Agent SDK: https://www.anthropic.com/engineering/building-agents-with-the-claude-agent-sdk
- Complete guide to agentic coding in 2026: https://www.teamday.ai/blog/complete-guide-agentic-coding-2026
Community resources
- Awesome Claude code: https://github.com/hesreallyhim/awesome-claude-code
- Everything Claude code (production configs): https://github.com/affaan-m/everything-claude-code
- Claude code Showcase: https://github.com/ChrisWiles/claude-code-showcase
- Claude code hooks for production: https://www.pixelmojo.io/blogs/claude-code-hooks-production-quality-ci-cd-patterns
- How to use AGENTS.md in Claude code: https://aiengineerguide.com/blog/how-to-use-agents-md-in-claude-code/
Universal patterns
- FlowForge: https://github.com/com-vitthalmirji/flowforge
- Code review operating system (companion article): Build a code review operating system
- AGENTS.md spec: https://agents.md/
- Parse, don’t validate (Alexis King): https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/
- AI is forcing us to write good code: https://bits.logic.inc/p/ai-is-forcing-us-to-write-good-code
- Ralph Wiggum Loop: https://ghuntley.com/loop/
- Good context, bad context: AGENTS.md for coding agents: https://addyosmani.com/blog/agents-md/
2026 platform developments (added May 2026)
- BigQuery Data Engineering Agent (GA 22 April 2026): https://cloud.google.com/bigquery/docs/data-engineering-agent
- dbt Coalesce 2025 announcements (dbt Agents, Fusion, MCP server): https://www.getdbt.com/blog/coalesce-2025-rewriting-the-future
- Snowflake Managed MCP Servers: https://www.snowflake.com/en/blog/managed-mcp-servers-secure-data-agents/
- Microsoft Fabric: Agentic Fabric (MCP as AI-native OS): https://blog.fabric.microsoft.com/en-us/blog/agentic-fabric-how-mcp-is-turning-your-data-platform-into-an-ai-native-operating-system
- Atlan MCP-Connected Data Catalog: https://atlan.com/know/mcp-connected-data-catalog/
- dbt Semantic Layer vs Text-to-SQL benchmark (2026): https://docs.getdbt.com/blog/semantic-layer-vs-text-to-sql-2026
- Data Engineering in 2026: 12 Predictions (Datafold): https://www.datafold.com/blog/data-engineering-in-2026-predictions/
Adjacent Scala-language and academic work
- Scalar 2026: How Can We Trust Our Agents? (Odersky framing): https://www.slideshare.net/slideshow/how-can-we-trust-our-agents-talk-given-at-scala3-2026/286692416
- OPAW: Tracking Capabilities for Safer Agents (March 2026): https://blog.georgovassilis.com/2026/03/13/opaw-tracking-capabilities-for-safer-agents/
- TypePilot: Leveraging the Scala type system for secure LLM-generated code (ACL Anthology 2025, OMMM workshop): https://aclanthology.org/2025.ommm-1.11/
- Hivemind Technologies: Prompting Safely (Lambda World 2025): https://github.com/HivemindTechnologies/scala-llms-dsl_final
- llm4s: Agentic and LLM Programming in Scala (under the Scala Center, GSoC 2025/2026): https://github.com/llm4s/llm4s
Production telemetry and failure-mode evidence
- Why AI Agents Break: A Field Analysis (Arize): https://arize.com/blog/common-ai-agent-failures/
- AI Agents for Data Engineering: 2026 reliability guide (Atlan): https://atlan.com/know/ai-agents-for-data-engineering/
- Detecting AI Agent Failure Modes in Production (Latitude): https://latitude.so/blog/ai-agent-failure-detection-guide
- Data Engineering Open Forum 2025 (recordings playlist): https://www.youtube.com/playlist?list=PLfXiENmg6yyXKICQiUNutmDyJKk84BVSP
Karpathy: practitioner patterns
- Andrej Karpathy - LLM coding pitfalls (January 2026): https://x.com/karpathy/status/2015883857489522876
- andrej-karpathy-skills - four-principle CLAUDE.md, also installable as a Claude Code plugin (143K+ stars): https://github.com/multica-ai/andrej-karpathy-skills
- Andrej Karpathy - LLM Knowledge Bases (April 2026): https://x.com/karpathy/status/2039805659525644595
- Karpathy LLM wiki gist (raw/wiki/operations pattern): https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
🔗 Related Posts
- Build a code review operating system: prevent 2 AM incidents in serious codebases
- Rust fundamentals: a precise 6-week plan for systems-minded data engineers now
- Effect polymorphism in Scala: write once, choose your runtime later safely now
- Kleisli for data engineers: the category trick that makes pipelines compose
If you’re staff or principal today, your highest-impact work isn’t writing excellent code anymore.
It’s in building a harness where excellent code becomes the default output of the system.
OpenAI shipped 1 million lines in 5 months with agents generating every line. They did it by investing in environment design, not implementation speed.
You can build the same harness with Claude Code. The patterns are identical. The tools differ. The outcomes are the same.
Read the companion: Build a code review operating system
You can start building your harness today. You’ve got the map for both ecosystems. Start with your instruction file (AGENTS.md or CLAUDE.md). The rest follows.
