
I spent the last week of March 2026 in San Francisco talking to CTOs, CPOs, and engineering leaders from companies of every size about how they actually build with AI agents today. I've met solo founders of pre-series A startups, I attended Y Combinator DevTool Day on March 27 and All Things Dev on March 31, sat down with our advisors, and had dozens of conversations with founders and tool builders working at the frontier.
This document is what I brought back. It is a field report: what I learned, what I think matters, and where the industry seems to be heading. It is also the reference document my team and I will use to structure how we adopt these practices ourselves.
The audience are startup Founders, CTOs, CPOs, and senior engineers/product managers who are already past the "what is an LLM" stage and want to know what actually works in production. San Francisco is not the whole market, but it is often a leading indicator, and right now, the signal is strong.
The terms below are overloaded, so I use them narrowly:
This section is intentionally opinionated. These are not consensus statements. They are recurring arguments, observed shifts, and directional predictions heard across both conferences and in every conversation I had that week.
This was a common framing, but it should not be presented as an audited universal benchmark.
The charitable and defensible version is:
Treat "10x" as a directional claim from fast adopters, not as settled measurement science.
This is rhetorical, but the underlying claim is serious.
What the statement is really pointing at:
The real decision is not "AI or no AI." The real decision is how much of the delivery loop remains human-led, and which work becomes agent-native now.
The distinction between UI designer, UX researcher, product owner, and developer is collapsing. The recurring claim is that a new profile is emerging: the Builder, someone who owns the problem end-to-end and uses agents to cover the skills they lack.
The threshold for producing a first-pass pull request dropped so sharply that role boundaries stopped being the constraint. What matters now is not your job title but whether you can judge the output: does this diff belong in the product, is it correct, and is it coherent with everything else?
When implementation gets cheaper, bad strategy gets more expensive.
The reason is simple:
This is why product quality now depends more on prioritization discipline, not less.
Agent-driven development compresses the time between:
You reach "the first vision is basically built, now what?" much faster.
That creates a new failure mode:
The result is feature volume without product direction.
Also rhetorical.
The stronger version is:
The terminal wins whenever the work looks more like operating a system than typing code line by line.
This follows directly from the previous point. If the compounding advantage is loop speed, then leaving agents idle overnight is a deliberate choice to slow that loop.
The argument is not about developer working hours. It is about asset utilization. Agents are infrastructure. Leaving them idle from 7pm to 9am is the equivalent of shutting down your CI pipeline every evening and restarting it in the morning.
The technical capability is no longer in question. Rakuten engineers ran Claude Code autonomously for seven hours on a 12.5-million-line codebase, achieving 99.9% accuracy. OpenAI published a Codex stress test that ran for 25 hours uninterrupted. These are logged runs, not demos.
What the strongest teams described:
This is still the wrong framing. Three product people for fifteen engineers is more than enough: possibly too many. The old ratio of 1 PM per 5-7 engineers assumed the PM was the translation layer between business intent and technical execution. When agents eliminate most of that translation cost, the PM's value shifts entirely upstream.
What changes is not mainly the headcount math. It is the job shape.
Work that shrinks:
Work that grows:
The PM role moves upstream. Less project management. More judgment.
| Usually better delegated to agents | Usually still human-led |
|---|---|
| Correctness sweeps | Where to start |
| Testing | Architecture |
| Error handling | Design direction and consistency |
| Debugging after reproduction | Abstraction boundaries |
| Boilerplate | Data model and API shape |
| Translation | Refactoring intent |
| Thoroughness | Product judgment |
| Repetitive implementation | Priority tradeoffs |
The practical question is not "can the model do this?" It is "what is the cost of a silent mistake here, and how cheaply can I detect it?"
| Claude Opus 4.6 | GPT-5.4 in Codex |
|---|---|
| Better first-pass writing tone | Better implementation reliability |
| Better exploratory docs and explanation | Better verification, testing and final passes |
| Strong for frontend and UI taste | Strong for correctness-sensitive backend work |
| Strong for interactive computer use | Strong for long, tool-heavy execution in Codex |
This is a heuristic, not a law. The real point is to stop treating model choice as a religion and start treating it as task routing.
The strongest proof point: on March 30, 2026, OpenAI open-sourced codex-plugin-cc: an official plugin that lets you invoke Codex directly from Claude Code. OpenAI shipping a plugin inside a competitor's tool confirms the moat is the harness, not the model. They'd rather have Codex running inside Claude Code (collecting API charges per review) than have users not use Codex at all. The ecosystem is converging on interoperability, not lock-in.
The category is still moving fast. Overbuilding orchestration too early is an easy way to create your own internal product to maintain.
Harness engineering is not "writing a better prompt." It is the design of the system around the model so output quality depends less on raw model brilliance and more on structure.
If you strip the category down to its minimum useful shape, an AI factory has seven layers:
If one of these layers is weak, the whole system regresses:
The important instruction artifacts are:
| Artifact | Primary use | Notes |
|---|---|---|
AGENTS.md |
Shared project instructions across agent tools, auto-imported by Codex. | Standard format used by all providers but Anthropic |
CLAUDE.md |
Same as AGENTS.md auto-imported by Claude. |
Can symlink AGENTS.md |
SKILL.md |
Narrow, on-demand workflow or capability | Use for reusable task methods, not global policy |
.cursor/rules/*.md |
Cursor-specific structured rules | Useful when you need metadata or path scoping |
Plugin vs. Skill:
A skill is a single SKILL.md file invoked via slash command (/deploy). A plugin is a directory with a .claude-plugin/plugin.json manifest that bundles multiple skills, hooks, agents, and MCP configs into a distributable package (/plugin-name:command). Use skills for personal workflows. Use plugins when sharing across teams.
ℹ️ Avoiding duplication between Claude Code and Codex: If you use both tools on the same repo, pick one source of truth:
ln -sf AGENTS.md CLAUDE.md. Both filenames point to the same content. Zero drift.@AGENTS.md inside your CLAUDE.md. Claude Code reads the referenced file inline. Add Claude-specific instructions below.READ AGENTS.md FIRST. Add overrides below.Concrete architecture: multi-tool project
my-project/
├── AGENTS.md # Source of truth (shared instructions)
├── CLAUDE.md -> AGENTS.md # Symlink for Claude Code
├── .claude/
│ ├── CLAUDE.md # Claude-specific overrides (optional)
│ ├── rules/
│ │ ├── testing.md # "Always run pytest before committing"
│ │ └── frontend.md # "Use Tailwind, no inline styles"
│ └── skills/
│ ├── deploy/
│ │ └── SKILL.md # /deploy: push to prod workflow
│ └── review/
│ └── SKILL.md # /review: pre-landing PR checks
├── .cursor/
│ └── rules/
│ ├── base.md # Cursor-specific conventions
│ └── api.md # Path-gated to src/api/**
└── src/
└── api/
└── AGENTS.md # Directory-scoped: "All endpoints need auth"
What happens at session start:
CLAUDE.md (-> AGENTS.md via symlink) + .claude/CLAUDE.md + .claude/rules/*.md + skill names from .claude/skills/. When you type /deploy, the full deploy/SKILL.md loads into context.AGENTS.md at root. When working in src/api/, also loads src/api/AGENTS.md. The .claude/ directory is ignored..cursor/rules/*.md + AGENTS.md at root. The .claude/ directory is ignored.The best recent corrective on context-file enthusiasm came from ETH Zurich: detailed repository context often increases cost and can reduce task success when it adds unnecessary requirements.
| Use the root file for | Do not use the root file for |
|---|---|
| Build, test, and lint commands | Generic clean-code slogans |
| Dangerous areas and non-obvious constraints | Style rules your formatter already enforces |
| Generated-code boundaries | README duplication |
| Migration or deployment cautions | Long architecture tutorials the agent can read elsewhere |
| Review and verification expectations |
What matters in practice:
The rule of thumb is simple: if an error class recurs, stop describing it and start preventing it.
| Failure mode | Better fix |
|---|---|
| Agent stops too early | Explicit build-verify-fix loop |
| Agent forgets tests | Pre-completion verification hook plus CI |
| Agent edits the wrong area | Scoped instructions and path-specific rules |
| Agent repeats the same bug class | Linter, static rule, or regression test |
| Agent misses architectural context | Better issue framing and smaller task boundaries |
Example: LangChain published one of the clearest public examples of this pattern in February 2026: their coding agent moved from 52.8% to 66.5% on Terminal Bench 2.0 by changing the harness, not the model.
Over time, agent-generated code drifts:
Useful mitigations:
# Global Coding Standards
1. **YAGNI**: Don't build it until you need it
2. **DRY**: Extract patterns after second duplication, not before
3. **Fail Fast**: Explicit errors beat silent failures
4. **Simple First**: Write the obvious solution, optimize only if needed
5. **Delete Aggressively**: Less code = fewer bugs
6. **Semantic Naming**: Always name variables, parameters, and API endpoints with verbose, self-documenting names that optimize for comprehension by both humans and LLMs, not brevity (e.g., `wait_until_obs_is_saved=true` vs `wait=true`)
Source: All Things Web @ WorkOS, 31st of March 2026
As mentioned in the hot takes, adopting Harness Engineering rapidly is a matter of life or death for companies, whatever their size is. As stated by Y Combinator, the trend show come from the top, the Founders, specifically those owning the Technical and the Product Roles, summarized as the CTO and CPO in the rest of the document. With that framing, the CTO controls how fast the org can ship. The CPO controls whether what ships is worth shipping. When agents make the CTO side 10x faster, every CPO mistake compounds 10x faster too.
Don't standardize on day one. Run agents on real work for two weeks and log every revert, rework, and rejection. Then build guardrails around the failure modes you actually saw: not hypothetical ones.
Not all PRs need the same scrutiny. Start everything at full review. Promote downward only with evidence.
| Tier | Examples | Required before merge |
|---|---|---|
| Full autonomy | Typo fixes, test additions, dependency bumps, boilerplate | CI + automated review |
| Light review | Feature work within established patterns, bug fixes with clear repro | CI + automated review + human skim (< 5 min) |
| Full review | New endpoints, data model changes, auth/payment flows | CI + automated review + thorough human review |
| Human-led | Schema migrations, infra changes, security-critical paths | Human writes or co-writes. Agent assists. |
The point is not to install everything below. The point is to identify the bottleneck you actually have.
This is the stack pattern I would describe as convergent, not mandatory:
| Layer | Standard choice | Why it keeps showing up |
|---|---|---|
| Source of truth | GitHub | Claude Code authors ~4% of all public commits (~135K/day). Every agent tool produces PRs against GitHub repos. The entire agent factory pattern assumes Git and GitHub as the substrate. |
| Planning | Linear | Declared "issue tracking is dead" (March 2026). Coding agents installed in 75% of enterprise workspaces. Deeplinks send issue context directly into Claude Code, Cursor, or Copilot as prefilled prompts. Agent work volume up 5x in three months. |
| Trigger and coordination | Slack | Non-engineers describe a problem or request in Slack; an MCP integration routes it to an agent that opens a PR. The barrier drops from "file a ticket" to "describe it in a message." |
| Thinking and notes | Obsidian | Local markdown files that agents can read via MCP. Where intent gets structured before it becomes an issue or a prompt. |
| Runtime | Cloudflare Agents | Agents SDK, Durable Objects for state, Workflows for long-running tasks. Workers AI runs frontier models on-platform with 77% cost reduction on 7B token/day workloads vs. external API calls. |
| Observability | Sentry | Error tracking plus LLM-specific monitoring: agent runs, tool calls, token usage, conversation replay. Also maintains Claude Code agent skills (iterate-pr, code review): sits on both sides of the workflow. |
| Business signal | HubSpot | Customer feedback, support tickets, and sales conversations flow into the planning layer, giving agents business context for what to build next. |
| Tool | Bottleneck it solves | Why it matters |
|---|---|---|
| cmux / repo | 5+ agent sessions with no status visibility: constant tab-switching | macOS-native terminal with GPU-accelerated rendering (libghostty), per-agent green/yellow/red status indicators, git branch + PR status per workspace. Works with Claude Code, Codex, Gemini CLI. |
| Superset / repo | Parallel agents stepping on each other's files and git state | Git worktree isolation per agent. Each agent gets its own sandbox with no shared mutable state. Launched March 2026. |
| Conductor | Running agents sequentially: throughput capped at 1x | Orchestration layer from gstack. Runs multiple Claude Code sessions in parallel, each in its own isolated workspace. Garry Tan regularly runs 10-15 parallel sprints. |
| Claude Manager | Losing track of which Claude session is running, waiting, or finished | Rust TUI that organizes sessions by project/task hierarchy. Live status indicators, diff preview without attaching, worktree lifecycle management. First published March 2026. |
| Tool | Bottleneck it solves | Why it matters |
|---|---|---|
| OpenSpec | Agents coding before the problem is well-defined: expensive iterations on work that doesn't match intent | Three-phase state machine (proposal, apply, archive). Agent must produce a ~250-line spec before writing code. Supports Claude Code, Cursor, Copilot, and 20+ tools. 27K+ stars, YC-backed. |
| Tool | Bottleneck it solves | Why it matters |
|---|---|---|
| Codex plugin for Claude Code | Want a second opinion from a different model without leaving Claude Code | OpenAI's official plugin (open-sourced March 30, 2026). Adds /codex:review and /codex:adversarial-review. Uses the same harness as Codex itself. Runs in background using your ChatGPT subscription. |
| CodeRabbit | PR reviews are slow (waiting for humans) or shallow (humans skim large diffs) | Always-on AI review on every PR. 13M+ PRs reviewed, 2-3M connected repos, 75M defects found. GitHub/GitLab/Azure DevOps/Bitbucket. Free tier available, SOC 2 Type II. |
| Taskless | Agent keeps making the same class of mistake: you fix it once but nothing prevents it from reappearing | Converts code review corrections into deterministic syntax-tree rules (tree-sitter). Tag @taskless on a PR or file an issue; it creates a pass/fail rule that runs on every PR, in every IDE, on every run. Same result every time: not AI opinions, not prompt engineering. 25+ languages, zero instrumentation. |
| Sentry iterate-pr | Manual PR-fix-CI loops: developer re-runs checks, reads logs, applies fix, resubmits | Encodes the fix-CI-resubmit loop as a reusable skill. Agent detects failures, applies fixes, and re-runs checks without human intervention. Good reference for encoding any mechanical review iteration as a skill. |
| gstack | No structured review/QA patterns beyond basic linting | Pattern library, not a package: role-based review, directory freezes, visual QA, pre-landing checks. Steal the patterns that match your failure mode, ignore the rest. |
| Bottleneck it solves | Why it matters | |
|---|---|---|
| Claude-Mem | Sessions are stateless: everything the agent learned is lost when the session ends | Auto-captures session activity, compresses it with AI (agent-sdk), injects relevant context into future sessions. Adds dynamic, session-derived memory on top of static CLAUDE.md files. 44K+ stars. |
| Tool | Bottleneck it solves | Why it matters |
|---|---|---|
| Coasts / repo | Two agents both running localhost:3000: port collisions block parallel testing |
Each worktree gets its own containerized runtime with dynamic port assignment. Agnostic to AI providers. Single config file. |
| Docker-in-Docker / Docker Sandboxes | Need N isolated full-stack copies (app, database, workers) per agent | Docker Compose with per-agent port mappings. Docker Desktop 4.60+ supports Sandboxes in dedicated microVMs with network isolation. Heavier than Coasts but gives full stack isolation. |
Not all of these belong in a default stack. They are still worth tracking because they attack real bottlenecks.
| Tool | What it does | Why it's interesting |
|---|---|---|
| Ghost | Instant, ephemeral Postgres databases: agents spin them up like git branches. MCP/CLI only, no UI. | Standard SQL, no proprietary SDK. 100 hrs/month free. Pairs with Memory Engine, TigerFS, and Ox (sandboxed execution), all Postgres-native. |
| fp | CLI-first, local-first issue tracking for Claude Code. /fp-plan, /fp-execute, /fp-review. |
Local code review interface that sends inline comments back to the agent. No external service required. Mac desktop app. |
| GitButler | Parallel branches in a single working directory via virtual branching: no worktree directories. | Assign file changes to different branches visually. All branches start from the same state, guaranteed to merge cleanly. Lighter than worktree-based isolation. |
| FinalRun | Vision-based mobile testing on real iOS/Android devices. Test cases written in plain English. | 76.7% on Android World Benchmark (116 tasks): ahead of DeepSeek, Alibaba, ByteDance agents. ~99% flaky-free. 2-person startup. |
| SuperBuilder | Mac-native command center for Claude Code with per-message cost tracking, rate-limit queuing, and Branch Battle. | Free, BYOK. Tracks cost per thread/project, queues tasks through rate limits, compares two approaches side by side. |
| AgentsMesh | Remote AgentPods for running multiple coding agents (Claude Code, Codex, Gemini CLI, Aider, OpenCode). | Self-hosted runners, gRPC + mTLS control plane, Kanban with ticket-to-pod binding. One dev built 965K lines in 52 days using it. |
| Ghostgres | Experimental Postgres fork from Timescale: "there are no dumb queries, only dumb databases." | Early-stage (32 stars), but Timescale's broader push includes pgai (embeddings + NL-to-SQL in Postgres) and Ox (agent sandbox TUI). |
*** This is a Security Bloggers Network syndicated blog from Escape - Application Security & Offensive Security Blog authored by Antoine Carossio. Read the original post at: https://escape.tech/blog/everything-i-learned-about-harness-engineering-and-ai-factories-in-san-francisco-april-2026/