Everything I Learned About Harness Engineering and AI Factories in San Francisco (April 2026)
好的,我现在需要帮用户总结这篇文章的内容,控制在100个字以内,而且不需要以“文章内容总结”或“这篇文章”这样的开头。直接写文章的描述即可。 首先,我快速浏览一下文章。看起来这篇文章是关于人工智能代理(AI agents)在软件开发中的应用,特别是在旧金山的一系列会议和讨论中收集的信息。作者采访了CTO、CPO和工程领导,讨论了他们如何使用AI代理进行构建。 接下来,文章提到了一些关键概念,比如模型(Model/LLM)、 Harness、Agent、Vibe coding和AI工厂。这些概念解释了AI代理如何被整合到开发流程中,并提到了生产效率的提升、角色变化以及未来趋势。 然后,文章讨论了生产力的提升,提到自2025年12月以来生产效率提高了10倍。这不仅仅是代码生成速度的提升,还包括构建-审查-发布-学习循环的加速。作者指出,未能采用这些技术的初创公司将面临被淘汰的风险。 接着,文章提到了“Builder”角色的兴起,即产品经理、设计师和开发人员之间的界限变得模糊,更多的人开始使用AI代理来弥补技能不足。这导致了角色边界的转变,更注重输出的质量而非职位头衔。 此外,文章还讨论了瓶颈从实施转移到产品战略的问题。当实施变得便宜时,糟糕的战略会变得更昂贵。因此,产品质量越来越依赖于优先级的纪律性。 最后,文章提到旧金山作为市场的领先指标,并强调了采用 Harness Engineering 的重要性。它还提供了一些具体的建议和工具推荐。 现在我需要将这些信息浓缩到100个字以内。重点在于:旧金山会议中的讨论、AI代理在软件开发中的应用、生产效率提升、角色变化、“Builder”兴起以及未来趋势。 所以最终的总结应该是:旧金山会议讨论AI代理在软件开发中的应用及趋势;生产效率提升、“Builder”兴起及未来展望。 </think> 旧金山会议讨论了AI代理在软件开发中的应用及趋势;生产效率提升、“Builder”兴起及未来展望。 2026-4-3 17:46:58 Author: securityboulevard.com(查看原文) 阅读量:7 收藏


Everything I Learned About Harness Engineering and AI Factories in San Francisco (April 2026)

I spent the last week of March 2026 in San Francisco talking to CTOs, CPOs, and engineering leaders from companies of every size about how they actually build with AI agents today. I've met solo founders of pre-series A startups, I attended Y Combinator DevTool Day on March 27 and All Things Dev on March 31, sat down with our advisors, and had dozens of conversations with founders and tool builders working at the frontier.

This document is what I brought back. It is a field report: what I learned, what I think matters, and where the industry seems to be heading. It is also the reference document my team and I will use to structure how we adopt these practices ourselves.

The audience are startup Founders, CTOs, CPOs, and senior engineers/product managers who are already past the "what is an LLM" stage and want to know what actually works in production. San Francisco is not the whole market, but it is often a leading indicator, and right now, the signal is strong.

The terms below are overloaded, so I use them narrowly:

  • Model / LLM: The base intelligence layer: tokens in, tokens out. On its own it does not remember sessions, read your repo, run commands, or verify its work. LLM is a specific technology of models.
  • Harness: Everything around the model: instructions, context, tools, runtime, permissions, review loops, verification.
  • Agent: A harnessed loop that can decide, act, observe, and continue until done or blocked.
  • Vibe coding: A low-structure accept-and-iterate workflow. Useful for exploration and prototypes. Weak for correctness, repeatable delivery, and regulated workflows.
  • AI factory: The org-level system that repeatedly turns intent into shipped work: issue framing, execution, review, deployment, telemetry, feedback. Partly engineering, partly product operations. AI Factory enables Vibe Coding at Scale.

This section is intentionally opinionated. These are not consensus statements. They are recurring arguments, observed shifts, and directional predictions heard across both conferences and in every conversation I had that week.

Productivity x10 since December 2025

This was a common framing, but it should not be presented as an audited universal benchmark.

The charitable and defensible version is:

  • The comparison several aggressive teams make is against December 2025 workflows, not against the pre-AI era.
  • In one quarter, models improved, harnesses improved, and orchestration improved at the same time.
  • The operating ceiling for one engineer with good agents feels materially different than it did a few months earlier.

Treat "10x" as a directional claim from fast adopters, not as settled measurement science.

Startups that don't adopt will die

This is rhetorical, but the underlying claim is serious.

What the statement is really pointing at:

  • The compounding advantage is not only code generation speed.
  • It is shorter build-review-ship-learn loops.
  • Teams that delay adoption entirely are not just slower at implementation; they are slower at learning.

The real decision is not "AI or no AI." The real decision is how much of the delivery loop remains human-led, and which work becomes agent-native now.

The rise of the "Builder"

The distinction between UI designer, UX researcher, product owner, and developer is collapsing. The recurring claim is that a new profile is emerging: the Builder, someone who owns the problem end-to-end and uses agents to cover the skills they lack.

  • A PM with no frontend experience ships a working UI change.
  • A designer pushes code, not just mockups.
  • A founder prototypes a full feature before involving the team.

The threshold for producing a first-pass pull request dropped so sharply that role boundaries stopped being the constraint. What matters now is not your job title but whether you can judge the output: does this diff belong in the product, is it correct, and is it coherent with everything else?

The bottleneck is moving to product strategy

When implementation gets cheaper, bad strategy gets more expensive.

The reason is simple:

  • Slow implementation used to absorb weak decisions.
  • Fast implementation removes that buffer.
  • Teams can now ship low-quality strategy much faster than before.

This is why product quality now depends more on prioritization discipline, not less.

The startup lifecycle is compressing

Agent-driven development compresses the time between:

  • hypothesis
  • first product
  • early traction
  • version-two confusion

You reach "the first vision is basically built, now what?" much faster.

That creates a new failure mode:

  • the company has engineering leverage
  • but it does not yet have strategic clarity for what to do with it

The result is feature volume without product direction.

The IDE is dead

Also rhetorical.

The stronger version is:

  • The center of gravity is moving from the editor to the agent console.
  • Editors still matter.
  • But for multi-step work, the critical surface is now orchestration, visibility, review, status, and control over parallel sessions.

The terminal wins whenever the work looks more like operating a system than typing code line by line.

There is no excuse not to run 24 hours a day

This follows directly from the previous point. If the compounding advantage is loop speed, then leaving agents idle overnight is a deliberate choice to slow that loop.

The argument is not about developer working hours. It is about asset utilization. Agents are infrastructure. Leaving them idle from 7pm to 9am is the equivalent of shutting down your CI pipeline every evening and restarting it in the morning.

The technical capability is no longer in question. Rakuten engineers ran Claude Code autonomously for seven hours on a 12.5-million-line codebase, achieving 99.9% accuracy. OpenAI published a Codex stress test that ran for 25 hours uninterrupted. These are logged runs, not demos.

What the strongest teams described:

  • Engineers push work at end of day. Agents pick up test writing, code review, refactoring, and security scans overnight.
  • By morning, the codebase has been tested, reviewed, and flagged. The engineer's first task is triage, not implementation.
  • Nothing merges without human approval. The overnight cycle produces candidates, not commits.

Do we need fewer PMs or more?

This is still the wrong framing. Three product people for fifteen engineers is more than enough: possibly too many. The old ratio of 1 PM per 5-7 engineers assumed the PM was the translation layer between business intent and technical execution. When agents eliminate most of that translation cost, the PM's value shifts entirely upstream.

What changes is not mainly the headcount math. It is the job shape.

Work that shrinks:

  • detailed ticket translation
  • backlog grooming as a communication bridge
  • implementation-level handholding

Work that grows:

  • market understanding
  • synthesis of customer signal
  • prioritization under much faster engineering throughput
  • deciding what not to build

The PM role moves upstream. Less project management. More judgment.

Tasks for me or for the agent?

Usually better delegated to agents Usually still human-led
Correctness sweeps Where to start
Testing Architecture
Error handling Design direction and consistency
Debugging after reproduction Abstraction boundaries
Boilerplate Data model and API shape
Translation Refactoring intent
Thoroughness Product judgment
Repetitive implementation Priority tradeoffs

The practical question is not "can the model do this?" It is "what is the cost of a silent mistake here, and how cheaply can I detect it?"

Model choice: Claude 4.6 vs GPT-5.4? You should use both

Claude Opus 4.6 GPT-5.4 in Codex
Better first-pass writing tone Better implementation reliability
Better exploratory docs and explanation Better verification, testing and final passes
Strong for frontend and UI taste Strong for correctness-sensitive backend work
Strong for interactive computer use Strong for long, tool-heavy execution in Codex

This is a heuristic, not a law. The real point is to stop treating model choice as a religion and start treating it as task routing.

The strongest proof point: on March 30, 2026, OpenAI open-sourced codex-plugin-cc: an official plugin that lets you invoke Codex directly from Claude Code. OpenAI shipping a plugin inside a competitor's tool confirms the moat is the harness, not the model. They'd rather have Codex running inside Claude Code (collecting API charges per review) than have users not use Codex at all. The ecosystem is converging on interoperability, not lock-in.

The category is still moving fast. Overbuilding orchestration too early is an easy way to create your own internal product to maintain.

Harness engineering is not "writing a better prompt." It is the design of the system around the model so output quality depends less on raw model brilliance and more on structure.

Minimal AI Factory Architecture

If you strip the category down to its minimum useful shape, an AI factory has seven layers:

  1. Intent capture: Product request, bug, support signal, roadmap item, or internal need.
  2. Spec or issue framing: A bounded instruction with constraints, acceptance criteria, and links to context.
  3. Context and instruction layer: Repo guidance, scoped rules, skills, docs, APIs, and environment facts.
  4. Execution layer: One or more agents editing code, calling tools, and running commands.
  5. Verification layer: Tests, static analysis, review agents, CI, and human sign-off.
  6. Isolation and permission layer: Worktrees, sandboxes, runtime isolation, secret boundaries, and approval flows.
  7. Feedback layer: Production telemetry, customer signal, review outcomes, and repeated failures fed back into rules, prompts, or process.

If one of these layers is weak, the whole system regresses:

  • No issue framing: fast implementation of vague intent.
  • No context discipline: expensive wandering.
  • No verification: vibe coding at scale.
  • No isolation: parallelism without control.
  • No feedback loop: repeated mistakes with better marketing.

Instructions, rules, plugins and skills

The important instruction artifacts are:

Artifact Primary use Notes
AGENTS.md Shared project instructions across agent tools, auto-imported by Codex. Standard format used by all providers but Anthropic
CLAUDE.md Same as AGENTS.md auto-imported by Claude. Can symlink AGENTS.md
SKILL.md Narrow, on-demand workflow or capability Use for reusable task methods, not global policy
.cursor/rules/*.md Cursor-specific structured rules Useful when you need metadata or path scoping

Plugin vs. Skill:

A skill is a single SKILL.md file invoked via slash command (/deploy). A plugin is a directory with a .claude-plugin/plugin.json manifest that bundles multiple skills, hooks, agents, and MCP configs into a distributable package (/plugin-name:command). Use skills for personal workflows. Use plugins when sharing across teams.

ℹ️ Avoiding duplication between Claude Code and Codex: If you use both tools on the same repo, pick one source of truth:

  • Symlink (simplest): ln -sf AGENTS.md CLAUDE.md. Both filenames point to the same content. Zero drift.
  • Reference: Put @AGENTS.md inside your CLAUDE.md. Claude Code reads the referenced file inline. Add Claude-specific instructions below.
  • Pointer: Keep all shared instructions in AGENTS.md. Make CLAUDE.md a one-liner: READ AGENTS.md FIRST. Add overrides below.

Concrete architecture: multi-tool project

my-project/
├── AGENTS.md                          # Source of truth (shared instructions)
├── CLAUDE.md -> AGENTS.md             # Symlink for Claude Code
├── .claude/
│   ├── CLAUDE.md                      # Claude-specific overrides (optional)
│   ├── rules/
│   │   ├── testing.md                 # "Always run pytest before committing"
│   │   └── frontend.md               # "Use Tailwind, no inline styles"
│   └── skills/
│       ├── deploy/
│       │   └── SKILL.md              # /deploy: push to prod workflow
│       └── review/
│           └── SKILL.md              # /review: pre-landing PR checks
├── .cursor/
│   └── rules/
│       ├── base.md                    # Cursor-specific conventions
│       └── api.md                     # Path-gated to src/api/**
└── src/
    └── api/
        └── AGENTS.md                  # Directory-scoped: "All endpoints need auth"

What happens at session start:

  • Claude Code loads: CLAUDE.md (-> AGENTS.md via symlink) + .claude/CLAUDE.md + .claude/rules/*.md + skill names from .claude/skills/. When you type /deploy, the full deploy/SKILL.md loads into context.
  • Codex loads: AGENTS.md at root. When working in src/api/, also loads src/api/AGENTS.md. The .claude/ directory is ignored.
  • Cursor loads: .cursor/rules/*.md + AGENTS.md at root. The .claude/ directory is ignored.

Keep root context lean

The best recent corrective on context-file enthusiasm came from ETH Zurich: detailed repository context often increases cost and can reduce task success when it adds unnecessary requirements.

Use the root file for Do not use the root file for
Build, test, and lint commands Generic clean-code slogans
Dangerous areas and non-obvious constraints Style rules your formatter already enforces
Generated-code boundaries README duplication
Migration or deployment cautions Long architecture tutorials the agent can read elsewhere
Review and verification expectations

What matters in practice:

  • Keep one shared source of truth for durable project instructions.
  • Put tool-specific behavior only where it belongs.
  • Put local or path-specific constraints in narrower scopes, not in the root file.
  • Prefer on-demand skills for workflows that are occasionally needed, not always needed.

Verification beats advice

The rule of thumb is simple: if an error class recurs, stop describing it and start preventing it.

Failure mode Better fix
Agent stops too early Explicit build-verify-fix loop
Agent forgets tests Pre-completion verification hook plus CI
Agent edits the wrong area Scoped instructions and path-specific rules
Agent repeats the same bug class Linter, static rule, or regression test
Agent misses architectural context Better issue framing and smaller task boundaries

Example: LangChain published one of the clearest public examples of this pattern in February 2026: their coding agent moved from 52.8% to 66.5% on Terminal Bench 2.0 by changing the harness, not the model.

Review loops and context drift

Over time, agent-generated code drifts:

  • Conventions soften
  • Dead code accumulates
  • Review comments repeat
  • Context files become stale

Useful mitigations:

  • Automated review on every meaningful PR
  • A second model for high-stakes review when possible
  • Periodic cleanup of root instruction files
  • Tracing and postmortems on agent failures
  • Converting recurring review comments into deterministic checks

Example: coding standards in AGENTS.md

# Global Coding Standards

1. **YAGNI**: Don't build it until you need it
2. **DRY**: Extract patterns after second duplication, not before
3. **Fail Fast**: Explicit errors beat silent failures
4. **Simple First**: Write the obvious solution, optimize only if needed
5. **Delete Aggressively**: Less code = fewer bugs
6. **Semantic Naming**: Always name variables, parameters, and API endpoints with verbose, self-documenting names that optimize for comprehension by both humans and LLMs, not brevity (e.g., `wait_until_obs_is_saved=true` vs `wait=true`)

Source: All Things Web @ WorkOS, 31st of March 2026

As mentioned in the hot takes, adopting Harness Engineering rapidly is a matter of life or death for companies, whatever their size is. As stated by Y Combinator, the trend show come from the top, the Founders, specifically those owning the Technical and the Product Roles, summarized as the CTO and CPO in the rest of the document. With that framing, the CTO controls how fast the org can ship. The CPO controls whether what ships is worth shipping. When agents make the CTO side 10x faster, every CPO mistake compounds 10x faster too.

First 30 days

Don't standardize on day one. Run agents on real work for two weeks and log every revert, rework, and rejection. Then build guardrails around the failure modes you actually saw: not hypothetical ones.

  • CTO: pick one harness (Claude Code or Codex, not both), add a minimal instruction file, require CI + automated review on all agent PRs, set a per-session cost alert.
  • CPO: rewrite issue templates around intent and success criteria (agents execute literally), define an explicit "do not build" list for the quarter, pull customer signal into written artifacts.
  • Together: review merged agent-assisted PRs weekly. Update process from real failures, not theory.

Autonomy tiers

Not all PRs need the same scrutiny. Start everything at full review. Promote downward only with evidence.

Tier Examples Required before merge
Full autonomy Typo fixes, test additions, dependency bumps, boilerplate CI + automated review
Light review Feature work within established patterns, bug fixes with clear repro CI + automated review + human skim (< 5 min)
Full review New endpoints, data model changes, auth/payment flows CI + automated review + thorough human review
Human-led Schema migrations, infra changes, security-critical paths Human writes or co-writes. Agent assists.

Cadence

  • Weekly: review agent-authored regressions. Convert the top recurring mistake into a deterministic rule. Check whether issues were specific enough for agents to act without churn.
  • Monthly: reclassify work across autonomy tiers. Remove dead rules and stale instructions. Audit feature velocity vs. feature impact: are we shipping noise?
  • Quarterly: revisit the stack, permission model, cost structure, and PM staffing ratio.

Metrics

  • Lead time from issue to merged PR
  • Agent autonomy rate (% of tasks without human intervention)
  • Reopen and rollback rate on agent-authored changes
  • Wasted work rate (features reverted or unused within 30 days)
  • Issue clarity (% of issues agents can act on without clarification)
  • Monthly agent API cost per engineer
  • Cycle time from customer signal to shipped outcome

The point is not to install everything below. The point is to identify the bottleneck you actually have.

The winning stack pattern

This is the stack pattern I would describe as convergent, not mandatory:

Layer Standard choice Why it keeps showing up
Source of truth GitHub Claude Code authors ~4% of all public commits (~135K/day). Every agent tool produces PRs against GitHub repos. The entire agent factory pattern assumes Git and GitHub as the substrate.
Planning Linear Declared "issue tracking is dead" (March 2026). Coding agents installed in 75% of enterprise workspaces. Deeplinks send issue context directly into Claude Code, Cursor, or Copilot as prefilled prompts. Agent work volume up 5x in three months.
Trigger and coordination Slack Non-engineers describe a problem or request in Slack; an MCP integration routes it to an agent that opens a PR. The barrier drops from "file a ticket" to "describe it in a message."
Thinking and notes Obsidian Local markdown files that agents can read via MCP. Where intent gets structured before it becomes an issue or a prompt.
Runtime Cloudflare Agents Agents SDK, Durable Objects for state, Workflows for long-running tasks. Workers AI runs frontier models on-platform with 77% cost reduction on 7B token/day workloads vs. external API calls.
Observability Sentry Error tracking plus LLM-specific monitoring: agent runs, tool calls, token usage, conversation replay. Also maintains Claude Code agent skills (iterate-pr, code review): sits on both sides of the workflow.
Business signal HubSpot Customer feedback, support tickets, and sales conversations flow into the planning layer, giving agents business context for what to build next.

Terminal & orchestration

Tool Bottleneck it solves Why it matters
cmux / repo 5+ agent sessions with no status visibility: constant tab-switching macOS-native terminal with GPU-accelerated rendering (libghostty), per-agent green/yellow/red status indicators, git branch + PR status per workspace. Works with Claude Code, Codex, Gemini CLI.
Superset / repo Parallel agents stepping on each other's files and git state Git worktree isolation per agent. Each agent gets its own sandbox with no shared mutable state. Launched March 2026.
Conductor Running agents sequentially: throughput capped at 1x Orchestration layer from gstack. Runs multiple Claude Code sessions in parallel, each in its own isolated workspace. Garry Tan regularly runs 10-15 parallel sprints.
Claude Manager Losing track of which Claude session is running, waiting, or finished Rust TUI that organizes sessions by project/task hierarchy. Live status indicators, diff preview without attaching, worktree lifecycle management. First published March 2026.

Spec & planning

Tool Bottleneck it solves Why it matters
OpenSpec Agents coding before the problem is well-defined: expensive iterations on work that doesn't match intent Three-phase state machine (proposal, apply, archive). Agent must produce a ~250-line spec before writing code. Supports Claude Code, Cursor, Copilot, and 20+ tools. 27K+ stars, YC-backed.

Quality & review

Tool Bottleneck it solves Why it matters
Codex plugin for Claude Code Want a second opinion from a different model without leaving Claude Code OpenAI's official plugin (open-sourced March 30, 2026). Adds /codex:review and /codex:adversarial-review. Uses the same harness as Codex itself. Runs in background using your ChatGPT subscription.
CodeRabbit PR reviews are slow (waiting for humans) or shallow (humans skim large diffs) Always-on AI review on every PR. 13M+ PRs reviewed, 2-3M connected repos, 75M defects found. GitHub/GitLab/Azure DevOps/Bitbucket. Free tier available, SOC 2 Type II.
Taskless Agent keeps making the same class of mistake: you fix it once but nothing prevents it from reappearing Converts code review corrections into deterministic syntax-tree rules (tree-sitter). Tag @taskless on a PR or file an issue; it creates a pass/fail rule that runs on every PR, in every IDE, on every run. Same result every time: not AI opinions, not prompt engineering. 25+ languages, zero instrumentation.
Sentry iterate-pr Manual PR-fix-CI loops: developer re-runs checks, reads logs, applies fix, resubmits Encodes the fix-CI-resubmit loop as a reusable skill. Agent detects failures, applies fixes, and re-runs checks without human intervention. Good reference for encoding any mechanical review iteration as a skill.
gstack No structured review/QA patterns beyond basic linting Pattern library, not a package: role-based review, directory freezes, visual QA, pre-landing checks. Steal the patterns that match your failure mode, ignore the rest.

Context & memory

Bottleneck it solves Why it matters
Claude-Mem Sessions are stateless: everything the agent learned is lost when the session ends Auto-captures session activity, compresses it with AI (agent-sdk), injects relevant context into future sessions. Adds dynamic, session-derived memory on top of static CLAUDE.md files. 44K+ stars.

Runtime isolation

Tool Bottleneck it solves Why it matters
Coasts / repo Two agents both running localhost:3000: port collisions block parallel testing Each worktree gets its own containerized runtime with dynamic port assignment. Agnostic to AI providers. Single config file.
Docker-in-Docker / Docker Sandboxes Need N isolated full-stack copies (app, database, workers) per agent Docker Compose with per-agent port mappings. Docker Desktop 4.60+ supports Sandboxes in dedicated microVMs with network isolation. Heavier than Coasts but gives full stack isolation.

Not all of these belong in a default stack. They are still worth tracking because they attack real bottlenecks.

Tool What it does Why it's interesting
Ghost Instant, ephemeral Postgres databases: agents spin them up like git branches. MCP/CLI only, no UI. Standard SQL, no proprietary SDK. 100 hrs/month free. Pairs with Memory Engine, TigerFS, and Ox (sandboxed execution), all Postgres-native.
fp CLI-first, local-first issue tracking for Claude Code. /fp-plan, /fp-execute, /fp-review. Local code review interface that sends inline comments back to the agent. No external service required. Mac desktop app.
GitButler Parallel branches in a single working directory via virtual branching: no worktree directories. Assign file changes to different branches visually. All branches start from the same state, guaranteed to merge cleanly. Lighter than worktree-based isolation.
FinalRun Vision-based mobile testing on real iOS/Android devices. Test cases written in plain English. 76.7% on Android World Benchmark (116 tasks): ahead of DeepSeek, Alibaba, ByteDance agents. ~99% flaky-free. 2-person startup.
SuperBuilder Mac-native command center for Claude Code with per-message cost tracking, rate-limit queuing, and Branch Battle. Free, BYOK. Tracks cost per thread/project, queues tasks through rate limits, compares two approaches side by side.
AgentsMesh Remote AgentPods for running multiple coding agents (Claude Code, Codex, Gemini CLI, Aider, OpenCode). Self-hosted runners, gRPC + mTLS control plane, Kanban with ticket-to-pod binding. One dev built 965K lines in 52 days using it.
Ghostgres Experimental Postgres fork from Timescale: "there are no dumb queries, only dumb databases." Early-stage (32 stars), but Timescale's broader push includes pgai (embeddings + NL-to-SQL in Postgres) and Ox (agent sandbox TUI).

*** This is a Security Bloggers Network syndicated blog from Escape - Application Security &amp; Offensive Security Blog authored by Antoine Carossio. Read the original post at: https://escape.tech/blog/everything-i-learned-about-harness-engineering-and-ai-factories-in-san-francisco-april-2026/


文章来源: https://securityboulevard.com/2026/04/everything-i-learned-about-harness-engineering-and-ai-factories-in-san-francisco-april-2026/
如有侵权请联系:admin#unsafe.sh