The Harness Is the Product: Inside Cursor Cloud Agents' Real Architecture

How Cursor turned VM isolation, video artifacts, and model racing into a $29.3B bet on everything around the AI model

Your AI coding assistant just shipped a PR while you were on the train. It attached a video of itself clicking through the UI to prove the feature works. It resolved its own merge conflicts and squashed to a single commit. This is not a demo. 35% of Cursor's own merged pull requests are now produced this way. But here's what most engineers miss when they look at this: the model inside that agent (GPT-5, Claude, Gemini) is the least interesting part of the stack. What makes it work is everything around the model: VM isolation, codebase onboarding, parallel orchestration, video artifact capture, and multi-model routing. Cursor calls it a cloud agent. The more precise term is a harness.


Before we start! πŸ¦ΈπŸ»β€β™€οΈ

If this helps you ship better AI systems:

πŸ‘ Clap 50 times (yes, you can!) β€” Medium's algorithm favors this, increasing visibility to others who then discover the article.

πŸ”” Follow me on Medium, LinkedIn and subscribe to get my latest article.


TL;DR

  • Cursor Cloud Agents run autonomous coding tasks in isolated Linux VMs with full dev environments. Up to 10 to 20 agents run in parallel per user.
  • The standout capability: agents build, test, and use the software they create, then attach video proof to the PR. This is not autocomplete. This is delegated engineering.
  • 35% of Cursor's own merged PRs come from agents. Cursor 3 "Glass" (April 2, 2026) rebuilt the entire IDE around agent orchestration. The company crossed $2B ARR.
  • The model is a commodity input. Cursor routes GPT-5, Claude, Gemini, and its own Composer 2. The harness around it (VM isolation, codebase onboarding, artifact generation, parallel orchestration) is what you're paying $29.3B for.
  • Self-hosted option shipped March 2026. Claude Code and OpenAI Codex solve the same problem with radically different harnesses, proving that the model layer is interchangeable but the harness layer is not.

  • Why It Matters

    The shift from "AI suggests code" to "AI ships tested features end to end" is happening right now, and it is happening faster than most engineering teams expected. In October 2025, Cursor launched cloud agents as background workers. By February 2026, those agents could control their own computers: opening browsers, clicking through UIs, and recording video proof of their work. By April 2026, Cursor 3 tore out the chat panel entirely and rebuilt the IDE around an Agents Window where developers dispatch and monitor autonomous tasks like a project manager reviewing work.

    This matters beyond Cursor because the harness pattern is emerging everywhere. Claude Code ships background agents with worktree isolation. OpenAI Codex runs sandboxed tasks from GitHub issues. GitHub Copilot added its own coding agent. Every tool converges on the same promise: the AI works while you do something else. But the implementations are radically different. Understanding Cursor's harness teaches you how to evaluate all of them, because the differentiator is never the model. It is always the infrastructure wrapped around it.

    Diagram 1
    Image by Author β€” Diagram 1: The Shift
    Diagram 2
    Image by Author β€” Diagram 2: What 35% Looks Like

    The Engine vs. the Car

    You do not buy a car for the engine alone. You buy it for the chassis, the suspension, the transmission, the safety systems, and the dashboard that ties them together. The engine is interchangeable: swap a V6 for an electric motor, and the car still drives. Same with cloud agents. The LLM is one component at the bottom of the stack. The harness is everything above it that makes raw model capability useful for shipping code.

    Cursor's harness has five layers. At the top sits the interface layer: where tasks come in (Slack, GitHub, mobile, IDE). Below that, the orchestration layer: how work gets planned, which model gets picked, and how many agents run in parallel. Then the execution layer: where code actually runs (isolated VMs, not your laptop). Then the verification layer: how the agent proves its output works (computer use, video, screenshots, logs). And finally the output layer: what the developer receives (a PR with artifacts, not a chat message).

    The model sits below all five layers. You can swap GPT-5 for Claude for Gemini for Composer 2, and the harness behaves identically. This is not a theoretical claim. Cursor literally does this: it routes different models to different subtasks within the same agent session, and it runs the same task against multiple models in parallel to pick the best result. The model is the engine. The harness is the car. And Cursor's $29.3B valuation is a valuation of the car.

    Diagram 3
    Image by Author β€” Diagram 3: The Harness Stack
    Diagram 4
    Image by Author β€” Diagram 4: Model Interchangeability

    Inside Cursor's Cloud Agents

    What Actually Happens

    Codebase Onboarding

    The agent reads your repo before writing a single line of code. Cursor built a custom embedding model specifically for codebase recall across large repositories. When a cloud agent starts, subagents fan out in parallel to explore different parts of the codebase, each using the model best suited for that subtask. One subagent might index the frontend component tree while another maps the API routes and a third reads the database schema. The result is a context map that gives the agent working knowledge of your project before it touches any files.

    For new repos, you can kick off onboarding at cursor.com/onboard. The agent configures its own environment, installs dependencies, and records a demo video of the working application. You watch the demo to verify the agent understood your project correctly. This is not a README parser. It is an active exploration of your codebase that produces a verified baseline.

    Diagram 5
    Image by Author β€” Diagram 5: Onboarding Sequence
    Diagram 6
    Image by Author β€” Diagram 6: Subagent Exploration

    The VM Sandbox

    Each cloud agent gets its own isolated Linux VM with a full development environment: file system, terminal, browser, running application instance. Your laptop is not involved. There is no resource competition between agents, and there is no resource competition between agents and you. This isolation is what makes parallelism possible. You can run 10 to 20 agents simultaneously, each in its own sandbox, each working a completely different task on a completely different branch.

    Before cloud agents, local agent mode meant your machine was doing double duty: running your editor, running the agent, and running the application being tested. Three agents on one laptop meant memory pressure, port conflicts, and a fan noise that sounded like a jet engine. Cloud VMs eliminate all of that. Each agent gets a clean environment that cannot interfere with anything else.

    In March 2026, Cursor shipped self-hosted cloud agents for enterprise teams. Same capabilities: isolated VMs, full dev environments, multi-model harnesses, plugins. The difference is that your codebase, build outputs, and secrets never leave your network. The agent handles tool calls locally on your infrastructure. Same harness, different trust boundary.

    Diagram 7
    Image by Author β€” Diagram 7: VM Isolation
    Diagram 8
    Image by Author β€” Diagram 8: Cloud vs. Self-Hosted

    Plan Mode and Model Routing

    Cursor's workflow for complex features splits into two phases. First, you iterate locally with a model to create a detailed plan: what the feature should do, which files it touches, what the acceptance criteria look like. Once the plan is solid, you send it to a cloud agent for implementation. You move on to your next task. The agent works in the background, following the plan you agreed on.

    Model routing happens at the harness level, not the developer level. Cursor picks the best model for each subtask based on the task type and complexity. The GPT-5 Codex agent harness was specifically revamped for long time horizons in the cloud. Claude handles reasoning-heavy subtasks. Gemini processes large context windows efficiently. Composer 2, Cursor's own model trained with reinforcement learning from user interactions, handles routine coding with strong results on challenging tasks.

    The most interesting orchestration pattern is the race. Cursor dispatches the same problem to multiple models in parallel and picks the best result. The team reports this significantly improves final output quality, especially for harder bugs that require a handful of precise changes. This is why model lock-in does not matter at the harness level. The harness runs the race. The models are contestants.

    Diagram 9
    Image by Author β€” Diagram 9: Plan-to-Cloud Handoff
    Diagram 10
    Image by Author β€” Diagram 10: The Race Pattern

    Computer Use

    On February 24, 2026, Cursor shipped the capability that separates cloud agents from everything that came before. Agents can now use the software they create. Each VM includes a browser. The agent builds the application, launches it, navigates to localhost, and interacts with the UI the way a human would: clicking buttons, filling forms, navigating pages, and checking that elements render correctly.

    When the agent finds a problem during this verification, it does not stop and report a failure. It goes back to the code, fixes the issue, rebuilds, and tests again. This loop continues until the agent has verified that its changes actually work. When verification passes, the agent records a video of the entire session, takes screenshots of key states, and collects logs. All of this gets attached to the pull request as artifacts.

    Cursor has been dogfooding this capability internally. They used a cloud agent to build source code links for the Cursor Marketplace: the agent implemented the feature, navigated to the imported Prisma plugin, clicked each component to verify the GitHub links worked, then rebased onto main, resolved merge conflicts, and squashed to a single commit. For security work, they kicked off a cloud agent from Slack to triage a clipboard exfiltration vulnerability. The agent built an exploit page, started a backend server, loaded it in the browser, and recorded the complete attack flow. The summary appeared in the Slack thread.

    Diagram 11
    Image by Author β€” Diagram 11: The Verification Loop
    Diagram 12
    Image by Author β€” Diagram 12: Before vs. After Computer Use

    The PR as Proof

    The pull request from a cloud agent is not a diff. It is a diff plus a video demo plus screenshots plus logs plus a clean commit history (rebased, conflicts resolved, squashed). When you review this PR, you are not reading code and mentally simulating whether it works. You are watching a 30-second video of the agent demonstrating the feature. This changes the review bottleneck fundamentally: you verify intent, not execution.

    For teams that have adopted this workflow, the code review conversation shifted. Reviewers stopped asking "does this work?" and started asking "is this what we actually wanted?" The mechanical verification is handled by the agent. The human judgment is reserved for product decisions. That is a meaningful reallocation of engineering attention.


    Map It to the Harness

    Now that you have seen what actually happens, here is how it maps to the harness stack from the previous section. Each layer of the harness corresponds to a piece of the workflow you just walked through.

    Interface layer. Cloud agents can be triggered from Slack, GitHub, Linear, cursor.com/agents (web), mobile, or the desktop IDE. The Agents Window in Cursor 3 replaced the chat panel with a persistent orchestration panel: task cards showing Planning, Executing, Reviewing, or Done status, with file diffs and progress indicators. You dispatch five tasks, monitor them as cards, and review results when they surface. The mental model shifted from pair programming to project management.

    Orchestration layer. Plan mode, model routing, the race pattern, and subagent dispatch live here. This is the layer that decides how work gets done and which model does it. As shown in the race pattern above, Cursor does not commit to a single model. It hedges by running multiple models on the same problem and selecting the best output.

    Execution layer. VM isolation and the self-hosted option. This layer determines where code runs and who controls the environment. The trust boundary decision lives here: Cursor-hosted means code leaves your machine; self-hosted means it stays on your network.

    Verification layer. Computer use, video recording, screenshot capture, and log collection. This is the layer that turns a code diff into proof. Without it, cloud agents are just remote code generators. With it, they are autonomous engineers that demonstrate their work.

    Output layer. The PR with artifacts. The deliverable format that closes the loop between the agent's work and the developer's review.

    Five layers of harness. One layer of model at the bottom. The model is interchangeable. The harness is not.

    Diagram 13
    Image by Author β€” Diagram 13: Entry Points Map
    Diagram 14
    Image by Author β€” Diagram 14: The Agents Window
    Diagram 15
    Image by Author β€” Diagram 15: The Complete Harness (Filled In)

    The Platform Built on the Harness

    The cloud agent harness is the foundation. Cursor built four platform capabilities on top of it.

    Automations. Event-driven agents that trigger on a schedule or on events from external tools without you being present. When a trigger fires, Cursor spins up a cloud sandbox, follows your instructions, uses whatever MCPs you have configured, and can optionally remember the outcome of previous runs to improve over time. One limitation worth noting: automations do not yet support computer use, so automated agents cannot do visual verification.

    Bugbot Autofix. Bugbot started as a code reviewer. Now, when it finds a problem on your PR, it spins up a cloud agent on its own VM, tests a fix, and proposes the fix directly on your pull request. Over 35% of Bugbot Autofix suggestions are being merged into the base PR. The resolution rate (bugs flagged by Bugbot that get fixed before merge) climbed from 52% to 76% over six months, while the average number of issues identified per run nearly doubled. The tool is getting more accurate, not just louder.

    Plugin ecosystem. More than 30 plugins from partners including Atlassian, Datadog, GitLab, Glean, Hugging Face, monday.com, and PlanetScale. Most plugins contain MCPs that cloud agents can use when kicked off manually or through automations. This is where the harness becomes an integration platform: agents that can read from Jira, write to Datadog, and query PlanetScale within a single task.

    Cursor 3 "Glass." Launched April 2, 2026, this is the IDE rebuilt around agent orchestration. The Agents Window replaces the chat panel. Multi-repo layout lets agents read and write across multiple repositories in a single workspace. Design Mode provides a visual editor for UI components: click an element in a live preview, describe the change in natural language, and an agent modifies the source code with the preview updating in real time. Cursor crossed $2B ARR at launch and holds roughly 25% of the AI coding tool market.

    Diagram 16
    Image by Author β€” Diagram 16: Platform Layer Cake
    Diagram 17
    Image by Author β€” Diagram 17: Bugbot Autofix Flow

    Design Choices and Trade-offs

    What Cursor Got Right

    Video artifacts make review 10x faster. You verify intent, not execution. A 30-second video catches visual regressions that reading a diff cannot.

    Parallel agents multiply throughput for well-defined tasks. Ten bug fixes running simultaneously while you focus on architecture. The productivity math is simple: if each agent saves 30 minutes on a well-scoped task, ten agents save five hours in one session.

    Multi-surface access meets developers where they already work. Kick off an agent from Slack. Review a PR on your phone. Dispatch tasks from GitHub issues. The interface layer is the least flashy part of the harness, but it removes enough friction to change daily behavior.

    Self-hosted option addresses the hardest enterprise objection. "Our code cannot leave our network" used to disqualify Cursor Cloud entirely. Since March 2026, that objection has an answer: same harness, your infrastructure.

    What Cursor Got Wrong (Or Has Not Solved Yet)

    Lazy delete. Parallel workers occasionally use // ... existing code ... placeholder comments, silently deleting real code. You must review diffs carefully. This is not a rare edge case; multiple independent reviewers have flagged it.

    Credit burn. Heavy agentic workflows drain credits fast. The Pro plan compute ceiling gets hit mid-month for developers who lean hard on parallel agents. Agent mode runs several model calls per task at roughly $0.04 each, and Claude Sonnet costs 2.4x more per request than Gemini. The harness decides which model to route, and that routing decision has real cost implications.

    Task scoping is the real skill. Tasks that are too broad ("refactor the auth module") produce sweeping changes that require extensive review. Tasks that are too narrow ("rename this variable") are overkill. The sweet spot is medium-sized discrete tasks: "add rate limiting to the /api/auth/login route, using the existing middleware pattern in /api/users." Learning this scoping takes a few days of practice.

    Legacy codebases hit the wall. Cloud agents work best with well-structured, modern codebases that have consistent conventions and good test coverage. If you point one at a legacy monolith with inconsistent patterns, expect more manual steering and lower success rates.

    Agent recovery is clunky. When an agent goes in the wrong direction, the correction workflow (pause, describe the wrong turn, redirect) creates more friction than the old conversational chat model did. Cursor 3's Agents Window improved this with task cards you can fork and restart, but it is still rougher than the iterative back-and-forth of Cursor 2.

    Shadow code. For CTOs and engineering managers, the biggest operational fear is logic written autonomously by an AI that human developers fail to understand or properly review. Cursor's Enterprise plan includes an AI code tracking API and audit logs to track which model authored which lines. But the review discipline has to come from the team, not the tool.

    Diagram 18
    Image by Author β€” Diagram 18: When Cloud Agents Help vs. Hurt
    Diagram 19
    Image by Author β€” Diagram 19: Credit Burn by Model

    How Claude Code and Codex Solve It Differently

    This is not a product review of Claude Code or OpenAI Codex. It is a harness comparison. All three tools solve the same job: ship code autonomously while the developer does something else. They use the same class of frontier models. The results are completely different because the harnesses are completely different.

    Claude Code Background Agents

    Claude Code takes the opposite architectural bet from Cursor on almost every dimension. It is terminal-first. Code stays on your local machine. There are no cloud VMs. Instead of parallel orchestration across many agents, Claude Code goes deep on single complex tasks: subagent spawning with worktree isolation means Claude can work on multiple branches of a project simultaneously without interference, but all of it runs locally.

    The reasoning depth is where Claude Code pulls ahead. With Opus and up to 200K tokens of context, Claude Code handles architectural refactors and complex multi-file changes that require sustained reasoning across a large codebase. The MCP integration is native and deep: Claude Code connects to local tools, databases, and services through a full ecosystem of protocol servers.

    What Claude Code does not have: video artifacts, cloud VMs, or a GUI orchestration layer. The output is a diff and a conversation log, not a PR with video proof. The trust model is local-first: your code never leaves your machine, which makes it the default choice for teams with strict data residency requirements.

    The Dispatch feature (launched March 17, 2026) adds a remote control layer: scan a QR code and your phone becomes a walkie-talkie to the Claude Code session running on your desktop. This solves the "AI works while you're away" problem, but with a single agent on a single machine, not with parallel cloud VMs.

    OpenAI Codex

    Codex runs tasks in cloud sandboxes triggered from a web interface or GitHub issues. The harness is simpler than Cursor's: one task at a time, tight GitHub integration, no video artifacts, no model racing. The bet is that most useful agent work starts from well-scoped issues in an issue tracker and ends with a PR.

    Where Codex wins is simplicity. Point it at a GitHub issue. It reads the issue, creates a branch, implements the change, and opens a PR. The workflow is linear and predictable. Where it loses is the absence of verification (no computer use, no video proof) and the lack of parallelism.

    The Harness Comparison

    Harness LayerCursor CloudClaude CodeOpenAI Codex
    InterfaceIDE + Slack + GitHub + Mobile + WebTerminal + Phone (Dispatch)Web + GitHub
    OrchestrationParallel dispatch, model racing, plan modeSubagent spawning, worktree isolationSingle task from issue
    ExecutionCloud VMs (10 to 20 parallel)Local machineCloud sandbox
    VerificationComputer use + video + screenshots + logsTest execution (no visual proof)Test execution (no visual proof)
    OutputPR + video artifactsDiff + conversationPR + diff
    ModelGPT-5 / Claude / Gemini / Composer 2Claude onlyGPT series only
    Trust boundaryCode leaves machine (unless self-hosted)Code stays localCode in OpenAI cloud
    Parallelism10 to 20 agentsMultiple subagents (same machine)One task at a time

    When Each Harness Wins

    Cursor wins when you have 10+ well-defined tasks to parallelize, you want visual proof for reviewers, your team communicates via Slack and GitHub, and you are comfortable with code in cloud VMs (or can self-host).

    Claude Code wins when you have one complex architectural task requiring deep reasoning, code must stay local, you need MCP integrations with local tools, or you work terminal-first.

    Codex wins when your workflow is GitHub-issue-driven, tasks are well scoped, and you want the simplest possible path from issue to PR.

    Same models available to all three. Completely different harnesses. Completely different developer experiences. The harness is the product.

    Diagram 20
    Image by Author β€” Diagram 20: Three Harnesses, Same Model
    Diagram 21
    Image by Author β€” Diagram 21: Three Workflows, One Job
    Diagram 22
    Image by Author β€” Diagram 22: Pick Your Harness (Decision Tree)

    Pricing: What You're Actually Paying For

    Cursor uses credit-based billing. Every paid plan includes a credit pool denominated in dollars. When you make an AI request, credits are deducted based on two factors: which model you use and how complex the task is.

    PlanMonthlyAnnualCredit PoolCloud AgentsSelf-Hosted
    HobbyFreeFreeNoneLimitedNo
    Pro$20$16$20YesNo
    Pro+$60$48$60 (3x)YesNo
    Ultra$200$160$200 (10x)YesNo
    Teams$40/seat$32/seat$20/userYesNo
    EnterpriseCustomCustomCustomYesYes

    The credit math matters. The $20 Pro pool covers approximately 225 Claude Sonnet requests, 550 Gemini requests, or 500 GPT-5 requests. Agent mode runs multiple model calls per task, each at roughly $0.04. A complex parallel session with 5 agents on Claude Sonnet can consume several dollars of credits in minutes. Pro compute ceiling gets hit after roughly 15 to 20 complex parallel sessions per month.

    The insight: you are not paying for model access. You can get Claude, GPT-5, and Gemini through their respective APIs for less. You are paying for the harness: VM infrastructure, codebase indexing, artifact generation pipeline, plugin ecosystem, and the orchestration layer that ties it all together.

    Diagram 23
    Image by Author β€” Diagram 23: Pricing Tiers at a Glance

    Action Checklist

  • Decide your trust boundary. Can code leave your machine? If not, evaluate Claude Code or Cursor's self-hosted option before Cursor Cloud.
  • Audit your codebase. Modern and well-tested repositories are strong candidates. Legacy monoliths with inconsistent conventions will require more manual steering.
  • Start with one agent on one well-defined task. A specific bug fix or a feature with a clear spec. Do not start with "refactor the auth module."
  • Set up CI before you set up agents. Cloud agents produce PRs. Your pipeline validates them. If your tests are flaky, you will waste time reviewing false failures.
  • Review the video artifacts. They exist for a reason. A 30-second video catches visual regressions that reading a diff cannot.
  • Try the race pattern. Dispatch the same task to multiple models. Compare the results. The quality difference will surprise you.
  • Track your credit burn for one week. Log the cost per agent task before scaling from one agent to five. The math gets real fast.
  • Evaluate self-hosted if enterprise. Same capabilities, code stays on your network. The compliance conversation changes entirely.

  • Recap and Next Steps

  • Cursor Cloud Agents run autonomous coding tasks in isolated VMs with video proof. 35% of Cursor's own PRs ship this way. The model is interchangeable. The harness is the product.
  • The harness has five layers (interface, orchestration, execution, verification, output) that determine the developer experience far more than the model underneath. Cursor, Claude Code, and Codex prove this: same models, radically different harnesses, radically different outcomes.
  • The decision is not "which model is best." The decision is "which harness fits my team's trust model, parallelism needs, and workflow."
  • Try this week: pick one well-scoped bug from your backlog. Kick off a single Cursor Cloud Agent (or Claude Code background agent, or Codex). Review the output. Pay attention to how the harness shaped the experience, not which model was inside it. That observation will tell you more about where this industry is going than any benchmark.


    Credits and Further Reading

  • Cloud Agents β€” Cursor Blog β€” The original announcement (Oct 2025) explaining how Cursor uses cloud agents internally for bug fixes, quick todos, and complex features.
  • Cursor Agents Can Now Control Their Own Computers β€” Cursor Blog β€” The February 24, 2026 launch of computer use: video artifacts, self-testing, and the "self-driving codebase" vision.
  • Cursor Changelog β€” Full timeline of cloud agent features: self-hosted agents (Mar 19), Composer 2 (Mar 11), plugin ecosystem (Mar 5), and Cursor 3 Glass (Apr 2).
  • Cursor Cloud Agents Get Their Own Computers β€” DevOps.com β€” Deep dive on the 35% PR stat and Alexi Robbins (co-head of async agents engineering) on parallel execution.
  • Cursor 3 "Glass" Review β€” OpenAIToolsHub β€” Independent testing of Cursor 3 vs Claude Code vs Codex across three codebases. Agents Window, Design Mode, and cloud compute limits on Pro.
  • Cursor Pricing Explained β€” Vantage β€” Detailed breakdown of the credit-based billing system, model cost differentials, and the infrastructure spend framing.

  • Created by Han HELOIR YAN