Cursor Composer 2: How Anysphere Built a Frontier Coding Model on Kimi K2.5 — 86% Cheaper, Beats Claude Opus
Top Highlights
- Composer 2 launched March 19, 2026 — built on Kimi K2.5 open-source base, fine-tuned with RL for multi-file agentic coding
- 61.3 on CursorBench vs Claude Opus 4.6's 58.2 — frontier performance at $0.50/M input tokens (vs $5.00 for Claude Opus)
- 73.7% on SWE-bench Multilingual — 12% above predecessor, tested across Python, JavaScript, TypeScript, and Java
- 200K token context window with self-summarization for long-horizon tasks beyond the context limit
- Cursor serves 1M+ daily users and 50,000 businesses including Stripe and Figma — Composer 2 is available now in Cursor IDE
What Cursor Composer 2 Actually Is
Cursor Composer 2 is a purpose-built coding model released on March 19, 2026 by Anysphere, the company behind the Cursor AI code editor. Unlike most coding tools that sit atop general-purpose frontier models from OpenAI or Anthropic, Composer 2 is a custom fine-tuned model built on the open-source Kimi K2.5 base — Moonshot AI's mixture-of-experts architecture — and further trained using reinforcement learning specifically for multi-file code editing, refactoring, and long-horizon agentic coding tasks.
The core bet Anysphere is making is that coding is a narrow enough domain that a well-fine-tuned open-weight model, trained on real developer workflows rather than general internet text, can outperform frontier general models at a fraction of the cost. The March 2026 benchmarks make a compelling case that this bet is paying off.
Composer 2 replaces Composer 1.5 as the default agent model inside Cursor's Composer feature — the multi-file agent interface that lets developers describe a task and watch the model traverse the codebase, make edits, run commands, and iterate toward a solution autonomously.
Benchmark Breakdown
Anysphere published two headline benchmark numbers for Composer 2. The first is CursorBench — Anysphere's own internal evaluation suite covering tasks like multi-file refactoring, API integration, test generation, and codebase navigation — where Composer 2 scores 61.3 against Claude Opus 4.6's 58.2. The second is SWE-bench Multilingual, a public benchmark of real GitHub issues across Python, JavaScript, TypeScript, and Java, where Composer 2 scores 73.7% — a 12% improvement over Composer 1.5 and ahead of several frontier models.
| Model | CursorBench | SWE-bench Multi | Input $/M tokens | Output $/M tokens |
|---|---|---|---|---|
| Composer 2 Standard | 61.3 | 73.7% | $0.50 | $2.50 |
| Composer 2 Fast | 60.1 | 71.4% | $1.50 | $7.50 |
| Claude Opus 4.6 | 58.2 | — | $5.00 | $25.00 |
| Composer 1.5 (prev) | ~54 | 65.8% | $3.50 | $17.50 |
The 86% cost reduction from Composer 1.5 to Composer 2 Standard is the most significant number for teams running coding agents at scale. A team running Cursor's Composer agent aggressively — thousands of requests per day — sees the cost floor drop from $3.50 to $0.50 per million input tokens, roughly comparable to running GPT-4o-mini for a task that now produces frontier-quality results.
Technical Architecture: Kimi K2.5 + RL Fine-Tuning
Kimi K2.5 is a mixture-of-experts (MoE) model from Moonshot AI — a Chinese AI lab — released as open weights. MoE architectures activate only a subset of parameters per token, making inference significantly cheaper than dense models of equivalent capacity. Anysphere's decision to build on Kimi K2.5 rather than fine-tune an existing OpenAI or Anthropic model is notable: it signals that open-weight models have matured enough to serve as a competitive base for specialised coding applications at production scale.
The fine-tuning methodology uses reinforcement learning from code execution feedback — the model generates code, a sandboxed executor runs it, and the RL signal comes from whether tests pass, linters are satisfied, and the stated task objective is met. This is analogous to how DeepSeek-Coder-V2 and earlier SWE-agent approaches were trained, but applied to the specific ergonomics of Cursor's multi-file agent interface.
One key architectural innovation Anysphere describes is self-summarization for long-horizon tasks. When an agentic coding task exceeds the 200K token context window — for instance, when navigating a large monorepo — Composer 2 generates rolling summaries of previously visited files and decisions, compressing prior context into a structured scratchpad. This allows the agent to maintain coherent task state across a codebase that would otherwise overflow the context limit.
Task: "Migrate all fetch() calls in /src to use the new apiClient wrapper"
[Step 1] Scan /src — 47 files contain fetch(). Context: 12K tokens used.
[Step 2] Edit files/api/users.ts — replace 3 fetch() calls. [DONE]
[Step 3] Edit files/api/products.ts — replace 5 fetch() calls. [DONE]
...
[Step 22] Context budget 80% full — generate summary:
"Migrated 31/47 files. Remaining: /components/*, /hooks/*.
Pattern: always wrap with apiClient({ method, path, body }).
Edge case found: streaming responses use fetchStream() — leave as-is."
[Step 23] Resume from summary — continue with /components/*
This self-summarization loop is what distinguishes Composer 2 from simpler autocomplete or single-turn generation. The model is explicitly trained to maintain a task memory structure, not just respond to the current prompt in isolation.
Switching to Composer 2: Developer Guide
Composer 2 is available now for all Cursor users. Switching is a one-line settings change — it does not require a separate API key or plan upgrade for teams already on Cursor Pro or Business.
Enable Composer 2 in Cursor Settings
// cursor/settings.json — switch Composer agent model
{
"cursor.composer.model": "composer-2-standard",
// or "composer-2-fast" for lower latency, slightly lower bench scores
"cursor.composer.maxTokens": 200000,
"cursor.composer.enableSelfSummarization": true
}
When to Use Standard vs Fast
- Standard Large refactors, cross-repo migrations, feature implementations that span 10+ files. Best quality, $0.50/M input.
- Fast Quick fixes, single-file edits, test generation for known interfaces. Lower latency, slightly higher price.
Prompting Patterns That Work Well With Composer 2
- Be explicit about scope: "Migrate all fetch() in /src/api/ — leave /src/legacy/ unchanged" prevents the model from over-reaching.
- Specify constraints upfront: "Don't change function signatures — only the implementation" reduces hallucinated interface changes.
- Use task checkpoints: Ask Composer 2 to pause and summarise after every 10 file edits on large tasks — this surfaces misunderstandings early.
- Provide a test file: "Run npm test after each file — stop and ask if any test breaks" makes the RL-trained model's execution feedback loop work in your favour.
What This Means for the AI Coding Market
Composer 2's release continues a trend that has been accelerating since early 2026: the commoditisation of frontier coding capability. The cost curve for code generation has dropped faster than any other AI application category. Twelve months ago, running a frontier-quality coding agent required Claude Opus 3 at $15/M input tokens. Today, Composer 2 Standard achieves better benchmark scores at $0.50/M — a 30x cost reduction in one year.
For Anthropic and OpenAI, this represents a commoditisation threat to their highest-margin use case. Both companies have responded by emphasising reasoning, multimodality, and enterprise trust features that open-weight fine-tuned models cannot easily replicate. But for the dominant use case of write/refactor/test code in a monorepo, the gap between frontier and fine-tuned-open-weight has effectively closed on quality, while cost favours the latter by an order of magnitude.
Cursor now serves 1 million daily active developers and 50,000 businesses including Stripe, Figma, and Vercel. That distribution gives Composer 2 immediate production scale on day one — a feedback loop that generates training data for Composer 3 before any competitor can accumulate comparable usage data on a specialised coding agent workload.
5 Key Takeaways for Developers
-
1
Switch to Composer 2 Standard today. It outperforms Claude Opus 4.6 on coding tasks at 86% lower cost. For most multi-file refactoring and agent workflows, it is now the best price-performance option available in Cursor.
-
2
Open-weight MoE fine-tuning is a viable frontier strategy. Kimi K2.5 as a base proves that you don't need to train from scratch to reach frontier-level performance in a narrow domain. Expect more coding tools to follow this architecture.
-
3
Self-summarization is the key to large-codebase agents. The ability to maintain task memory beyond the context window distinguishes Composer 2 from simpler autocomplete tools. Structure your prompts to take advantage of this — provide explicit scope and checkpoint instructions.
-
4
RL from code execution feedback is the training frontier. The quality gap between general-purpose models and code-specific RL-trained models will widen. Watch for similar approaches from GitHub Copilot, JetBrains AI, and Amazon Q in 2026.
-
5
Budget for agentic coding at scale. At $0.50/M input tokens, teams can now run thousands of Composer agent sessions per day without significant cost concern. Build cost-tracking around output tokens, not input — output pricing at $2.50/M is where spend accumulates on long agentic tasks.