DEVELOPER TOOLS AI CODING 86% CHEAPER

Cursor Composer 2: How Anysphere Built a Frontier Coding Model on Kimi K2.5 — 86% Cheaper, Beats Claude Opus

Dillip Chowdary

Tech Entrepreneur & Innovator · March 28, 2026

Top Highlights

Composer 2 launched March 19, 2026 — built on Kimi K2.5 open-source base, fine-tuned with RL for multi-file agentic coding
61.3 on CursorBench vs Claude Opus 4.6's 58.2 — frontier performance at $0.50/M input tokens (vs $5.00 for Claude Opus)
73.7% on SWE-bench Multilingual — 12% above predecessor, tested across Python, JavaScript, TypeScript, and Java
200K token context window with self-summarization for long-horizon tasks beyond the context limit
Cursor serves 1M+ daily users and 50,000 businesses including Stripe and Figma — Composer 2 is available now in Cursor IDE

What Cursor Composer 2 Actually Is

Cursor Composer 2 is a purpose-built coding model released on March 19, 2026 by Anysphere, the company behind the Cursor AI code editor. Unlike most coding tools that sit atop general-purpose frontier models from OpenAI or Anthropic, Composer 2 is a custom fine-tuned model built on the open-source Kimi K2.5 base — Moonshot AI's mixture-of-experts architecture — and further trained using reinforcement learning specifically for multi-file code editing, refactoring, and long-horizon agentic coding tasks.

The core bet Anysphere is making is that coding is a narrow enough domain that a well-fine-tuned open-weight model, trained on real developer workflows rather than general internet text, can outperform frontier general models at a fraction of the cost. The March 2026 benchmarks make a compelling case that this bet is paying off.

Composer 2 replaces Composer 1.5 as the default agent model inside Cursor's Composer feature — the multi-file agent interface that lets developers describe a task and watch the model traverse the codebase, make edits, run commands, and iterate toward a solution autonomously.

Benchmark Breakdown

Anysphere published two headline benchmark numbers for Composer 2. The first is CursorBench — Anysphere's own internal evaluation suite covering tasks like multi-file refactoring, API integration, test generation, and codebase navigation — where Composer 2 scores 61.3 against Claude Opus 4.6's 58.2. The second is SWE-bench Multilingual, a public benchmark of real GitHub issues across Python, JavaScript, TypeScript, and Java, where Composer 2 scores 73.7% — a 12% improvement over Composer 1.5 and ahead of several frontier models.

Model	CursorBench	SWE-bench Multi	Input $/M tokens	Output $/M tokens
Composer 2 Standard	61.3	73.7%	$0.50	$2.50
Composer 2 Fast	60.1	71.4%	$1.50	$7.50
Claude Opus 4.6	58.2	—	$5.00	$25.00
Composer 1.5 (prev)	~54	65.8%	$3.50	$17.50

The 86% cost reduction from Composer 1.5 to Composer 2 Standard is the most significant number for teams running coding agents at scale. A team running Cursor's Composer agent aggressively — thousands of requests per day — sees the cost floor drop from $3.50 to $0.50 per million input tokens, roughly comparable to running GPT-4o-mini for a task that now produces frontier-quality results.

Technical Architecture: Kimi K2.5 + RL Fine-Tuning

Kimi K2.5 is a mixture-of-experts (MoE) model from Moonshot AI — a Chinese AI lab — released as open weights. MoE architectures activate only a subset of parameters per token, making inference significantly cheaper than dense models of equivalent capacity. Anysphere's decision to build on Kimi K2.5 rather than fine-tune an existing OpenAI or Anthropic model is notable: it signals that open-weight models have matured enough to serve as a competitive base for specialised coding applications at production scale.

The fine-tuning methodology uses reinforcement learning from code execution feedback — the model generates code, a sandboxed executor runs it, and the RL signal comes from whether tests pass, linters are satisfied, and the stated task objective is met. This is analogous to how DeepSeek-Coder-V2 and earlier SWE-agent approaches were trained, but applied to the specific ergonomics of Cursor's multi-file agent interface.

One key architectural innovation Anysphere describes is self-summarization for long-horizon tasks. When an agentic coding task exceeds the 200K token context window — for instance, when navigating a large monorepo — Composer 2 generates rolling summaries of previously visited files and decisions, compressing prior context into a structured scratchpad. This allows the agent to maintain coherent task state across a codebase that would otherwise overflow the context limit.

// Composer 2 long-horizon task: self-summarization pattern

Task: "Migrate all fetch() calls in /src to use the new apiClient wrapper"

[Step 1] Scan /src — 47 files contain fetch(). Context: 12K tokens used.
[Step 2] Edit files/api/users.ts — replace 3 fetch() calls. [DONE]
[Step 3] Edit files/api/products.ts — replace 5 fetch() calls. [DONE]
...
[Step 22] Context budget 80% full — generate summary:
  "Migrated 31/47 files. Remaining: /components/*, /hooks/*.
   Pattern: always wrap with apiClient({ method, path, body }).
   Edge case found: streaming responses use fetchStream() — leave as-is."
[Step 23] Resume from summary — continue with /components/*

This self-summarization loop is what distinguishes Composer 2 from simpler autocomplete or single-turn generation. The model is explicitly trained to maintain a task memory structure, not just respond to the current prompt in isolation.

Switching to Composer 2: Developer Guide

Composer 2 is available now for all Cursor users. Switching is a one-line settings change — it does not require a separate API key or plan upgrade for teams already on Cursor Pro or Business.

Enable Composer 2 in Cursor Settings

// cursor/settings.json — switch Composer agent model
{
  "cursor.composer.model": "composer-2-standard",
  // or "composer-2-fast" for lower latency, slightly lower bench scores
  "cursor.composer.maxTokens": 200000,
  "cursor.composer.enableSelfSummarization": true
}

When to Use Standard vs Fast

Standard Large refactors, cross-repo migrations, feature implementations that span 10+ files. Best quality, $0.50/M input.
Fast Quick fixes, single-file edits, test generation for known interfaces. Lower latency, slightly higher price.

Prompting Patterns That Work Well With Composer 2

Be explicit about scope: "Migrate all fetch() in /src/api/ — leave /src/legacy/ unchanged" prevents the model from over-reaching.
Specify constraints upfront: "Don't change function signatures — only the implementation" reduces hallucinated interface changes.
Use task checkpoints: Ask Composer 2 to pause and summarise after every 10 file edits on large tasks — this surfaces misunderstandings early.
Provide a test file: "Run npm test after each file — stop and ask if any test breaks" makes the RL-trained model's execution feedback loop work in your favour.

What This Means for the AI Coding Market

Composer 2's release continues a trend that has been accelerating since early 2026: the commoditisation of frontier coding capability. The cost curve for code generation has dropped faster than any other AI application category. Twelve months ago, running a frontier-quality coding agent required Claude Opus 3 at $15/M input tokens. Today, Composer 2 Standard achieves better benchmark scores at $0.50/M — a 30x cost reduction in one year.

For Anthropic and OpenAI, this represents a commoditisation threat to their highest-margin use case. Both companies have responded by emphasising reasoning, multimodality, and enterprise trust features that open-weight fine-tuned models cannot easily replicate. But for the dominant use case of write/refactor/test code in a monorepo, the gap between frontier and fine-tuned-open-weight has effectively closed on quality, while cost favours the latter by an order of magnitude.

Cursor now serves 1 million daily active developers and 50,000 businesses including Stripe, Figma, and Vercel. That distribution gives Composer 2 immediate production scale on day one — a feedback loop that generates training data for Composer 3 before any competitor can accumulate comparable usage data on a specialised coding agent workload.

Official References

5 Key Takeaways for Developers

1
Switch to Composer 2 Standard today. It outperforms Claude Opus 4.6 on coding tasks at 86% lower cost. For most multi-file refactoring and agent workflows, it is now the best price-performance option available in Cursor.
2
Open-weight MoE fine-tuning is a viable frontier strategy. Kimi K2.5 as a base proves that you don't need to train from scratch to reach frontier-level performance in a narrow domain. Expect more coding tools to follow this architecture.
3
Self-summarization is the key to large-codebase agents. The ability to maintain task memory beyond the context window distinguishes Composer 2 from simpler autocomplete tools. Structure your prompts to take advantage of this — provide explicit scope and checkpoint instructions.
4
RL from code execution feedback is the training frontier. The quality gap between general-purpose models and code-specific RL-trained models will widen. Watch for similar approaches from GitHub Copilot, JetBrains AI, and Amazon Q in 2026.
5
Budget for agentic coding at scale. At $0.50/M input tokens, teams can now run thousands of Composer agent sessions per day without significant cost concern. Build cost-tracking around output tokens, not input — output pricing at $2.50/M is where spend accumulates on long agentic tasks.

Share: Share on X LinkedIn