What Cursor Didn't Say About Composer 2 (And What a Developer Found in the API)

The benchmark was innovative. The engineering was real. The model ID told a different story.

March 20, 2026. A developer named Fynn is poking at Cursor's API endpoint. He's not hunting for secrets. He's just debugging.

But the response comes back with a model identifier that isn't "Composer 2." It's kimi-k2p5-rl-0317-s515-fast.

He tweets it. 444,000 views.

The most carefully orchestrated model launch of the year (custom benchmarks, Pareto efficiency charts, a pricing strategy designed to undercut everyone) gets upstaged by the one string nobody thought to rename.

Before We Start!

If you find this analysis useful, a clap goes a long way ✍️
I write about AI infrastructure, model architecture, and the business of building with AI. Follow for more.

TL;DR

Cursor shipped Composer 2 on March 19 as a breakthrough coding model. It was built on top of Kimi K2.5 from Moonshot AI, with substantial continued pretraining and RL on top.
A developer found the original model ID in the API response within 24 hours. The internet did the rest.
Moonshot AI accused Cursor of violating the Kimi K2.5 license. Then confirmed it was an authorized partnership. Same news cycle.
The real engineering (CursorBench, compaction-in-the-loop RL, 5x token efficiency gains) is genuinely impressive. The communication around provenance was not.
The bigger pattern: the entire AI coding industry is building on Chinese open-source models and struggling to talk about it openly.

Act 1: The Setup

Eight days before Composer 2 launched, Cursor published something that deserved more attention than it got.

On March 11, the team released a detailed blog post introducing CursorBench, their internal evaluation framework for coding agents. Whatever you think about what happened later, this benchmark is genuinely innovative.

Why public benchmarks are failing

Public coding benchmarks are hitting a ceiling. Cursor's blog makes this case directly, and the argument is hard to dismiss:

Contamination. SWE-bench Verified, Pro, and Multilingual all draw tasks from public repositories that end up in training data. OpenAI recently stopped reporting SWE-bench Verified results entirely after finding that frontier models could reproduce gold patches from memory. Nearly 60% of unsolved problems had flawed tests.
Misalignment. Most SWE benchmarks still focus on bug-fixing tasks with detailed, overly specific issue descriptions. That's not how developers actually talk to coding agents. You don't write a three-paragraph GitHub issue when you want your agent to refactor a module. You type "clean this up" and expect it to understand.
Saturation. At frontier levels, models like Haiku can match or exceed GPT-5 on some public benchmarks. The benchmarks no longer distinguish between models that developers experience as meaningfully different.

What CursorBench does differently

CursorBench was built around real developer behavior. The key design choices:

Task sourcing from production. Tasks come from real Cursor sessions using Cursor Blame, which traces committed code back to the agent request that produced it. Natural query-to-solution pairs. No synthetic problems.
Intentionally ambiguous prompts. Task descriptions are short and underspecified, matching how developers actually talk to agents. The opposite of SWE-bench's detailed GitHub issues.
Bigger scope. CursorBench-3 tasks involve substantially more lines of code and span more files than anything in the SWE-bench family. Problem scope has roughly doubled since the first version.
Token efficiency as a first-class metric. The chart plots correctness against median completion tokens, capturing the compute/latency tradeoff that affects usability. The optimal corner isn't "highest score." It's "highest score at the lowest token cost."
Four evaluation dimensions. Solution correctness, code quality, efficiency, and interaction behavior. Not just "did it pass the test."

For a benchmark designed to evaluate real-world coding agents, that's the right set of axes.

Performance on CursorBench scatter plot showing models plotted by score vs median tokens — Source: cursor.com/blog/cursorbench — "Performance on CursorBench." The x-axis is median tokens (fewer = more efficient). The y-axis is CursorBench score. Note the "token efficiency frontier" line. Composer 1.5 sits at ~44% with ~12K tokens. No Kimi K2.5 anywhere on the chart.

CursorBench vs Public Benchmarks comparison diagram — Image by Author

The training innovation

Then there's the engineering behind the model itself.

Cursor developed compaction-in-the-loop reinforcement learning. The idea: build context summarization directly into the RL training loop. When a generation hits a token-length trigger, the model pauses and compresses its own context to roughly 1,000 tokens (down from 5,000+ with traditional methods).

Because the RL reward covers the entire chain (including the summarization steps), the model learns which details matter and which to discard. The results, per Cursor's published research:

50% reduction in compaction errors
~5x more token-efficient summaries compared to standard prompt-based approaches
Enables project-scale refactors requiring hundreds of sequential actions without losing the original goal

Credit where it's due: this is serious engineering work.

The launch

On March 19, Cursor put it all together. Composer 2 launched with a polished blog post, clean benchmark tables, and a pricing strategy that made the competitive landscape uncomfortable.

Benchmark	Composer 1	Composer 1.5	Composer 2
CursorBench	38.0	44.2	61.3
Terminal-Bench 2.0	40.0	47.9	61.7
SWE-bench Multilingual	56.9	65.9	73.7

On Terminal-Bench 2.0, Composer 2 beat Claude Opus 4.6's 58.0. It still trailed GPT-5.4's 75.1, but the pricing flipped the value equation: $0.50 per million input tokens, $2.50 per million output. That's 86% cheaper than Composer 1.5 and a fraction of what Anthropic and OpenAI charge.

Performance vs Cost on CursorBench showing Composer 2 in optimal corner — Source: cursor.com/blog/composer-2 — "Performance vs. Cost on CursorBench." Composer 2 sits in the optimal corner: ~61% score at the lowest cost. Opus 4.6 (high) scores ~58% at roughly 5x the price. GPT-5.4 (high) scores ~63% but costs 3x more. The chart that launched a thousand tweets.

The blog post attributed the gains to "our first continued pretraining run, which provides a far stronger base to scale our reinforcement learning." Technically accurate. No base model was named.

The Pareto chart placed Composer 2 exactly where you'd want it: high performance, low cost, optimal quadrant. Compared against GPT-5.4 (various configurations), Claude Opus 4.6, and Cursor's own previous models.

One model was conspicuously absent from every comparison chart across both the CursorBench blog and the Composer 2 launch: Kimi K2.5. The model Composer 2 was built on top of.

You don't benchmark against yourself. That's understandable. But in hindsight, the omission completes a picture of a launch carefully designed to avoid any mention of where the foundation came from.

The choreography was meticulous:

Custom benchmark published eight days early
Internal evaluation framework with a novel efficiency axis
Pareto efficiency argument positioning Composer 2 in the optimal corner
Aggressive pricing narrative undercutting every competitor
Zero references to any base model, anywhere

Everything was scripted. Everything except the API response.

Act 2: The String, the Confrontation, and the Pivot

Less than 24 hours. That's how long the carefully constructed narrative held.

The discovery

On March 20, a developer named Fynn (@fynnso) was testing Cursor's OpenAI-compatible base URL when something unexpected appeared in the API response:

accounts/anysphere/models/kimi-k2p5-rl-0317-s515-fast

That's not a Cursor model name. Let's decode it:

kimi-k2p5 : Kimi K2.5, Moonshot AI's open-weight model
rl : reinforcement learning (the post-training method)
0317 : March 17, 2026 (likely the training date)
s515 : internal experiment identifier
fast : the optimized serving variant

The entire history of the model, its origin, its training method, its version, its serving configuration, sitting right there in a string that nobody thought to rename before shipping.

Fynn's comment was dry and perfect: "at least rename the model ID."

The tweet hit 444,000 views. Then things escalated.

The Model ID Decoded diagram showing each segment of the API string — Image by Author

Moonshot reacts (Phase 1: surprise)

Yulun Du, Head of Pretraining at Moonshot AI, moved fast. He ran Composer 2's tokenizer through analysis and posted his findings publicly: the tokenizer was identical to Kimi's. His conclusion: "almost certainly the result of further fine-tuning of our model."

Then he tagged Cursor co-founder Michael Truell directly, asking why Cursor wasn't respecting the license or paying fees.

Two other Moonshot employees confirmed the connection on social media. Then both deleted their posts. Internal communications, it seems, hadn't caught up with external communications.

This detail matters. It foreshadows what came next.

The license problem

Kimi K2.5 is released under a Modified MIT License. Standard MIT is one of the most permissive open-source licenses available. Moonshot added one critical clause:

If the Software (or any derivative works thereof) is used for any of your commercial products or services that have more than 100 million monthly active users, or more than 20 million US dollars in monthly revenue, you shall prominently display "Kimi K2.5" on the user interface of such product or service.

The math on Cursor:

Annual recurring revenue: $2 billion+
Monthly revenue: ~$167 million
Revenue threshold in license: $20 million/month
How far over the threshold: roughly 8x
"Kimi K2.5" displayed on Cursor's UI: nowhere

The license was written precisely for this scenario. Moonshot anticipated that companies would fine-tune Kimi K2.5 and ship it under their own brand. The attribution clause was designed to ensure the origin model still gets credit, even after modification.

Cursor's interface said "Composer 2." No Kimi anywhere.

Cursor responds (gradually)

Hours of silence.

Then Lee Robinson, a developer at Cursor, posted the first acknowledgment on X. The progression tells its own story:

The soft open. "Yep, Composer 2 started from an open-source base!" Followed by the claim that only about a quarter of the compute came from the base model. The rest came from Cursor's own training.
The explicit naming (under visible pressure). "Since people really want me to say this: KIMI K2.5." With double exclamation marks. He then claimed compliance through "inference partner terms" with Fireworks AI, the platform hosting the RL and inference infrastructure.
The technical defense. The RL work was real. The continued pretraining was real. The evals are very different from the base model. All of which is likely true, and none of which explains why the base model wasn't named in the blog post.

The community noticed the sequence: first minimize, then acknowledge only when forced. Not a great look for a company that had plenty of good news to share.

The engineering was strong. The benchmark results were legitimate. Naming the base model would have cost Cursor nothing technically. It might have actually strengthened the narrative: "We took the best open-source coding model in the world and made it dramatically better." That's a compelling story.

Instead, the omission created a trust gap that the internet filled with the worst possible interpretation.

Moonshot reacts (Phase 2: the pivot)

And then came the twist nobody expected.

Kimi's official account posted a congratulatory message to Cursor. The tone was warm, supportive, and completely at odds with what Moonshot's pretraining lead had posted hours earlier:

"Congrats to the @cursor_ai team on the launch of Composer 2! We are proud to see Kimi-k2.5 provide the foundation."

The statement confirmed that Cursor accesses Kimi K2.5 through Fireworks AI's hosted RL and inference platform as part of an authorized commercial partnership.

Read that again. Authorized commercial partnership.

The deal existed the whole time. Moonshot's own pretraining lead didn't know about it when he posted his accusation. Internal coordination failure on both sides: Cursor didn't disclose the base model publicly, and Moonshot's commercial team apparently didn't brief their technical leadership about the partnership before the launch.

The 24-Hour Timeline from celebration to accusation to congratulations — Image by Author

From accusation to congratulations in under 24 hours. The drama was real. The underlying business relationship was fine. The communication on all sides was a mess.

Act 3: The Bigger Picture

A forgotten model ID string is a good story. But it's not the real story.

The real story is that this pattern is everywhere, and Cursor just happened to be the one that got caught on a Wednesday night.

Everyone's building on Chinese open-source models

Cursor isn't an outlier. It's the norm.

Windsurf admitted it's fine-tuning a customized version of GLM-4.6 from Zhipu AI.
Vercel integrated GLM-4.6 into its official API services.
Cerebras, one of the largest US inference chip companies, announced it would promote GLM-4.6 as a primary model.
Together AI deployed Alibaba's Qwen-3-Coder.

The numbers tell the same story from a different angle. On OpenRouter, four of the five most-used models globally are Chinese. Chinese model usage exceeded US model usage for two consecutive weeks in early March 2026. On the agent leaderboard, Kimi, GLM, and Qwen all rank in the top tier.

Chamath Palihapitiya, founder of Social Capital, said it plainly: "We have started using Kimi-K2 on Groq. Although the models of OpenAI and Anthropic perform well, they are simply too expensive."

Performance plus cost. That's the entire explanation. Nobody's making this choice for ideological reasons. They're making it because Chinese open-source models are competitive at the frontier and priced for volume.

What's Under the Hood: AI coding tools and their base models — Image by Author

Why nobody wants to say it out loud

If the engineering rationale is so straightforward, why is every company so allergic to naming the base model? Three reasons:

Investor optics. Cursor is valued at $29.3 billion. That valuation is partly built on the narrative of proprietary AI capability. Saying "we took an open-source model and made it better" is a different pitch than "we built our own model." Both are valid. One sounds more like a $29 billion company.

Geopolitical awkwardness. "Powered by Chinese AI" is not a talking point any American company wants on its landing page. Not in the current political climate. The fact that six of ten top Japanese AI models are also based on DeepSeek or Qwen shows this isn't a US-only discomfort. Nobody wants to explain this to their enterprise customers.

The value proposition shift. If the base model is open-source and anyone can fine-tune it, the competitive moat isn't the model. It's the UX, the agent workflow, the tool integration, the RL training pipeline. That's a harder story to tell than "we built our own AI." Cursor's compaction-in-the-loop RL, their Cursor Blame evaluation system, their co-designed training infrastructure: that's real differentiation. But it takes paragraphs to explain, not a tagline.

The irony is that transparency would actually strengthen Cursor's position. "We took the best open-weight coding model available, invested 75% of our compute budget in continued pretraining and RL, and built a product that beats Claude Opus 4.6 at one-tenth the price." That's a story worth telling. It positions the engineering as the value, not the base weights.

The license question that won't go away

Cursor claims compliance through "inference partner terms" with Fireworks AI. The argument seems to be that the license obligation is satisfied at the infrastructure level rather than in Cursor's own user interface.

Whether that interpretation holds is an open question. The license says "prominently display 'Kimi K2.5' on the user interface of such product or service." Cursor's user interface says "Composer 2." A reasonable person could read those two facts and see a gap.

This matters beyond one company. Open-weight model licensing is still being tested in real-world, high-stakes situations. If Moonshot doesn't enforce the attribution clause against a company generating $2 billion in annual revenue from their model, the clause becomes decorative. Every future open-weight license with a similar provision loses credibility.

The precedent cuts both ways:

If Cursor complies and adds attribution, it validates the Modified MIT approach and gives other open-source model creators confidence that their terms will be respected.
If Cursor argues that sufficient transformation creates a new work, it could undermine every open-weight license in the AI industry.
If Moonshot quietly lets it go (which the congratulatory tweet suggests is likely), the message to future model creators is clear: write whatever you want in your license, but don't expect a $29 billion company to follow it.

The Moonshot pivot from accusation to congratulations suggests this will resolve quietly. That's probably the right business outcome. But it's the wrong signal for the open-source ecosystem.

Recap

Three takeaways:

CursorBench and the RL innovations are real. Token-efficient evaluation, compaction-in-the-loop training, and a 50% reduction in context loss represent genuine engineering contributions. Cursor had plenty to be proud of without obscuring the provenance.
The attribution gap was self-inflicted. The partnership was authorized. The engineering added real value. The drama came entirely from the distance between what the marketing said and what the API revealed.
This is an industry pattern, not a single incident. Chinese open-source models now power a significant share of Silicon Valley's most valuable coding tools. The companies that figure out how to talk about this honestly will build more trust than those that bury it in an API string.

Maybe the real benchmark isn't CursorBench or Terminal-Bench. It's whether your model ID survives a curious developer with an API debugger on a Wednesday night.

Credits & Further Reading

Cursor's CursorBench blog (March 11, 2026): The benchmark design, Cursor Blame methodology, and the case against public benchmarks.
Cursor's Composer 2 announcement (March 19, 2026): Benchmark results, pricing, and the continued pretraining + RL approach.
@fynnso's discovery tweet (March 20, 2026): The model ID that started it all. 444K views and counting.
Kimi's official response (March 20, 2026): Confirming the authorized partnership through Fireworks AI. Collected by Simon Willison.
VentureBeat's coverage (March 19, 2026): The most balanced reporting on Composer 2's benchmarks, pricing, and competitive positioning.
Kimi K2.5 Modified MIT License: The full license text with the attribution clause that triggered the controversy. Available on the Kimi K2.5 repository.

Han HELOIR YAN, Ph.D. · 2026