What Cursor Didn't Say About Composer 2 (And What a Developer Found in the API)

The benchmark was innovative. The engineering was real. The model ID told a different story.


March 20, 2026. A developer named Fynn is poking at Cursor's API endpoint. He's not hunting for secrets. He's just debugging.

But the response comes back with a model identifier that isn't "Composer 2." It's kimi-k2p5-rl-0317-s515-fast.

He tweets it. 444,000 views.

The most carefully orchestrated model launch of the year (custom benchmarks, Pareto efficiency charts, a pricing strategy designed to undercut everyone) gets upstaged by the one string nobody thought to rename.


Before We Start!


TL;DR


Act 1: The Setup

Eight days before Composer 2 launched, Cursor published something that deserved more attention than it got.

On March 11, the team released a detailed blog post introducing CursorBench, their internal evaluation framework for coding agents. Whatever you think about what happened later, this benchmark is genuinely innovative.

Why public benchmarks are failing

Public coding benchmarks are hitting a ceiling. Cursor's blog makes this case directly, and the argument is hard to dismiss:

What CursorBench does differently

CursorBench was built around real developer behavior. The key design choices:

For a benchmark designed to evaluate real-world coding agents, that's the right set of axes.

Performance on CursorBench scatter plot showing models plotted by score vs median tokens
Source: cursor.com/blog/cursorbench — "Performance on CursorBench." The x-axis is median tokens (fewer = more efficient). The y-axis is CursorBench score. Note the "token efficiency frontier" line. Composer 1.5 sits at ~44% with ~12K tokens. No Kimi K2.5 anywhere on the chart.
CursorBench vs Public Benchmarks comparison diagram
Image by Author

The training innovation

Then there's the engineering behind the model itself.

Cursor developed compaction-in-the-loop reinforcement learning. The idea: build context summarization directly into the RL training loop. When a generation hits a token-length trigger, the model pauses and compresses its own context to roughly 1,000 tokens (down from 5,000+ with traditional methods).

Because the RL reward covers the entire chain (including the summarization steps), the model learns which details matter and which to discard. The results, per Cursor's published research:

Credit where it's due: this is serious engineering work.

The launch

On March 19, Cursor put it all together. Composer 2 launched with a polished blog post, clean benchmark tables, and a pricing strategy that made the competitive landscape uncomfortable.

BenchmarkComposer 1Composer 1.5Composer 2
CursorBench38.044.261.3
Terminal-Bench 2.040.047.961.7
SWE-bench Multilingual56.965.973.7

On Terminal-Bench 2.0, Composer 2 beat Claude Opus 4.6's 58.0. It still trailed GPT-5.4's 75.1, but the pricing flipped the value equation: $0.50 per million input tokens, $2.50 per million output. That's 86% cheaper than Composer 1.5 and a fraction of what Anthropic and OpenAI charge.

Performance vs Cost on CursorBench showing Composer 2 in optimal corner
Source: cursor.com/blog/composer-2 — "Performance vs. Cost on CursorBench." Composer 2 sits in the optimal corner: ~61% score at the lowest cost. Opus 4.6 (high) scores ~58% at roughly 5x the price. GPT-5.4 (high) scores ~63% but costs 3x more. The chart that launched a thousand tweets.

The blog post attributed the gains to "our first continued pretraining run, which provides a far stronger base to scale our reinforcement learning." Technically accurate. No base model was named.

The Pareto chart placed Composer 2 exactly where you'd want it: high performance, low cost, optimal quadrant. Compared against GPT-5.4 (various configurations), Claude Opus 4.6, and Cursor's own previous models.

One model was conspicuously absent from every comparison chart across both the CursorBench blog and the Composer 2 launch: Kimi K2.5. The model Composer 2 was built on top of.

You don't benchmark against yourself. That's understandable. But in hindsight, the omission completes a picture of a launch carefully designed to avoid any mention of where the foundation came from.

The choreography was meticulous:

Everything was scripted. Everything except the API response.


Act 2: The String, the Confrontation, and the Pivot

Less than 24 hours. That's how long the carefully constructed narrative held.

The discovery

On March 20, a developer named Fynn (@fynnso) was testing Cursor's OpenAI-compatible base URL when something unexpected appeared in the API response:

accounts/anysphere/models/kimi-k2p5-rl-0317-s515-fast

That's not a Cursor model name. Let's decode it:

The entire history of the model, its origin, its training method, its version, its serving configuration, sitting right there in a string that nobody thought to rename before shipping.

Fynn's comment was dry and perfect: "at least rename the model ID."

The tweet hit 444,000 views. Then things escalated.

The Model ID Decoded diagram showing each segment of the API string
Image by Author

Moonshot reacts (Phase 1: surprise)

Yulun Du, Head of Pretraining at Moonshot AI, moved fast. He ran Composer 2's tokenizer through analysis and posted his findings publicly: the tokenizer was identical to Kimi's. His conclusion: "almost certainly the result of further fine-tuning of our model."

Then he tagged Cursor co-founder Michael Truell directly, asking why Cursor wasn't respecting the license or paying fees.

Two other Moonshot employees confirmed the connection on social media. Then both deleted their posts. Internal communications, it seems, hadn't caught up with external communications.

This detail matters. It foreshadows what came next.

The license problem

Kimi K2.5 is released under a Modified MIT License. Standard MIT is one of the most permissive open-source licenses available. Moonshot added one critical clause:

If the Software (or any derivative works thereof) is used for any of your commercial products or services that have more than 100 million monthly active users, or more than 20 million US dollars in monthly revenue, you shall prominently display "Kimi K2.5" on the user interface of such product or service.

The math on Cursor:

The license was written precisely for this scenario. Moonshot anticipated that companies would fine-tune Kimi K2.5 and ship it under their own brand. The attribution clause was designed to ensure the origin model still gets credit, even after modification.

Cursor's interface said "Composer 2." No Kimi anywhere.

Cursor responds (gradually)

Hours of silence.

Then Lee Robinson, a developer at Cursor, posted the first acknowledgment on X. The progression tells its own story:

  1. The soft open. "Yep, Composer 2 started from an open-source base!" Followed by the claim that only about a quarter of the compute came from the base model. The rest came from Cursor's own training.
  2. The explicit naming (under visible pressure). "Since people really want me to say this: KIMI K2.5." With double exclamation marks. He then claimed compliance through "inference partner terms" with Fireworks AI, the platform hosting the RL and inference infrastructure.
  3. The technical defense. The RL work was real. The continued pretraining was real. The evals are very different from the base model. All of which is likely true, and none of which explains why the base model wasn't named in the blog post.

The community noticed the sequence: first minimize, then acknowledge only when forced. Not a great look for a company that had plenty of good news to share.

The engineering was strong. The benchmark results were legitimate. Naming the base model would have cost Cursor nothing technically. It might have actually strengthened the narrative: "We took the best open-source coding model in the world and made it dramatically better." That's a compelling story.

Instead, the omission created a trust gap that the internet filled with the worst possible interpretation.

Moonshot reacts (Phase 2: the pivot)

And then came the twist nobody expected.

Kimi's official account posted a congratulatory message to Cursor. The tone was warm, supportive, and completely at odds with what Moonshot's pretraining lead had posted hours earlier:

"Congrats to the @cursor_ai team on the launch of Composer 2! We are proud to see Kimi-k2.5 provide the foundation."

The statement confirmed that Cursor accesses Kimi K2.5 through Fireworks AI's hosted RL and inference platform as part of an authorized commercial partnership.

Read that again. Authorized commercial partnership.

The deal existed the whole time. Moonshot's own pretraining lead didn't know about it when he posted his accusation. Internal coordination failure on both sides: Cursor didn't disclose the base model publicly, and Moonshot's commercial team apparently didn't brief their technical leadership about the partnership before the launch.

The 24-Hour Timeline from celebration to accusation to congratulations
Image by Author

From accusation to congratulations in under 24 hours. The drama was real. The underlying business relationship was fine. The communication on all sides was a mess.


Act 3: The Bigger Picture

A forgotten model ID string is a good story. But it's not the real story.

The real story is that this pattern is everywhere, and Cursor just happened to be the one that got caught on a Wednesday night.

Everyone's building on Chinese open-source models

Cursor isn't an outlier. It's the norm.

The numbers tell the same story from a different angle. On OpenRouter, four of the five most-used models globally are Chinese. Chinese model usage exceeded US model usage for two consecutive weeks in early March 2026. On the agent leaderboard, Kimi, GLM, and Qwen all rank in the top tier.

Chamath Palihapitiya, founder of Social Capital, said it plainly: "We have started using Kimi-K2 on Groq. Although the models of OpenAI and Anthropic perform well, they are simply too expensive."

Performance plus cost. That's the entire explanation. Nobody's making this choice for ideological reasons. They're making it because Chinese open-source models are competitive at the frontier and priced for volume.

What's Under the Hood: AI coding tools and their base models
Image by Author

Why nobody wants to say it out loud

If the engineering rationale is so straightforward, why is every company so allergic to naming the base model? Three reasons:

Investor optics. Cursor is valued at $29.3 billion. That valuation is partly built on the narrative of proprietary AI capability. Saying "we took an open-source model and made it better" is a different pitch than "we built our own model." Both are valid. One sounds more like a $29 billion company.

Geopolitical awkwardness. "Powered by Chinese AI" is not a talking point any American company wants on its landing page. Not in the current political climate. The fact that six of ten top Japanese AI models are also based on DeepSeek or Qwen shows this isn't a US-only discomfort. Nobody wants to explain this to their enterprise customers.

The value proposition shift. If the base model is open-source and anyone can fine-tune it, the competitive moat isn't the model. It's the UX, the agent workflow, the tool integration, the RL training pipeline. That's a harder story to tell than "we built our own AI." Cursor's compaction-in-the-loop RL, their Cursor Blame evaluation system, their co-designed training infrastructure: that's real differentiation. But it takes paragraphs to explain, not a tagline.

The irony is that transparency would actually strengthen Cursor's position. "We took the best open-weight coding model available, invested 75% of our compute budget in continued pretraining and RL, and built a product that beats Claude Opus 4.6 at one-tenth the price." That's a story worth telling. It positions the engineering as the value, not the base weights.

The license question that won't go away

Cursor claims compliance through "inference partner terms" with Fireworks AI. The argument seems to be that the license obligation is satisfied at the infrastructure level rather than in Cursor's own user interface.

Whether that interpretation holds is an open question. The license says "prominently display 'Kimi K2.5' on the user interface of such product or service." Cursor's user interface says "Composer 2." A reasonable person could read those two facts and see a gap.

This matters beyond one company. Open-weight model licensing is still being tested in real-world, high-stakes situations. If Moonshot doesn't enforce the attribution clause against a company generating $2 billion in annual revenue from their model, the clause becomes decorative. Every future open-weight license with a similar provision loses credibility.

The precedent cuts both ways:

The Moonshot pivot from accusation to congratulations suggests this will resolve quietly. That's probably the right business outcome. But it's the wrong signal for the open-source ecosystem.


Recap

Three takeaways:

Maybe the real benchmark isn't CursorBench or Terminal-Bench. It's whether your model ID survives a curious developer with an API debugger on a Wednesday night.


Credits & Further Reading

Han HELOIR YAN, Ph.D. · 2026