@srush_nlp: Talk: Training Composer https://youtube.com/watch?v=uTgqYeVxy2c… Overview of the methods that we use at Cursor to build…

X AI KOLs Timeline 05/21/26, 07:05 PM Models

cursor composer-2 code-generation agentic-programming reinforcement-learning pretraining benchmark

Summary

Cursor shared the training methods for its self-developed programming model Composer 2, including large-scale continuous pre-training, long-range reinforcement learning, and an internal benchmark CursorBench, which brings the model's programming performance to a top level.

Talk: Training Composer https://youtube.com/watch?v=uTgqYeVxy2c… Overview of the methods that we use at Cursor to build our model.

Original Article

View Cached Full Text

Cached at: 05/22/26, 01:56 PM

Talk: Training Composer https://youtube.com/watch?v=uTgqYeVxy2c… Overview of the methods we use at Cursor to build our model.

TL;DR: Cursor’s in-house coding model, Composer 2, achieves top-tier programming agent performance through large-scale continued pretraining, long-range reinforcement learning, and the internal CursorBench benchmark, while making breakthroughs in cost, speed, and real-world task performance.

Composer 2 Overview

Composer 2 is the model Cursor built for agentic coding. It’s extremely capable — at launch its performance matches Opus 4.6 and sits just below GPT-5.4. It also offers top-tier coding speed and very competitive pricing, making it highly efficient for a wide range of coding tasks, with an affordable per-token cost.

Over the past year, Cursor has been developing coding models. The first-generation model, Composer 1, aimed to provide a strong interactive assistance experience. Users are shifting from manual, tab-by-tab coding toward more agentic workflows, and this shift has accelerated significantly in recent months.

When we initially designed the model, the goal was to handle this kind of coding task: a user sends a request to the model, then the model calls a series of tools (read files, edit files, search the codebase, collect lint, run terminal commands). The agent loops through these tools to complete the changes.

Over the past few months, we’ve moved from single code changes to full software engineering. Composer 2 is built exactly for that world — a programming agent. For a typical Cursor developer, the agent can write nearly 100% of the code. They spend their time on non-coding activities: breaking down problems, reviewing outputs, running tests, and giving feedback. At the same time, people want to launch multiple agents at once without watching them constantly — just letting them do the job.

With that goal, we set three concrete sub-goals:

The agent has broad and deep understanding of code.
It can complete difficult tasks.
It performs well on the real tasks that people actually care about.

Knowledge: Large-Scale Continued Pretraining

We wanted to build an agent with broad, deep code knowledge, so we introduced a large-scale continued pretraining phase. The purpose of this extra phase is to boost the system’s core knowledge — not the knowledge of a general chatbot, but knowledge specific to the type of coding we need.

Base Model Choice

We used Kimmy K 2.5 as the base model. This model has 1 trillion parameters, with 32 billion activated. It has 61 layers and a context length of 256K. Very impressive for an open-source model. It also uses Multi-Head Latent Attention, which makes it easier to serve.

When deciding on the base model, we tested several open-source models on different benchmarks, including internal code and other properties. Many open-source models are quite strong. We ultimately chose Chimney (i.e., Kimmy K 2.5) over others mainly due to infrastructure factors and how well it fits some of the systems we already developed.

Training Stages

We ran a relatively standard but large-scale pipeline:

Short-context pretraining stage: Train on a huge number of tokens to make the model more knowledgeable on the topics we want.
Long-context extension stage: Provide more data at the 256K token level.
Final supervised fine-tuning stage: Use agent data that is closer to practical usage.

The continued pretraining chart shows loss decreasing over time. We validated its necessity: considering three variants of pretraining amount (small, medium, large), we got different negative log-likelihood values. After running a standard RL stage, these three systems achieved different rewards. Having the ability to run experiments like this lets us see the benefits.

Long-Range Reinforcement Learning

The next major stage of the system is long-range RL, aimed at building an agent model that can truly run and complete very difficult tasks. RL simulates as closely as possible the queries users actually run in Cursor. Besides improving intelligence, it also allows us to tune behavior to maximize user experience.

Collecting Problems

The problems we handle (left chart) target a wide range of real-world programming issues: iterating on features, debugging code, adding new features. But increasingly, they include parts of the software development process — tasks that involve documentation, migrations, managing different project structures. The difficulty varies widely, and finding problems hard enough to truly challenge the model is becoming increasingly difficult.

Auto-Install

This is one of my favorite parts of the system, showing how developing a model can improve future versions of the model itself. We use the previous model, Composer 1.5, in two stages:

Stage 1: Explore the repo and read documentation, propose 10 install commands, then write tests to check if those commands are verifiable.
Stage 2: Composer installs the environment, runs UV setup and all the different commands that might be needed, ensures tests pass, simulates dependencies, installs packages that didn’t work the first time.

If verification passes Stage 1, the environment is set up and used for RL training.

RL Process

How it works: Starting from a previously set problem, we run several different rollouts. Each rollout contains a simulated environment where the agent is given the same task as in Cursor and tries to solve it. The infrastructure challenge is huge — each rollout can involve 200K tokens and hundreds of tool calls. For each problem, we run many such rollouts.

Among the results, some rollouts are better (they solve the problem) and some fail. We identify good rollouts for the model to learn from, and bad ones for it to avoid. Besides scoring on benchmarks, we also consider behavioral characteristics and invest significant effort in designing rewards to shape user experience.

One interesting aspect is deciding how many tokens the model uses. We explored various penalties and eventually settled on a nonlinear length penalty (bottom-left chart). The idea: penalize short sequences to push the model to be more efficient; if the problem is easy, we want Composer 2 to solve it efficiently and move on. If the problem is very challenging, we want it to spend more time finding the correct answer, so the penalty decreases as thinking time grows.

Self-Summarization

To encourage the model to solve very long problems, we developed a system: the model is allowed to work beyond its length limit. For example, the first example ran for over 200K tokens. When it hits a trigger point, it’s asked to summarize what it’s done so far. That summary is provided to the agent for the next step, it runs again, possibly summarizes again, and continues. During RL, these three steps are treated as leading to the same final reward. This trains the model to effectively handle infinite length while limiting the actual length used per segment. It’s very effective for tackling very hard problems.

Training Effect

When running RL, we typically see continuous improvement in metrics over training time. The figure shows a log training scale — many steps of RL on many problems, with accuracy consistently rising. Academic literature sometimes criticizes RL for potentially improving best-case performance but harming diversity. We don’t see that in our approach. Considering the performance of the best 16 solutions, it also improves over time, showing the model does not over-concentrate on any single solution.

Internal Benchmark: CursorBench

Our goal is not just to improve arbitrary benchmarks, but to perform well on the real tasks people use Composer for every day. To drive that, we built an internal evaluation system. The goal is to have queries and goal-oriented solutions that more closely resemble what software engineers actually do in practice.

This is the third iteration of the CursorBench project. It contains a fairly large number of solutions, which have grown longer and more difficult over time. The left chart shows the average lines of code changed per example and the number of files involved. The goal is to be uncontaminated, cover different use cases, and test whether tests are passed (not just that they pass). We use code from Cursor’s software engineers, derived from real problems developers actually ask.

Characteristics of CursorBench

Compared to some well-known public benchmarks:

Larger code diffs: Hundreds of lines across multiple files, not just a few lines.
Shorter problem descriptions: The right chart shows that the average problem description length (what the agent is asked to do) is much shorter than other benchmarks. At first glance this might seem like a disadvantage because there’s less specification. But we consider it an advantage — it’s closer to what agents must solve in practice. Resolving ambiguity or guessing user intent is a core part of the model’s job.

Example

A real example from CursorBench: In the problem statement at the bottom, some words are not capitalized, there are odd punctuation marks, multiple files are referenced, and multiple logs are mentioned. It asks to look at DataDog logs. In practice, it’s common to ask the model to do very hard things and expect it to figure things out, but many public benchmarks don’t include such poorly structured problems.

CursorBench differentiates models better than some other benchmarks. For example, the difference in results between a model we consider very good (like Opus 4.6) and a model we consider not as good or from a previous generation (like Sonnet 4.5) is much larger. Some models all score highly on SWE-bench, possibly just because they’ve been well-targeted for that type of problem over the past few years.

Using CursorBench to Understand Model Performance

We see improvements from Composer 1.5 to Composer 2. We can also plot other large labs’ models on these performance charts. One thing we found very helpful internally is plotting not only by score but also by average completion token length (how many tokens were actually used to find the answer). For models like GPT 5.4, the ability to answer questions varies considerably based on the number of tokens they use. The chart shows the medium, high, and very high versions of the model, which also improve on these benchmarks as they leverage more tokens.

Future

We are currently in the process of releasing Composer 2.5. It will be launched in the near future. We are also building models that scale to larger training clusters, training Composer 3 and future versions using SpaceX’s cluster. Coding models are already very capable, but there’s still a lot of work to do. Larger pretraining and better RL will continue to improve performance.

A quick preview: Composer 2.5 will be even stronger on terminal command scores and other similar benchmarks, thanks to optimizations in the RL process and the continued pretraining process.

Source

Video: Talk: Training Composer by @srush_nlp (https://www.youtube.com/watch?v=uTgqYeVxy2c)