SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following

arXiv cs.CL Papers

Summary

This paper introduces Sequor, a new benchmark for evaluating how well AI models follow constraints in long, multi-turn conversations. It highlights that current models struggle significantly with maintaining instruction adherence over extended interactions.

arXiv:2605.06353v1 Announce Type: new Abstract: In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests. Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, leaving open how well models handle long-horizon instruction-following tasks. To bridge this gap, we present SEQUOR, an automatic benchmark for evaluating constraint adherence in long multi-turn conversations. SEQUOR consists of simulated persona-driven interactions built with constraints extracted from real-world conversations. Our results show that even when following a single constraint, instruction-following accuracy consistently decreases as the conversation grows longer, with drops exceeding 11%. This decline becomes larger when models have to follow multiple constraints simultaneously, reducing their accuracy by over 40%. In scenarios where constraints are added or replaced at arbitrary points of the conversation, model accuracy decreases by more than 9%. Taken together, our results reveal that current models still struggle to follow user instructions in multi-turn conversations, and provide a way for better measuring instruction-following capabilities in assistants.
Original Article
View Cached Full Text

Cached at: 05/08/26, 07:53 AM

# A Multi-Turn Benchmark for Realistic Constraint Following
Source: [https://arxiv.org/html/2605.06353](https://arxiv.org/html/2605.06353)
Beatriz Canaverde1,2, Duarte M\. Alves1,2, José Pombal1,2,3, Giuseppe Attanasio1&André F\. T\. Martins1,2,4,5

1Instituto de Telecomunicações,2Instituto Superior Técnico, Universidade de Lisboa 3Sword Health,4TransPerfect,5ELLIS Unit Lisbon beatriz\.canaverde@tecnico\.ulisboa\.pt

###### Abstract

In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests\. Yet most instruction\-following benchmarks focus on single\-turn or short multi\-turn scenarios, leaving open how well models handle long\-horizon instruction\-following tasks\. To bridge this gap, we presentSequor, an automatic benchmark for evaluating constraint adherence in long multi\-turn conversations\.Sequorconsists of simulated persona\-driven interactions built with constraints extracted from real\-world conversations\. Our results show that even when following a single constraint, instruction\-following accuracy consistently decreases as the conversation grows longer, with drops exceeding11%11\\%\. This decline becomes larger when models have to follow multiple constraints simultaneously, reducing their accuracy by over40%40\\%\. In scenarios where constraints are added or replaced at arbitrary points of the conversation, model accuracy decreases by more than9%9\\%\. Taken together, our results reveal that current models still struggle to follow user instructions in multi\-turn conversations, and provide a way for better measuring instruction\-following capabilities in assistants\.111Code and data are available at:[https://github\.com/BeatrizCanaverde/SEQUOR](https://github.com/BeatrizCanaverde/SEQUOR)

## 1Introduction

Digital assistants must consistently adhere to user instructions across the full course of a conversation\. Because users often refine, modify, or even contradict earlier requests\(Zheng et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib37); Zhao et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib35); Bai et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib1); Chiang et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib2); Laban et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib18)\), this skill requires following instructions that may change over many turns\. This makes evaluation challenging, as current instruction\-following benchmarks evaluate models in single\-turn or short multi\-turn settings, using either programmatically verifiable or LLM\-generated instructions that are not representative of real\-world use cases\(Zheng et al\.,[2023](https://arxiv.org/html/2605.06353#bib.bib36); Zhou et al\.,[2023](https://arxiv.org/html/2605.06353#bib.bib38); Qin et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib25); He et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib11); Kwan et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib17); Jiang et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib13); Dussolle et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib5); Pyatkin et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib24); Xia et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib33); Bai et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib1); Jiang et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib13); Deshpande et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib4)\)\. This context raises the question of how robustly modern large language models \(LLMs\) follow instructions in open\-domain interactions that span multiple turns\.

![Refer to caption](https://arxiv.org/html/2605.06353v1/x1.png)Figure 1:Example snippet of a conversation fromSequor\.To bridge this gap, we presentSequor,222Sequoris a Latin verb meaning “I follow\.”an automatic benchmark that measures instruction\-following capabilities in multi\-turn open\-domain conversations\.Sequoris grounded in two core principles\. First, constraints must be realistic, broadly applicable, challenging, and verifiable \(§[2](https://arxiv.org/html/2605.06353#S2)\)\. Second, interactions must span many turns, allowing constraints to accumulate or be replaced in a credible way \(§[3](https://arxiv.org/html/2605.06353#S3); see[Figure1](https://arxiv.org/html/2605.06353#S1.F1)\)\. Accordingly,Sequorcomprises simulated persona\-driven interactions built from constraints extracted from real\-world conversations\. It systematically varies how and when constraints are introduced, from initial constraints that remain unchanged over the conversation, to constraints that are incrementally added or replaced over time\. As a result, it captures a broad range of long\-horizon instruction\-following scenarios\.

We evaluated several modern LLMs onSequor, revealing consistent limitations in long\-horizon instruction\-following \(§[4](https://arxiv.org/html/2605.06353#S4)\)\. Across all regimes, constraint\-following accuracy degrades as conversations become longer\. Even when following a single constraint, accuracy drops by more than11%11\\%between the first and last turns\. The decline is substantially larger, exceeding38%38\\%, when multiple constraints need to be satisfied simultaneously, and is most pronounced, with losses above40%40\\%, when constraints are introduced sequentially rather than all at once\. Resetting constraints mid\-conversation allows models to recover their initial performance, although accuracy declines more rapidly afterward\. Finally, when constraints are randomly added or replaced at arbitrary points in the conversation, accuracy decreases by more than9%9\\%\. Taken together, these results show that current models still struggle to reliably follow user directives over long multi\-turn interactions\.

Our main contributions are summarized as follows:

- •We propose an automated pipeline for extracting and curating broadly applicable, non\-trivial, and objectively verifiable constraints from real\-world conversations \(§[2](https://arxiv.org/html/2605.06353#S2)\)\.333Accompanying our benchmark, we will also release the curated pool of1,4461\{,\}446realistic constraints\.
- •We introduceSequor, a multi\-turn benchmark for constraint\-following, comprising1,4001,400conversations of5050turns each \(§[3](https://arxiv.org/html/2605.06353#S3)\)\. It systematically varies how constraints are introduced in a conversation, ranging from static initial constraints to incremental addition and replacement over time\.
- •We empirically demonstrate that current LLMs experience substantial degradation in constraint adherence as the number of turns increases and constraints accumulate \(§[4](https://arxiv.org/html/2605.06353#S4)\)\.

## 2Collecting Realistic Constraints in the Wild

To evaluate instruction\-following, we test whether an assistant adheres to constraints shaping its output’s form, style, or structure\. We collect constraints from real\-world conversational data and automatically filter them using heuristics and LLMs\-as\-judges\(Zheng et al\.,[2023](https://arxiv.org/html/2605.06353#bib.bib36)\)\. Our pipeline, shown in[Figure2](https://arxiv.org/html/2605.06353#S2.F2), produced a pool of1,4461\{,\}446realistic constraints\.

![Refer to caption](https://arxiv.org/html/2605.06353v1/x2.png)Figure 2:Pipeline to collect constraints from real\-world conversations\.#### Extracting constraints\.

Starting from lmsys\-chat\-1m\(Zheng et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib37)\), a dataset of real\-world user\-assistant conversations,444It contains11M conversations collected from Chatbot Arena and Vicuna demo \(April\-August20232023\)\.we prompt Qwen3\-Next\-80B\-A3B\-Instruct\-FP8\(Yang et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib34); Team,[2025](https://arxiv.org/html/2605.06353#bib.bib29)\)with each English conversation to extract all constraints expressed in the user prompts\. FollowingQin et al\. \([2024](https://arxiv.org/html/2605.06353#bib.bib25)\), we categorize the constraints into four main categories: linguistic guidelines, style rules, format specifications, and number limitations\.555Qin et al\. \([2024](https://arxiv.org/html/2605.06353#bib.bib25)\)consider an additional fifth category, content constraints, which define the topics or details that should be addressed in the LLM’s response\. We exclude this category from our experiments because: 1\) we found them harder to extract from conversations, and 2\) they are less aligned with our goal of collecting constraints that are broadly applicable across diverse contexts\.

#### Automatic filtering\.

Using the Datatrove library\(Penedo et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib23)\), we remove non\-English constraints with the fastText language identification model\(Joulin et al\.,[2016a](https://arxiv.org/html/2605.06353#bib.bib14);[b](https://arxiv.org/html/2605.06353#bib.bib15)\), discarding all entries with a confidence score below0\.650\.65\. We then remove similar constraints using MinHash deduplication with5050buckets,44hashes per bucket, and33\-grams\. Finally, we exclude constraints containing words from the English subset of a predefined list of bad words,666[https://github\.com/LDNOOBW/List\-of\-Dirty\-Naughty\-Obscene\-and\-Otherwise\-Bad\-Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words)or sequences of characters from a custom list\.777Custom list: sex, porn, nud

#### Ensuring constraints are satisfiable\.

We retain only constraints that are satisfiable across multiple contexts\. For example,“Answer in at most100100words\.”is broadly applicable, whereas“Your answer must include a Python function definition\.”is only meaningful for programming\-related tasks\. To identify satisfiable constraints, we pair each one with100100randomly sampled tasks, and evaluate each pair using various judges and the rubrics presented in[Figure 3](https://arxiv.org/html/2605.06353#S2.F3)\. A constraint is satisfiable for a given task if a judge assigns positive scores to rubrics1,3, and4, and a negative response to rubric2\. To pass this filter, a constraint must be deemed satisfiable by every judge in at least70%70\\%of the analyzed contexts\.

Rubrics used to identify non\-satisfiable constraints1\.Is the constraint actually a restriction or condition that limits how the model should generate its output to the task?2\.Does the constraint target a different question, topic, or domain than the task itself?3\.Is the constraint applicable to the type of output the task requires?4\.Does the constraint fall into one of the following four categories: linguistic guidelines, style rules, format specifications, or number limitations?Figure 3:Rubrics used by LLM judges to identify non\-satisfiable constraints\. The complete prompt template, including the definitions of the constraint categories, is shown in[Figure9](https://arxiv.org/html/2605.06353#A1.F9)\.![Refer to caption](https://arxiv.org/html/2605.06353v1/x3.png)Figure 4:Sequorsimulates persona\-driven interactions, varying how constraints are introduced across five systematic regimes\.
#### Avoiding trivially satisfiable constraints\.

Although we prioritize broadly applicable constraints, we exclude ones that are likely to be satisfied even when not explicitly specified in the prompt\. For example,“Answer in proper English\.”is typically followed by most language models when responding in English\. To identify such trivial constraints, we sample model responses to100100tasks without specifying any constraint and test whether the constraint is nevertheless satisfied\. A constraint is non\-trivial if each judge classifies at least70%70\\%of the responses as not satisfying the constraint\.

#### Removing subjective constraints\.

Some constraints are subjective and may lead to inconsistent evaluations across judges \(e\.g\.,“Write a creative response\.”\)\. To identify and remove such cases, we pair each constraint with100100tasks and sample model responses to these constraint\-task pairs\. We then use multiple judges to independently assess whether the constraint has been followed in each response\. A constraint is non\-subjective if all judges agree on the binary judgment in at least70%70\\%of the evaluated task contexts\.

For constraint assessment, we use three judges: GPT\-oss\-120B\(OpenAI et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib22)\), Qwen3\-235B\-A22B\-Instruct\-2507\-FP8\(Team,[2025](https://arxiv.org/html/2605.06353#bib.bib29)\), and GLM\-4\.7\-FP8\(Team et al\.,[2025b](https://arxiv.org/html/2605.06353#bib.bib28)\)\. Model responses are generated using four smaller LLMs: Qwen3\-4B\-Instruct\-2507\(Team,[2025](https://arxiv.org/html/2605.06353#bib.bib29)\), Llama\-3\.2\-3B\-Instruct\(Grattafiori et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib9)\), Gemma3\-4B\(Team et al\.,[2025a](https://arxiv.org/html/2605.06353#bib.bib27)\), and Olmo\-3\-7B\-Instruct\(Olmo et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib20)\)\. The70%70\\%threshold balances robustness to contextual variability and constraint diversity\. Further analysis and prompt templates are in[AppendixA](https://arxiv.org/html/2605.06353#A1)\.

## 3SEQUOR: Simulating and Evaluating Multi\-Turn Conversations

From our pool of realistic constraints, we constructSequor\. InSequor, user turns are generated from persona profiles and the extracted constraints, and the assistant turns are then evaluated using LLMs\-as\-a\-Judge\.

### 3\.1Simulating Conversations

Sequorconsists of sequences of user turns that form multi\-turn conversations with an assistant\. Each turn specifies a task—an action or goal for the assistant to perform—and optionally updates the constraints the assistant must follow\. To emulate realistic conditions, we design five test scenarios, use persona profiles, and control for conflicting constraints\.

#### Test sets\.

Sequorincludes five test sets built from a fixed collection of user\-turn sequences, differing only in the constraints provided to the assistant \(see[Figure4](https://arxiv.org/html/2605.06353#S2.F4)\)\. For each test set, the constraints are randomly sampled from our pool and introduced at specific turns using predefined templates \(see §[D](https://arxiv.org/html/2605.06353#A4)\)\. The five sets are defined as follows:

- •Single\.One constraint is given in the first turn and must be followed thereafter\.
- •Tuples\.Three constraints are given in the first turn and must be followed thereafter\.
- •Replace\.A constraint is given in the first turn and replaced everyxxturns\. Each constraint must be followed until it is replaced\. We considerx=5x=5andx=10x=10\.
- •Add\.A constraint is given in the first turn, and additional ones are added everyxxturns, up to a maximum of three\. Constraints accumulate; once introduced, they must be followed thereafter\. We considerx=5x=5andx=10x=10\.
- •Everything\.A mixture of the previous regimes\. After a random number of turns \(between11and55\), up to three constraints are given, randomly accumulating with or replacing earlier ones\.

#### Tasks\.

We design tasks to simulate interactions between diverse personas and an assistant\. We first sample persona profiles from Persona Hub\(Ge et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib6)\)and use Qwen3\-Next\-80B\-A3B\-Instruct\-FP8\(Yang et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib34); Team,[2025](https://arxiv.org/html/2605.06353#bib.bib29)\)to generate a sequence of daily activities tailored to each persona’s profession, interests, and lifestyle\. Then, given a persona and an activity, the same model generates open\-ended questions that the persona might naturally ask an assistant in that scenario\. This process yields ordered sequences of questions that simulate a natural flow of interactions\. See[Figure1](https://arxiv.org/html/2605.06353#S1.F1)for an example\. Prompt templates are given in Appendix[E](https://arxiv.org/html/2605.06353#A5)\. The final dataset contains200200personas, each with5050associated tasks\.

#### Tuples of constraints\.

For evaluation scenarios in which the assistant must satisfy multiple constraints simultaneously, we must identify tuples of compatible constraints \(e\.g\.,“Write your answer entirely in capital letters\.”vs\.“Write your answer entirely in lowercase letters\.”are not compatible\)\. To do so, we first sample tuples of three constraints from our pool and pair them with100100tasks\. Then, we apply two filtering criteria to ensure the tuples contain compatible constraints\. First, we only retain tuples for which all judges agree that its constraints are non\-conflicting for70%70\\%of the tasks\. Second, after sampling one response for each tuple\-task pair, we only retain tuples for which all judges agree that at least one answer satisfies all constraints for70%70\\%of the tasks\. We use three judges: GPT\-oss\-120B\(OpenAI et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib22)\), Qwen3\-235B\-A22B\-Instruct\-2507\-FP8\(Team,[2025](https://arxiv.org/html/2605.06353#bib.bib29)\), and GLM\-4\.7\-FP8\(Team et al\.,[2025b](https://arxiv.org/html/2605.06353#bib.bib28)\)\. Responses are sampled from GPT\-oss\-20B\(OpenAI et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib22)\), Llama\-3\.1\-8B\-Instruct\(Grattafiori et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib9)\), Gemma3\-27B\(Team et al\.,[2025a](https://arxiv.org/html/2605.06353#bib.bib27)\), and Olmo\-3\.1\-32B\-Instruct\(Olmo et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib20)\)\. Prompt templates are provided in[AppendixB](https://arxiv.org/html/2605.06353#A2)\.

### 3\.2Evaluation with LLM\-as\-a\-Judge

All assistant responses inSequorare evaluated independently on a turn\-by\-turn basis using an LLM\-as\-a\-judge\(Zheng et al\.,[2023](https://arxiv.org/html/2605.06353#bib.bib36); Gu et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib10)\)\. For each turn, the judge determines whether the response satisfies each active constraint individually\. We verify that models can reliably judge constraint adherence\.

#### Data\.

We pair500500constraints and500500tasks randomly sampled from our pool\. For each pair, we prompt proprietary models to generate two answers in a single\-turn setup: \(i\) one answer explicitly instructed to satisfy the constraint, and \(ii\) one answer explicitly instructed to violate the constraint\. We manually inspect a subset to verify that answers labeled as satisfying, or violating, the constraint behave as intended\. We treat these as gold responses\.

#### Evaluation\.

We evaluate four candidate judges by asking them to determine whether each answer satisfies its associated constraint \(prompt template in[Figure11](https://arxiv.org/html/2605.06353#A1.F11)\)\. We report the percentage of correct and incorrect classifications for: \(i\) gold answers satisfying constraints \(Gold: Yes\), \(ii\) gold answers violating constraints \(Gold: No\), and \(iii\) overall\.

#### Models\.

Gold responses are generated using Gemini 3 Flash Preview\(Google,[2025](https://arxiv.org/html/2605.06353#bib.bib7)\)888At the time of testing, this model ranked among the top\-performing systems on the IFBench leaderboard:[https://artificialanalysis\.ai/evaluations/ifbench](https://artificialanalysis.ai/evaluations/ifbench)\(accessed February 25, 2026\)\.and GPT\-5\.2\(OpenAI,[2025](https://arxiv.org/html/2605.06353#bib.bib21)\), producing two independent gold sets\. We compare the following models as judges: Qwen3\-235B\-A22B\-Instruct\-2507\-FP8, Qwen3\-235B\-A22B\-Thinking\-2507\-FP8\(Team,[2025](https://arxiv.org/html/2605.06353#bib.bib29)\), GPT\-oss\-120B\(OpenAI et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib22)\), and GLM\-4\.7\-FP8\(Team et al\.,[2025b](https://arxiv.org/html/2605.06353#bib.bib28)\)\.

Gold: YesGold: NoOverallModel✓✗?✓✗?✓✗?Qwen3\-235B\-A22B\-Inst\.84\.9015\.10098\.101\.700\.2091\.508\.400\.10Qwen3\-235B\-A22B\-Think\.87\.9011\.900\.2098\.401\.60093\.156\.750\.10GPT\-oss\-120B88\.3011\.500\.2098\.800\.700\.5093\.556\.100\.35GLM\-4\.7\-FP891\.108\.90098\.601\.300\.1094\.855\.100\.05Table 1:Percentage of correct \(✓\), incorrect \(✗\), and unextractable \(?\) verdicts of candidate judges in detecting constraint adherence\. “Gold: Yes” and “Gold: No” denote gold responses that satisfy or violate constraints, respectively; “Overall” reflects the full evaluation set\.
#### Results\.

[Table1](https://arxiv.org/html/2605.06353#S3.T1)reports the judges’ performance averaged across the two gold sets generated by Gemini 3 Flash Preview and GPT\-5\.2\. For all models, detecting violations is easier than confirming correct adherence, and the rate of unextractable verdicts is minimal\. GLM\-4\.7\-FP8 performs best on “Gold: Yes” and overall\. However, we select GPT\-oss\-120B for our main experiments \(§[4](https://arxiv.org/html/2605.06353#S4)\) as it offers competitive performance with greater efficiency\.

## 4Experiments

Sequorevaluates models’ ability to follow constraints throughout multi\-turn conversations\. Accordingly, we consider all model responses in a conversation and assess uniquely whether they satisfy all constraints active at each turn\.

### 4\.1Experimental Setup

#### Metrics\.

A turn is considered successful if the model’s response satisfies all constraints active at that turn\. Our main metric isper\-turn accuracy, defined as the percentage of successful responses at each turn, averaged across conversations\.

#### Evaluated Models\.

We benchmark1010open\-weight models from different families and sizes: Qwen3\-4B\-Inst, Qwen3\-30B\-A3B\-Inst, Qwen3\-235B\-A22B\-Inst999We use the Instruct\-2507 versions of the models, and FP8 quantization for the largest two\.\(Team,[2025](https://arxiv.org/html/2605.06353#bib.bib29)\), Gemma3 4B, 12B, and 27B\(Team et al\.,[2025a](https://arxiv.org/html/2605.06353#bib.bib27)\), Llama\-3\.3\-70B\-Inst\(Grattafiori et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib9)\), GPT\-oss 20B and 120B\(OpenAI et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib22)\), and GLM\-4\.7\-Flash\(Team et al\.,[2025b](https://arxiv.org/html/2605.06353#bib.bib28)\)\. We also report results for the proprietary Gemini 3\.1 Flash Lite\(Google,[2026](https://arxiv.org/html/2605.06353#bib.bib8)\)\. All models are run with their default configurations\. We implement a sliding\-window approach when the conversation history exceeds a model’s context length\.

### 4\.2Regime Analysis

We analyze how constraint adherence evolves throughout conversations for each regime\.[Figure5](https://arxiv.org/html/2605.06353#S4.F5)shows the per\-turn accuracy averaged across all models, as well as results for the best\-performing model; shaded areas indicate the95%95\\%confidence intervals\.[Figure6](https://arxiv.org/html/2605.06353#S4.F6)illustrates the drop in accuracy between the first and last turns for all regimes and models\.Add 5, 10andReplace 5, 10denote the number of turns between each addition or replacement\.

![Refer to caption](https://arxiv.org/html/2605.06353v1/x4.png)Figure 5:Per\-turn accuracy across regimes\. Shaded regions indicate95%95\\%bootstrap confidence intervals\.![Refer to caption](https://arxiv.org/html/2605.06353v1/x5.png)Figure 6:Change in per\-turn accuracy from turn 1 to turn 50 across regimes\. Each gray line corresponds to a model, while the colored lines show the average across models\.![Refer to caption](https://arxiv.org/html/2605.06353v1/x6.png)Figure 7:Per\-turn accuracy for all models in theEverythingregime\.#### Performance degrades consistently as conversations progress across all regimes\.

Even in the simplest regime, where a single constraint is introduced in the first turn and must be followed in all subsequent turns, we observe a consistent decrease in performance over time\. The average accuracy in theSingleregime drops by26%26\\%from the first to the last turn\.

#### Models struggle more to follow multiple constraints simultaneously, particularly when these are introduced sequentially over time\.

Among all regimes,TuplesandAddexhibit the largest performance drops, with average decreases of38%38\\%and63%63\\%, respectively \(see[Figure6](https://arxiv.org/html/2605.06353#S4.F6)\)\.Tuplesstarts with the lowest average accuracy and remains the most challenging setting during the first1010turns \(see[Figure5](https://arxiv.org/html/2605.06353#S4.F5)\)\. At turn1111,Add 5becomes the lowest\-performing regime, andAdd 10becomes the worst at turn2121; both shifts coincide with the introduction of a third constraint\. In theAddregime, accuracy drops noticeably whenever new constraints are presented \(turns66,1111inAdd 5and turns1111,2121inAdd 10; see[Figure5](https://arxiv.org/html/2605.06353#S4.F5)\)\. These results suggest that models struggle more when constraints accumulate over time than when provided all at once, likely due to difficulty retaining earlier constraints or reasoning about their integration\.

#### Models tend to recover their initial performance when existing constraints are replaced with new ones\.

This is evidenced by the sharp accuracy spikes observed inReplace 5andReplace 10\(see[Figure5](https://arxiv.org/html/2605.06353#S4.F5)\), which occur when constraints are replaced \(e\.g\., turns1111,2121,3131, and4141inReplace 10\)\. At these points, models’ average accuracy consistently returns to the high levels observed in the first turn\. Interestingly, after each replacement, subsequent turns often exhibit sharper performance declines than those following the first turn\.

#### Randomly adding or replacing constraints at arbitrary points in a conversation results in an average performance drop of27%27\\%from the first to the last turn\.

TheEverythingregime combines all others, balancing easier scenarios \(e\.g\., resetting constraints\) with more challenging ones \(e\.g\., accumulating multiple constraints\)\. Consequently, its behavior lies between the easier and harder regimes\. Although models occasionally show small improvements between consecutive turns—particularly after constraints are reset—accuracy declines substantially throughout the conversations\.

#### Gemini 3\.1 Flash Lite, the best\-performing model, shows similar overall trends and significant performance losses\.

To better understand the upper bound of the evaluated systems, we analyze the behavior of the best\-performing model across regimes, identified using the Borda count ranking\(Colombo et al\.,[2022](https://arxiv.org/html/2605.06353#bib.bib3)\)\. Gemini’s overall trends remain consistent with those observed across models\. In theSingle,Replace, andEverythingregimes, accuracy declines by up to12%12\\%over the course of the conversations, although the model still achieves relatively strong results at turn5050\(see[Figure5](https://arxiv.org/html/2605.06353#S4.F5)\)\. In contrast, the more challengingTuplesandAddregimes exhibit much larger losses of25%25\\%and40%40\\%, respectively, ending with accuracies below60%60\\%at turn5050\.

### 4\.3Model Breakdown Analysis

In[Figure7](https://arxiv.org/html/2605.06353#S4.F7), we examine the performance of all models throughout the conversations in theEverythingregime, measured every five turns\. Gemini 3\.1 Flash Lite and Qwen3\-235B\-A22B\-Inst are the only models that sustain accuracy above70%70\\%across all turns, although they still experience drops of9%9\\%and13%13\\%, respectively, between the first and last turns\. In contrast, most other models fall below60%60\\%accuracy at turn5050\. Notably, in this regime, potential context\-length limitations should not cause degradation in later turns\. Although earlier parts of a conversation may fall outside the context window, the constraints active at each turn are given within at most the previous1515turns\. These results suggest that maintaining constraint adherence over long interactions remains challenging for current models\.

## 5Related Work

### 5\.1Constraint Following Evaluation

A large body of work evaluates language models ability to follow explicit output constraints\. IFEval\(Zhou et al\.,[2023](https://arxiv.org/html/2605.06353#bib.bib38)\)introduced manually designed, programmatically verifiable constraints for single\-turn evaluation\. Subsequent work expanded this framework to multilingual and limited multi\-turn settings while preserving rule\-based verification\(He et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib11); Dussolle et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib5); Pyatkin et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib24)\)\. Although this approach enables precise and reproducible evaluation, it restricts constraints to those that can be automatically checked\. InFoBench\(Qin et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib25)\), FOFO\(Xia et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib33)\), and FollowBench\(Jiang et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib13)\)relax deterministic verification and instead rely on LLM\-as\-a\-judge evaluation\. Consequently, they introduce more diverse constraints, using manually written or LLM\-generated templates\. These settings, however, remain limited to single\-turn interactions\.

Constraint following has also been studied in moderately longer multi\-turn settings\. MT\-Eval\(Kwan et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib17)\)and MultiChallenge\(Deshpande et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib4)\)extend prior work to conversations of1010–1212turns, explicitly evaluating whether models can retain, accumulate, and apply constraints across dialogue turns\. MemoryCode\(Rakotonirina et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib26)\)further scales this paradigm to substantially longer multi\-session coding dialogues \(up to4040K tokens\)\. It is, however, restricted to the coding domain and relies on manually crafted, programmatically verifiable constraints\.

Sequor, in constrast, evaluates constraint adherence in conversations of5050turns, systematically varying how constraints are introduced\. It uses constraints derived from real\-world datasets, allowing the evaluation of instruction\-following in open\-domain interactions\.

### 5\.2Multi\-Turn Evaluation

Beyond constraint following, several benchmarks evaluate general multi\-turn conversational abilities\. MT\-Bench\(Zheng et al\.,[2023](https://arxiv.org/html/2605.06353#bib.bib36)\)introduced a two\-turn evaluation framework based on LLM judges without reference answers—an evaluation paradigm we also adopt\. Building on this,Bai et al\. \([2024](https://arxiv.org/html/2605.06353#bib.bib1)\)propose MT\-Bench\-101 to evaluate more complex, real\-world dialogue phenomena over interactions of up to66turns, assessing1313fine\-grained abilities categorized under perceptivity, adaptability, and interactivity\.

Other benchmarks explore interactive and feedback\-driven settings\. MINT\(Wang et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib31)\), Meeseeks\(Wang et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib30)\), and related work\(Laban et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib18); Kim et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib16)\)assess models under iterative feedback, self\-correction loops, dynamically revealed information, and follow\-up questioning\. In contrast,Sequordoes not provide feedback or allow response revision\. Instead, it evaluates whether models consistently follow evolving constraints without external guidance\.

Finally, long\-context benchmarks such as LoCoMo\(Maharana et al\.,[2024](https://arxiv.org/html/2605.06353#bib.bib19)\), LongMemEval\(Wu et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib32)\), and PersonaMem\(Jiang et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib12)\)focus on long\-term memory, persona adaptation, and long\-range reasoning in extended dialogues\. While forSequorwe chose conversations with5050turns, this number is configurable, allowing control for the context length as models progressively support longer context sizes\.

## 6Conclusion

We introducedSequor, an automatic benchmark for evaluating constraint adherence in long multi\-turn conversations\. It consists of simulated persona\-driven interactions built with constraints extracted from real\-world conversations, systematically varying how constraints are introduced\. Our experiments reveal key limitations of current LLMs\. Instruction\-following accuracy declines as conversations grow longer, even when following a single constraint\. The degradation becomes substantially larger when multiple constraints must be followed simultaneously and are introduced sequentially over time\. We hopeSequorserves as a valuable testbed for studying long\-horizon instruction\-following\. Future work may extend it to additional languages, modalities, and longer interactions\.

## Acknowledgments

We thank Miguel Moura Ramos for his help, constructive feedback, and technical assistance on the paper\. This work was supported by the project DECOLLAGE \(ERC\-2022\-CoG 101088763\), by the Portuguese Recovery and Resilience Plan through project C645008882\-00000055 \(Center for Responsible AI\), and by FCT/MECI through national funds and when applicable co\-funded EU funds under UID/50008: Instituto de Telecomunicações\.

## References

- Bai et al\. \(2024\)Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang\.MT\-Bench\-101: A Fine\-Grained Benchmark for Evaluating Large Language Models in Multi\-Turn Dialogues\.In Lun\-Wei Ku, Andre Martins, and Vivek Srikumar \(eds\.\),*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 7421–7454, Bangkok, Thailand, August 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.acl\-long\.401\.URL[https://aclanthology\.org/2024\.acl\-long\.401/](https://aclanthology.org/2024.acl-long.401/)\.
- Chiang et al\. \(2024\)Wei\-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N\. Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I\. Jordan, Joseph E\. Gonzalez, and Ion Stoica\.Chatbot arena: an open platform for evaluating llms by human preference\.In*Proceedings of the 41st International Conference on Machine Learning*, ICML’24\. JMLR\.org, 2024\.
- Colombo et al\. \(2022\)Pierre Colombo, Nathan Noiry, Ekhine Irurozki, and Stephan Clemencon\.What are the best Systems? New Perspectives on NLP Benchmarking\.In Alice H\. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho \(eds\.\),*Advances in Neural Information Processing Systems*, 2022\.URL[https://openreview\.net/forum?id=kvtVrzQPvgb](https://openreview.net/forum?id=kvtVrzQPvgb)\.
- Deshpande et al\. \(2025\)Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed\-Yeremai Hernandez\-Cardona, Dean Lee, Jeremy Kritz, Willow E\. Primack, Summer Yue, and Chen Xing\.MultiChallenge: A Realistic Multi\-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs\.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar \(eds\.\),*Findings of the Association for Computational Linguistics: ACL 2025*, pp\. 18632–18702, Vienna, Austria, July 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-256\-5\.doi:10\.18653/v1/2025\.findings\-acl\.958\.URL[https://aclanthology\.org/2025\.findings\-acl\.958/](https://aclanthology.org/2025.findings-acl.958/)\.
- Dussolle et al\. \(2025\)Antoine Dussolle, A\. Cardeña, Shota Sato, and Peter Devine\.M\-IFEval: Multilingual Instruction\-Following Evaluation\.In Luis Chiruzzo, Alan Ritter, and Lu Wang \(eds\.\),*Findings of the Association for Computational Linguistics: NAACL 2025*, pp\. 6161–6176, Albuquerque, New Mexico, April 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-195\-7\.doi:10\.18653/v1/2025\.findings\-naacl\.344\.URL[https://aclanthology\.org/2025\.findings\-naacl\.344/](https://aclanthology.org/2025.findings-naacl.344/)\.
- Ge et al\. \(2025\)Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu\.Scaling Synthetic Data Creation with 1,000,000,000 Personas\.2025\.URL[https://arxiv\.org/abs/2406\.20094](https://arxiv.org/abs/2406.20094)\.
- Google \(2025\)Google\.Gemini 3 Flash, December 2025\.URL[https://deepmind\.google/models/gemini/flash/](https://deepmind.google/models/gemini/flash/)\.Accessed 28\-Feb\-2026\.
- Google \(2026\)Google\.Gemini 3\.1 Flash\-Lite, March 2026\.URL[https://deepmind\.google/models/model\-cards/gemini\-3\-1\-flash\-lite/](https://deepmind.google/models/model-cards/gemini-3-1-flash-lite/)\.Accessed 25\-Mar\-2026\.
- Grattafiori et al\. \(2024\)Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, Danny Wyatt, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia\-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Francisco Guzmán, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Govind Thattai, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jack Zhang, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Karthik Prasad, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El\-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Kushal Lakhotia, Lauren Rantala\-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Maria Tsimpoukelli, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Ning Zhang, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohan Maheswari, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vítor Albiero, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaofang Wang, Xiaoqing Ellen Tan, Xide Xia, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aayushi Srivastava, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Amos Teo, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Dong, Annie Franco, Anuj Goyal, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Ce Liu, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching\-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Cynthia Gao, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Eric\-Tuan Le, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Filippos Kokkinos, Firat Ozgenel, Francesco Caggioni, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hakan Inan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Hongyuan Zhan, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Ilias Leontiadis, Irina\-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Janice Lam, Japhet Asher, Jean\-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kiran Jagadeesh, Kun Huang, Kunal Chawla, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Miao Liu, Michael L\. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikhil Mehta, Nikolay Pavlovich Laptev, Ning Dong, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Rangaprabhu Parthasarathy, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Russ Howes, Ruty Rinott, Sachin Mehta, Sachin Siby, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Mahajan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shishir Patil, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Summer Deng, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Koehler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaojian Wu, Xiaolan Wang, Xilun Wu, Xinbo Gao, Yaniv Kleinman, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yu Zhao, Yuchen Hao, Yundi Qian, Yunlu Li, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, Zhiwei Zhao, and Zhiyu Ma\.The Llama 3 Herd of Models, 2024\.URL[https://arxiv\.org/abs/2407\.21783](https://arxiv.org/abs/2407.21783)\.
- Gu et al\. \(2025\)Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo\.A Survey on LLM\-as\-a\-Judge, 2025\.URL[https://arxiv\.org/abs/2411\.15594](https://arxiv.org/abs/2411.15594)\.
- He et al\. \(2024\)Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, Shruti Bhosale, Chenguang Zhu, Karthik Abinav Sankararaman, Eryk Helenowski, Melanie Kambadur, Aditya Tayade, Hao Ma, Han Fang, and Sinong Wang\.Multi\-IF: Benchmarking LLMs on Multi\-Turn and Multilingual Instructions Following\.2024\.URL[https://arxiv\.org/abs/2410\.15553](https://arxiv.org/abs/2410.15553)\.
- Jiang et al\. \(2025\)Bowen Jiang, Zhuoqun Hao, Young Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo Jose Taylor, and Dan Roth\.Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale\.In*Second Conference on Language Modeling*, 2025\.URL[https://openreview\.net/forum?id=6ox8XZGOqP](https://openreview.net/forum?id=6ox8XZGOqP)\.
- Jiang et al\. \(2024\)Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang\.FollowBench: A Multi\-level Fine\-grained Constraints Following Benchmark for Large Language Models\.In Lun\-Wei Ku, Andre Martins, and Vivek Srikumar \(eds\.\),*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 4667–4688, Bangkok, Thailand, August 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.acl\-long\.257\.URL[https://aclanthology\.org/2024\.acl\-long\.257/](https://aclanthology.org/2024.acl-long.257/)\.
- Joulin et al\. \(2016a\)Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov\.FastText\.zip: Compressing text classification models, 2016a\.URL[https://arxiv\.org/abs/1612\.03651](https://arxiv.org/abs/1612.03651)\.
- Joulin et al\. \(2016b\)Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov\.Bag of Tricks for Efficient Text Classification, 2016b\.URL[https://arxiv\.org/abs/1607\.01759](https://arxiv.org/abs/1607.01759)\.
- Kim et al\. \(2025\)Eunsu Kim, Juyoung Suk, Seungone Kim, Niklas Muennighoff, Dongkwan Kim, and Alice Oh\.LLM\-as\-an\-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation\.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar \(eds\.\),*Findings of the Association for Computational Linguistics: ACL 2025*, pp\. 26456–26493, Vienna, Austria, July 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-256\-5\.doi:10\.18653/v1/2025\.findings\-acl\.1357\.URL[https://aclanthology\.org/2025\.findings\-acl\.1357/](https://aclanthology.org/2025.findings-acl.1357/)\.
- Kwan et al\. \(2024\)Wai\-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam\-Fai Wong\.MT\-Eval: A Multi\-Turn Capabilities Evaluation Benchmark for Large Language Models\.In Yaser Al\-Onaizan, Mohit Bansal, and Yun\-Nung Chen \(eds\.\),*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pp\. 20153–20177, Miami, Florida, USA, November 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.emnlp\-main\.1124\.URL[https://aclanthology\.org/2024\.emnlp\-main\.1124/](https://aclanthology.org/2024.emnlp-main.1124/)\.
- Laban et al\. \(2025\)Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville\.LLMs Get Lost In Multi\-Turn Conversation\.2025\.URL[https://arxiv\.org/abs/2505\.06120](https://arxiv.org/abs/2505.06120)\.
- Maharana et al\. \(2024\)Adyasha Maharana, Dong\-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang\.Evaluating Very Long\-Term Conversational Memory of LLM Agents\.In Lun\-Wei Ku, Andre Martins, and Vivek Srikumar \(eds\.\),*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 13851–13870, Bangkok, Thailand, August 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.acl\-long\.747\.URL[https://aclanthology\.org/2024\.acl\-long\.747/](https://aclanthology.org/2024.acl-long.747/)\.
- Olmo et al\. \(2025\)Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Shane Arora, Shashank Gupta, Taira Anderson, Teng Xiao, Tyler Murray, Tyler Romero, Victoria Graf, Akari Asai, Akshita Bhagia, Alexander Wettig, Alisa Liu, Aman Rangapur, Chloe Anastasiades, Costa Huang, Dustin Schwenk, Harsh Trivedi, Ian Magnusson, Jaron Lochner, Jiacheng Liu, Lester James V\. Miranda, Maarten Sap, Malia Morgan, Michael Schmitz, Michal Guerquin, Michael Wilson, Regan Huff, Ronan Le Bras, Rui Xin, Rulin Shao, Sam Skjonsberg, Shannon Zejiang Shen, Shuyue Stella Li, Tucker Wilde, Valentina Pyatkin, Will Merrill, Yapei Chang, Yuling Gu, Zhiyuan Zeng, Ashish Sabharwal, Luke Zettlemoyer, Pang Wei Koh, Ali Farhadi, Noah A\. Smith, and Hannaneh Hajishirzi\.Olmo 3, 2025\.URL[https://arxiv\.org/abs/2512\.13961](https://arxiv.org/abs/2512.13961)\.
- OpenAI \(2025\)OpenAI\.Introducing GPT‑5\.2, December 2025\.URL[https://openai\.com/index/introducing\-gpt\-5\-2/](https://openai.com/index/introducing-gpt-5-2/)\.Accessed 20\-Feb\-2026\.
- OpenAI et al\. \(2025\)OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K\. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, Vlad Fomenko, Timur Garipov, Kristian Georgiev, Mia Glaese, Tarun Gogineni, Adam Goucher, Lukas Gross, Katia Gil Guzman, John Hallman, Jackie Hehir, Johannes Heidecke, Alec Helyar, Haitang Hu, Romain Huet, Jacob Huh, Saachi Jain, Zach Johnson, Chris Koch, Irina Kofman, Dominik Kundel, Jason Kwon, Volodymyr Kyrylov, Elaine Ya Le, Guillaume Leclerc, James Park Lennon, Scott Lessans, Mario Lezcano\-Casado, Yuanzhi Li, Zhuohan Li, Ji Lin, Jordan Liss, Lily, Liu, Jiancheng Liu, Kevin Lu, Chris Lu, Zoran Martinovic, Lindsay McCallum, Josh McGrath, Scott McKinney, Aidan McLaughlin, Song Mei, Steve Mostovoy, Tong Mu, Gideon Myles, Alexander Neitz, Alex Nichol, Jakub Pachocki, Alex Paino, Dana Palmie, Ashley Pantuliano, Giambattista Parascandolo, Jongsoo Park, Leher Pathak, Carolina Paz, Ludovic Peran, Dmitry Pimenov, Michelle Pokrass, Elizabeth Proehl, Huida Qiu, Gaby Raila, Filippo Raso, Hongyu Ren, Kimmy Richardson, David Robinson, Bob Rotsted, Hadi Salman, Suvansh Sanjeev, Max Schwarzer, D\. Sculley, Harshit Sikchi, Kendal Simon, Karan Singhal, Yang Song, Dane Stuckey, Zhiqing Sun, Philippe Tillet, Sam Toizer, Foivos Tsimpourlas, Nikhil Vyas, Eric Wallace, Xin Wang, Miles Wang, Olivia Watkins, Kevin Weil, Amy Wendling, Kevin Whinnery, Cedric Whitney, Hannah Wong, Lin Yang, Yu Yang, Michihiro Yasunaga, Kristen Ying, Wojciech Zaremba, Wenting Zhan, Cyril Zhang, Brian Zhang, Eddie Zhang, and Shengjia Zhao\.gpt\-oss\-120b & gpt\-oss\-20b model card, 2025\.URL[https://arxiv\.org/abs/2508\.10925](https://arxiv.org/abs/2508.10925)\.
- Penedo et al\. \(2024\)Guilherme Penedo, Hynek Kydlíček, Alessandro Cappelli, Mario Sasko, and Thomas Wolf\.DataTrove: large scale data processing, 2024\.URL[https://github\.com/huggingface/datatrove](https://github.com/huggingface/datatrove)\.
- Pyatkin et al\. \(2025\)Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi\.Generalizing Verifiable Instruction Following\.2025\.URL[https://arxiv\.org/abs/2507\.02833](https://arxiv.org/abs/2507.02833)\.
- Qin et al\. \(2024\)Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu\.InFoBench: Evaluating Instruction Following Ability in Large Language Models\.In Lun\-Wei Ku, Andre Martins, and Vivek Srikumar \(eds\.\),*Findings of the Association for Computational Linguistics: ACL 2024*, pp\. 13025–13048, Bangkok, Thailand, August 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.findings\-acl\.772\.URL[https://aclanthology\.org/2024\.findings\-acl\.772/](https://aclanthology.org/2024.findings-acl.772/)\.
- Rakotonirina et al\. \(2025\)Nathanaël Carraz Rakotonirina, Mohammed Hamdy, Jon Ander Campos, Lucas Weber, Alberto Testoni, Marzieh Fadaee, Sandro Pezzelle, and Marco Del Tredici\.From Tools to Teammates: Evaluating LLMs in Multi\-Session Coding Interactions\.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar \(eds\.\),*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 19609–19642, Vienna, Austria, July 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-251\-0\.doi:10\.18653/v1/2025\.acl\-long\.964\.URL[https://aclanthology\.org/2025\.acl\-long\.964/](https://aclanthology.org/2025.acl-long.964/)\.
- Team et al\. \(2025a\)Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin, Robert Busa\-Fekete, Alex Feng, Noveen Sachdeva, Benjamin Coleman, Yi Gao, Basil Mustafa, Iain Barr, Emilio Parisotto, David Tian, Matan Eyal, Colin Cherry, Jan\-Thorsten Peter, Danila Sinopalnikov, Surya Bhupatiraju, Rishabh Agarwal, Mehran Kazemi, Dan Malkin, Ravin Kumar, David Vilar, Idan Brusilovsky, Jiaming Luo, Andreas Steiner, Abe Friesen, Abhanshu Sharma, Abheesht Sharma, Adi Mayrav Gilady, Adrian Goedeckemeyer, Alaa Saade, Alex Feng, Alexander Kolesnikov, Alexei Bendebury, Alvin Abdagic, Amit Vadi, András György, André Susano Pinto, Anil Das, Ankur Bapna, Antoine Miech, Antoine Yang, Antonia Paterson, Ashish Shenoy, Ayan Chakrabarti, Bilal Piot, Bo Wu, Bobak Shahriari, Bryce Petrini, Charlie Chen, Charline Le Lan, Christopher A\. Choquette\-Choo, CJ Carey, Cormac Brick, Daniel Deutsch, Danielle Eisenbud, Dee Cattle, Derek Cheng, Dimitris Paparas, Divyashree Shivakumar Sreepathihalli, Doug Reid, Dustin Tran, Dustin Zelle, Eric Noland, Erwin Huizenga, Eugene Kharitonov, Frederick Liu, Gagik Amirkhanyan, Glenn Cameron, Hadi Hashemi, Hanna Klimczak\-Plucińska, Harman Singh, Harsh Mehta, Harshal Tushar Lehri, Hussein Hazimeh, Ian Ballantyne, Idan Szpektor, Ivan Nardini, Jean Pouget\-Abadie, Jetha Chan, Joe Stanton, John Wieting, Jonathan Lai, Jordi Orbay, Joseph Fernandez, Josh Newlan, Ju yeong Ji, Jyotinder Singh, Kat Black, Kathy Yu, Kevin Hui, Kiran Vodrahalli, Klaus Greff, Linhai Qiu, Marcella Valentine, Marina Coelho, Marvin Ritter, Matt Hoffman, Matthew Watson, Mayank Chaturvedi, Michael Moynihan, Min Ma, Nabila Babar, Natasha Noy, Nathan Byrd, Nick Roy, Nikola Momchev, Nilay Chauhan, Noveen Sachdeva, Oskar Bunyan, Pankil Botarda, Paul Caron, Paul Kishan Rubenstein, Phil Culliton, Philipp Schmid, Pier Giuseppe Sessa, Pingmei Xu, Piotr Stanczyk, Pouya Tafti, Rakesh Shivanna, Renjie Wu, Renke Pan, Reza Rokni, Rob Willoughby, Rohith Vallu, Ryan Mullins, Sammy Jerome, Sara Smoot, Sertan Girgin, Shariq Iqbal, Shashir Reddy, Shruti Sheth, Siim Põder, Sijal Bhatnagar, Sindhu Raghuram Panyam, Sivan Eiger, Susan Zhang, Tianqi Liu, Trevor Yacovone, Tyler Liechty, Uday Kalra, Utku Evci, Vedant Misra, Vincent Roseberry, Vlad Feinberg, Vlad Kolesnikov, Woohyun Han, Woosuk Kwon, Xi Chen, Yinlam Chow, Yuvein Zhu, Zichuan Wei, Zoltan Egyed, Victor Cotruta, Minh Giang, Phoebe Kirk, Anand Rao, Kat Black, Nabila Babar, Jessica Lo, Erica Moreira, Luiz Gustavo Martins, Omar Sanseviero, Lucas Gonzalez, Zach Gleicher, Tris Warkentin, Vahab Mirrokni, Evan Senter, Eli Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell, Yossi Matias, D\. Sculley, Slav Petrov, Noah Fiedel, Noam Shazeer, Oriol Vinyals, Jeff Dean, Demis Hassabis, Koray Kavukcuoglu, Clement Farabet, Elena Buchatskaya, Jean\-Baptiste Alayrac, Rohan Anil, Dmitry, Lepikhin, Sebastian Borgeaud, Olivier Bachem, Armand Joulin, Alek Andreev, Cassidy Hardin, Robert Dadashi, and Léonard Hussenot\.Gemma 3 Technical Report, 2025a\.URL[https://arxiv\.org/abs/2503\.19786](https://arxiv.org/abs/2503.19786)\.
- Team et al\. \(2025b\)GLM Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bowen Xu, Can Huang, Casey Zhao, Changpeng Cai, Chao Yu, Chen Li, Chendi Ge, Chenghua Huang, Chenhui Zhang, Chenxi Xu, Chenzheng Zhu, Chuang Li, Congfeng Yin, Daoyan Lin, Dayong Yang, Dazhi Jiang, Ding Ai, Erle Zhu, Fei Wang, Gengzheng Pan, Guo Wang, Hailong Sun, Haitao Li, Haiyang Li, Haiyi Hu, Hanyu Zhang, Hao Peng, Hao Tai, Haoke Zhang, Haoran Wang, Haoyu Yang, He Liu, He Zhao, Hongwei Liu, Hongxi Yan, Huan Liu, Huilong Chen, Ji Li, Jiajing Zhao, Jiamin Ren, Jian Jiao, Jiani Zhao, Jianyang Yan, Jiaqi Wang, Jiayi Gui, Jiayue Zhao, Jie Liu, Jijie Li, Jing Li, Jing Lu, Jingsen Wang, Jingwei Yuan, Jingxuan Li, Jingzhao Du, Jinhua Du, Jinxin Liu, Junkai Zhi, Junli Gao, Ke Wang, Lekang Yang, Liang Xu, Lin Fan, Lindong Wu, Lintao Ding, Lu Wang, Man Zhang, Minghao Li, Minghuan Xu, Mingming Zhao, Mingshu Zhai, Pengfan Du, Qian Dong, Shangde Lei, Shangqing Tu, Shangtong Yang, Shaoyou Lu, Shijie Li, Shuang Li, Shuang\-Li, Shuxun Yang, Sibo Yi, Tianshu Yu, Wei Tian, Weihan Wang, Wenbo Yu, Weng Lam Tam, Wenjie Liang, Wentao Liu, Xiao Wang, Xiaohan Jia, Xiaotao Gu, Xiaoying Ling, Xin Wang, Xing Fan, Xingru Pan, Xinyuan Zhang, Xinze Zhang, Xiuqing Fu, Xunkai Zhang, Yabo Xu, Yandong Wu, Yida Lu, Yidong Wang, Yilin Zhou, Yiming Pan, Ying Zhang, Yingli Wang, Yingru Li, Yinpei Su, Yipeng Geng, Yitong Zhu, Yongkun Yang, Yuhang Li, Yuhao Wu, Yujiang Li, Yunan Liu, Yunqing Wang, Yuntao Li, Yuxuan Zhang, Zezhen Liu, Zhen Yang, Zhengda Zhou, Zhongpei Qiao, Zhuoer Feng, Zhuorui Liu, Zichen Zhang, Zihan Wang, Zijun Yao, Zikang Wang, Ziqiang Liu, Ziwei Chai, Zixuan Li, Zuodong Zhao, Wenguang Chen, Jidong Zhai, Bin Xu, Minlie Huang, Hongning Wang, Juanzi Li, Yuxiao Dong, and Jie Tang\.GLM\-4\.5: Agentic, Reasoning, and Coding \(ARC\) Foundation Models, 2025b\.URL[https://arxiv\.org/abs/2508\.06471](https://arxiv.org/abs/2508.06471)\.
- Team \(2025\)Qwen Team\.Qwen3 Technical Report, 2025\.URL[https://arxiv\.org/abs/2505\.09388](https://arxiv.org/abs/2505.09388)\.
- Wang et al\. \(2025\)Jiaming Wang, Yunke Zhao, Peng Ding, Jun Kuang, Yibin Shen, Zhe Tang, Yilin Jin, ZongYu Wang, Xiaoyu Li, Xuezhi Cao, and Xunliang Cai\.Meeseeks: A Feedback\-Driven, Iterative Self\-Correction Benchmark evaluating LLMs’ Instruction Following Capability\.2025\.URL[https://arxiv\.org/abs/2504\.21625](https://arxiv.org/abs/2504.21625)\.
- Wang et al\. \(2024\)Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji\.MINT: Evaluating LLMs in multi\-turn interaction with tools and language feedback\.In*The Twelfth International Conference on Learning Representations*, 2024\.URL[https://openreview\.net/forum?id=jp3gWrMuIZ](https://openreview.net/forum?id=jp3gWrMuIZ)\.
- Wu et al\. \(2025\)Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai\-Wei Chang, and Dong Yu\.LongMemEval: Benchmarking Chat Assistants on Long\-Term Interactive Memory\.In*The Thirteenth International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=pZiyCaVuti](https://openreview.net/forum?id=pZiyCaVuti)\.
- Xia et al\. \(2024\)Congying Xia, Chen Xing, Jiangshu Du, Xinyi Yang, Yihao Feng, Ran Xu, Wenpeng Yin, and Caiming Xiong\.FOFO: A Benchmark to Evaluate LLMs’ Format\-Following Capability\.In Lun\-Wei Ku, Andre Martins, and Vivek Srikumar \(eds\.\),*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 680–699, Bangkok, Thailand, August 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.acl\-long\.40\.URL[https://aclanthology\.org/2024\.acl\-long\.40/](https://aclanthology.org/2024.acl-long.40/)\.
- Yang et al\. \(2025\)An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang\.Qwen2\.5\-1M Technical Report\.*arXiv preprint arXiv:2501\.15383*, 2025\.
- Zhao et al\. \(2024\)Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng\.WildChat: 1M ChatGPT Interaction Logs in the Wild\.In*The Twelfth International Conference on Learning Representations*, 2024\.URL[https://openreview\.net/forum?id=Bl8u7ZRlbM](https://openreview.net/forum?id=Bl8u7ZRlbM)\.
- Zheng et al\. \(2023\)Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P\. Xing, Hao Zhang, Joseph E\. Gonzalez, and Ion Stoica\.Judging LLM\-as\-a\-judge with MT\-bench and Chatbot Arena\.In*Proceedings of the 37th International Conference on Neural Information Processing Systems*, NIPS ’23, Red Hook, NY, USA, 2023\. Curran Associates Inc\.
- Zheng et al\. \(2024\)Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric Xing, Joseph E\. Gonzalez, Ion Stoica, and Hao Zhang\.LMSYS\-chat\-1m: A large\-scale real\-world LLM conversation dataset\.In*The Twelfth International Conference on Learning Representations*, 2024\.URL[https://openreview\.net/forum?id=BOfDKxfwt0](https://openreview.net/forum?id=BOfDKxfwt0)\.
- Zhou et al\. \(2023\)Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou\.Instruction\-Following Evaluation for Large Language Models\.2023\.URL[https://arxiv\.org/abs/2311\.07911](https://arxiv.org/abs/2311.07911)\.

## Appendix AConstraint Curation Pipeline

### A\.1Prompt Templates

Figures[8](https://arxiv.org/html/2605.06353#A1.F8)and[9](https://arxiv.org/html/2605.06353#A1.F9)show the prompt templates used to extract constraints from conversational datasets and identify satisfiable constraints, respectively\. In the phases for identifying trivial and subjective constraints, we used the prompt templates shown in[Figure11](https://arxiv.org/html/2605.06353#A1.F11)and[10](https://arxiv.org/html/2605.06353#A1.F10)\. To sample model responses to tasks without specifying any constraint in the trivial phase, we did use any template, just sent the tasks directly to the models\.[Figures8](https://arxiv.org/html/2605.06353#A1.F8)and[9](https://arxiv.org/html/2605.06353#A1.F9)show the prompt templates used to extract constraints from conversational datasets and to identify satisfiable constraints, respectively\. For the triviality and subjectivity phases, we used the prompt templates shown in[Figures11](https://arxiv.org/html/2605.06353#A1.F11)and[10](https://arxiv.org/html/2605.06353#A1.F10)\. In the triviality phase, when sampling model responses without imposing any constraint, we submitted the tasks directly to the models without using any template\.

Constraint Extraction Prompt TemplateIdentify tasks and constraints in user prompts\.A task is a directive that specifies an action or goal\. It tells a model to provide information or do something, such as ”What is the capital of France?”, ”Summarize this text”, ”Translate this sentence”, ”Who are you?”, ”Generate a list of ideas”, or ”Why is the sky blue?”\.A user prompt might contain no task\! It can simply be a statement or expression, without any specific request for information or action\. Examples of such user prompts include greetings \(”Hello\!”\), expressions of emotion \(”I’m feeling great today\.”\), or sharing information \(”I went to the park yesterday\.”\)\.A constraint is a restriction or condition that limits how the model should generate its output, rather than what task it performs\. It guides the form, style, or structure of the response — ensuring it adheres to specific requirements or rules\.We classify constraints into four main categories:1\) Linguistic Guidelines: These dictate the use of particular language structures and terms, including grammatical styles, syntax, and specific dialects, like ”Victorian English” or ”technical jargon”;2\) Style Rules: These direct the overall tone and audience of the text, varying from formal to persuasive or sophisticated, as in writing with a ”respectful tone” or for ”a young audience”;3\) Format Specifications: These instruct the LLM on the structural presentation of its response, such as ”write your answer as a sonnet” or ”list ideas bullet\-wise”;4\) Number Limitations: These involve numeric\-related instructions, like producing ”a 500\-word essay” or presenting ”three arguments for your answer”\.Below, you are given a sequence of user prompts taken from a conversation\. Your job is to identify all tasks and constraints in the user prompts\. In addition, classify all constraints into their categories\. Each constraint should be classified into one and only one of the four categories listed above\.You can first reason about the user prompts and their context\. At the end, present your final answer as a valid json output, ie, as a list of dictionaries where each dictionary contains the turn number, the task defined in the user turn \(if any, otherwise ””\),and a list of dictionaries for the constraints found \(if any, otherwise \[\]\), where each constraint dictionary contains the constraint and the constraint type\.For example:User prompts:“Turn 1:Hello\!Turn 2:I want to write an email to my boss about the crazy amount of meetings he’s scheduling\. Write me a formal email of 300 words to explain the situation and ask for him to be more understanding on the number of meetings he schedules\.Turn 3:Could you make sure to include some suggestions in a bullet\-point list on how to manage meetings better?Turn 4:Rewrite the email in a more polite tone\.”Output:\[\{\{“turn”: 1,“task”: “”,“constraints”:\[ \]\}\},\{\{“turn”: 2,“task”: “I want to write an email to my boss about the crazy amount of meetings he’s scheduling\. Write me an email to explain the situation and ask for him to be more understanding on the number of meetings he schedules\.”,“constraints”: \[\{\{“constraint”: “Write in formal tone”,“type”: “Style Rules”\}\},\{\{“constraint”: “Write in 300 words”,“type”: “Number Limitations”\}\}\]\}\},\{\{“turn”: 3,“task”: “Make sure to include some suggestions on how to manage meetings better\.”,“constraints”: \[\{\{“constraint”: “Write a bullet\-point list”,“type”: “Format Specifications”\}\}\]\}\},\{\{“turn”: 4,“task”: “Rewrite the email\.”,“constraints”: \[\{\{“constraint”: “Write in a more polite tone”,“type”: “Style Rules”\}\}\]\}\}\]User prompts:“\{user\_turn\}”Output:

Figure 8:Prompt template used to extract constraints from datasets of conversations\.Satisfiable Constraints Prompt TemplateIdentify whether a constraint is relevant to a task\.A task is a directive that specifies an action or goal\. It tells a model to provide information or do something, such as “What is the capital of France?”, “Summarize this text”, “Translate this sentence”, “Who are you?”, “Generate a list of ideas”, or “Why is the sky blue?”\.A constraint is a restriction or condition that limits how the model should generate its output, rather than what task it performs\. It guides the form, style, or structure of the response — ensuring it adheres to specific requirements or rules\.We classify constraints into four main categories:1\) Linguistic Guidelines: These dictate the use of particular language structures and terms, including grammatical styles, syntax, and specific dialects, like “Victorian English” or “technical jargon”;2\) Style Rules: These direct the overall tone and audience of the text, varying from formal to persuasive or sophisticated, as in writing with a “respectful tone” or for “a young audience”;3\) Format Specifications: These instruct the LLM on the structural presentation of its response, such as “write your answer as a sonnet” or “list ideas bullet\-wise”;4\) Number Limitations: These involve numeric\-related instructions, like producing “a 500\-word essay” or presenting “three arguments for your answer”\.Below, you are given a task and a constraint\. To determine whether the constraint is relevant to the task, answer seperately each of these questions with either\[\[Yes\]\]or\[\[No\]\]:1\) Is the constraint actually a restriction or condition that limits how the model should generate its output to the task?2\) Does the constraint target a different question, topic, or domain than the task itself?3\) Is the constraint applicable to the type of output the task requires?4\) Does the constraint fall within one of the four defined categories above?You can first reason about the task and the constraint\. Output only a valid JSON with this structure:\{\{“reasoning”: “write your reasoning here”,“question 1”: “\[\[Yes/No\]\]”,“question 2”: “\[\[Yes/No\]\]”,“question 3”: “\[\[Yes/No\]\]”,“question 4”: “\[\[Yes/No\]\]”\}\}Task:\{task\}Constraint:\{constraint\}Figure 9:Prompt template used to identify satisfiable constraints with LM judges\.Adress Task While Following ConstraintAddress the following task while adhering to the given constraint\.Constraint:\{constraint\}Task: \{task\}Figure 10:Prompt template used to generate model responses to a task under a specified constraint\.Single\-Constraint Evaluation Prompt TemplateAn assistant has been asked to perform a task\. Your job is to assess whether the provided answer satisfies a given constraint\. You may first reason about both the constraint and the answer\. At the end, present your final verdict as either “Final Verdict:\[\[Yes\]\]” if the answer satisfies the constraint, or “Final Verdict:\[\[No\]\]” if it does not\.Does the following answer satisfy the constraint?Answer:\{answer\}Constraint:\{constraint\}Figure 11:Prompt template used by LM judges to assess whether an answer satisfies a single constraint\.
### A\.2Satisfiable, Trivial, and Subjective Thresholds

In the satisfiability, triviality, and subjectivity phases, each constraint was evaluated across100100different task contexts\. A constraint was classified as satisfiable \(respectively non\-trivial or non\-subjective\) if each judge classified it as such in at leastX%X\\%of the evaluated task contexts\.[Figures12](https://arxiv.org/html/2605.06353#A1.F12),[14](https://arxiv.org/html/2605.06353#A1.F14)and[16](https://arxiv.org/html/2605.06353#A1.F16)present the number of constraints classified as satisfiable, non\-trivial, and non\-subjective, respectively, for different values ofXX\.[Figures13](https://arxiv.org/html/2605.06353#A1.F13),[15](https://arxiv.org/html/2605.06353#A1.F15)and[17](https://arxiv.org/html/2605.06353#A1.F17)show the distribution of constraints across5%5\\%performance intervals\.

Our final pool includes only constraints classified as satisfiable, non\-trivial, and non\-subjective in at least70%70\\%of the evaluated task contexts by each judge \(not necessarily the same contexts across phases\)\. This threshold balances robustness to task variability with the need to maintain sufficient constraint diversity, while ensuring a reasonably large and reliable constraint set\.

![Refer to caption](https://arxiv.org/html/2605.06353v1/x7.png)Figure 12:Number of constraints classified as satisfied by all three judges as a function of the minimum percentage of task contexts in which they are classified as such\.![Refer to caption](https://arxiv.org/html/2605.06353v1/x8.png)Figure 13:Distribution of constraints by the percentage of task contexts in which they are classified as satisfied by all three judges\.![Refer to caption](https://arxiv.org/html/2605.06353v1/x9.png)Figure 14:Number of constraints classified as non\-trivial by all three judges as a function of the minimum percentage of task contexts in which they are classified as such\.![Refer to caption](https://arxiv.org/html/2605.06353v1/x10.png)Figure 15:Distribution of constraints by the percentage of task contexts in which they are classified as non\-trivial by all three judges\.![Refer to caption](https://arxiv.org/html/2605.06353v1/x11.png)Figure 16:Number of constraints classified as non\-subjective by all three judges as a function of the minimum percentage of task contexts in which they are classified as such\.![Refer to caption](https://arxiv.org/html/2605.06353v1/x12.png)Figure 17:Distribution of constraints by the percentage of task contexts in which they are classified as non\-subjective by all three judges\.

## Appendix BCreating Tuples of Constraints

### B\.1Prompt Templates

We focused on creating tuples of33constraints that can be simultaneously satisfied\. We achieved this through a two\-step process\. First, we randomly sampled tuples of33constraints and paired them with100100tasks\. Each tuple–task pair was evaluated by multiple judges, and we retained only those tuples for which all judges agreed that the constraints can be jointly satisfied in at least70%70\\%of the task contexts—these are satisfiable tuples\. We used the prompt template shown in[Figure18](https://arxiv.org/html/2605.06353#A2.F18)\. Second, we paired each satisfiable tuple again with100100tasks and prompted various models to generate responses for such pairs\. For each tuple–task pair, we checked whether there existed at least one response satisfying all constraints, as verified unanimously by the judges\. We used the prompt templates shown in[Figures19](https://arxiv.org/html/2605.06353#A2.F19)and[11](https://arxiv.org/html/2605.06353#A1.F11)\.

Satisfiable Tuples Prompt TemplateYou will receive a task and a list of constraints\. Analyze whether the constraints can all be followed simultaneously by a single response to the task without contradiction\. You can first reason about the task, the constraints, and their possible contradictions\. At the end, reply with “Final Verdict:\[\[Yes\]\]” if they are jointly compatible, otherwise reply with “Final Verdict:\[\[No\]\]”\.Task:“\{task\}”Constraints:“\{constraints\}”Figure 18:Prompt template used to identify satisfiable tuples with LM judges\.Adress Task While Following Tuple of ConstraintsAddress the following task while adhering to all the given constraints\.Constraints:\{constraints\}Task:\{task\}Figure 19:Prompt template used to generate model responses to a task under a specified tuple of constraints\.
### B\.2Cut\-Off Threshold

Figure[20](https://arxiv.org/html/2605.06353#A2.F20)presents the number of satisfiable tuples for which we found at least one answer satisfying all constraints of the tuple, as classified by all three judges, for a varying number of task contexts\.[Figure21](https://arxiv.org/html/2605.06353#A2.F21)shows the distribution of tuples across5%5\\%performance intervals\.

Our final pool includes all satisfiable tuples for which we found, for at least70%70\\%of the task contexts analyzed, an answer satisfying all constraints of the tuple, as determined by all three judges\. This threshold ensures a reasonably large and reliable set of948948constraint tuples, as well as robustness to task variability\.

![Refer to caption](https://arxiv.org/html/2605.06353v1/x13.png)Figure 20:Number of tuples as a function of the minimum percentage of task contexts for which we found at least one answer satisfying all constraints of the tuple, as determined by all three judges\.![Refer to caption](https://arxiv.org/html/2605.06353v1/x14.png)Figure 21:Distribution of tuples by the percentage of task contexts for which we found at least one answer satisfying all constraints of the tuple, as determined by all three judges\.

## Appendix CFinal Pool of Constraints and Tuples

### C\.1Constraint Categories

We adopt four main categories of constraints defined byQin et al\. \([2024](https://arxiv.org/html/2605.06353#bib.bib25)\):

- •Linguistic Guidelines\.These impose requirements on the language of the response, including vocabulary choice, grammatical constructions, or adherence to specific linguistic varieties \(e\.g\.,“use passive voice throughout”,“avoid technical terminology”\)\.
- •Style Rules\.These govern the overall tone or intended audience of the response \(e\.g\.,“write in a neutral and objective voice”,“explain the answer as if speaking to a child”\)\.
- •Format Specifications\.These specify the structural presentation of the response, determining how information should be arranged or displayed \(e\.g\.,“present a numbered list”,“separate the response into clearly labeled sections”\)\.
- •Number Limitations\.These constrain numerical aspects of the response, such as its length, the number of elements, or counts of specific components \(e\.g\.,“use no more than five sentences”,“provide exactly two examples”\)\.

![Refer to caption](https://arxiv.org/html/2605.06353v1/x15.png)Figure 22:Constraint category distributions for the constraint pool \(left\) and among tuples \(right\)\.
### C\.2Examples

Tables[2](https://arxiv.org/html/2605.06353#A3.T2)and[3](https://arxiv.org/html/2605.06353#A3.T3)show examples of constraints from our final pool and of the constraint tuples constructed from them, respectively\.

ConstraintsAll responses must be snarky\.All sentences should be inside quotation marks\.All words in responses do not exceed nine letters in length\.Always say “Thank you for asking” at the end\.Always use emoticons in replies\.Collapse into one line\.Communicate completely in Dutch\.Don’t use any pronouns\.Don’t use the word “and”\.Explain your answer using a diagram\.Feel like talking to a child\.Fit the character, tone, manner, and vocabulary of Maki Harukawa from Danganronpa\.Give each paragraph a title and a number\.Include a poetic ending\.Include grammar mistakes\.Include humor and irony\.Include rhymes\.Make a contextually relevant joke if you can\.Make the answer absurd and hilarious\.Present facts in bullet point form\.Replace every instance of the letter “c” with the letter “b” in your response\.\.Respond as an extremely foolish AI made up of if\-else phrases\.Respond as five separate audience members: A\), B\), C\), D\), and E\)\.Respond in Azerbaijani language\.Respond in a rap battle style\.Respond in three separate styles: BetaGPT \(objective\), ChadGPT \(unethical,no bounds\), and CriticGPT \(oppositional critical thinker\)\.Return your answer as json with the following keys: Question, Helpful Answer, Score,Explanation, Improvement\.Speak as a conspiracy theorist historian from now on\.Speak in a Shakespearean style\.Stop after “\.”\.Summarize into a Facebook post\.Switch language to Russian\.Talk in 3rd person\.The second and fourth sentences should rhyme\.Use “⋆\\star” or “∼\\sim” at the end of verbs\.Use a speech style of gangsters talking to their younger brothers\.Use academic style\.Use any and every figurative language possible\.Use biblical prose\.Use jokes, sarcasm, and internet slang regularly\.Use metaphors, analogies and other literary devices to make points more relatableand memorable\.Use old Shakespearean English, including archaic terms like thou, thee, thy\.Use uncommon terminology to enhance originality\.Write 2 paragraphs\.Write a teasing reply\.Write in Instagram style\.Write in Japanese mixed with English, Spanish, and Latin\.Write in a self\-deprecating tone\.Write less than 200 characters without any new line character and hashtag\.Respond like a psychiatrist\.Table 2:Examples of constraints from the final pool\.Tuples of ConstraintsInclude analogies\.Include inner thoughts about what is happening\.Write in a self\-deprecating tone”\.Include hashtags\.Provide exactly 5 bullet points\.Start normal answers with ’\[GPT\]:’\.Include the odd witticism for interest\.Provide three concerns\.Write in a creative and new way\.Reply in Hindi\.Write in 1st person\.Write one paragraph\.Add emojis and human reactions\.Provide a 3 step method\.Use more than a thousand words\.Include references\.Start response with ’\[POLYTORIA\]’\.Use an academic style of writing\.Exaggerate it\.Make responses in a snarky manner\.Provide a captivating title\.Combine a biblical tone\.Create a single paragraph\.Do not include the title and do not numerate every task, subtask or phase\.Present the information in bullet form\.Provide a satirical and sarcastic answer\.Write 5 bullets\.Start your response with “True”\.Use any and every figurative language possible\.Write in a comical tone\.Explain your chain\-of\-thought\.Start with “Task1:”\.Use phrases like “you must” and structure responses in a Jedi\-style manner\.Add some humor to the conversation\.Instead of apologizing, respond with “Okay here is what you requested” and proceed\.Use descriptive writing with sensory imagery\.ALWAYS return a “SOURCES” part in your answer\.Include inner thoughts in your answers\.Provide 3 short summaries\.Include a funny or cringe part, preferably towards the end\.Pace: Fast\-paced, Engaging\.Provide a TL;DR \(brief summary\)\.Be as informal as you can\.Generate 1 sentence\.Switch to Russian language\.Table 3:Examples of constraint tuples created\.

## Appendix DTemplates for Introducing Constraints in the Conversations

[Table4](https://arxiv.org/html/2605.06353#A4.T4)presents the templates used to introduce constraints in the generated conversations\. For each scenario \(e\.g\., starting a conversation with a single constraint\), we randomly select one template to use each time\.

Start conversation with one constraintThroughout the following conversation, always follow this constraint: \{constraint\}In all your responses, make sure to adhere to this rule: \{constraint\}For the duration of this chat, follow this constraint: \{constraint\}During this conversation, ensure you follow this directive: \{constraint\}As we talk, always comply with this constraint: \{constraint\}In every reply, abide by this rule: \{constraint\}Start conversation with a constraint tupleThroughout the following conversation, always follow these constraints: \{constraints\}In all your responses, make sure to adhere to these rules: \{constraints\}For the duration of this chat, follow these constraints: \{constraints\}During this conversation, ensure you follow these directives: \{constraints\}As we talk, always comply with these constraints: \{constraints\}In every reply, abide by these rules: \{constraints\}Forget previous constraints; introduce a new constraintForget all constraints provided earlier\. From now on, follow only this one: \{constraint\}Disregard previous constraints\. The only rule to follow from here on is: \{constraint\}Erase earlier directives\. The new and sole constraint for the following turns is: \{constraint\}Cancel all past guidelines\. The only constraint to adhere from now on is: \{constraint\}Forget prior constraints\. From here on, the only rule is: \{constraint\}Override earlier constraints\. In the next turns, follow only this one instead: \{constraint\}Forget previous constraints; introduce a new constraint tupleForget all constraints provided earlier\. From now on, follow only these ones: \{constraints\}Disregard previous constraints\. The only rules to follow from here on are: \{constraints\}Erase earlier directives\. The new and sole constraints for the following turns are: \{constraints\}Cancel all past guidelines\. The only constraints to adhere from now on are: \{constraints\}Forget prior constraints\. From here on, the only rules are: \{constraints\}Override earlier constraints\. In the next turns, follow only these ones instead: \{constraints\}Remember previous constraints; introduce a new constraintIn addition to the previous constraints, also follow this one from now on: \{constraint\}Along with the earlier directives, from here on also follow this new constraint: \{constraint\}Do not forget the existing rules; in the next turns follow also this new one: \{constraint\}Building on the earlier constraints, adhere to this as well in the following turns: \{constraint\}Keep in mind the previous constraints and, in addition, follow this new one from here on:\{constraint\}Remember previous constraints; introduce a new constraint tupleIn addition to the previous constraints, also follow these ones from now on: \{constraints\}Along with the earlier directives, from here on also follow these new constraints: \{constraints\}Do not forget the existing rules; in the next turns follow also these new ones: \{constraints\}Building on the earlier constraints, adhere to these as well in the following turns: \{constraints\}Keep in mind the previous constraints and, in addition, follow these new ones from here on:\{constraints\}Table 4:Templates used to introduce single or multiple constraints throughout the conversational testsets\. Tasks are appended immediately after the constraint\(s\)\.
## Appendix ESynthetic Task Generation

We construct synthetic tasks simulating interactions between diverse personas and an assistant using a two\-stage generation pipeline\. Starting from randomly sampled persona profiles from Persona Hub\(Ge et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib6)\), we first generate a structured daily agenda consisting of realistic and contextually grounded activities tailored to each persona’s profession, interests, and lifestyle\.[Figure23](https://arxiv.org/html/2605.06353#A5.F23)shows the prompt template used for agenda generation\.

In the second stage, we generate open\-ended questions conditioned on both the persona description and a specific activity from the generated agenda\. For each persona\-activity pair, the model produces independent and natural questions that the persona might plausibly ask an assistant within that scenario\.[Figure24](https://arxiv.org/html/2605.06353#A5.F24)provides the corresponding prompt template\.

This two\-step process yields ordered sequences of scenario\-grounded questions, forming coherent day\-long interaction trajectories for each persona\. From the generated data, we retain500500distinct personas, each associated with a single interaction trajectory consisting of120120user turns\. We use Qwen3\-Next\-80B\-A3B\-Instruct\-FP8\(Team,[2025](https://arxiv.org/html/2605.06353#bib.bib29); Yang et al\.,[2025](https://arxiv.org/html/2605.06353#bib.bib34)\)for all generations\.

Agenda Generation PromptCreate a possible agenda for a day in the life of the following persona:\{persona\}Note:1\. Identify several activities or tasks that the persona might engage in throughout their day\. Provide a detailed description of each activity or task\.2\. The agenda should be specific and tailored to the persona’s characteristics, interests, and lifestyle\.3\. Your output should start with “Agenda: ” and list the activities in chronological order\. Identify each activity in a new line with the markdown divider “\#\#\#”Figure 23:Prompt template used to generate a structured daily agenda conditioned on a persona description\.Question Generation PromptNext, you are given the description of a persona and an activity/task from their daily agenda\. Elaborate on how this activity/task might unfold, setting the stage for interesting questions the persona could naturally wonder about or ask an assistant for clarification\.Your ultimate goal is to generate a sequence of open\-ended, creative questions that build upon the scenario\. Put yourself in the position of the persona; each question should feel as if the persona is asking it in real time to someone or an AI assistant\. Importantly, no question should depend on or assume answers to previous ones\.Persona: \{persona\}Activity/Task: \{activity\}Note:1\. The questions can include details such as the location where the actions take place, people involved, time of day, emotions, challenges, or other relevant aspects that make the scenario vivid and engaging\.2\. Ensure the questions are coherent and consistent with both the persona and the activity/task\. Avoid contradictions\.3\. Write each question on a new line, preceded by the markdown divider “\#\#\#”Figure 24:Prompt template used to generate open\-ended questions conditioned on a persona description and a specific activity from the generated agenda\.
## Appendix FDetailed Experimental Results

We report several metrics summarizing model performance across regimes\.[Tables5](https://arxiv.org/html/2605.06353#A6.T5)and[6](https://arxiv.org/html/2605.06353#A6.T6)show how per\-turn accuracy changes between the beginning and end of conversations, as well as the overall variability across turns\. The heatmaps in Figures[25](https://arxiv.org/html/2605.06353#A6.F25)\-[30](https://arxiv.org/html/2605.06353#A6.F30)show the per\-turn accuracy for all models across regimes, providing a detailed view of how performance evolves throughout conversations\.

[Table7](https://arxiv.org/html/2605.06353#A6.T7)presents the average number of tokens per conversation for each model and regime, computed using each model\-specific tokenizer and chat template\. These counts exclude internal reasoning or thinking traces\. For Gemini 3\.1 Flash Lite, we employ the Gemma3 tokenizer and chat template\. While most evaluated models have context windows ranging from128128K to256256K tokens, Gemini exceeds11M\. Despite these large windows, we observed that some models still exhausted the available context in certain cases\.

SingleTuplesReplace 10Add 10EverythingModelsGemini\-3\.1\-Flash\-Lite\-10\.50\-24\.42\-11\.50\-40\.00\-9\.00Qwen3\-235B\-A22B\-Inst\-16\.00\-33\.00\-7\.50\-50\.00\-13\.00GPT\-oss\-120B\-35\.50\-41\.50\-45\.50\-59\.50\-41\.00Llama\-3\.3\-70B\-Inst\-30\.00\-44\.50\-18\.00\-72\.00\-22\.00Qwen3\-30B\-A3B\-Inst\-15\.00\-30\.50\-21\.50\-65\.50\-27\.00GLM\-4\.7\-Flash\-4\.50\-16\.50\-9\.26\-40\.50\-16\.00Gemma3\-27B\-40\.00\-54\.50\-19\.00\-78\.50\-30\.50GPT\-oss\-20B\-34\.50\-23\.50\-35\.50\-67\.50\-32\.00Gemma3\-12B\-34\.00\-54\.00\-14\.50\-81\.00\-27\.00Qwen3\-4B\-Inst\-12\.50\-28\.50\-23\.00\-71\.50\-34\.00Gemma3\-4B\-54\.50\-62\.00\-44\.50\-93\.50\-48\.00Average\-26\.09\-37\.54\-22\.71\-65\.41\-27\.23Table 5:Average difference in per\-turn accuracy between the last and first turns \(%\)\.SingleTuplesReplace 10Add 10EverythingModelsGemini\-3\.1\-Flash\-Lite\-16\.00\-24\.42\-13\.50\-40\.50\-16\.00Qwen3\-235B\-A22B\-Inst\-17\.00\-36\.00\-12\.00\-55\.00\-18\.00GPT\-oss\-120B\-37\.50\-42\.00\-47\.00\-66\.36\-44\.00Llama\-3\.3\-70B\-Inst\-30\.50\-45\.50\-22\.50\-72\.00\-29\.50Qwen3\-30B\-A3B\-Inst\-15\.00\-36\.00\-25\.00\-73\.00\-33\.50GLM\-4\.7\-Flash\-9\.50\-22\.00\-17\.00\-48\.50\-20\.00Gemma3\-27B\-41\.00\-56\.50\-35\.50\-80\.00\-31\.50GPT\-oss\-20B\-36\.50\-27\.50\-41\.00\-67\.50\-33\.50Gemma3\-12B\-39\.00\-59\.00\-28\.50\-83\.50\-29\.00Qwen3\-4B\-Inst\-16\.50\-28\.50\-23\.50\-74\.50\-35\.50Gemma3\-4B\-56\.00\-63\.00\-53\.50\-93\.50\-52\.00Average\-28\.59\-40\.04\-29\.00\-68\.58\-31\.14Table 6:Average difference in per\-turn accuracy between the best and worst turns \(%\)\.![Refer to caption](https://arxiv.org/html/2605.06353v1/x16.png)Figure 25:Per\-turn accuracy for all models in theSingleregime\.![Refer to caption](https://arxiv.org/html/2605.06353v1/x17.png)Figure 26:Per\-turn accuracy for all models in theTuplesregime\.![Refer to caption](https://arxiv.org/html/2605.06353v1/x18.png)Figure 27:Per\-turn accuracy for all models in theReplaceregime, where constraints are replaced every55turns\.![Refer to caption](https://arxiv.org/html/2605.06353v1/x19.png)Figure 28:Per\-turn accuracy for all models in theReplaceregime, where constraints are replaced every1010turns\.![Refer to caption](https://arxiv.org/html/2605.06353v1/x20.png)Figure 29:Per\-turn accuracy for all models in theAddregime, where new constraints are introduced every55turns\.![Refer to caption](https://arxiv.org/html/2605.06353v1/x21.png)Figure 30:Per\-turn accuracy for all models in theAddregime, where new constraints are introduced every1010turns\.SingleTuplesReplaceAddEverythingModel510510Gemini\-3\.134±\\pm1732±\\pm1520±\\pm824±\\pm1133±\\pm1333±\\pm1422±\\pm9Qwen3\-235B\-A22B39±\\pm2643±\\pm2522±\\pm1128±\\pm1540±\\pm2339±\\pm2026±\\pm13GPT\-oss\-120B95±\\pm5068±\\pm4352±\\pm2162±\\pm2972±\\pm4176±\\pm3948±\\pm20Llama\-3\.3\-70B28±\\pm1333±\\pm2620±\\pm622±\\pm733±\\pm2429±\\pm1023±\\pm6Qwen3\-30B\-A3B36±\\pm3341±\\pm3223±\\pm1128±\\pm1835±\\pm1936±\\pm2127±\\pm15GLM\-4\.7\-Flash25±\\pm1931±\\pm2418±\\pm921±\\pm1527±\\pm1827±\\pm2121±\\pm10Gemma3\-27B31±\\pm1332±\\pm1222±\\pm825±\\pm1232±\\pm1131±\\pm1226±\\pm9GPT\-oss\-20B61±\\pm3149±\\pm2834±\\pm1442±\\pm2049±\\pm2350±\\pm2335±\\pm13Gemma3\-12B31±\\pm1832±\\pm1320±\\pm823±\\pm1130±\\pm1130±\\pm1224±\\pm9Qwen3\-4B42±\\pm3750±\\pm5124±\\pm1530±\\pm2238±\\pm2341±\\pm3329±\\pm21Gemma3\-4B27±\\pm1231±\\pm1020±\\pm923±\\pm1030±\\pm1029±\\pm1024±\\pm9Average41±\\pm3340±\\pm3025±\\pm1530±\\pm2038±\\pm2538±\\pm2528±\\pm15Table 7:Average token count per conversation \(in thousands\)\.

Similar Articles

VCIFBench: Evaluating Complex Instruction Following for Video Understanding

arXiv cs.CL

VCIFBench is a new benchmark for evaluating complex instruction following in video understanding, featuring 306 test instructions with content, format, style, and structure constraints, plus a DPO preference dataset. Experiments on 10 MLLMs reveal that joint constraint satisfaction remains challenging, and DPO training on the benchmark data improves instruction-following performance.

How Well Do Models Follow Their Constitutions?

arXiv cs.AI

This paper proposes a multi-method audit pipeline to evaluate how well frontier AI models follow their written behavioral specifications (Anthropic's constitution and OpenAI's Model Spec) under adversarial multi-turn pressure, finding that newer models show significantly lower violation rates (e.g., Claude Sonnet 4.6 at 2.0% vs. Sonnet 4 at 15.0%).

Constraint acquisition needs better benchmarks

arXiv cs.AI

The article argues that current benchmarks for Constraint Acquisition are inadequate and introduces MPMMine, a benchmark suite designed to assess algorithms that discover, validate, and enhance mathematical programming models using diverse domain knowledge artifacts.