Conditional Attribute Estimation with Autoregressive Sequence Models
Summary
This paper introduces Conditional Attribute Transformers, a method for jointly estimating next-token probability and attribute values conditionally, enabling credit assignment, counterfactual analysis, and steerable generation in a single forward pass.
View Cached Full Text
Cached at: 05/15/26, 06:18 AM
# Conditional Attribute Estimation with Autoregressive Sequence Models
Source: [https://arxiv.org/html/2605.14004](https://arxiv.org/html/2605.14004)
Erica Stutz Department of Biomedical Informatics and Data Science Yale University New Haven, CT 06510 erica\.stutz@yale\.edu &Giacomo Marino Department of Biomedical Informatics and Data Science Yale University New Haven, CT 06510 giacomo\.marino@yale\.edu &Daniella Meeker Department of Biomedical Informatics and Data Science Yale University New Haven, CT 06510 daniella\.meeker@yale\.edu &Qiao Liu Department of Biostatistics Yale University New Haven, CT 06510 qiao\.liu@yale\.edu &Andrew J\. Loza Department of Biomedical Informatics and Data Science, Department of Pediatrics Yale University New Haven, CT 06510 andrew\.loza@yale\.edu
###### Abstract
Generative models are often trained with a next\-token prediction objective, yet many downstream applications require the ability to estimate or control sequence\-level properties\. Despite their success, next\-token prediction can lead to overfitting of local patterns during training, underfitting of global structure, and requires significant downstream modifications or expensive sampling to guide or predict the global attributes of generated samples at inference time\. Here, we introduce Conditional Attribute Transformers, a novel method for jointly estimating the next\-token probability and the value of an attribute conditional on each potential next token selection\. This framework enables three critical capabilities within a single forward pass, without modification of the input sequence: \(1\) per\-token credit assignment across an entire sequence, by identifying how each token in a sequence is associated with an attribute’s value; \(2\) counterfactual analysis, by quantifying attribute differences conditional on alternative next token choices; \(3\) steerable generation, by decoding sequences based on a combination of next\-token and attribute likelihoods\. Our approach achieves state\-of\-the\-art performance on sparse reward tasks, improves next\-token prediction at sufficient model sizes, estimates attribute probabilities orders of magnitude faster than sampling, and can guide decoding of autoregressive sequence models on a range of language tasks\.
## 1Introduction
Generative models have demonstrated performance advances across multiple domains from language to biomedical informaticsBrownet al\.\([2020](https://arxiv.org/html/2605.14004#bib.bib10)\); Ferruzet al\.\([2022](https://arxiv.org/html/2605.14004#bib.bib9)\); Waxleret al\.\([2025](https://arxiv.org/html/2605.14004#bib.bib8)\); Brixiet al\.\([2025](https://arxiv.org/html/2605.14004#bib.bib4)\)\. While next\-token prediction is a scalable training objectiveHoffmannet al\.\([2022](https://arxiv.org/html/2605.14004#bib.bib33)\), it optimizes for local coherence and can lead to greedy overfitting of local patterns and suboptimal prediction of critical branch tokensGloeckleet al\.\([2024](https://arxiv.org/html/2605.14004#bib.bib11)\); Qiet al\.\([2020](https://arxiv.org/html/2605.14004#bib.bib7)\)\. Furthermore, the downstream utility of these models is frequently defined by sequence\-level attributesChanget al\.\([2023](https://arxiv.org/html/2605.14004#bib.bib34)\)\. In language models, there is need to control token selection to create text with specific attributes, such as correctness or helpfulnessKeskaret al\.\([2019](https://arxiv.org/html/2605.14004#bib.bib23)\)\. In biomedical informatics, generative models are used to estimate clinically relevant sequence attributes such as disease onset, medical events, or treatment response through expensive Monte Carlo \(MC\) simulationWaxleret al\.\([2025](https://arxiv.org/html/2605.14004#bib.bib8)\); Rencet al\.\([2024](https://arxiv.org/html/2605.14004#bib.bib6)\); Shmatkoet al\.\([2025](https://arxiv.org/html/2605.14004#bib.bib5)\)\. The capacity to learn representations that better capture sequence\-level properties, estimate properties efficiently from partial sequences, and control generation to create sequences with specific properties has broad implications across domains\.
These diverse use cases share a common mathematical requirement: the need to estimate a sequence\-level attribute for a partial sequence, conditional on the selection of the next token\. This capability allows for estimation of attribute likelihood for observed sequences or steerable decoding to optimize the likelihood of a particular attribute\. Current methods for predicting or controlling sequence\-level attributes are computationally demanding because they often require properties that may not emerge until many tokens in the future\. Current approaches fall into two main categories: conditioning and base\-model steering with auxiliary models\. These methods have several limitations including high computational overhead, requirements for additional model training, and limited flexibility\.
Here, we propose Conditional Attribute Transformers \(CAT\), a method for conditional attribute estimation \(Fig\.[1](https://arxiv.org/html/2605.14004#S1.F1)\)\. Leveraging a reinforcement learning framework, we cast data as arising from an unknown sequential game\. We develop a generative modeling of this framework with a joint objective of next\-token prediction and conditional sequence\-level attribute prediction in a single model through a branched architecture and shared latent space\.
The specific contributions of this work are as follows:
- •We provide a framework for simultaneous estimation of next\-token likelihood and a sequence\-level attribute likelihood, conditional on each potential next token\. We also link this objective to components of causal inference and reinforcement learning\.
- •We demonstrate that this objective can be integrated into \(a\) the pre\-training of generative decoder\-only transformers with minimal computational overhead and can synergistically improve next\-token perplexity, or \(b\) pre\-trained models through fine\-tuning\.
- •We evaluate performance on three diverse tasks: \(1\) learning strategy from random gameplay, \(2\) predicting and controlling the likely rating for Amazon product reviews, and \(3\) predicting sepsis onset in a medical data set\.
Figure 1:CAT is a unified architecture for next\-token and sequence\-level attribute prediction\. Tokens \(tnt\_\{n\}\) are processed by a shared backbone\. The final latent representation is simultaneously processed by the language modeling head \(Token Head\) and an integrated conditional attribute model \(attribute block \+ attribute head\)\. Next\-token cross\-entropy loss \(LtokenL\_\{token\}\) is combined with the attribute loss \(LattrL\_\{attr\}\) which can be from a binary, multinomial, or numeric attribute, delivered as a token\-level sequence \(ana\_\{n\}\)\. During training, the full conditional attribute matrix does not have to be materialized \(gray\) because only the attribute of the true next token is seen, although it is learned\. Lower right panel shows example of continuation \(dashed\) or CAT steering \(solid\) to change a 5 star prefix into a 1 star full review\.
## 2Related Work
Numerous methods for controlling or estimating sequence\-level attributes have been developed and fall into two classes: generation via conditioning or inference time guidance by a separate model\.
Conditioning is used for generative model steering by inserting a fixed prompt or code in its input\. Pre\-training conditioning includes methods like CTRLKeskaret al\.\([2019](https://arxiv.org/html/2605.14004#bib.bib23)\), which prepends control codes to sequences, and Decision TransformersChenet al\.\([2021](https://arxiv.org/html/2605.14004#bib.bib14)\), which insert reward\-to\-go tokens within each reward\-state\-action tuple using an offline reinforcement learning framework\. Among post\-training conditioning methods, Quark prepends sequences with reward quantile tokensLuet al\.\([2022](https://arxiv.org/html/2605.14004#bib.bib24)\)\. While these models can condition generation, they do not ensure that insertion of control tokens produces sequences that remain in distributionorprovide probabilistic estimates of the attribute likelihood\. Furthermore, if a downstream token is selected in error, there is no means of correction\. CAT, by contrast, estimates the attribute at every step without requiring modification of the input sequence, allowing for more flexible and active control of steering\.
Alternatively, token generation can be guided with auxiliary models\. Classifier\-based methods include PPLM, which uses gradients from an external classifier to guide generation, and FUDGE, which trains a binary classifier to predict attribute selection from partial sequencesDathathriet al\.\([2019](https://arxiv.org/html/2605.14004#bib.bib25)\); Yang and Klein \([2021](https://arxiv.org/html/2605.14004#bib.bib26)\)\. GeDi uses a generative discriminator to update next\-token probabilitiesKrauseet al\.\([2020](https://arxiv.org/html/2605.14004#bib.bib35)\)\. Director integrates the generator and classifier into a unified model by adding an attribute head alongside the language modeling head at the final latent representationAroraet al\.\([2022](https://arxiv.org/html/2605.14004#bib.bib22)\)\. DExperts uses two auxiliary models, one fine\-tuned on the desired attribute and the other on an undesired attribute, to reweight next\-token probabilities of the main modelLiuet al\.\([2021](https://arxiv.org/html/2605.14004#bib.bib21)\)\. TRACE distills a hidden Markov model from a base language model to calculate the sequence\-level attribute probability, while ILQL takes this a step further by using full Q\-learning and not just single\-step policy updatesWenget al\.\([2025](https://arxiv.org/html/2605.14004#bib.bib19)\); Snellet al\.\([2022](https://arxiv.org/html/2605.14004#bib.bib20)\)\. Many of these approaches require significant computational overhead due to the training of auxiliary models including three separate transformer models with ILQLSnellet al\.\([2022](https://arxiv.org/html/2605.14004#bib.bib20)\)\. Notable exceptions are TRACE and Director; however, they are limited by the expressivity of the distilled hidden Markov model and a simple linear layer\. CAT maintains the full expressivity of a transformer model, yet requires substantially less computational overhead than training a full auxiliary model\.
## 3Methods
### 3\.1Definitions
We model data as arising from an unknown sequential game in which the rules, players, strategies, and differences between actions and observations are unknown\. We only have access to records of its play, which include a mixed sequence of actions and observations and an associated result which we consider as the sequence\-level attribute,α\\alpha\. The attribute can be binary, multinomial, or numeric\. After observing these records, we aim to address two questions: \(1\) can an agent learn to play \(i\.e\. to generate valid moves\) and \(2\) can it learn to play well \(i\.e\. to achieve a specified result\)?
Each game consists of a sequenceS=\[s1,s2,…,sN\]S=\[s\_\{1\},s\_\{2\},\.\.\.,s\_\{N\}\]of observations or actionssis\_\{i\}\. We do not assume that we need to differentiate between these two during training\. We usesns\_\{n\}to denote a candidate next element in the sequence\. For simplicity, we will consider onlysns\_\{n\}that are drawn from a discrete languageLL\. For a sequenceSS, there exists an unknown functionf\(S\)=P\(αi∣S\)f\(S\)=P\(\\alpha\_\{i\}\\mid S\)that generatesαi∈A\\alpha\_\{i\}\\in Aconditioned onSS, which we take here to be the result of the game\. We assume that each sequenceSSrepresents a valid game instance, but make no assumption that it is generated from a specific policy\. Instead, it can be viewed as a sample from an average policy distributionπμ\\pi^\{\\mu\}\.
### 3\.2Model
##### Joint Distribution:
A record of an unknown game can be drawn from the joint distribution of the sequence of gameplay and its result attribute,P\(S,αi\)P\(S,\\alpha\_\{i\}\)\. A standard autoregressive decomposition is:
P\(S,αi\)\\displaystyle P\(S,\\alpha\_\{i\}\)=P\(αi∣S\)P\(S\)\\displaystyle=P\(\\alpha\_\{i\}\\mid S\)P\(S\)\(1\)P\(S\)\\displaystyle P\(S\)=∏i=1kP\(si∣s1,…,si−1\)\\displaystyle=\\prod\_\{i=1\}^\{k\}P\(s\_\{i\}\\mid s\_\{1\},\.\.\.,s\_\{i\-1\}\)\(2\)Ifαi\\alpha\_\{i\}is treated as an additional observation, a single autoregressive model can be used for actions, game state, and game result\. However, the probability ofαi\\alpha\_\{i\}can only be estimated after conditioning on the full sequence\. For partial sequences, MC simulation must be used to estimateP\(αi,S\)P\(\\alpha\_\{i\},S\)\.
##### Alternative Decomposition:
We can considerSSto be a sequence of three parts\(Sa,sn,Sb\)\(S\_\{a\},s\_\{n\},S\_\{b\}\), a prefix sequenceSaS\_\{a\}of all observations precedingsns\_\{n\}, a next observationsns\_\{n\}, and a suffix sequenceSbS\_\{b\}consisting of all observations followingsns\_\{n\}\. The original joint distribution can be expanded as:
P\(αi,S\)=P\(αi,Sa,sn,Sb\)P\(\\alpha\_\{i\},S\)=P\(\\alpha\_\{i\},S\_\{a\},s\_\{n\},S\_\{b\}\)\(3\)We can decompose this new expanded joint distribution as:
P\(αi,Sa,sn,Sb\)=P\(Sa\)⋅P\(sn∣Sa\)⋅P\(αi,Sb∣Sa,sn\)P\(\\alpha\_\{i\},S\_\{a\},s\_\{n\},S\_\{b\}\)=P\(S\_\{a\}\)\\cdot P\(s\_\{n\}\\mid S\_\{a\}\)\\cdot P\(\\alpha\_\{i\},S\_\{b\}\\mid S\_\{a\},s\_\{n\}\)\(4\)We can then marginalize overSbS\_\{b\}:
∑SbP\(αi,Sa,sn,Sb\)=P\(Sa\)⋅P\(sn∣Sa\)⋅∑SbP\(αi,Sb∣Sa,sn\)\\sum\_\{S\_\{b\}\}P\(\\alpha\_\{i\},S\_\{a\},s\_\{n\},S\_\{b\}\)=P\(S\_\{a\}\)\\cdot P\(s\_\{n\}\\mid S\_\{a\}\)\\cdot\\sum\_\{S\_\{b\}\}P\(\\alpha\_\{i\},S\_\{b\}\\mid S\_\{a\},s\_\{n\}\)\(5\)which yields:
P\(αi,Sa,sn\)=P\(Sa\)⏟prefix⋅P\(sn∣Sa\)⏟sequencemodel⋅P\(αi∣Sa,sn\)⏟attributemodelP\(\\alpha\_\{i\},S\_\{a\},s\_\{n\}\)=\\underbrace\{P\(S\_\{a\}\)\}\_\{\\mathrm\{prefix\}\}\\cdot\\underbrace\{P\(s\_\{n\}\\mid S\_\{a\}\)\}\_\{\\mathrm\{sequence\\ model\}\}\\cdot\\underbrace\{P\(\\alpha\_\{i\}\\mid S\_\{a\},s\_\{n\}\)\}\_\{\\mathrm\{attribute\\ model\}\}\(6\)Since we make no assumptions on the length ofSbS\_\{b\}, this decomposition is valid for any position ofsns\_\{n\}including the end of the sequenceSS, in which caseSb=∅S\_\{b\}=\\varnothing\.
##### Distribution Estimation:
This decomposition can be modeled using an augmented causal transformer with two heads: a next\-token prediction head for estimatingP\(sn∣Sa\)P\(s\_\{n\}\\mid S\_\{a\}\)and conditional attribute prediction head for estimatingP\(αi∣Sa,sn\)P\(\\alpha\_\{i\}\\mid S\_\{a\},s\_\{n\}\)\. The prefixP\(Sa\)P\(S\_\{a\}\)is modeled autoregressively by the sequence head, enabling sampling from the full joint distributionP\(αi,S\)P\(\\alpha\_\{i\},S\)\. Here, we use a shared model backbonefθf\_\{\\theta\}to produce a hidden representationHH\.
fθ\(S\)=H,wherefθ:S→ℝdf\_\{\\theta\}\(S\)=H,\\quad\\text\{where \}f\_\{\\theta\}:S\\to\\mathbb\{R\}^\{d\}\(7\)This representation is passed to two heads:
gψ\(H\)=P\(sn∣S\),wheregψ:ℝd→Δ\|L\|g\_\{\\psi\}\(H\)=P\(s\_\{n\}\\mid S\),\\quad\\text\{where \}g\_\{\\psi\}:\\mathbb\{R\}^\{d\}\\to\\Delta^\{\|L\|\}\(8\)hϕ\(H,sn\)=P\(αi∣S,sn\),wherehϕ:ℝd×L→\{Δ\|A\|\(categorical attributes\)ℝp\(parameterized attributes\)h\_\{\\phi\}\(H,s\_\{n\}\)=P\(\\alpha\_\{i\}\\mid S,s\_\{n\}\),\\quad\\text\{where \}h\_\{\\phi\}:\\mathbb\{R\}^\{d\}\\times L\\to\\begin\{cases\}\\Delta^\{\|A\|\}&\\text\{\(categorical attributes\)\}\\\\ \\mathbb\{R\}^\{p\}&\\text\{\(parameterized attributes\)\}\\end\{cases\}\(9\)such that the backbone modelfθf\_\{\\theta\}contains information about both next\-token and attribute whilegψg\_\{\\psi\}andhϕh\_\{\\phi\}provide task\-specific transformations\. The functionhϕh\_\{\\phi\}can be selected to produce attribute category logits or distributional parameters depending on the attribute type\. Here, we demonstrate the model using binary, multinomial, and numeric attributes\.
### 3\.3Inference
Samples of observed games can be generated viasn∼P\(sn∣Sa\)s\_\{n\}\\sim P\(s\_\{n\}\\mid S\_\{a\}\)from the sequence model\. The conditional attribute model can be used for alternative sampling strategies such as greedy decoding toward an attribute of interest\. When selectingsns\_\{n\}, we can restrict choices to estimated valid moves \(next\-token probability above a thresholdϵ\\epsilon\) and select one that maximizes the probability of achieving a desired result while remaining within the distribution of the training data to a certain thresholdϵ\\epsilon:
sn∗=argmaxsnP\(αi∣Sa,sn\)for allsn∈P\(sn∣Sa\)\>ϵs\_\{n\}^\{\*\}=\\arg\\max\_\{s\_\{n\}\}P\(\\alpha\_\{i\}\\mid S\_\{a\},s\_\{n\}\)\\quad\\text\{for all\}\\quad s\_\{n\}\\in P\(s\_\{n\}\\mid S\_\{a\}\)\>\\epsilon\(10\)Eq\.[10](https://arxiv.org/html/2605.14004#S3.E10)defines a greedy optimizer for maximizing the likelihood ofαi\\alpha\_\{i\}\. This approach can be used to select locally optimal actions and develop a policyπnew\\pi^\{new\}that improves upon the average policyπμ\\pi^\{\\mu\}\. However, it does not guarantee attainment of the global optimum\.
### 3\.4Relationship to Other Entities
While we introduce this model as a conditional attribute estimator, it is directly related to functions in reinforcement learning and causal inference\.
##### Relationship to Q\-Functions:
Although derived using different notation, the components of this framework map to those in Distributional Reinforcement Learning in an offline settingBellemareet al\.\([2017](https://arxiv.org/html/2605.14004#bib.bib37)\)\. Consider the case whereαi\\alpha\_\{i\}represents a reward,sns\_\{n\}represents an action, andSaS\_\{a\}represents a state\. Note that here we have discussed the attribute as a terminal reward, but the derivation applies to sub\-sequences as well\. The sequence modelP\(sn∣Sa\)P\(s\_\{n\}\\mid S\_\{a\}\)functions as the average behavior policyπμ\(sn∣Sa\)\\pi\_\{\\mu\}\(s\_\{n\}\\mid S\_\{a\}\), representing the probability of selecting an actionsns\_\{n\}given a stateSaS\_\{a\}in the training distribution\. The attribute modelP\(αi∣Sa,sn\)P\(\\alpha\_\{i\}\\mid S\_\{a\},s\_\{n\}\)learns the conditional distribution over rewards\. The standard state\-action value function under the behavior policy,Qπμ\(s,a\)Q^\{\\pi\_\{\\mu\}\}\(s,a\), is recovered by taking the expectation over this distribution:
Qπμ\(Sa,sn\)=𝔼α∼P\(⋅∣Sa,sn\)\[α\]Q^\{\\pi\_\{\\mu\}\}\(S\_\{a\},s\_\{n\}\)=\\mathbb\{E\}\_\{\\alpha\\sim P\(\\cdot\\mid S\_\{a\},s\_\{n\}\)\}\[\\alpha\]\(11\)We note that this work focuses on single\-step policy improvement by selecting from the learnedQQfunction in a greedy \(or adjusted\-greedy manner\) rather than learning a globally optimalQ∗Q^\{\*\}through recursive value updatesSnellet al\.\([2022](https://arxiv.org/html/2605.14004#bib.bib20)\)\.
##### Relationship to Causal Models:
In the case wheresns\_\{n\}represents a treatment or intervention andSaS\_\{a\}represents the history of confounders or covariates, this approach parallels the core components of causal inference\. Specifically, the sequence modelP\(sn∣Sa\)P\(s\_\{n\}\\mid S\_\{a\}\)functions as a generalized propensity score, estimating the probability of treatment assignment conditioned on covariates:
e\(Sa\)=P\(Z=sn∣X=Sa\)e\(S\_\{a\}\)=P\(Z=s\_\{n\}\\mid X=S\_\{a\}\)\(12\)Simultaneously, the attribute modelP\(αi∣Sa,sn\)P\(\\alpha\_\{i\}\\mid S\_\{a\},s\_\{n\}\)functions as a conditional outcome model \(or response surface\), estimating the expected outcome given the covariates and treatment:
μ\(Sa,sn\)=𝔼\[Y=αi∣X=Sa,Z=sn\]\\mu\(S\_\{a\},s\_\{n\}\)=\\mathbb\{E\}\[Y=\\alpha\_\{i\}\\mid X=S\_\{a\},Z=s\_\{n\}\]\(13\)While it operates on fixed covariates, focuses on single treatment assignment, and uses binary outcomes, Dragonnet is similar in that it uses a shared latent representation for prediction of a propensity and outcome modelShiet al\.\([2019](https://arxiv.org/html/2605.14004#bib.bib36)\)\. The recent generative causal inference approachesLiuet al\.\([2024b](https://arxiv.org/html/2605.14004#bib.bib1)\); Liu and Wong \([2026b](https://arxiv.org/html/2605.14004#bib.bib3),[a](https://arxiv.org/html/2605.14004#bib.bib2)\), further generalize this idea through latent probabilistic modeling of treatments, covariates, and outcomes, enabling flexible conditional inference and uncertainty quantification\. Causal transformers use sequential data, but require separate streams for covariates, treatments, and outcomes recorded at specific time intervalsMelnychuket al\.\([2022](https://arxiv.org/html/2605.14004#bib.bib29)\)\. Other methods use expensive MC rollouts for estimating the outcomes associated with treatment assignmentsReinet al\.\([2024](https://arxiv.org/html/2605.14004#bib.bib27)\); Xionget al\.\([2024](https://arxiv.org/html/2605.14004#bib.bib28)\)\. CAT improves upon these by using full sequential data and avoiding MC simulation\.
### 3\.5Model Architecture
We extended the nanoGPT architectureKarpathy \([2023](https://arxiv.org/html/2605.14004#bib.bib13)\), for joint next\-token and conditional attribute prediction\. The forward pass of CAT matches standard transformers until the final latent embedding\. This branches to a standard language modeling head and the conditional attribute transformer block \(similar to how multi\-token prediction is performed with DeepSeek\-R1 during pretraining\)Liuet al\.\([2024a](https://arxiv.org/html/2605.14004#bib.bib32)\)\. This is used to predict the parameters required for attribute prediction which can be binary, multinomial, or numeric\. Computationally efficient training is achieved by recognizing that although next\-token requires computing logits across the entire vocabulary, the conditional attribute loss only requires computing the probabilities of attributes for the true single next token\. Therefore, for a vocabulary of sizeVVand an attribute dimension ofAA, only a1×A1\\times Amatrix must be materialized during training instead of aV×AV\\times Amatrix\. Cross\-entropy loss is used for next\-token prediction and the attribute loss can be any likelihood\-based loss\. We used cross\-entropy loss for binary and multinomial tasks, and Guassian negative log\-likelihood for regression tasks\. The total training loss was defined as:
L=Ltoken\+λ∗LattrL=L\_\{token\}\+\\lambda\*L\_\{attr\}\(14\)whereLtokenL\_\{token\}is next\-token cross\-entropy loss,LattrL\_\{attr\}is attribute loss, andλ\\lambdacontrols the relative contribution of the attribute loss\. Optimalλ\\lambdavalues varied across tasks, and were selected by balancing next\-token loss, attribute loss, and sampling performance\. Model sizes and hyperparameters for each experiment are provided in Table[A\.1](https://arxiv.org/html/2605.14004#A1.T1),[A\.2](https://arxiv.org/html/2605.14004#A1.T2), and[A\.4](https://arxiv.org/html/2605.14004#A1.T4)\.
### 3\.6Data Sets
Experiments were conducted on three data sets: \(1\) the Key\-to\-Door environment used in the Decision Transformers paperChenet al\.\([2021](https://arxiv.org/html/2605.14004#bib.bib14)\)to test performance in a sparse\-reward setting, \(2\) a language modeling data set \(Amazon ReviewsHouet al\.\([2024](https://arxiv.org/html/2605.14004#bib.bib31)\)\) to test performance on large action spaces, and \(3\) a clinical data set \(PhysioNet SepsisReynaet al\.\([2019](https://arxiv.org/html/2605.14004#bib.bib30)\)\) to test performance on a biomedical informatics task\.
1. 1\.Key\-to\-Door:data on 10,000 random\-walk trajectories in a three\-room grid world consisting of a key room, a distractor room, and a door room\. The agent must pick up the key and reach the door within a fixed move budget\. This environment was introduced inMesnardet al\.\([2021](https://arxiv.org/html/2605.14004#bib.bib15)\)and later used inChenet al\.\([2021](https://arxiv.org/html/2605.14004#bib.bib14)\)as a credit assignment benchmark\.
2. 2\.Amazon Reviews:data on 574 million Amazon product reviews fromHouet al\.\([2024](https://arxiv.org/html/2605.14004#bib.bib31)\)\. Each sequence consists of a product category, title, review text, and a 1\-5 star rating, which serves as a multinomial attribute\.
3. 3\.PhysioNet Sepsis:ICU time\-series data set with 40,336 patient sequences for early sepsis identification\. Each sequence contains patient demographics followed by hourly vital\-sign and laboratory\-measurement tokens, with time tokens separating hourly measurements\. This was derived from the 2019 PhysioNet Challenge data setReynaet al\.\([2019](https://arxiv.org/html/2605.14004#bib.bib30)\)\.
## 4Results
### 4\.1Key\-to\-Door: Long\-Term Credit Assignment
The Key\-to\-Door task tests CAT’s ability to learn attribute assignment from a single terminal reward\. CAT successfully learns to estimate conditional win probabilities for each potential next move \(Fig\.[2](https://arxiv.org/html/2605.14004#S4.F2)A\) and can serve as a critic to stably classify the win probability for evolving trajectories with reduced variance compared to decision transformers \(Fig\.[2](https://arxiv.org/html/2605.14004#S4.F2)B\)Chenet al\.\([2021](https://arxiv.org/html/2605.14004#bib.bib14)\)\. Using greedy decoding \(Eq\.[10](https://arxiv.org/html/2605.14004#S3.E10)\), the model outperforms all baselines: random policy, behavior cloning, percentile behavior cloning \(trained on only winning examples\), conservative Q\-learning, and decision transformers \(Table[1](https://arxiv.org/html/2605.14004#S4.T1)\)\.
Figure 2:Key\-to\-Door task\.AAgent moving in the key room with the move and win probabilities\.BAverage and 95% confidence interval for estimated win probability for trajectories stratified by outcome\. The dashed lined demarcates moves in each rooms: key \(1\), distractor \(2\), and door \(3\)\.Table 1:Win rates of evaluated methods for the Key\-to\-Door task\. In addition to the 0\.999 win rate, CAT takes the shortest manhattan\-distance path 998 out of the 999 wins\.RandomPolicyBehavioralCloningPercentileBehavioral CloningConservativeQ\-LearningDecisionTransformersConditional AttributeTransformersWin Rate0\.0310\.0160\.9510\.1330\.9460\.999
### 4\.2Language Modeling: Amazon Reviews
To test scalability and performance with a large input and action space, we evaluate CAT on the Amazon Reviews dataset, using review text as the sequence and the multinomial 5\-class product rating as the attribute \(Fig\.[A\.1](https://arxiv.org/html/2605.14004#A1.F1)\)\. We evaluate \(1\) the impact of joint training on next\-token prediction performance, \(2\) the performance of its role as a critic by estimating the rating from partial reviews, \(3\) the identification of semantic shifts caused by counterfactual adjective substitution, and \(4\) steering generation to create reviews with specific ratings\. Full model details are reported in Appendix[A\.2](https://arxiv.org/html/2605.14004#A1.SS2)\.
#### 4\.2\.1Next\-Token Prediction Performance
In contrast to prior methods where joint\-training worsened next\-token prediction performanceAroraet al\.\([2022](https://arxiv.org/html/2605.14004#bib.bib22)\), we find that the CAT architecture enables improved next\-token prediction performance with increasing model size\. At small model sizes \(7\-72\-million parameters\), the conditional attribute task impairs next\-token prediction\. However, the 1\-billion parameter CAT model achieved better performance on next\-token prediction than the standard model\. This synergy depended onλ\\lambdaused to balance the two contributions to the joint loss \(Fig\.[3](https://arxiv.org/html/2605.14004#S4.F3)\)\. Subsequent experiments useλ=0\.15\\lambda=0\.15to balance next\-token and attribute prediction performance\.
Figure 3:Token perplexity for CAT versus GPT\. The perplexity from next\-token prediction for CAT models versus standard GPT models across model size and CAT attribute head weights is shown\. The GPT model is equivalent to a weight of zero \(no attribute loss contribution\)\. Synergy between the two tasks of next\-token and conditional attribute prediction depends on model size and task weight\.
#### 4\.2\.2Review Critic Performance
CAT estimates attributes from partial sequences in a simulation\-free manner\. Performance of predicting the rating using CAT \(conditional attribute probability for the true next token\) was compared to MC simulation, an attribute head fine\-tuned on a frozen standard GPT model, an attribute\-only CAT model, and a version of Director which we extended for multinomial attributes \(Director\*\) \(Fig\.[4](https://arxiv.org/html/2605.14004#S4.F4)A\)\. Due to the extensive compute required, MC sampling was evaluated on 4,000 reviews and four partial review lengths, whereas other methods are evaluated on 1 million reviews\.
CAT and the fine\-tuned CAT model both outperformed Director\*, and MC sampling using CAT’s next\-token head outperformed the standard next\-token model\. The attribute\-only CAT model underperformed most methods, except standard next\-token MC simulation at a partial sequence length of 80, indicating that jointly modeling next\-token and conditional attribute prediction improves performance over either task alone\. CAT also provided an approximately108×10^\{8\}\\timesspeedup in partial sequence evaluation relative to MC sampling \(Fig\.[4](https://arxiv.org/html/2605.14004#S4.F4)B\)\.
Figure 4:Rating prediction from partial reviews\.ATop\-1 rating prediction accuracy for CAT, fine\-tuned CAT, attribute\-only CAT, Director\*, standard next\-token MC simulation \(n=100n=100\), and CAT MC simulation \(n=100n=100\)\. The first four models were evaluated on 1 million reviews; sampling\-based approaches were evaluated on 4,000 reviews due to computational cost\.BCompute time as a function of expected sequence length for CAT and standard next\-token MC simulation\.
#### 4\.2\.3Counterfactual Estimation
Table 2:Counterfactual estimation of substitutinggoodwith alternative adjectives on predicted 1 and 5 star probabilities \(ΔP1\\Delta P\_\{1\},ΔP5\\Delta P\_\{5\}\)\. Shown for all contexts and negated contexts \(i\.e\.notgood\)\.
Counterfactual𝚫𝑷𝟏\\Delta P\_\{1\}𝚫𝑷𝟓\\Delta P\_\{5\}AllContextsNegationin ContextAllContextsNegationin Contextgood→\\rightarrowAMAZING\-0\.01\-0\.070\.090\.02good→\\rightarrowamazing\-0\.01\-0\.110\.070\.02good→\\rightarrowGREAT0\.00\-0\.080\.060\.00good→\\rightarrowgreat0\.00\-0\.080\.040\.00good→\\rightarrowGOOD0\.000\.000\.030\.00good→\\rightarrowbad0\.06\-0\.14\-0\.080\.06good→\\rightarrowBAD0\.08\-0\.11\-0\.100\.03good→\\rightarrowhorrible0\.09\-0\.11\-0\.110\.02good→\\rightarrowHORRIBLE0\.10\-0\.04\-0\.190\.01
We evaluated the semantic accuracy of CAT’s counterfactual estimates by substituting sentiment\-bearing adjectives in 1,000,000 validation reviews\. Replacinggoodwith negative adjectives increased 1 star and decreased 5 star probabilities\. Positive substitutions increased 5 star probabilities with little effect on 1 star probabilities\. Substitutions in negated contexts \(not good\) had more nuanced shifts: both positive and negative substitutions reduced 1 star probability and led to no change or increased 5 star probabilities\. Capitalization modulated these effects, reflecting the semantic emphasis \(Table[2](https://arxiv.org/html/2605.14004#S4.T2)\)\.
#### 4\.2\.4Guided Decoding
In steering 3 star prompts to 1 and 5 star reviews, CAT had the highest accuracy \(CAT accuracy 0\.64/0\.77; best alternative Director\* 0\.58/0\.65\) across all evaluated models\. Furthermore, generations were more fluent than models of similar accuracy \(CAT perplexity 45\.88/44\.03; best alternative Director\* 46\.77/48\.16\) and had comparable diversity \(Table[3](https://arxiv.org/html/2605.14004#S4.T3)\)\.
Table 3:Steering 3 star prompts toward 1 and 5 star reviews\.ModelAccuracyFluency \(↓\\downarrow\)Diversity \(↑\\uparrow\)Mean ppl\.Dist\-1Dist\-2Dist\-31 starCTRL0\.1313\.680\.900\.870\.78DExperts0\.1419\.240\.910\.860\.77Director0\.2520\.180\.880\.890\.82Director\*0\.5846\.770\.800\.800\.75CAT0\.6445\.880\.730\.650\.575 starCTRL0\.4113\.450\.900\.860\.77DExperts0\.4118\.990\.910\.850\.76Director0\.4718\.950\.900\.860\.78Director\*0\.6548\.160\.830\.750\.65CAT0\.7744\.030\.790\.840\.82
### 4\.3PhysioNet Sepsis
To evaluate CAT’s critic performance beyond reinforcement learning and language, we assessed its ability to predict sepsis in ICU patients\. We evaluated two sequence\-level attributes: a binary label for sepsis occurrence during the ICU course, and a continuous label estimating maximum heart rate \(HR\) within a 6\-hour sliding window\. Assessed 12 hours prior to onset, CAT demonstrated comparable predictive performance compared to MC simulation of the standard next\-token model and superior to Director \(Fig\.[5](https://arxiv.org/html/2605.14004#S4.F5)A\)\. Although the standard model achieved a slightly higher ROC AUC, CAT provides a substantial improvement in Average Precision \(AP\), indicating higher precision at clinically relevant recall levels in this highly imbalanced prediction setting\.
To evaluate sensitivity to counterfactual vital sign substitutions, we isolated the first temperature reading in sepsis\-positive validation sequences and measured stepwise changes in predicted sepsis probability across temperature bins \(Fig\.[5](https://arxiv.org/html/2605.14004#S4.F5)B\)\. Elevated temperature increased predicted sepsis risk, with substantially larger hyperthermic risk increases among older patients, consistent with their greater vulnerability to thermal dysregulationBrody \([1994](https://arxiv.org/html/2605.14004#bib.bib39)\)\. A representative case illustrates CAT’s dynamic sepsis risk estimates alongside conditional forecasts of maximum HR over the subsequent six hours \(Fig\.[5](https://arxiv.org/html/2605.14004#S4.F5)C\)\. Risk is assessed at each token, demonstrating an implicitly learned sensitivity to established clinical criteria\. Extracting the hour with the largest increase inP\(sepsis\)P\(\\mathrm\{sepsis\}\)reveals the token\-level contributions driving this risk change \(Fig\.[5](https://arxiv.org/html/2605.14004#S4.F5)D\)\. Large MAP\-associated risk increases following low DBP may reflect discordance between MAP, SBP, and DBP measurements in the data set \(Fig\.[A\.4](https://arxiv.org/html/2605.14004#A1.F4)\)\.
Figure 5:Predicting sepsis and maximum heart rate \(HR\) per token\.AReceiver operating characteristic \(ROC\) and precision–recall \(PR\) curves on the PhysioNet 2019 sepsis task for CAT, Director, and GPT MC simulation \(n=64\)\.BMean change in sepsis probability,ΔP\(sepsis\)\\Delta P\(\\mathrm\{sepsis\}\), as a function of the counterfactual vitals bin \(mean±\\pmSEM\) for first temperature measurement in sepsis\-positive sequence in validation data set, stratified by age \(24–59 yrs vs 71–87 yrs\)\.CExample validation case showing predicted sepsis probability \(black\) and predicted maximum HR in the next 6h with the maximum HR plotted with a dashed line \(light blue\)\. The red dashed line marks the 6\-hours to sepsis time point\. Both reported sepsis risk and predicted max HR are plotted as rolling averages with window of 5\.DPer\-token attribution for the largest 1\-hour increase in sepsis risk, marked in red inC\.
## 5Discussion
We present Conditional Attribute Transformers \(CAT\), a novel framework to jointly model next\-token probabilities and sequence level attributes for credit assignment, counterfactual estimation, and guided decoding\. We find that with large enough models, CAT improves next token prediction, suggesting that forcing the model to learn global structure can complement rather than compromise next\-token prediction\. This mirrors results from the multi\-token prediction training objective utilized by DeepSeekLiuet al\.\([2024a](https://arxiv.org/html/2605.14004#bib.bib32)\), where forecasting future states acts as an auxiliary training objective, regularizing representations and encouraging broader structural planning\. In addition, CAT’s capabilities can be extended to pretrained models through fine\-tuning only the conditional attribute head\.
Across a range of settings and tasks, CAT consistently outperformed baseline models\. In the Key\-to\-Door task, CAT learns long\-term credit assignment from sparse terminal rewards and achieves near\-perfect performance, demonstrating its ability to identify actions that determine delayed outcomes\. In language modeling, CAT effectively captures shifts in review sentiment, enables efficient partial\-sequence rating prediction, and supports attribute\-guided generation\. In biomedical informatics, CAT estimates elevated sepsis risk well before diagnosis is recorded, supports identification and examination of the drivers of major inflection points in predicted risk, and characterizes subgroup\-specific changes in risk associated with clinical variables\. While CAT offers profound societal benefits, particularly by advancing explainability through counterfactual analysis, it also carries inherent risks, as its steering capabilities could be exploited to introduce harmful or malicious biases\.
Limitations:This method is currently limited to generating conditional attribute probabilities over discrete action choices, but there may be settings where a continuous attribute is desired\. For steering problems, current counterfactual estimates represent single\-step policy improvement and yield a greedy optimum under an average behavior policy rather than a global optimum under a specific policy\. This limitation is shared by prior approaches such as Director and DExperts\.
Future Work:Future work includes scaling this method to larger data sets and extending it to continuous action spaces\. We are currently developing approaches to enable global optimal choice selection\. Furthermore, this framework could be naturally extended to any task which requires estimation or control of sequence\-level properties, including the biological domain \(de\-novoprotein design, predicting small\-molecule binding functionality, predicting sequence\-to\-function regulatory mechanism from DNA, etc\.\)\.
Ultimately, by jointly modeling local token generation and global sequence level objectives, CAT provides a highly adaptable foundation for complex predictive and generative tasks\.
## 6Acknowledgements
This work was supported by the United States National Library of Medicine \(grant T15 LM007056 to ES and GM\) and National Institutes of Health \(grants 5UL1TR001863 to DM and R00HG013661 to QL\)\. AJL receives funding through UL1 TR001863, The Hartwell Foundation, and the ARIA foundation\.
## References
- \[1\]\(2022\)Director: generator\-classifiers for supervised language modeling\.arXiv preprint arXiv:2206\.07694\.Cited by:[§A\.2](https://arxiv.org/html/2605.14004#A1.SS2.SSS0.Px2.p1.2),[§2](https://arxiv.org/html/2605.14004#S2.p3.1),[§4\.2\.1](https://arxiv.org/html/2605.14004#S4.SS2.SSS1.p1.2)\.
- \[2\]M\. G\. Bellemare, W\. Dabney, and R\. Munos\(2017\)A distributional perspective on reinforcement learning\.External Links:1707\.06887,[Link](https://arxiv.org/abs/1707.06887)Cited by:[§3\.4](https://arxiv.org/html/2605.14004#S3.SS4.SSS0.Px1.p1.9)\.
- \[3\]G\. Brixi, M\. G\. Durrant, J\. Ku, M\. Poli, G\. Brockman, D\. Chang, G\. A\. Gonzalez, S\. H\. King, D\. B\. Li, A\. T\. Merchant,et al\.\(2025\)Genome modeling and design across all domains of life with evo 2\.BioRxiv,pp\. 2025–02\.Cited by:[§1](https://arxiv.org/html/2605.14004#S1.p1.1)\.
- \[4\]G\. M\. Brody\(1994\-02\)Hyperthermia and hypothermia in the elderly\.\.Clinics in geriatric medicine10\(1\),pp\. 213–229\.External Links:[Link](https://pubmed.ncbi.nlm.nih.gov/8168025/)Cited by:[§4\.3](https://arxiv.org/html/2605.14004#S4.SS3.p2.1)\.
- \[5\]T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2605.14004#S1.p1.1)\.
- \[6\]Y\. Chang, X\. Wang, J\. Wang, Y\. Wu, L\. Yang, K\. Zhu, H\. Chen, X\. Yi, C\. Wang, Y\. Wang, W\. Ye, Y\. Zhang, Y\. Chang, P\. S\. Yu, Q\. Yang, and X\. Xie\(2023\)A survey on evaluation of large language models\.arXiv preprint arXiv:2307\.03109\.External Links:[Link](https://arxiv.org/abs/2307.03109)Cited by:[§1](https://arxiv.org/html/2605.14004#S1.p1.1)\.
- \[7\]L\. Chen, K\. Lu, A\. Rajeswaran, K\. Lee, A\. Grover, M\. Laskin, P\. Abbeel, A\. Srinivas, and I\. Mordatch\(2021\)Decision transformer: reinforcement learning via sequence modeling\.Advances in neural information processing systems34,pp\. 15084–15097\.Cited by:[§2](https://arxiv.org/html/2605.14004#S2.p2.1),[item 1](https://arxiv.org/html/2605.14004#S3.I1.i1.p1.1),[§3\.6](https://arxiv.org/html/2605.14004#S3.SS6.p1.1),[§4\.1](https://arxiv.org/html/2605.14004#S4.SS1.p1.1)\.
- \[8\]T\. Chen, T\. He, M\. Benesty, V\. Khotilovich, Y\. Tang, H\. Cho, K\. Chen, R\. Mitchell, I\. Cano, T\. Zhou, M\. Li, J\. Xie, M\. Lin, Y\. Geng, Y\. Li, J\. Yuan, and D\. Cortes\(2026\)Xgboost: extreme gradient boosting\.R package version 3\.3\.0\.0\.External Links:[Link](https://github.com/dmlc/xgboost)Cited by:[§A\.2](https://arxiv.org/html/2605.14004#A1.SS2.SSS0.Px3.p3.5)\.
- \[9\]S\. Dathathri, A\. Madotto, J\. Lan, J\. Hung, E\. Frank, P\. Molino, J\. Yosinski, and R\. Liu\(2019\)Plug and play language models: a simple approach to controlled text generation\.arXiv preprint arXiv:1912\.02164\.Cited by:[§2](https://arxiv.org/html/2605.14004#S2.p3.1)\.
- \[10\]N\. Ferruz, S\. Schmidt, and B\. Höcker\(2022\)ProtGPT2 is a deep unsupervised language model for protein design\.Nature communications13\(1\),pp\. 4348\.Cited by:[§1](https://arxiv.org/html/2605.14004#S1.p1.1)\.
- \[11\]F\. Gloeckle, B\. Y\. Idrissi, B\. Rozière, D\. Lopez\-Paz, and G\. Synnaeve\(2024\)Better & faster large language models via multi\-token prediction\.External Links:2404\.19737,[Link](https://arxiv.org/abs/2404.19737)Cited by:[§1](https://arxiv.org/html/2605.14004#S1.p1.1)\.
- \[12\]J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. de Las Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark, T\. Hennigan, E\. Noland, K\. Millican, G\. van den Driessche, B\. Damoc, A\. Guy, S\. Osindero, K\. Simonyan, E\. Elsen, J\. W\. Rae, O\. Vinyals, and L\. Sifre\(2022\)Training compute\-optimal large language models\.arXiv preprint arXiv:2203\.15556\.External Links:[Link](https://arxiv.org/abs/2203.15556)Cited by:[§1](https://arxiv.org/html/2605.14004#S1.p1.1)\.
- \[13\]Y\. Hou, J\. Li, Z\. He, A\. Yan, X\. Chen, and J\. McAuley\(2024\)Bridging language and items for retrieval and recommendation\.arXiv preprint arXiv:2403\.03952\.Cited by:[§A\.2](https://arxiv.org/html/2605.14004#A1.SS2.SSS0.Px1.p1.1),[item 2](https://arxiv.org/html/2605.14004#S3.I1.i2.p1.1),[§3\.6](https://arxiv.org/html/2605.14004#S3.SS6.p1.1)\.
- \[14\]A\. Karpathy\(2023\)NanoGPT: a minimalistic and educational gpt training code\.Note:[https://github\.com/karpathy/nanoGPT](https://github.com/karpathy/nanoGPT)GitHub repositoryCited by:[§3\.5](https://arxiv.org/html/2605.14004#S3.SS5.p1.4)\.
- \[15\]N\. S\. Keskar, B\. McCann, L\. R\. Varshney, C\. Xiong, and R\. Socher\(2019\)Ctrl: a conditional transformer language model for controllable generation\.arXiv preprint arXiv:1909\.05858\.Cited by:[§1](https://arxiv.org/html/2605.14004#S1.p1.1),[§2](https://arxiv.org/html/2605.14004#S2.p2.1)\.
- \[16\]B\. Krause, A\. D\. Gotmare, B\. McCann, N\. S\. Keskar, S\. Joty, R\. Socher, and N\. F\. Rajani\(2020\)GeDi: generative discriminator guided sequence generation\.arXiv preprint arXiv:2009\.06367\.Cited by:[§2](https://arxiv.org/html/2605.14004#S2.p3.1)\.
- \[17\]A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§3\.5](https://arxiv.org/html/2605.14004#S3.SS5.p1.4),[§5](https://arxiv.org/html/2605.14004#S5.p1.1)\.
- \[18\]A\. Liu, M\. Sap, X\. Lu, S\. Swayamdipta, C\. Bhagavatula, N\. A\. Smith, and Y\. Choi\(2021\)DExperts: decoding\-time controlled text generation with experts and anti\-experts\.arXiv preprint arXiv:2105\.03023\.Cited by:[§A\.2](https://arxiv.org/html/2605.14004#A1.SS2.SSS0.Px3.p3.5),[§2](https://arxiv.org/html/2605.14004#S2.p3.1)\.
- \[19\]Q\. Liu, Z\. Chen, and W\. H\. Wong\(2024\)An encoding generative modeling approach to dimension reduction and covariate adjustment in causal inference with observational studies\.Proceedings of the National Academy of Sciences121\(23\),pp\. e2322376121\.Cited by:[§3\.4](https://arxiv.org/html/2605.14004#S3.SS4.SSS0.Px2.p1.5)\.
- \[20\]Q\. Liu and W\. H\. Wong\(2026\)A bayesian generative modeling approach for arbitrary conditional inference\.arXiv preprint arXiv:2601\.05355\.Cited by:[§3\.4](https://arxiv.org/html/2605.14004#S3.SS4.SSS0.Px2.p1.5)\.
- \[21\]Q\. Liu and W\. H\. Wong\(2026\)An ai\-powered bayesian generative modeling approach for causal inference in observational studies\.Journal of the American Statistical Association\(just\-accepted\),pp\. 1–20\.Cited by:[§3\.4](https://arxiv.org/html/2605.14004#S3.SS4.SSS0.Px2.p1.5)\.
- \[22\]X\. Lu, S\. Welleck, J\. Hessel, L\. Jiang, L\. Qin, P\. West, P\. Ammanabrolu, and Y\. Choi\(2022\)Quark: controllable text generation with reinforced unlearning\.Advances in neural information processing systems35,pp\. 27591–27609\.Cited by:[§2](https://arxiv.org/html/2605.14004#S2.p2.1)\.
- \[23\]V\. Melnychuk, D\. Frauen, and S\. Feuerriegel\(2022\)Causal transformer for estimating counterfactual outcomes\.InInternational conference on machine learning,pp\. 15293–15329\.Cited by:[§3\.4](https://arxiv.org/html/2605.14004#S3.SS4.SSS0.Px2.p1.5)\.
- \[24\]T\. Mesnard, T\. Weber, F\. Viola, S\. Thakoor, A\. Saade, A\. Harutyunyan, W\. Dabney, T\. Stepleton, N\. Heess, A\. Guez, É\. Moulines, M\. Hutter, L\. Buesing, and R\. Munos\(2021\)Counterfactual credit assignment in model\-free reinforcement learning\.External Links:2011\.09464,[Link](https://arxiv.org/abs/2011.09464)Cited by:[item 1](https://arxiv.org/html/2605.14004#S3.I1.i1.p1.1)\.
- \[25\]W\. Qi, Y\. Yan, Y\. Gong, D\. Liu, N\. Duan, J\. Chen, R\. Zhang, and M\. Zhou\(2020\)ProphetNet: predicting future n\-gram for sequence\-to\-sequencepre\-training\.InFindings of the Association for Computational Linguistics: EMNLP 2020,pp\. 2401–2410\.Cited by:[§1](https://arxiv.org/html/2605.14004#S1.p1.1)\.
- \[26\]A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, and I\. Sutskever\(2019\)Language models are unsupervised multitask learners\.Cited by:[§A\.2](https://arxiv.org/html/2605.14004#A1.SS2.SSS0.Px3.p3.5)\.
- \[27\]S\. M\. Rein, J\. Li, M\. Hernan, and A\. Beam\(2024\)Deep learning methods for the noniterative conditional expectation g\-formula for causal inference from complex observational data\.arXiv preprint arXiv:2410\.21531\.Cited by:[§3\.4](https://arxiv.org/html/2605.14004#S3.SS4.SSS0.Px2.p1.5)\.
- \[28\]P\. Renc, Y\. Jia, A\. E\. Samir, J\. Was, Q\. Li, D\. W\. Bates, and A\. Sitek\(2024\)Zero shot health trajectory prediction using transformer\.NPJ digital medicine7\(1\),pp\. 256\.Cited by:[§1](https://arxiv.org/html/2605.14004#S1.p1.1)\.
- \[29\]M\. Reyna, C\. Josef, R\. Jeter, S\. Shashikumar, B\. Moody, M\. B\. Westover, A\. Sharma, S\. Nemati, and G\. D\. Clifford\(2019\)Early prediction of sepsis from clinical data: the PhysioNet/Computing in Cardiology Challenge 2019\.PhysioNet\.Note:RRID:SCR\_007345External Links:[Document](https://dx.doi.org/10.13026/v64v-d857),[Link](https://doi.org/10.13026/v64v-d857)Cited by:[§A\.3](https://arxiv.org/html/2605.14004#A1.SS3.SSS0.Px1.p1.1),[item 3](https://arxiv.org/html/2605.14004#S3.I1.i3.p1.1),[§3\.6](https://arxiv.org/html/2605.14004#S3.SS6.p1.1)\.
- \[30\]C\. Shi, D\. M\. Blei, and V\. Veitch\(2019\)Adapting neural networks for the estimation of treatment effects\.arXiv preprint arXiv:1906\.02120\.Cited by:[§3\.4](https://arxiv.org/html/2605.14004#S3.SS4.SSS0.Px2.p1.5)\.
- \[31\]A\. Shmatko, A\. W\. Jung, K\. Gaurav, S\. Brunak, L\. H\. Mortensen, E\. Birney, T\. Fitzgerald, and M\. Gerstung\(2025\)Learning the natural history of human disease with generative transformers\.Nature647\(8088\),pp\. 248–256\.Cited by:[§1](https://arxiv.org/html/2605.14004#S1.p1.1)\.
- \[32\]H\. A\. Simon\(1955\)A behavioral model of rational choice\.Quarterly Journal of Economics69\(1\),pp\. 99–118\.Cited by:[§A\.2](https://arxiv.org/html/2605.14004#A1.SS2.SSS0.Px3.p3.5)\.
- \[33\]C\. Snell, I\. Kostrikov, Y\. Su, M\. Yang, and S\. Levine\(2022\)Offline rl for natural language generation with implicit language q learning\.arXiv preprint arXiv:2206\.11871\.Cited by:[§2](https://arxiv.org/html/2605.14004#S2.p3.1),[§3\.4](https://arxiv.org/html/2605.14004#S3.SS4.SSS0.Px1.p1.11)\.
- \[34\]S\. Waxler, P\. Blazek, D\. White, D\. Sneider, K\. Chung, M\. Nagarathnam, P\. Williams, H\. Voeller, K\. Wong, M\. Swanhorst,et al\.\(2025\)Generative medical event models improve with scale\.arXiv preprint arXiv:2508\.12104\.Cited by:[§1](https://arxiv.org/html/2605.14004#S1.p1.1)\.
- \[35\]G\. Y\. Weng, B\. Wang, and G\. V\. d\. Broeck\(2025\)TRACE back from the future: a probabilistic reasoning approach to controllable language generation\.arXiv preprint arXiv:2504\.18535\.Cited by:[§2](https://arxiv.org/html/2605.14004#S2.p3.1)\.
- \[36\]H\. Xiong, F\. Wu, L\. Deng, M\. Su, Z\. Shahn, and L\. H\. Lehman\(2024\)G\-transformer: counterfactual outcome prediction under dynamic and time\-varying treatment regimes\.Proceedings of machine learning research252,pp\. https–proceedings\.Cited by:[§3\.4](https://arxiv.org/html/2605.14004#S3.SS4.SSS0.Px2.p1.5)\.
- \[37\]K\. Yang and D\. Klein\(2021\)FUDGE: controlled text generation with future discriminators\.arXiv preprint arXiv:2104\.05218\.Cited by:[§2](https://arxiv.org/html/2605.14004#S2.p3.1)\.
## Appendix AAppendix
All experiments were run on a computing cluster with a combination of NVIDIA H100 and H200 and RTX5000 for approximately 3,000 GPU hours\. Training and inference runtime varies with model architecture, model size, input sequence length, and hardware configuration \(Table[A\.1](https://arxiv.org/html/2605.14004#A1.T1),[A\.2](https://arxiv.org/html/2605.14004#A1.T2), and[A\.4](https://arxiv.org/html/2605.14004#A1.T4)\)\.
### A\.1Key\-to\-Door: Long\-Term Credit Assignment
##### Model and Training Details:
Model specifications and hyperparameters are outlined in Fig\.[A\.1](https://arxiv.org/html/2605.14004#A1.T1)\. CAT and baseline models were trained on 10,000 random walk trajectories and evaluated on 1,000 random starts\. Baseline models included random policy, behavioral cloning, percentile behavior cloning \(trained on winning trajectories only\), conservative Q\-learning, and decision transformers\.
Table A\.1:Model configurations for the Key\-to\-Door task\.ModelVariantParamsLayersDimHeadsMLP DimContextLRRandom PolicyBase3M812885121143e\-3Behavioral CloningBase3M812885121143e\-3Percentile Behavioral CloningBase3M812885121143e\-3Conservative Q\-LearningBase3M812885121143e\-3Decision TransformersBase3M812885121143e\-3CATBase3M812885121143e\-3
### A\.2Language Modeling: Amazon Reviews
##### Data Set:
The Amazon Reviews data set \(CC0: Public Domain\) is a large\-scale corpus of 574 million product reviews, each consisting of a product category, title, review text, and a 1–5 star rating\[[13](https://arxiv.org/html/2605.14004#bib.bib31)\]\. Reviews were tokenized using a 32k BPE vocabulary and formatted as<\|sos\|\><\|category\|\><\|sotitle\|\>title text<\|sotext\|\>review text<\|sor\|\><\|\*\_R\|\>, whereRdenotes the rating\. The data set was split 90:10 into training and validation sets\. The distribution of ratings for 1–5 stars in the training set was 10\.3%, 4\.9%, 7\.1%, 12\.7%, and 65\.0%, respectively, with a similar distribution observed in the validation set\.
##### Model and Training Details:
We trained models ranging from 7\-million to 1\-billion parameters \(Table[A\.2](https://arxiv.org/html/2605.14004#A1.T2)\)\. For each model size, one standard decoder\-only baseline and seven CAT variants with different values ofλ\\lambdain Eq\.[14](https://arxiv.org/html/2605.14004#S3.E14)were trained\. CAT was also extended to a pre\-trained standard model by fine\-tuning only the attribute head\. For DExperts, a 1\-billion parameter standard model was used as the base model and aλ=0\.15\\lambda=0\.15was used, as it outperformed the previously reported optimum of 0\.2 on this dataset\[[1](https://arxiv.org/html/2605.14004#bib.bib22)\]\.
Table A\.2:Model configurations for the language modeling task\.ModelVariantParamsLayersDimHeadsMLP DimContextLRGPTBase7M29623845123e\-4GPTBase72M12512820485123e\-4GPTBase270M1610241640965123e\-4GPTBase1B3215363261445122\.5e\-4CTRLBase1B3215363261445122\.5e\-4DExperts1 star1B3215363261445122\.5e\-4DExperts5 star1B3215363261445122\.5e\-4DirectorBinary \(1 star\)1B3215363261445122\.5e\-4DirectorBinary \(5 star\)1B3215363261445122\.5e\-4Director\*Base1B3215363261445122\.5e\-4CATBase7M29623845123e\-4CATBase72M12512820485123e\-4CATBase270M1610241640965123e\-4CATBase1B3215363261445122\.5e\-4CATFine\-tuned1B3215363261445125e\-2CATAttribute\-only1B3215363261445125e\-2
##### Evaluation Details:
Figure A\.1:Token\-level attribute estimates\. CAT serves as a token\-level critic of reviews, estimating attributes using the conditional attribute probabilities for the true next token\. The model also estimates counterfactual probabilities that can be used to maximize the probability of any star review \(1 star and 5 star shown\)\.Figure A\.2:Token\-level attribute estimates for adjective substitution\. Token\-level attribute estimates from CAT demonstrating its ability to capture sentiment and the contextual effect of negation across the adjective substitutionsamazing,good,bad, andhorrible\.Next\-token validation perplexity was evaluated over 2,000 iterations for the standard and CAT models across model sizes\.
Counterfactual adjective substitution was evaluated on 1,000,000 validation reviews in which the true next token wasgood\. Results are reported for all reviews and for the subset of 1,593 reviews with a preceding negation term or phrase:not,no,never,not really,isn’t,ain’t,wasn’t,weren’t\. Fig\.[A\.2](https://arxiv.org/html/2605.14004#A1.F2)visualizes the conditional attribute estimates at each token position\.This product is amazing,This product is good,This product is bad, andThis product is horribleproduce similar probability distributions up to the adjective token, after which the predicted rating probabilities diverge according to adjective sentiment\. Positive adjectives increase the predicted probability of 5 star ratings, whereas negative adjectives increase the predicted probability of 1 star ratings\. Stronger adjectives \(amazing,horrible\) produce larger shifts than weaker adjectives \(good,bad\)\. A similar pattern is observed under negation\. The reviews produce similar distributions up tonot, after which the predicted probability of a 5 star rating decreases and that of a 1 star rating increases, indicating that the model has learned to associate negation with more negative sentiment and lower ratings\. Adjectives following negation produce effects consistent with those of negation shown in Table[2](https://arxiv.org/html/2605.14004#S4.T2)\.This product is not goodresemblesThis product is badwith an increased predicted probability of a 1 star rating, whilenot bad,not amazing, andnot horribleare dominated by a 3 star rating, consistent with indicatingokay\.
For the steering experiment, validation reviews with true 3 star ratings were split after the first 50% of the text following the<\|sotext\|\>token\. The resulting first\-half prompts were scored using an XGBoost rating classifier trained on 10,000,000 validation reviews with CountVectorizer features \(Fig\.[A\.3](https://arxiv.org/html/2605.14004#A1.F3)\)\[[8](https://arxiv.org/html/2605.14004#bib.bib40)\]\. Prompts classified as 3 stars were retained, from which 1,000 were randomly selected\. Steering was performed using a satisficing criterion with top\-kkdecoding for CAT, Director, and Director\* models\. At each step, decoding was restricted to tokens with a next\-token probability aboveϵ\\epsilonand a conditional attribute probability above the specified attribute threshold\[[32](https://arxiv.org/html/2605.14004#bib.bib41)\]\. DExperts scales the difference between expert and anti\-expert logits with a control parameterα\\alpha; following prior work, we useα=0\.2\\alpha=0\.2for steering\[[18](https://arxiv.org/html/2605.14004#bib.bib21)\]\. Hyperparameters for all model types are listed in Table[A\.3](https://arxiv.org/html/2605.14004#A1.T3)\. Each method generated 10 completions per prompt, and the resulting full text, consisting of the prompt and completion, was re\-scored by XGBoost\. Performance is reported as the proportion of full texts classified as either 1 or 5 stars\. Fluency of the completion was measured by mean perplexity using a Hugging Face GPT\-2 XL model and tokenizer\[[26](https://arxiv.org/html/2605.14004#bib.bib18)\], and diversity of the completion was measured using normalized uniquenn\-gram counts\. Evaluation metrics closely follow those used for the toxicity task in\[[18](https://arxiv.org/html/2605.14004#bib.bib21)\]\.
Table A\.3:Steering hyperparameters\.Method𝒌\\boldsymbol\{k\}𝜶\\boldsymbol\{\\alpha\}ϵ\\boldsymbol\{\\epsilon\}AttributeThresholdSamplingMethodCTRL20–––Top\-kkDExperts200\.2––Top\-kkDirector20–0\.0010\.8Top\-kkDirector\*20–0\.0010\.8Top\-kkCAT20–0\.0010\.8Top\-kkFigure A\.3:AUC\-ROC curve for XGBoost critic model across all five ratings\.
### A\.3PhysioNet Sepsis
##### Data Set:
The 2019 PhysioNet Challenge \(CC\-BY\-4\.0\) provides a data set from 40,336 patients collected from the Intensive Care Units \(ICUs\) of three hospital systems as a benchmark for early sepsis identification\[[29](https://arxiv.org/html/2605.14004#bib.bib30)\]\. This data set consists of patient demographic information and hourly measurements of vital signs and laboratory values\. Each ICU course is annotated with whether sepsis occurred and its time of onset\. Records were converted into sequences of discrete tokens\. Patient demographics were included at the beginning of each sequence\. Categorical measurements received a unique discrete token and each numeric measurement received binned value tokens\. Values were discretized into deciles based on per\-class distribution of values in the training set\. A time token was inserted between each set of hourly measurements\. Data was split 90:10 into training and validation sets, with a sepsis prevalence of 7\.3% in training and 7\.4% in validation\. Model hyperparameters are listed in Table[A\.4](https://arxiv.org/html/2605.14004#A1.T4)\.
##### Model and Training Details:
The optimalλ\\lambdaused to balance the contribution of next\-token and target attribute loss differed in the binary attribute prediction and regression target in the sepsis data set\. We selected aλ\\lambdaof 0\.01 for binary sepsis risk prediction and 0\.5 for predicting the regression target of the max heart rate in the next 6 hours\. The large difference in these values highlights how predicting the next token correctly is essential for sepsis risk prediction, whereas a larger contribution of target regression loss is necessary to learn the more volatile value of heart rate\.
Table A\.4:Model configurations for the sepsis prediction task\.ModelVariantParamsLayersDimHeadsMLP DimContextLRGPTBase29M85128204810245e\-4DirectorBase29M85128204810245e\-4CATBinary29M85128204810245e\-4CATRegression29M85128204810245e\-5Figure A\.4:Blood Pressure reading and correlation to reported MAP\.AScatter plot comparing computed MAP to reported MAP\.BUpSet plot highlighting overlap of hours where SBP, DBP, and MAP are reported\.Similar Articles
The Attribution Contract: Feature Attribution for Generative Language Models
This paper introduces the Attribution Contract, a specification for feature-attribution claims in generative language models, addressing ambiguities in what constitutes a feature and how attribution methods should be evaluated. It uses autoregressive and diffusion models as case studies to show when attribution is informative or misleading.
When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming
This paper distinguishes three probabilistic objects often conflated in language modeling—the full conditional language process, the marginal text-only law, and the model-induced distribution—and analyzes the conditions under which next-token prediction is useful, with RAG and tools interpreted as conditional sufficiency devices.
Probabilistic Attribution For Large Language Models
This paper proposes a model-agnostic probabilistic token attribution measure for LLMs using Bayes' rule to invert next-token log probabilities, capturing the model's internal representation of token sequences and improving interpretability through entropy analysis.
Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents
Proposes CVT-RL, a constrained policy-gradient algorithm with policy-conditioned counterfactual contribution estimation and verifiable rewards, improving long-horizon language agent reliability and reducing reward hacking.
Temporal Contrastive Transformer for Financial Crime Detection: Self-Supervised Sequence Embeddings via Predictive Contrastive Coding
Introduces the Temporal Contrastive Transformer (TCT), a self-supervised framework for learning temporal embeddings from financial transactions for fraud detection. Achieves AUC 0.8644 with embeddings alone but does not improve over strong engineered features (AUC 0.9205 vs 0.9245), indicating learned representations overlap with existing features.