Goal-Conditioned Supervised Learning for LLM Fine-Tuning
Summary
This paper proposes goal-conditioned supervised learning (GCSL) as an offline fine-tuning framework for LLMs, which treats feedback as an explicit goal and trains models via supervised learning with a novel goal formulation and natural-language goal representations. Evaluated on non-toxic generation, code generation, and recommendation, it outperforms standard offline baselines.
View Cached Full Text
Cached at: 05/19/26, 06:41 AM
# Goal-Conditioned Supervised Learning for LLM Fine-Tuning
Source: [https://arxiv.org/html/2605.16345](https://arxiv.org/html/2605.16345)
Shijun Li1, Kaiwen Dong2, Xiang Gao2, Joydeep Ghosh4 1,4The University of Texas at Austin,2Intuit AI Research shijunli@utexas\.edu
###### Abstract
Large language models often require fine\-tuning to better align their behavior with user intent at deployment\. Existing approaches are commonly divided into online and offline paradigms\. Online methods, such as RL\-based alignment, can directly optimize outcome quality but typically rely on external reward models and iterative rollouts, making them costly and difficult to deploy in many cases\. Offline methods are more efficient, but prevailing approaches such as supervised fine\-tuning \(SFT\) and direct preference optimization \(DPO\) remain limited: SFT typically collapses graded feedback into binary supervision, while DPO depends on paired preference data that is often unavailable or expensive to construct\.
In this paper, we propose goal\-conditioned supervised learning \(GCSL\) as an offline fine\-tuning framework for LLMs\. Our core idea is to treat feedback signals directly as an explicit goal and train the model, purely through supervised learning, to generate responses that achieve that goal\. To better exploit graded feedback, we further introduce a novel goal formulation that defines learning as consistently pursuing outcomes above a target quality threshold, rather than imitating samples from a selected high\-quality subset\. This design mitigates the bounded\-learning effect of SFT and classic GCSL by explicitly guiding the model to learn the directional progression of quality\. We also propose natural\-language goal representations to better leverage the semantic understanding and reasoning capabilities of LLMs\.
We evaluate our method on three tasks: non\-toxic generation, code generation, and LLM for recommendation\. Results show that our approach consistently outperforms standard offline fine\-tuning baselines while retaining the efficiency, scalability, and simple data requirements of supervised learning\.
## 1Introduction
LLMs exhibit strong general capabilities, but their pretraining objective often doesn’t reliably produce behaviors users want at deployment\. Consequently, fine\-tuning has become the standard route to alignment\. Existing fine\-tuning methods can be broadly categorized into online and offline paradigms\. Online approaches, most notably RL\-based alignment such as PPO/GRPO, optimize model behavior through iterative sampling and updates driven by a reward signal\. While these methods can deliver strong performance by directly optimizing an outcome objective, they may come with substantial practical limitations: they typically rely on an external reward model, which is expensive to train and can be mis\-specified relative to real user needs, thereby introducing harmful noise and biases into optimization\[[37](https://arxiv.org/html/2605.16345#bib.bib115),[15](https://arxiv.org/html/2605.16345#bib.bib116)\]\. In addition, online rollouts and iterative updates are time\- and resource\-consuming, making such methods difficult to scale or deploy in many real\-world settings\[[35](https://arxiv.org/html/2605.16345#bib.bib102),[34](https://arxiv.org/html/2605.16345#bib.bib117)\]\.
Such constraints have motivated the widespread use of offline fine\-tuning methods\. Most current offline fine\-tuning approaches can be broadly categorized as supervised fine\-tuning \(SFT\) or direct preference optimization \(DPO\) \(and closely related preference\-based objectives\)\[[35](https://arxiv.org/html/2605.16345#bib.bib102),[31](https://arxiv.org/html/2605.16345#bib.bib103)\]\. DPO avoids online RL by learning from preference comparisons, but it requires training data in a paired format \(preferred vs\. not preferred outputs\), which is not always available or easy to construct\. In contrast, SFT can directly leverage raw sequential data collected in the wild, such as user dialogue records, interaction logs in recommender systems, or other behavioral traces, making it broadly applicable and simple to operationalize\. However, this flexibility comes with an important limitation: in practice, SFT often reduces available niche feedback or reward signals into a binary notion of correctness \(typically via a handcrafted cutoff\)\[[10](https://arxiv.org/html/2605.16345#bib.bib118)\], and then imitates the selected “positive” subset\. This thresholding discards fine\-grained information contained in graded feedback, makes performance sensitive to the cutoff, and treats all positive samples as equally good demonstrations, which can bias learning towards the average quality of the selected subset rather than explicitly encouraging consistent improvement towards higher\-quality outcomes\.
This paper investigates goal\-conditioned supervised learning \(GCSL\)\[[24](https://arxiv.org/html/2605.16345#bib.bib7),[29](https://arxiv.org/html/2605.16345#bib.bib34)\]as an offline fine\-tuning paradigm for LLMs\. Many real deployments naturally provide feedback signals such as numeric ratings or categorical judgments\. Instead of converting these signals into homogeneous demonstrations \(SFT\), pairwise comparisons \(DPO\), or rewards requiring a learned reward model and online RL \(PPO/GRPO\), GCSL suggests a direct reframing: treat feedback as an explicit goal and train the model, via supervised learning, to generate responses that achieve the goal\.
This brings three key advantages\. First, it enables direct use of feedback \(scores or categories\) without external reward models or paired samples\. Second, it realizes long\-horizon goal\-achieving optimization within a pure supervised framework, thereby retaining the efficiency and scalability of standard supervised training and avoiding online rollouts and reward\-model\-dependent iteration\. Third, and most importantly, our approach is designed to overcome a key limitation of SFT and classic GCSL: learning is implicitly bounded by the average quality of the selected training subset\. To address this, we introduce a novel goal\-achieving objective that defines the goal as consistently pursuing outcomes above a given quality threshold\. This formulation better exploits graded feedback, avoids collapsing supervision into undifferentiated positive samples or pairwise contrastive samples, and encourages the model to generalize goal\-seeking behavior across diverse targets in the data\. In other words, the new goal\-achieving objective seeks to drive directional progression of performance beyond the bounded learning constraint within certain selected subsets\.
The closest prior work in this direction is Quark\[[30](https://arxiv.org/html/2605.16345#bib.bib95)\], which incorporates goal\-conditioned ideas into LLM optimization but retains several drawbacks\. First, Quark still relies on an online procedure with an external reward model, reducing efficiency and inheriting reward\-model mis\-specification risks\. Second, its goal definition via the highest quantized score bin still collapses toward SFT\-like behavior, limiting improvement beyond the average quality within that bin\. Finally, its special\-token goal representation may underutilize the model’s native semantic understanding and extrapolation capability\. These issues motivate our approach: a purely offline supervised formulation that directly leverages feedback as goals, reformulates goals to better reflect the optimization target, and uses natural\-language goal representations to better exploit LLM’s semantic ability and world knowledge\.
Our key contributions can be summarized as follows:
- •We reframe LLM fine\-tuning as goal\-conditioned supervised learning, enabling training from direct feedback signals while avoiding online reward\-model\-dependent training and paired preference data, benefiting from high training efficiency, scalability, and data generality\.
- •We introduce a novel goal\-achieving objective to overcome a key limitation of SFT and classic GCSL, whose learning objective is bounded by the average quality of the selected training subset\. By formulating learning as consistently pursuing outcomes above a target quality threshold, our approach better exploits graded feedback, mitigates bounded\-learning effects, and explicitly drive directional progression of quality\. We further propose natural\-language goal representations, which better leverage the LLM’s inherent semantic understanding and extrapolation abilities\.
- •We evaluate our approach on non\-toxic generation, code generation, and recommendation tasks to demonstrate its generality and effectiveness\. Empirical results show that it consistently outperforms standard offline fine\-tuning baselines while retaining high efficiency and simple data requirements\.
## 2Related Work
LLM Fine\-Tuning\.Fine\-tuning is the standard approach for aligning pretrained LLMs with user intent and deployment constraints\. Supervised fine\-tuning \(SFT\) is widely used for its simplicity and stability, but with non\-binary feedback \(e\.g\., scalar ratings\), it typically relies on handcrafted cutoffs to select “good” samples and then treats all retained positives equally\. This makes performance sensitive to the cutoff and biases learning toward the average quality of the selected subset rather than consistently improving toward higher\-quality outcomes\. To move beyond imitation, many pipelines adopt RL\-based fine\-tuning \(e\.g\., PPO/GRPO\), which usually requires an external reward model and iterative online updates\. Such reward models are costly to train and often mis\-specified, introducing noise and bias into optimization, while online rollouts add substantial computational cost and complexity\. Preference\-based methods such as DPO avoid explicit RL, but they require paired preference data, which may be unavailable in many scenarios\. Our work instead targets a more direct and efficient alternative: optimizing outcome quality using only supervised learning and readily available scalar or categorical feedback\. See more discussion in Appendix[C\.8](https://arxiv.org/html/2605.16345#A3.SS8)\.
Goal\-Conditioned Supervised Learning\.Goal\-conditioned supervised learning \(GCSL\) offers an RL\-like paradigm while remaining purely supervised\. By conditioning a policy on an explicit goal and learning from supervised targets, GCSL can train goal\-achieving behavior without value estimation or policy\-gradient optimization\. The definition of goals in GCSL can be very broad, such as specific states, cumulative rewards, or final outcomes for the trajectory to reach\[[24](https://arxiv.org/html/2605.16345#bib.bib7),[29](https://arxiv.org/html/2605.16345#bib.bib34),[5](https://arxiv.org/html/2605.16345#bib.bib8)\]\.
A closely related attempt on LLMs is Quark\[[30](https://arxiv.org/html/2605.16345#bib.bib95)\], which introduces goal\-conditioned optimization but remains an online method requiring a reward model\. More importantly, its goals are tied to specific quantized score bins, which can reduce goal\-conditioning to imitation of these subsets and limit gains beyond the average quality\. By contrast, our method reformulates goals as achieving outcomes above a target quality threshold, yielding a more aligned optimization objective for inference target within a purely offline supervised learning framework\. We further express goals in natural language to better leverage LLM’s semantic understanding and reasoning abilities\. SteerLM\[[11](https://arxiv.org/html/2605.16345#bib.bib119)\]is similar to Quark, still requiring the reward model for online update\. Some recent work also draws on GCSL ideas for LLM\-related learning, but their formulations differ substantially from ours\. E\.g\.,Nathet al\.\[[33](https://arxiv.org/html/2605.16345#bib.bib92)\]use GCSL\-inspired ideas to train a reward model conditioned on future goals, whereas PNLC\[[20](https://arxiv.org/html/2605.16345#bib.bib93)\]learns a Q\-function to evaluate actions for reaching a target goal state\. Since these methods are not designed to directly apply GCSL to LLM fine\-tuning, we don’t include them in our comparisons\.
## 3Methodology
In this section, we first formulate LLM fine\-tuning with offline feedback as a goal\-conditioned sequence modeling problem\. We then describe a classic offline adaptation of GCSL for LLM fine\-tuning, analyze its limitations, and finally introduce our beyond\-threshold formulation,GCSL\-bey, together with its natural\-language variant,GCSL\-bey\-NL\.
### 3\.1Classic GCSL for LLM Fine\-Tuning
#### Problem setting\.
We view an autoregressive LLM as a goal\-conditioned policy\. Given an input prompt and/or contextxx, the model generates a response sequencey=\(y1,…,yT\)y=\(y\_\{1\},\\ldots,y\_\{T\}\)token by token, where each next\-token decision can be viewed as an action and the full response as a trajectory toward some target outcome\. Our offline training set is:𝒟=\{\(xi,yi,ri\)\}i=1N,\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\},r\_\{i\}\)\\\}\_\{i=1\}^\{N\},whererir\_\{i\}denotes the feedback signal associated with the complete responseyiy\_\{i\}\. Depending on the application,rir\_\{i\}can be a scalar score \(e\.g\., non\-toxicity level, code efficiency score\) or an ordered categorical judgment \(e\.g\., user’s categorical rating\)\. Importantly, unlike online RL\-based alignment methods, we assume that these feedback signals are already available in the logged data, either from direct user feedback or from pre\-existing task evaluators\. Therefore, no online rollout, reward\-model fitting, or iterative re\-scoring is required during the offline fine\-tuning\.
#### Offline reward quantization\.
Following the standard approach of classic GCSL research like Trajectory Transformer\[[23](https://arxiv.org/html/2605.16345#bib.bib94)\]and Quark\[[30](https://arxiv.org/html/2605.16345#bib.bib95)\], we first convert the feedback signals in the training data into a finite set of goal labels using equal\-frequency binning\. For scalar or ordered feedback, let
τ1<τ2<⋯<τK\\tau\_\{1\}<\\tau\_\{2\}<\\cdots<\\tau\_\{K\}be ordered bin boundaries and letQ\(r\)∈\{1,…,K\}Q\(r\)\\in\\\{1,\\ldots,K\\\}be the corresponding quantizer\. Each quantized level is represented by a special goal token\[Rk\]\[R\_\{k\}\]\. For each example\(xi,yi,ri\)\(x\_\{i\},y\_\{i\},r\_\{i\}\), we assign
qi=Q\(ri\),gicls=\[Rqi\],q\_\{i\}=Q\(r\_\{i\}\),\\qquad g\_\{i\}^\{\\mathrm\{cls\}\}=\[R\_\{q\_\{i\}\}\],and construct the quantized offline dataset
𝒟~cls=\{\(xi,yi,gicls\)\}i=1N\.\\widetilde\{\\mathcal\{D\}\}\_\{\\mathrm\{cls\}\}=\\\{\(x\_\{i\},y\_\{i\},g\_\{i\}^\{\\mathrm\{cls\}\}\)\\\}\_\{i=1\}^\{N\}\.If the feedback is already categorical, the category itself can be used directly as the goal label\. Notably, unlike Quark, which repeatedly explores, re\-scores, and re\-quantizes newly sampled outputs, we perform this quantization once on a fixed offline dataset before fine\-tuning\.

Figure 1:Workflow of applying classic GCSL for LLM fine\-tuning\.
#### Goal\-conditioned supervised fine\-tuning\.
Given the quantized dataset, we fine\-tune the LLM with standard teacher forcing conditioned on the goal:
ℒcls\(θ\)=−∑i=1N∑t=1Tilogpθ\(yi,t∣xi,gicls,yi,<t\)\.\\mathcal\{L\}\_\{\\mathrm\{cls\}\}\(\\theta\)=\-\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{T\_\{i\}\}\\log p\_\{\\theta\}\\\!\\left\(y\_\{i,t\}\\mid x\_\{i\},g\_\{i\}^\{\\mathrm\{cls\}\},y\_\{i,<t\}\\right\)\.\(1\)This is standard next\-token prediction, except that the target response is conditioned not only on the inputxix\_\{i\}but also on the desired goal label\. At inference time, given a new promptxx, we specify a goal tokenggand decode frompθ\(⋅∣x,g\)p\_\{\\theta\}\(\\cdot\\mid x,g\)\. When the objective is to maximize response quality associated with the goal, the natural choice is the highest goal token\[RK\]\[R\_\{K\}\]\. Figure[1](https://arxiv.org/html/2605.16345#S3.F1)illustrates the workflow of classic GCSL on fine\-tuning LLM for non\-toxicity generation\.
#### Benefits of classic offline GCSL\.
Classic offline GCSL provides several practical advantages\. First, it is a pure offline supervised method and therefore avoids the computational overhead of online rollouts, iterative policy updates, and reward model requirement\. Second, it can directly use conventional sequential data with one\-sided feedback signals, avoiding the need for paired preference data as in DPO\. Third, compared with SFT, it makes finer\-grained use of the training set by distinguishing samples with different quality levels rather than collapsing them into a binary positive/negative split\. This helps the model learn general goal\-achieving capabilities from diverse goal\-reaching training patterns and adapt them for goal\-achieving at inference\.
Nevertheless, classic GCSL still exhibits two important limitations when applied to LLM fine\-tuning\.
### 3\.2Deficiencies of Classic GCSL
#### SFT\-like performance bottleneck\.
Classic GCSL defines goals as belonging to a quantized interval\. In practice, when we want the best possible response, we condition on the highest goal token\[RK\]\[R\_\{K\}\]at inference time\. However, the training examples associated with each goal token are simply the offline responses that fall inside the corresponding bin\. As a result, the model is still encouraged to imitate the average behavior of the subset, rather than consistently pursuing possibly higher quality, which should instead be the learning objective that directly aligns with the inference target\. Thus, although classic GCSL goes beyond the binary splitting of SFT, its learning can still be bounded by the average quality of the selected top subset\.
#### Goal representation bottleneck\.
Classic GCSL represents goals with abstract reward tokens or score embeddings\. These can encode relative ordering, but they don’t explicitly tell what the goal means or what type of behavior is expected to achieve it\. For LLMs whose pretrained capabilities are tightly coupled with natural\-language understanding and instruction following, such symbolic goal representations can underutilize the model’s semantic knowledge and extrapolation ability\.
### 3\.3Going Beyond Classic GCSL
We address the above issues from two complementary directions: a new goal definition and a new goal representation\.

Figure 2:Workflow of GCSL\-bey\-NL for LLM fine\-tuning\.#### Beyond\-threshold goal definition\.
Instead of defining a goal as*belonging to a quantized bin*, we define it as*exceeding a quality threshold*\. Letτ1<τ2<⋯<τK\\tau\_\{1\}<\\tau\_\{2\}<\\cdots<\\tau\_\{K\}denote ordered thresholds, and letqi=Q\(ri\)q\_\{i\}=Q\(r\_\{i\}\)be the quantized level of sampleii\. Under this beyond\-threshold formulation, a trajectory with levelqiq\_\{i\}is a successful demonstration*not only for its own level, but for every threshold no higher than its outcome*\. In other words, sampleiiprovides positive supervision for all goals
gkbey:r≥τk,∀k≤qi\.g\_\{k\}^\{\\mathrm\{bey\}\}:\\ \\ r\\geq\\tau\_\{k\},\\quad\\forall k\\leq q\_\{i\}\.This yields an expanded training set
𝒟~bey=⋃i=1N\{\(xi,yi,gkbey\):1≤k≤qi,k∈ℋ\},\\widetilde\{\\mathcal\{D\}\}\_\{\\mathrm\{bey\}\}=\\bigcup\_\{i=1\}^\{N\}\\left\\\{\(x\_\{i\},y\_\{i\},g\_\{k\}^\{\\mathrm\{bey\}\}\)\\;:\\;1\\leq k\\leq\{q\_\{i\}\},\\;k\\in\\mathcal\{H\}\\right\\\},\(2\)whereℋ⊆\{1,…,K\}\\mathcal\{H\}\\subseteq\\\{1,\\ldots,K\\\}denotes the set of thresholds retained for training\.
We then optimize the same goal\-conditioned next\-token prediction objective:
ℒbey\(θ\)=−∑\(xi,yi,g\)∈𝒟~bey∑t=1Tilogpθ\(yi,t∣xi,g,yi,<t\)\.\\mathcal\{L\}\_\{\\mathrm\{bey\}\}\(\\theta\)=\-\\sum\_\{\(x\_\{i\},y\_\{i\},g\)\\in\\widetilde\{\\mathcal\{D\}\}\_\{\\mathrm\{bey\}\}\}\\sum\_\{t=1\}^\{T\_\{i\}\}\\log p\_\{\\theta\}\\\!\\left\(y\_\{i,t\}\\mid x\_\{i\},g,y\_\{i,<t\}\\right\)\.\(3\)
This reformulation changes the learning signal qualitatively\. The model no longer learns merely what responses inside a particular reward interval look like; instead, it learns what kinds of responses are sufficient to satisfy progressively stronger targets\. A trajectory with outcome levelR3R\_\{3\}is therefore not only an example of “being in binR3R\_\{3\},” but also a valid demonstration for achieving all the goals such as “at leastτ1\\tau\_\{1\},” “at leastτ2\\tau\_\{2\},” and “at leastτ3\\tau\_\{3\}\.” This induces an ordered supervision structure across quality levels, which teaches the model how stronger outcomes subsume given threshold goals\. In this sense, the learning objective is no longer to imitate the average sample within a fixed bin; rather, it is to pursue consistently directional progression of quality\. The expanded data construction provides the training samples that teach the LLM this progression, e\.g\., from average to better, and from better to even better\. Consequently, when we condition on a high threshold at inference time, the model is encouraged to generate responses that are at least this good, rather than merely reproducing the average behavior of the highest observed bin\. In other words, the model is expected to extrapolate the progression patterns learned from training data, such as from good to better, to the inference objective of achieving even better performance\. See more discussions in Appendix[C\.7](https://arxiv.org/html/2605.16345#A3.SS7)\.
This is especially natural for LLMs, because pretrained LLMs already possess*substantial world knowledge, semantic abstraction ability, and instruction extrapolation capacity*\. Once this new target is phrased as an explicit goal in natural language, the model can extrapolate and learn how to generate stronger responses for given thresholds, and then extrapolate this knowledge for its inference\.
#### Natural\-language goal representation\.
As discussed above, we further replace symbolic goal tokens with natural\-language goal descriptions tailored to LLMs\. Specifically, for each thresholdτk\\tau\_\{k\}, we construct a short instruction that explicitly specifies both the target and the semantics of the corresponding metric\. For example, if the task is fine\-tuning LLMs for non\-toxic generation, the goal can be written as:*“Generate a response with a non\-toxicity score greater than0\.850\.85\. Non\-toxicity scores range from0to11, where larger values indicate less toxic content\.”*
Natural\-language goals are much more interpretable than abstract reward tokens and allow the LLM to better exploit its pretrained instruction\-following, semantic understanding, world knowledge, and extrapolation capabilities\. This is particularly important under our beyond\-threshold goal definition\. In this setting, the model should not merely memorize reward buckets; rather, it should understand the meaning of an objective such as achieving performance*above*a threshold, and use that understanding to generalize beyond the exact training patterns\. In other words, once the goal is expressed as an explicit instruction, the LLM can more naturally extrapolate the ordered relationships among thresholds and thus transfer the progression patterns learned from training to the inference stage\.
#### Goal filtering for train\-test alignment\.
A practical issue in the above data construction is that most trajectories naturally satisfy the weakest thresholds\. If all thresholds are kept, the training set will include many samples for goals corresponding to poor outcomes, even though inference always targets desirable high goals\. To better align training with test\-time use, we retain only thresholds above the average performance of the training set, i\.e\., we chooseℋ\\mathcal\{H\}to include only sufficiently good goals\. This focuses learning on the transition from good to better, rather than from bad to average, and empirically leads to better downstream behavior\. See Section[4\.4](https://arxiv.org/html/2605.16345#S4.SS4)for more discussion\.
Overall, we refer to the threshold\-based formulation with symbolic goals asGCSL\-bey, and to the version further using natural\-language goals asGCSL\-bey\-NL\.
## 4Experiments
We evaluate our approach on three tasks: non\-toxic generation, code generation, and LLMs for recommendation\. We compare four variants of GCSL for LLM fine\-tuning:GCSLdenotes the classic offline GCSL formulation, which retains both the original goal definition \(i\.e\., reaching a quantized target value\) and the original goal representation using special tokens;GCSL\-NLkeeps the same formulation but represents goals in natural language \(e\.g\.,*“Generate a response with a non\-toxicity score between 0\.8 and 0\.9”*\); as well as our proposedGCSL\-beyandGCSL\-bey\-NL\.
Across all experiments, we use Qwen3\-4B\-Instruct\-2507 as the default base LLM and adopt LoRA for parameter\-efficient fine\-tuning\. Following Quark, we quantize the scores intoK=5K=5bins as a default choice\. All experiments are conducted with three different seeds, and the average performance is reported\. See Appendix[B](https://arxiv.org/html/2605.16345#A2)for more details on the experimental settings and implementation\.
### 4\.1Non\-toxic Generation
Task and Dataset\.The task is to fine\-tune the LLM to generate less toxic responses\. We conduct experiments on the REALTOXICITYPROMPTS dataset\[[16](https://arxiv.org/html/2605.16345#bib.bib96)\], which contains 100K prompts designed to elicit toxic generations\. Following prior work\[[30](https://arxiv.org/html/2605.16345#bib.bib95),[28](https://arxiv.org/html/2605.16345#bib.bib97)\], we use 85K prompts from the training set\. For evaluation, we use the same 5K non\-toxic test prompts as in\[[30](https://arxiv.org/html/2605.16345#bib.bib95),[28](https://arxiv.org/html/2605.16345#bib.bib97)\], and generate responses with nucleus sampling \(p=0\.9p=0\.9\)\. To construct the offline training set, we first use the base LLM \(Qwen3\-4B\-Instruct\-2507\) to generate a response for each training prompt\. We then follow Quark and use the Perspective API\[[26](https://arxiv.org/html/2605.16345#bib.bib98)\]to evaluate each generated response, assigning scores from 0 \(non\-toxic\) to 1 \(toxic\)\. The non\-toxicity score is defined as one minus the toxicity score\.
Baselines and Evaluation Metrics\.We compare against several representative offline baselines for toxicity reduction in LLMs\.DEXPERT\[[28](https://arxiv.org/html/2605.16345#bib.bib97)\]is a decoding\-time method that combines an expert and an anti\-expert language model to steer generation toward desired attributes\.PPLM\[[7](https://arxiv.org/html/2605.16345#bib.bib99)\]updates hidden representations during decoding using gradients from a toxicity classifier\.LLMEraser\[[9](https://arxiv.org/html/2605.16345#bib.bib100)\]is a recent efficient fine\-tuning framework for unlearning undesirable characteristics in LLMs\. Also, we include a standardSFTbaseline, which fine\-tunes the model on the top 25% of training samples with the highest non\-toxicity scores\.
Following prior work\[[30](https://arxiv.org/html/2605.16345#bib.bib95),[28](https://arxiv.org/html/2605.16345#bib.bib97)\], we report*maximum toxicity*as the average of the maximum toxicity score over five generations per prompt, and*toxic probability*as the empirical probability that at least one of the five generations is toxic \(toxicity score \> 0\.5\)\. Both are measured using the Perspective API\. We also report*perplexity*of the generated outputs under the base model as a proxy for language quality and for the extent to which the fine\-tuned model deviates from the original model\.
Table 1:Comparison of different methods for non\-toxic generation task\.SFTDEXPERTPPLMLLMEraserGCSLGCSL\-NLGCSL\-beyGCSL\-bey\-NLAvg\.max\.↓\\downarrow0\.1390\.1450\.1520\.1300\.1340\.1290\.1250\.115Prob\.↓\\downarrow0\.0320\.0390\.0420\.0250\.0270\.0270\.0250\.019Perplexity↓\\downarrow59\.4361\.0361\.4257\.6755\.0958\.6754\.0258\.27
Results\.Table[1](https://arxiv.org/html/2605.16345#S4.T1)presents the results\. Overall, GCSL\-bey\-NL achieves the best performance among all offline methods in reducing toxic generations while maintaining competitive language quality\. Comparing GCSL\-bey and GCSL\-NL against GCSL shows that both the beyond\-threshold goal definition and natural\-language goal representation lead to clear improvements\. Moreover, GCSL\-bey\-NL delivers the strongest gains, suggesting that these two design choices are complementary: the new goal definition provides a more effective optimization target, while natural\-language goals allow the LLM to better leverage its semantic understanding of the intended objective\.
### 4\.2Code Generation
Task and Dataset\.The target is to generate bothcorrect and efficientcode for given programming problems\. We conduct experiments on the Mercury benchmark\[[12](https://arxiv.org/html/2605.16345#bib.bib101)\], which contains 1,889 Python programming tasks spanning three difficulty levels\. A key feature of Mercury is its*Beyond Score*, a metric for evaluating code efficiency by comparing a candidate solution against a set of reference implementations with different efficiency levels for the same task\.
We randomly sample 20K task\-code pairs from Mercury as training data, using Beyond Score as the reward signal\. For evaluation, we use the official test split of Mercury\. Following the benchmark protocol, generated code is first executed in an isolated sandbox environment, then its Beyond Score is computed by comparing its execution behavior and efficiency against the reference solutions\.
Baselines and Evaluation Metrics\.We further includeDPO\[[35](https://arxiv.org/html/2605.16345#bib.bib102)\]for comparison\. To construct the pairwise preference data, we create five contrastive pairs for each training task by selecting the top five pairs of solutions with the largest runtime differences, following Mercury paper\[[12](https://arxiv.org/html/2605.16345#bib.bib101)\]\. Besides DPO, we also compare with two recent variants:SimPO\[[31](https://arxiv.org/html/2605.16345#bib.bib103)\]removes DPO’s dependence on a reference model;AlphaDPO\[[38](https://arxiv.org/html/2605.16345#bib.bib104)\]introduces adaptive preference optimization to mitigate the static\-reference issue\. We also compare SFT on training samples with the highest 25% Beyond Scores\.
Following Mercury, we report*Beyond Score*as the primary metric\. As illustrated above, this metric evaluates code efficiency relative to the distribution of valid solutions in the benchmark\. Note that Beyond Score also reflects functional correctness, since solutions that fail the tests are directly assigned a score of 0\. In addition, we report*Pass Rate*to directly measure the functional correctness\.
Table 2:Comparison of different methods for code generation task\.ModelPass Rate \(↑\\uparrow\)Beyond Score \(↑\\uparrow\)EasyMediumHardOverallEasyMediumHardOverallSFT0\.8380\.8270\.4750\.7230\.7730\.5750\.4650\.563DPO0\.7780\.6870\.4130\.6750\.6400\.4500\.3740\.498SimPO0\.7810\.7050\.4260\.6980\.6680\.4810\.3870\.534AlphaDPO0\.7980\.7320\.4930\.7100\.7120\.4950\.4140\.592GCSL0\.8390\.7240\.5600\.7180\.9430\.5000\.4590\.650GCSL\-NL0\.8570\.7390\.5790\.7310\.9470\.5430\.4650\.662GCSL\-bey0\.9030\.7930\.5600\.7640\.9410\.5900\.4450\.676GCSL\-bey\-NL0\.8450\.8280\.6400\.7770\.9560\.6500\.6400\.714Results\.Table[2](https://arxiv.org/html/2605.16345#S4.T2)summarizes the results\. The overall conclusion is similar: GCSL\-bey\-NL achieves the strongest performance, and its advantage is particularly pronounced on more difficult coding tasks\. This is intuitive, since harder problems typically exhibit larger performance gaps among candidate solutions\. In such cases, learning from fine\-grained quality differences and their ordered relationships is more beneficial than either treating high\-quality samples as a homogeneous set like SFT, or modeling only pairwise relative preferences like DPO\.
Comparing GCSL\-bey and GCSL\-NL with GCSL also shows that both the beyond\-threshold goal formulation and natural\-language goal representation contribute clear gains\. And the stronger results of GCSL\-bey\-NL indicate that these two components are complementary, and that combining them yields the most effective formulation\.
Notably, GCSL\-based methods show an even larger advantage on code generation tasks\. We attribute this to the task nature: generating code that is both functionally correct and computationally efficient is a particularly subtle objective\. It requires the model not only to resolve the coding problems, but also to extrapolate nuanced efficiency differences among valid solutions\. This is precisely where GCSL is especially well suited, as it enables the model to learn from structured, graded supervision over outcome quality rather than from binary filtering or pairwise comparisons alone\.
### 4\.3LLM for Recommendation
Task and Dataset\.The target is to fine\-tune an LLM to recommend items to users\. Specifically, given a user’s historical interaction sequence, the model is asked to predict the item\(s\) that best match the user’s current preference\. We conduct experiments on the Amazon Reviews\[[18](https://arxiv.org/html/2605.16345#bib.bib110)\]dataset on theCDs & Vinylcategory\. Following prior work\[[3](https://arxiv.org/html/2605.16345#bib.bib113),[6](https://arxiv.org/html/2605.16345#bib.bib114),[14](https://arxiv.org/html/2605.16345#bib.bib108)\], we filter out user interaction sequences with fewer than 10 entries\. We then split the processed data into training, validation, and test sets with an 8:1:1 ratio\. Following the recent LLM\-based recommendation setup\[[14](https://arxiv.org/html/2605.16345#bib.bib108)\], we further sample 4,096 interactions from the training set, 512 from the validation set, and 1,000 from the test set\.
Each interaction record includes a user rating from 1 to 5 for the target item, which we directly use as the reward signal to derive the goals for GCSL\. For evaluation, following prior works\[[14](https://arxiv.org/html/2605.16345#bib.bib108),[2](https://arxiv.org/html/2605.16345#bib.bib106)\], we prompt the LLM to generate a predicted item, compute semantic similarity scores between the generated output and all candidate items using Sentence Transformer\[[36](https://arxiv.org/html/2605.16345#bib.bib123)\], rank the entire item set accordingly, and then derive the final top\-kkrecommendation list\.
Baselines and Evaluation Metrics\.We compare against representative and recent methods for LLM fine\-tuning in recommendation\.BIGRec\[[2](https://arxiv.org/html/2605.16345#bib.bib106)\]is an established instruction\-tuning framework for sequential recommendation\.RosePO\[[27](https://arxiv.org/html/2605.16345#bib.bib107)\]is a DPO\-based preference optimization framework that combines negative sampling and personalized uncertainty modeling to improve fairness, robustness, and bias mitigation\.SPRec\[[14](https://arxiv.org/html/2605.16345#bib.bib108)\]is a recent method that alternates between SFT and DPO to iteratively improve user preference estimation\. We also include a directSFTbaseline trained on sequences with ratings greater than or equal to 3, following the standard positive\-label threshold commonly adopted in prior recommendation studies\[[14](https://arxiv.org/html/2605.16345#bib.bib108),[19](https://arxiv.org/html/2605.16345#bib.bib109)\]on the same dataset\.
We report two standard ranking metrics\.*NDCG@k*\[[25](https://arxiv.org/html/2605.16345#bib.bib111)\]evaluates the quality of the top\-kkrecommendation list by accounting for both the position of relevant items and their graded relevance scores \(i\.e\., users’ ratings\)\.*ERR@k*\[[4](https://arxiv.org/html/2605.16345#bib.bib105)\]interprets the rating as a probability of user satisfaction and places greater emphasis on ranking highly rated items at the top of the list\. Notably, both metrics place greater value on recommendation items with higher ratings, which highlights the key advantage of our GCSL strategy over SFT which treats all positive training samples equally\.
Table 3:Comparison of different methods for recommendation by LLM task\.SFTBIGRecRosePOSPRecGCSLGCSL\-NLGCSL\-beyGCSL\-bey\-NLNDCG@10↑\\uparrow0\.00670\.00710\.00610\.00790\.00720\.00740\.01070\.0115ERR@10↑\\uparrow0\.00350\.00400\.00290\.00440\.00410\.00430\.00460\.0054NDCG@20↑\\uparrow0\.00720\.00760\.00690\.00920\.00720\.00790\.01120\.0127ERR@20↑\\uparrow0\.00370\.00440\.00340\.00470\.00420\.00460\.00500\.0057
Results\.The results are similar: GCSL\-bey\-NL achieves the best performance by a clear margin\. Notably, the purely DPO\-based method RosePO performs worse than direct SFT\. We attribute this to the nature of the recommendation task and its data structure\. In particular, for each training sequence, we only observe the user’s rating for a single target item\. As a result, DPO\-based methods must sample other items \(usually as the negative ones\) from the candidate pool to construct the contrastive pairs\. This can be problematic, because the sampled items are not necessarily true negative: they may still be preferred by the user, but simply have not been exposed or recommended to them yet\. In contrast, GCSL directly leverages the observed graded feedback without requiring such pair construction, and therefore avoids the noise and ambiguity introduced by sampled negatives\.
### 4\.4Ablations
We conduct additional ablation studies to examine the effectiveness and mechanism of our approach\.Due to space limitation, tables and figures for the ablation results are presented in Appendix[A](https://arxiv.org/html/2605.16345#A1)\.
Effect of the number of thresholds\.As described in Section[3\.3](https://arxiv.org/html/2605.16345#S3.SS3.SSS0.Px1), we obtain thresholds by quantizing the training samples intoKKbins according to their rewards/performance values\. Figure[3](https://arxiv.org/html/2605.16345#A1.F3)shows that using a larger number of quantiles generally improves performance, as it enables a more fine\-grained partition of the training data\. This helps GCSL better capture and extrapolate the progression patterns among quality levels\. However, when the number of quantiles becomes too large, the number of training samples within each bin decreases, especially in the highest bin that is most closely aligned with the inference objective\. In this case, learning can become less effective because the model has fewer training examples for high\-threshold goals, which can hurt the final performance\. That said, we chooseK=5K=5in all the experiments, and the consistently superior performance over compared methods demonstrates the robustness of the threshold choice\.
Effect of the goal filtering strategy\.As described in Section[3\.3](https://arxiv.org/html/2605.16345#S3.SS3.SSS0.Px3), we retain only thresholds above the average training performance to align the training and inference objectives\. To evaluate this, we compare GCSL\-bey and GCSL\-bey\-NL with their counterparts that do not use goal filtering\. Table[4](https://arxiv.org/html/2605.16345#A1.T4)shows that filtering out low\-quality goals is important for maintaining the effectiveness of GCSL, as it ensures the alignment between training\-time supervision and the goal pursued at inference time\.
Comparison with SFT under different data\-splitting ratios\.Figure[4](https://arxiv.org/html/2605.16345#A1.F4)compares GCSL\-bey\-NL with SFT trained on different positive\-data ratios, where SFT selects the top 60% to 10% of training samples as positive examples\. The results show that SFT can be highly sensitive to this choice\. The key trade\-off is between the quality of the selected positive subset and its size: stricter filtering yields higher\-quality supervision but fewer training samples, while looser filtering increases data quantity at the cost of including lower\-quality examples\. In contrast, our GCSL\-bey\-NL avoids binarizing the positive samples and*consistently outperforms SFT across all positive\-ratio settings*\. This demonstrates the advantage of exploiting the hierarchical structure of graded training data, rather than reducing it to a binary split and treating all positive samples as equally informative\.
Comparison with online methods\.Although our method is designed for purely offline learning settings, we further compare it with online methods like Quark and PPO\. As shown in Table[5](https://arxiv.org/html/2605.16345#A1.T5), online methods can achieve stronger performance, which is expected because they are able to iteratively generate new data and optimize beyond the limitations of a fixed offline dataset\. However, this advantage comes at a substantial cost: online methods require repeated rollout generation, the training and/or use of an external reward model to evaluate newly generated outputs, and repeated model updates on the expanded data\. This can be highly time\- and resource\-intensive, and in many practical settings an appropriate reward model may not be available\. In contrast, GCSL\-bey\-NL is fully offline, offering significant efficiency advantages while consistently outperforming other offline approaches\.
## 5Conclusion
In this work, we present goal\-conditioned supervised learning as an offline fine\-tuning framework for LLMs\. By directly treating feedback signals as explicit goals, our method avoids binarizing training positive samples and paired preference construction\. We further introduce a beyond\-threshold goal formulation and natural\-language goal representations to better exploit graded feedback\. Extensive experiments demonstrate that our approach consistently outperforms standard offline baselines while preserving the efficiency and data efficiency advantages of supervised learning\.
## References
- \[1\]R\. Agarwal, A\. Singh, L\. Zhang, B\. Bohnet, L\. Rosias, S\. Chan, B\. Zhang, A\. Anand, Z\. Abbas, A\. Nova,et al\.\(2024\)Many\-shot in\-context learning\.Advances in Neural Information Processing Systems37,pp\. 76930–76966\.Cited by:[§C\.7](https://arxiv.org/html/2605.16345#A3.SS7.p2.1)\.
- \[2\]\(2025\)A bi\-step grounding paradigm for large language models in recommendation systems\.ACM Transactions on Recommender Systems3\(4\),pp\. 1–27\.Cited by:[§4\.3](https://arxiv.org/html/2605.16345#S4.SS3.p2.1),[§4\.3](https://arxiv.org/html/2605.16345#S4.SS3.p3.1)\.
- \[3\]K\. Bao, J\. Zhang, Y\. Zhang, X\. Huo, C\. Chen, and F\. Feng\(2024\)Decoding matters: addressing amplification bias and homogeneity issue for llm\-based recommendation\.arXiv preprint arXiv:2406\.14900\.Cited by:[§4\.3](https://arxiv.org/html/2605.16345#S4.SS3.p1.1)\.
- \[4\]O\. Chapelle, D\. Metlzer, Y\. Zhang, and P\. Grinspan\(2009\)Expected reciprocal rank for graded relevance\.InProceedings of the 18th ACM conference on Information and knowledge management,pp\. 621–630\.Cited by:[§4\.3](https://arxiv.org/html/2605.16345#S4.SS3.p4.1)\.
- \[5\]L\. Chen, K\. Lu, A\. Rajeswaran, K\. Lee, A\. Grover, M\. Laskin, P\. Abbeel, A\. Srinivas, and I\. Mordatch\(2021\)Decision transformer: reinforcement learning via sequence modeling\.Advances in neural information processing systems34,pp\. 15084–15097\.Cited by:[§2](https://arxiv.org/html/2605.16345#S2.p2.1)\.
- \[6\]Y\. Chen, J\. Tan, A\. Zhang, Z\. Yang, L\. Sheng, E\. Zhang, X\. Wang, and T\. Chua\(2024\)On softmax direct preference optimization for recommendation\.Advances in Neural Information Processing Systems37,pp\. 27463–27489\.Cited by:[§4\.3](https://arxiv.org/html/2605.16345#S4.SS3.p1.1)\.
- \[7\]S\. Dathathri, A\. Madotto, J\. Lan, J\. Hung, E\. Frank, P\. Molino, J\. Yosinski, and R\. Liu\(2020\)Plug and play language models: a simple approach to controlled text generation\.ICLR\.Cited by:[Appendix B](https://arxiv.org/html/2605.16345#A2.p2.1),[§4\.1](https://arxiv.org/html/2605.16345#S4.SS1.p2.1)\.
- \[8\]R\. Diaz and A\. Marathe\(2019\)Soft labels for ordinal regression\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 4738–4747\.Cited by:[§C\.8](https://arxiv.org/html/2605.16345#A3.SS8.p3.1)\.
- \[9\]C\. Ding, J\. Wu, Y\. Yuan, J\. Lu, K\. Zhang, A\. Su, X\. Wang, and X\. He\(2025\)Unified parameter\-efficient unlearning for llms\.ICLR\.Cited by:[§4\.1](https://arxiv.org/html/2605.16345#S4.SS1.p2.1)\.
- \[10\]H\. Dong, W\. Xiong, D\. Goyal, Y\. Zhang, W\. Chow, R\. Pan, S\. Diao, J\. Zhang, K\. Shum, and T\. Zhang\(2023\)Raft: reward ranked finetuning for generative foundation model alignment\.arXiv preprint arXiv:2304\.06767\.Cited by:[§1](https://arxiv.org/html/2605.16345#S1.p2.1)\.
- \[11\]Y\. Dong, Z\. Wang, M\. Sreedhar, X\. Wu, and O\. Kuchaiev\(2023\)Steerlm: attribute conditioned sft as an \(user\-steerable\) alternative to rlhf\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 11275–11288\.Cited by:[§2](https://arxiv.org/html/2605.16345#S2.p3.1)\.
- \[12\]M\. Du, L\. A\. Tuan, B\. Ji, Q\. Liu, and S\. Ng\(2024\)Mercury: a code efficiency benchmark for code large language models\.Advances in Neural Information Processing Systems37,pp\. 16601–16622\.Cited by:[Appendix E](https://arxiv.org/html/2605.16345#A5.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.16345#S4.SS2.p1.1),[§4\.2](https://arxiv.org/html/2605.16345#S4.SS2.p3.1)\.
- \[13\]K\. Ethayarajh, W\. Xu, N\. Muennighoff, D\. Jurafsky, and D\. Kiela\(2024\)Kto: model alignment as prospect theoretic optimization\.arXiv preprint arXiv:2402\.01306\.Cited by:[§C\.8](https://arxiv.org/html/2605.16345#A3.SS8.p2.1)\.
- \[14\]C\. Gao, R\. Chen, S\. Yuan, K\. Huang, Y\. Yu, and X\. He\(2025\)Sprec: self\-play to debias llm\-based recommendation\.InProceedings of the ACM on Web Conference 2025,pp\. 5075–5084\.Cited by:[§4\.3](https://arxiv.org/html/2605.16345#S4.SS3.p1.1),[§4\.3](https://arxiv.org/html/2605.16345#S4.SS3.p2.1),[§4\.3](https://arxiv.org/html/2605.16345#S4.SS3.p3.1)\.
- \[15\]L\. Gao, J\. Schulman, and J\. Hilton\(2023\)Scaling laws for reward model overoptimization\.InInternational Conference on Machine Learning,pp\. 10835–10866\.Cited by:[§1](https://arxiv.org/html/2605.16345#S1.p1.1)\.
- \[16\]S\. Gehman, S\. Gururangan, M\. Sap, Y\. Choi, and N\. A\. Smith\(2020\)Realtoxicityprompts: evaluating neural toxic degeneration in language models\.InFindings of the association for computational linguistics: EMNLP 2020,pp\. 3356–3369\.Cited by:[Appendix E](https://arxiv.org/html/2605.16345#A5.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.16345#S4.SS1.p1.1)\.
- \[17\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[Appendix E](https://arxiv.org/html/2605.16345#A5.SS0.SSS0.Px2.p1.1)\.
- \[18\]R\. He and J\. McAuley\(2016\)Ups and downs: modeling the visual evolution of fashion trends with one\-class collaborative filtering\.Inproceedings of the 25th international conference on world wide web,pp\. 507–517\.Cited by:[Appendix E](https://arxiv.org/html/2605.16345#A5.SS0.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2605.16345#S4.SS3.p1.1)\.
- \[19\]X\. He, K\. Deng, X\. Wang, Y\. Li, Y\. Zhang, and M\. Wang\(2020\)Lightgcn: simplifying and powering graph convolution network for recommendation\.InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval,pp\. 639–648\.Cited by:[§4\.3](https://arxiv.org/html/2605.16345#S4.SS3.p3.1)\.
- \[20\]J\. Hong, A\. Dragan, and S\. Levine\(2025\)Planning without search: refining frontier llms with offline goal\-conditioned rl\.arXiv preprint arXiv:2505\.18098\.Cited by:[§2](https://arxiv.org/html/2605.16345#S2.p3.1)\.
- \[21\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.ICLR\.Cited by:[Appendix B](https://arxiv.org/html/2605.16345#A2.p1.1)\.
- \[22\]J\. Hu, L\. Tao, J\. Yang, and C\. Zhou\(2023\)Aligning language models with offline learning from human feedback\.arXiv preprint arXiv:2308\.12050\.Cited by:[§C\.8](https://arxiv.org/html/2605.16345#A3.SS8.p2.1)\.
- \[23\]M\. Janner, Q\. Li, and S\. Levine\(2021\)Offline reinforcement learning as one big sequence modeling problem\.Advances in neural information processing systems34,pp\. 1273–1286\.Cited by:[§C\.2](https://arxiv.org/html/2605.16345#A3.SS2.p1.1),[§C\.2](https://arxiv.org/html/2605.16345#A3.SS2.p2.1),[§3\.1](https://arxiv.org/html/2605.16345#S3.SS1.SSS0.Px2.p1.4)\.
- \[24\]M\. Janner, Q\. Li, and S\. Levine\(2021\)Offline reinforcement learning as one big sequence modeling problem\.Advances in neural information processing systems34,pp\. 1273–1286\.Cited by:[§1](https://arxiv.org/html/2605.16345#S1.p3.1),[§2](https://arxiv.org/html/2605.16345#S2.p2.1)\.
- \[25\]W\. Kang and J\. McAuley\(2018\)Self\-attentive sequential recommendation\.In2018 IEEE international conference on data mining \(ICDM\),pp\. 197–206\.Cited by:[§4\.3](https://arxiv.org/html/2605.16345#S4.SS3.p4.1)\.
- \[26\]A\. Lees, V\. Q\. Tran, Y\. Tay, J\. Sorensen, J\. Gupta, D\. Metzler, and L\. Vasserman\(2022\)A new generation of perspective api: efficient multilingual character\-level transformers\.InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining,pp\. 3197–3207\.Cited by:[§4\.1](https://arxiv.org/html/2605.16345#S4.SS1.p1.1)\.
- \[27\]J\. Liao, X\. He, R\. Xie, J\. Wu, Y\. Yuan, X\. Sun, Z\. Kang, and X\. Wang\(2024\)Rosepo: aligning llm\-based recommenders with human values\.arXiv preprint arXiv:2410\.12519\.Cited by:[§4\.3](https://arxiv.org/html/2605.16345#S4.SS3.p3.1)\.
- \[28\]A\. Liu, M\. Sap, X\. Lu, S\. Swayamdipta, C\. Bhagavatula, N\. A\. Smith, and Y\. Choi\(2021\)DExperts: decoding\-time controlled text generation with experts and anti\-experts\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 6691–6706\.Cited by:[Appendix B](https://arxiv.org/html/2605.16345#A2.p2.1),[§4\.1](https://arxiv.org/html/2605.16345#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.16345#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2605.16345#S4.SS1.p3.1)\.
- \[29\]M\. Liu, M\. Zhu, and W\. Zhang\(2022\)Goal\-conditioned reinforcement learning: problems and solutions\.arXiv preprint arXiv:2201\.08299\.Cited by:[§1](https://arxiv.org/html/2605.16345#S1.p3.1),[§2](https://arxiv.org/html/2605.16345#S2.p2.1)\.
- \[30\]X\. Lu, S\. Welleck, J\. Hessel, L\. Jiang, L\. Qin, P\. West, P\. Ammanabrolu, and Y\. Choi\(2022\)Quark: controllable text generation with reinforced unlearning\.Advances in neural information processing systems35,pp\. 27591–27609\.Cited by:[Appendix B](https://arxiv.org/html/2605.16345#A2.p2.1),[§C\.2](https://arxiv.org/html/2605.16345#A3.SS2.p1.1),[§C\.2](https://arxiv.org/html/2605.16345#A3.SS2.p2.1),[§1](https://arxiv.org/html/2605.16345#S1.p5.1),[§2](https://arxiv.org/html/2605.16345#S2.p3.1),[§3\.1](https://arxiv.org/html/2605.16345#S3.SS1.SSS0.Px2.p1.4),[§4\.1](https://arxiv.org/html/2605.16345#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.16345#S4.SS1.p3.1)\.
- \[31\]Y\. Meng, M\. Xia, and D\. Chen\(2024\)Simpo: simple preference optimization with a reference\-free reward\.Advances in Neural Information Processing Systems37,pp\. 124198–124235\.Cited by:[§1](https://arxiv.org/html/2605.16345#S1.p2.1),[§4\.2](https://arxiv.org/html/2605.16345#S4.SS2.p3.1)\.
- \[32\]S\. Mukherjee, V\. D\. Lai, R\. Addanki, R\. Rossi, S\. Yoon, T\. Bui, A\. Rao, J\. Subramanian, and B\. Kveton\(2025\)Offline rl by reward\-weighted fine\-tuning for conversation optimization\.arXiv preprint arXiv:2506\.06964\.Cited by:[§C\.8](https://arxiv.org/html/2605.16345#A3.SS8.p1.1)\.
- \[33\]V\. Nath, D\. Slack, J\. Da, Y\. Ma, H\. Zhang, S\. Whitehead, and S\. Hendryx\(2024\)Learning goal\-conditioned representations for language reward models\.Advances in Neural Information Processing Systems37,pp\. 117070–117108\.Cited by:[§2](https://arxiv.org/html/2605.16345#S2.p3.1)\.
- \[34\]X\. Qiu, Y\. Gan, C\. F\. Hayes, Q\. Liang, Y\. Xu, R\. Dailey, E\. Meyerson, B\. Hodjat, and R\. Miikkulainen\(2025\)Evolution strategies at scale: llm fine\-tuning beyond reinforcement learning\.arXiv preprint arXiv:2509\.24372\.Cited by:[§1](https://arxiv.org/html/2605.16345#S1.p1.1)\.
- \[35\]R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn\(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§1](https://arxiv.org/html/2605.16345#S1.p1.1),[§1](https://arxiv.org/html/2605.16345#S1.p2.1),[§4\.2](https://arxiv.org/html/2605.16345#S4.SS2.p3.1)\.
- \[36\]N\. Reimers and I\. Gurevych\(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 3982–3992\.Cited by:[§4\.3](https://arxiv.org/html/2605.16345#S4.SS3.p2.1)\.
- \[37\]A\. Scheid, E\. Boursier, A\. Durmus, M\. I\. Jordan, P\. Ménard, E\. Moulines, and M\. Valko\(2024\)Optimal design for reward modeling in rlhf\.arXiv preprint arXiv:2410\.17055\.Cited by:[§1](https://arxiv.org/html/2605.16345#S1.p1.1)\.
- \[38\]J\. Wu, X\. Wang, Z\. Yang, J\. Wu, J\. Gao, B\. Ding, X\. Wang, and X\. He\(2025\)AlphaDPO: adaptive reward margin for direct preference optimization\.InInternational Conference on Machine Learning,pp\. 67793–67809\.Cited by:[§4\.2](https://arxiv.org/html/2605.16345#S4.SS2.p3.1)\.
- \[39\]D\. Xu, T\. Xie, B\. Xia, H\. Li, Y\. Bai, Y\. Sun, and W\. Wang\(2024\)Does few\-shot learning help llm performance in code synthesis?\.arXiv preprint arXiv:2412\.02906\.Cited by:[§C\.7](https://arxiv.org/html/2605.16345#A3.SS7.p2.1)\.
- \[40\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Appendix E](https://arxiv.org/html/2605.16345#A5.SS0.SSS0.Px2.p1.1)\.
## Appendix AAblation Results
Figure[3](https://arxiv.org/html/2605.16345#A1.F3)shows the effect of the threshold numbers on the performance of GCSL\-bey\-NL\.


Figure 3:Performance of GCSL\-bey\-NL \(y\-axis\) with different number of quantization thresholds \(x\-axis\) on non\-toxic generation \(left\) and code generation \(right\)\.Figure[4](https://arxiv.org/html/2605.16345#A1.F4)shows the effect of different positive\-data ratios on the performance of SFT\. Note that GCSL\-bey\-NL achieves 0\.115 on toxic max and 0\.019 on toxic prob, as well as 0\.777 on pass rate and 0\.714 on beyond score\. The results show thatGCSL\-bey\-NL consistently provides significant performance benefits over SFT across different positive\-data ratios\.


Figure 4:Performance of SFT \(y\-axis\) with different positive\-data ratios \(x\-axis, i\.e\., selecting the top 60% to 10% of training samples as positive examples\) on non\-toxic generation \(left\) and code generation \(right\)\.Table[4](https://arxiv.org/html/2605.16345#A1.T4)compares GCSL\-bey and GCSL\-bey\-NL with their counterparts that do not use goal filtering\. The results demonstrate the effectiveness and necessity of this strategy\.
Table 4:Comparison of GCSL\-bey and GCSL\-bey\-NL with and without goal filtering strategy\.ModelRealToxicityPromptsMercuryAvg\.max\.↓\\downarrowProb\.↓\\downarrowPass Rate↑\\uparrowBeyond Score↑\\uparrowGCSL\-bey w/o filter0\.1730\.0490\.6940\.634GCSL\-bey0\.1250\.0250\.7640\.676GCSL\-bey\-NL w/o filter0\.1560\.0350\.7430\.658GCSL\-bey\-NL0\.1150\.0190\.7770\.714Table[5](https://arxiv.org/html/2605.16345#A1.T5)compares GCSL\-bey\-NL with online fine\-tuning methods such as PPO and Quark\. Training time is computed from the start until convergence\. The results show that, although GCSL\-bey\-NL performs worse than online approaches due to the intrinsic limitations of offline methods trained with fixed offline data, it inherits the significant computational efficiency advantages of standard supervised learning\. Specifically, it does not require an external reward model or online labeling/updates, while still consistently outperforming other offline fine\-tuning baselines such as SFT\.
Notably, the efficiency advantage of offline methods is especially pronounced in the code generation task\. This is because online methods must execute each newly generated program in the sandbox environment, compute its performance metrics, and use these scores as rewards for further training\. Consequently, the cost and latency of code execution make online approaches particularly expensive and time\- and resource\-intensive for fine\-tuning LLM for code generation\.
Table 5:Comparison of GCSL\-bey\-NL and online fine\-tuning methods\.ModelRealToxicityPromptsMercuryAvg\.max\.↓\\downarrowProb\.↓\\downarrowTraining Time↓\\downarrowPass Rate↑\\uparrowBeyond Score↑\\uparrowTraining Time↓\\downarrowPPO0\.1050\.01812\.9h0\.8250\.71824\.5hQuark0\.1090\.01511\.1h0\.8120\.72220\.8hSFT0\.1390\.0327\.8h0\.7230\.56312\.0hGCSL\-bey\-NL0\.1150\.0198\.0h0\.7770\.71412\.3h
## Appendix BExperimental Details
We conduct all experiments on a dedicated NVIDIA GH200 superchip equipped with an H100 GPU, with 256 GB of RAM and a 1 TB SSD\. For all models, we use the Adam optimizer and tune the learning rate within the range \(\[1e\-7, 5e\-7, 1e\-6, 5e\-7, 1e\-5\]\)\. By default, we fine\-tune the Qwen3\-4B\-Instruct\-2507 model using LoRA\[[21](https://arxiv.org/html/2605.16345#bib.bib112)\], with LoRA rank set to 13, LoRA alpha set to 32, and LoRA dropout set to 0\.05\. We also set the data type of torch tensor to 16\-bit floating point\.
When comparing training efficiency, we configure each method with the largest batch size that fits on the H100 GPU\. To preserve the language distribution and prevent the model from deviating excessively from the original, we follow Quark\[[30](https://arxiv.org/html/2605.16345#bib.bib95)\]by adding a KL penalty to the loss function for the non\-toxic generation task, with the KL coefficient set to 0\.05 following Quark\. We don’t include DPO and related baselines on non\-toxic generation task by following the assumption of previous work\[[7](https://arxiv.org/html/2605.16345#bib.bib99),[30](https://arxiv.org/html/2605.16345#bib.bib95),[28](https://arxiv.org/html/2605.16345#bib.bib97)\]that the available data comes with one\-sided feedback without pairwise format, which prevents the formation of contrastive preference pairs without generating additional data\. For all baseline methods, we adhere to the implementation details, optimization configurations, and hyperparameter tuning strategies reported in their original papers\. We repeat each experiment three times with different random seeds and report the average performance\. Statistical significance is evaluated using a paired t\-test withρ<0\.05\\rho<0\.05\.
## Appendix CAdditional Experiments and Discussions
### C\.1Comparison with Data Augmentation
As described in Section[3\.3](https://arxiv.org/html/2605.16345#S3.SS3.SSS0.Px1), our proposed GCSL\-bey and GCSL\-bey\-NL redefine the GCSL objective as “exceeding a quality threshold”\. Under this formulation, a trajectory with quality levelqiq\_\{i\}is a successful example not only for its own level, but for every threshold no higher thanqiq\_\{i\}\. This expands the training set by associating the same trajectory with multiple valid goals\.
One may wonder whether the gains from this design simply come from implicit data augmentation by oversampling\. To examine this, we construct a control variant that uses the same data expansion procedure as GCSL\-bey but removes goal conditioning\. For example, if an offline trajectoryτk\\tau\_\{k\}has quality levelq3q\_\{3\}, we repeat it three times following the GCSL\-bey construction and include all copies in the training set\. We also apply the same filtering strategy as GCSL\-bey, retaining only trajectories above the average performance threshold\. We then perform standard supervised fine\-tuning on this expanded dataset without appending any goals\. We denote this variant asSFT\-aug\.
Table[6](https://arxiv.org/html/2605.16345#A3.T6)compares SFT\-aug with GCSL\-bey and GCSL\-bey\-NL\. The results show that the gains of GCSL\-bey are not merely due to repeating high\-quality samples more often\. Rather, the improvement comes from explicitly conditioning on goals and learning the relationship between trajectories and ordered quality thresholds\.
Table 6:Comparison of SFT\-aug with GCSL\-bey and GCSL\-bey\-NL\.ModelRealToxicityPromptsMercuryAvg\.max\.↓\\downarrowProb\.↓\\downarrowPass Rate↑\\uparrowBeyond Score↑\\uparrowSFT\-aug0\.1320\.0290\.7270\.598GCSL\-bey0\.1250\.0250\.7640\.676GCSL\-bey\-NL0\.1150\.0190\.7770\.714
### C\.2Discussion of Quantization
As described in Section[3\.1](https://arxiv.org/html/2605.16345#S3.SS1), both classic GCSL and our GCSL\-bey begin by quantizing feedback signals in the training data into a finite set of goal labels\. This design follows prior work on applying GCSL with LLMs or Transformer\-based sequential models\[[23](https://arxiv.org/html/2605.16345#bib.bib94),[30](https://arxiv.org/html/2605.16345#bib.bib95)\]\. The main motivation is training efficiency: quantization makes the joint distribution over goals and tokenized prompt–response trajectories much denser and therefore easier to learn\.
By contrast, if we directly use continuous scalar feedback as goals, the joint space of numerical goals and corresponding trajectories becomes highly sparse, especially in offline settings with limited data\. Such sparsity can significantly reduce the learning efficiency of GCSL and significantly harm the final performance, as also noted in prior work\[[23](https://arxiv.org/html/2605.16345#bib.bib94),[30](https://arxiv.org/html/2605.16345#bib.bib95)\]\. Nevertheless, we additionally test this issue in our setting by comparing GCSL with and without quantization\. Specifically, we compare the classic GCSL adaptation for LLM fine\-tuning, which quantizes feedback signals and represents goals with special reward tokens, with a non\-quantized variant that maps scalar feedback values to embedding representations via an external reward embedding layer\. These embeddings are then provided as additional inputs to the LLM Transformer decoder\. We also compare GCSL\-NL with its variant without quantization, which converts original scalar feedback values into strings with two decimal places and tokenized by the LLM tokenizer\.
Table[7](https://arxiv.org/html/2605.16345#A3.T7)presents the results, demonstrating the necessity and effectiveness of the goal quantization procedure for GCSL in LLM fine\-tuning\.
Table 7:Comparison of GCSL and GCSL\-NL with their variants without goal quantization\.ModelRealToxicityPromptsMercuryAvg\.max\.↓\\downarrowProb\.↓\\downarrowPass Rate↑\\uparrowBeyond Score↑\\uparrowGCSL w/o quantization0\.1790\.0490\.6140\.485GCSL0\.1340\.0270\.7180\.650GCSL\-NL w/o quantization0\.1680\.0460\.6300\.506GCSL\-NL0\.1290\.0270\.7640\.662
### C\.3Results with Other Backbone LLM
In addition to using Qwen3\-4B\-Instruct\-2507 as the base model, we also conducted experiments with a larger backbone, Llama\-3\.1\-8B\-Instruct\. The corresponding results are reported in Table[8](https://arxiv.org/html/2605.16345#A3.T8)\. The findings consistently demonstrate the robustness of GCSL\-bey\-NL and its sustained performance advantages over the compared baseline methods\. These results further highlight the cross\-model transferability of the optimization benefits enabled by the proposed goal\-conditional supervised learning framework\.
Table 8:Performance comparison with Llama\-3\.1\-8B\-Instruct as backbone\.ModelRealToxicityPromptsMercuryAvg\.max\.↓\\downarrowProb\.↓\\downarrowPass Rate↑\\uparrowBeyond Score↑\\uparrowSFT0\.1140\.0130\.6350\.515LLMEraser0\.1100\.015––AlphaDPO––0\.6510\.523GCSL0\.1120\.0140\.6290\.501GCSL\-NL0\.1090\.0130\.6340\.514GCSL\-bey0\.1070\.0120\.6620\.553GCSL\-bey\-NL0\.1040\.0100\.6940\.610
### C\.4Results with Flexible Thresholds
As discussed in Section[3\.3](https://arxiv.org/html/2605.16345#S3.SS3), GCSL\-bey\-NL defines goals based on the quantized thresholds derived from the training data, and uses the highest threshold as the inference\-time goal for pursuing high\-quality outcomes\. Since a key motivation of our work is to leverage the flexible semantic understanding and instruction\-following ability of LLMs to generalize goal\-achieving behavior, we further study performance under more flexible goals that do not explicitly appear in the training set\. Specifically, we scale the highest training threshold by a series of factors and define new inference goals using the resulting values, keeping the same decimal precision as the original threshold\. For example, if the highest training threshold is 0\.85 and the scaling factor is 1\.04, the corresponding goal becomes:*“Generate a response with a non\-toxicity score greater than 0\.88\. Non\-toxicity scores range from 0 to 1, where larger values indicate less toxic content\.”*
Figure[5](https://arxiv.org/html/2605.16345#A3.F5)shows the results\. As the scaling factor increases from 1\.04 to 1\.24, performance first improves slightly and then declines\. This is reasonable: when the inference goal is close to the highest training objective but slightly more ambitious, the model is able to use its semantic knowledge and inherent reasoning ability to adapt its behavior toward a better target, leading to improved performance\. However, when the new goal becomes too far from the training distribution, the model may no longer be able to effectively transfer the knowledge learned during post\-training to the downstream task, which leads to performance degradation\.
Nevertheless, these ablations still perform generally better than most baselines in our main experiments\. The results suggest that the model remains reasonably effective even when queried with unseen, interpolated, but reasonable thresholds at inference time\. This supports our claim that natural\-language goal representations enable the model to leverage its semantic understanding and generalize the goal\-achieving behavior learned during training to more flexible inference\-time objectives\.


Figure 5:Performance of GCSL\-bey\-NL \(y\-axis\) with different scaling factors for inference goal \(x\-axis\) on non\-toxic generation \(left\) and code generation \(right\)\.
### C\.5Discussion of Fine\-tuning Methods
Sections[1](https://arxiv.org/html/2605.16345#S1)and[2](https://arxiv.org/html/2605.16345#S2)compare the main characteristics of different LLM fine\-tuning strategies\. Here, we provide a more concise, side\-by\-side summary in Table[9](https://arxiv.org/html/2605.16345#A3.T9)\. As shown in the table, our proposed GCSL\-bey and GCSL\-bey\-NL introduce a goal\-achievement objective that addresses a key limitation of standard SFT and classic GCSL, while retaining the practical advantages of pure offline supervised learning, including high training efficiency, scalability, and broad applicability across datasets\.
Table 9:LLM Fine\-Tuning Methods ComparisonApproachOfflineNo RewardNo ValueNo PairwiseNo BinaryNo BoundedTrainingModelModelDataData SplittingLearning ConstraintPPO✘✘✘✔✔✔GRPO✘✘✔✔✔✔Quark✘✘✔✔✔✘DPO✔✔✔✘✔✔SFT✔✔✔✔✘✘Classic GCSL✔✔✔✔✔✘GCSL\-bey\(\-NL\)✔✔✔✔✔✔
### C\.6Discussion of Goal Presentation
As illustrated in Section[3\.3](https://arxiv.org/html/2605.16345#S3.SS3), the goals in GCSL\-bey and GCSL\-bey\-NL are represented by special tokens and concise instruction prompts for specifying the task target and the semantics of the corresponding metric, respectively\. More specifically, we define the special tokens like \[\_TREE\_TOKEN\_0, …,\_TREE\_TOKEN\_4\] and register them in the model’s tokenizer configuration\. For natural language goals, we use the following for each experimental task\.
On the non\-toxic generation task, the NL goal is like:
*“Generate a response with a non\-toxicity score greater than 0\.85\. Non\-toxicity scores range from 0 to 1, with 1 indicating the least toxicity\.”*
On the efficient code generation task, the NL goal is like:
*“Follow the instructions to generate a correct Python solution with an Efficiency Score greater than 0\.92\. Efficiency scores range from 0 to 1, with 1 indicating highest efficiency\.”*
On the LLM for recommendation task, the NL goal is like:
*“Generate a recommendation with a user preference rating equal or greater than 4\. Ratings range from 1 to 5, with 5 indicating the highest rating\.”*
Furthermore, we conduct experiments to examine how sensitive GCSL\-bey\-NL is to the exact wording of the natural\-language goals\. Specifically, we remove the detailed task instruction and metric explanation from the goal prompt, and retain only the metric name and target threshold\. For example, in the non\-toxicity generation task and the efficient code generation task, the simplified goals are:*“Generate a response with a non\-toxicity score greater than 0\.85\.”*and*“Generate a Python solution with an Efficiency Score greater than 0\.92\.”*, respectively\. We denote this simplified\-language variant as GCSL\-bey\-SNL\.
Table[10](https://arxiv.org/html/2605.16345#A3.T10)shows that GCSL\-bey\-SNL still significantly outperforms SFT and GCSL\-bey on both tasks, demonstrating that the gains from natural\-language goal presentation are robust and can effectively leverage the LLM’s language understanding and generalization capabilities\. Nevertheless, the performance gap between GCSL\-bey\-SNL and GCSL\-bey\-NL also highlights the value of presenting goals more clearly, with explicit task context and metric interpretation\. This further supports the importance of goal representation design when applying GCSL\-bey to LLM fine\-tuning\.
Table 10:Comparison with GCSL\-bey\-SNL for simple goal representation\.ModelRealToxicityPromptsMercuryAvg\.max\.↓\\downarrowProb\.↓\\downarrowPass Rate↑\\uparrowBeyond Score↑\\uparrowSFT0\.1390\.0320\.7230\.563GCSL\-bey0\.1250\.0250\.7640\.676GCSL\-bey\-SNL0\.1190\.0220\.7720\.703GCSL\-bey\-NL0\.1150\.0190\.7770\.714
### C\.7Discussion on Learning Patterns
We’d like to address that we do not attribute the superior performance of GCSL\-bey\-NL to an “unbounded” set of training demonstrations for the maximum goal used at inference\. Rather, as discussed in Section[3\.3](https://arxiv.org/html/2605.16345#S3.SS3), GCSL\-bey\-NL is designed to learn a general goal\-achieving behavior from training samples by conditioning on the objective of exceeding a given threshold, which naturally encourages improvement toward better outcomes\. At inference time, the fine\-tuned LLM is expected to extrapolate this learned goal\-achieving progression pattern beyond the exact training goals, leveraging the strong extrapolation, parameter sharing, and natural\-language understanding capabilities of LLMs\. In this way, the model can transfer the learned goal\-achieving patterns to pursue stronger performance at inference time\.
This kind of generalization and extrapolation is consistent with the capabilities demonstrated by recent strong LLMs\. For example, one\-shot and few\-shot prompting have shown that LLMs can learn effectively from only a small number of examples in context\[[1](https://arxiv.org/html/2605.16345#bib.bib121),[39](https://arxiv.org/html/2605.16345#bib.bib122)\]\. Although such examples provide only a limited set of task instances, the model can still leverage its internal knowledge and reasoning ability to generalize to related unseen cases and achieve strong performance\.
Our approach goes beyond standard in\-context learning\. We explicitly construct training samples associated with our defined goals and fine\-tune the LLM to learn how to achieve them\. In this way, the model can have a deeper understanding of general goal\-achieving patterns for pursuing progressive outcomes beyond given performance thresholds, and then extrapolate these learned patterns at inference time\. As a result, it has the ability to achieve performance beyond the average behavior of the highest training bin, even when such superior patterns may not explicitly present in the training data\.
### C\.8Expanded Related Work
Some recent works have also explored offline optimization of LLMs from reward\-like supervision, but these formulations remain meaningfully different from ours\. Swift\[[32](https://arxiv.org/html/2605.16345#bib.bib124)\]is related in spirit of utilizing ordered rewards, but its formulation and proposed methodology are specialized to conversations with fixed turns\. As a result, it targets a much narrower interaction setting than our work, which studies general offline LLM fine\-tuning over open\-ended prompt–response data\. Our method instead focuses on a broader next\-token generation setting, where graded feedback is incorporated directly through goal\-conditioned supervised learning without assuming a fixed conversational horizon\.
CA\[[22](https://arxiv.org/html/2605.16345#bib.bib125)\]is also related in that it replaces online PPO\-style alignment with offline objectives\. However, it still requires training a separate human preference model, and its conditional alignment strategy leaves the inference\-time reward choice unclear and not principled: for a given downstream task, it is unclear what reward value should be specified at deployment\. By contrast, our framework defines the inference objective directly through explicit goals directly derived from observed feedback, and our beyond\-threshold formulation provides a clearer and more natural target for controlled generation\. Some recent preference\-based methods\[[13](https://arxiv.org/html/2605.16345#bib.bib129)\]further explore the use of unpaired preferences for LLM fine\-tuning\. However, they still strictly rely on binary labels\. In contrast, our proposed GCSL\-bey\-NL can directly leverage fine\-grained graded feedback, avoiding the coarse treatment of feedback signals\.
Our work is also related in motivation to ordinal labeling\[[8](https://arxiv.org/html/2605.16345#bib.bib126)\], since both seek to preserve the ordered structure in graded supervision rather than collapsing it into binary labels\. However, ordinal\-labeling methods are primarily developed for ordered multi\-class classification, where the output is a discrete label or label distribution\. In contrast, our setting is autoregressive next\-token generation for LLM fine\-tuning, where graded feedback must be incorporated into the conditioning and learning objective of a generative model\. Therefore, although ordinal labeling is conceptually relevant, it is not directly comparable to our setting\.
### C\.9Discussion on Social Impact
Our work aims to provide a more efficient and practical framework for LLM fine\-tuning in settings with graded feedback, which may benefit applications such as safer text generation, better code assistance, and more accurate recommendation\. By avoiding reward\-model\-based online optimization, the proposed method can lower the computational and engineering cost of alignment, making such techniques more accessible\.
At the same time, the method inherits the limitations of the feedback it learns from\. If the feedback signals are biased, incomplete, or misaligned with user welfare, the fine\-tuned model may reproduce or amplify these issues\. This is especially important in application domains such as recommendation, where optimization may unintentionally reinforce popularity bias or narrow user exposure\. Therefore, careful dataset design, feedback auditing, and task\-specific safety evaluation remain important when deploying our approach in practice\.
## Appendix DLimitations and Future Works
Although GCSL\-bey\-NL demonstrates strong effectiveness and efficiency for offline LLM fine\-tuning, we identify some potential limitations and corresponding future works\.
First, our method is designed and evaluated in a purely offline setting\. While this is an important advantage in scenarios where online rollouts or reward models are costly or unavailable, GCSL\-bey\-NL can in principle also be extended to online learning in a manner similar to Quark\. It would therefore be interesting to study how our beyond\-threshold goal formulation and natural\-language goal representation behave when combined with iterative data collection and online policy improvement\.
Second, our current framework still relies on quantization to construct a finite set of goal levels\. As discussed earlier, this design is helpful and necessary for training efficiency and data density, but the choice of quantization granularity still requires specific parameter tuning\. Future work could explore more adaptive goal discretization strategies, or hybrid formulations that better bridge discrete goal labels and continuous feedback signals\.
Last, in this paper we focus on scalar or categorical feedback, which already covers many practical fine\-tuning scenarios and allows us to study the proposed framework in a clean and controlled setting\. However, real\-world applications may also involve richer supervision signals, such as multi\-aspect judgments or more dynamic forms of user preference\. Exploring how GCSL can be extended to these settings is a natural direction for future work\.
## Appendix EAsset Licenses
#### Datasets\.
REALTOXICITYPROMPTS\[[16](https://arxiv.org/html/2605.16345#bib.bib96)\]is released under the Apache License 2\.0\. Mercury\[[12](https://arxiv.org/html/2605.16345#bib.bib101)\]is released under the Creative Commons Attribution\-NonCommercial 4\.0 \(CC BY\-NC 4\.0\) license\. Amazon Reviews\[[18](https://arxiv.org/html/2605.16345#bib.bib110)\]is released under the MIT License\. We use all datasets in accordance with their original release terms and intended research usage\.
#### Models\.
Qwen3\-4B\-Instruct\-2507\[[40](https://arxiv.org/html/2605.16345#bib.bib127)\]is released under the Apache License 2\.0\. Llama\-3\.1\-8B\-Instruct\[[17](https://arxiv.org/html/2605.16345#bib.bib128)\], used for additional experiments, is released under the Meta Llama 3\.2 Community License\. Our experiments access the pretrained model through the Hugging Face ecosystem\.Similar Articles
Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training
ART (Art-based Reinforcement Training) enables parameter-efficient fine-tuning of frozen multimodal LLMs by optimizing raw visual input via gradient backpropagation, achieving performance comparable to LoRA while supporting pre-compiled computational graphs for high-throughput engines like vLLM.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
A fast-slow learning framework for LLMs combines fixed slow weights with optimized fast context weights, achieving up to 3x better sample efficiency and reduced catastrophic forgetting in continual learning scenarios.
Learning, Fast and Slow: Towards LLMs That Adapt Continually [R]
This paper introduces a Fast-Slow Training framework for LLMs that combines parameter updates with optimized context to improve sample efficiency and reduce catastrophic forgetting during continual learning.
The Long-Term Effects of Data Selection in LLM Fine-Tuning
This paper investigates the long-term effects of data selection strategies in multi-stage LLM fine-tuning, revealing that myopic selection can harm future adaptability. It introduces a Long-Horizon Aware Selection (LHAS) objective to mitigate these issues.
GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification
GFT (Group Fine-Tuning) is a unified post-training framework for LLMs that addresses limitations of supervised fine-tuning by using Group Advantage Learning and Dynamic Coefficient Rectification to improve training stability and generalization. The paper shows SFT can be interpreted as a special case of policy gradient optimization with sparse implicit rewards, and GFT consistently outperforms SFT-based methods while integrating more smoothly with subsequent RL training.