StoicLLM: Preference Optimization for Philosophical Alignment in Small Language Models
Summary
This research paper investigates using preference optimization (ORPO, AlphaPO) on small language models like Llama-3.2-3B and Qwen-3-4B to align them with Stoic philosophy using micro-datasets. The study finds that while 300 examples can effectively encode Stoic virtues, small models still struggle with outward-facing cosmopolitan duties.
View Cached Full Text
Cached at: 05/13/26, 06:13 AM
# StoicLLM: Preference Optimization for Philosophical Alignment in Small Language Models
Source: [https://arxiv.org/html/2605.11483](https://arxiv.org/html/2605.11483)
Ishmam Khan Tufts University Ishmam\.Khan@tufts\.edu &Sindhuja Thogarrati Bose Corporation sindhuja\_thogarrati@bose\.com &Shuo Zhang Tufts University Shuo\.Zhang@tufts\.edu
###### Abstract
While large language models excel at factual adaptation, their ability to internalize nuanced philosophical frameworks under severe data constraints remains underexplored\. We investigate this by specializing small LLMs on micro\-datasets of foundational Stoic texts using preference optimization \(ORPO, AlphaPO\)\. Evaluated via a multi\-model critic bank, our results show that just 300 high\-fidelity examples can induce strong alignment with inward\-facing Stoic virtues, closely approaching few\-shot prompting while freeing the context window\. Critically, however, all models, including few\-shot baselines, exhibit a persistent failure on Stoicism’s outward\-facing cosmopolitan duties, pointing to a representational limitation of small models that micro\-dataset adaptation alone cannot overcome\.
StoicLLM: Preference Optimization for Philosophical Alignment in Small Language Models
Ishmam KhanTufts UniversityIshmam\.Khan@tufts\.eduSindhuja ThogarratiBose Corporationsindhuja\_thogarrati@bose\.comShuo ZhangTufts UniversityShuo\.Zhang@tufts\.edu
## 1INTRODUCTION
Founded by Zeno of Citium around 300 BCE and later flourishing in Rome through thinkers like Seneca, Epictetus, and Marcus Aurelius, Stoicism posits that virtue—wisdom, justice, courage, and temperance—is the sole true good\. This ancient Greek philosophy teaches that human flourishing \(eudaimonia\) is achieved through reason and moral character rather than external circumstancesDurandet al\.\([2023](https://arxiv.org/html/2605.11483#bib.bib10)\)\. Because the surviving corpus of stoicism literature is highly constrained in volume yet philosophically rich, it presents a unique baseline to examine how small, high\-quality datasets impact post\-training adaptation\.
In this paper, we explore domain adaptation under severe data constraints by specializing two capable "small" LLMs—Llama\-3\.2\-3B\-Instruct and Qwen\-3\-4B\-Instruct\. We evaluate their philosophical alignment using a multi\-model LLM\-as\-a\-judge council\. Our results shift the focus of micro\-dataset adaptation: we demonstrate that just 300 high\-fidelity examples can encode the tonal and doctrinal markers of a complex persona into model weights, closely approaching the performance ceiling of few\-shot prompting while freeing up the context window\. Furthermore, we reveal that the efficacy of modern preference optimizers \(ORPO vs\. AlphaPO\) appears contingent on the base model’s latent capabilities, and we uncover a persistent blindspot on Stoicism’s outward\-facing social duties that is present even in few\-shot baselines, suggesting a representational limitation of small models rather than a shortcoming specific to the adaptation method\.
## 2RELATED WORK
Custom LLMs are increasingly adopted in high\-demand, factual domains\. For example,Yueet al\.,[2023](https://arxiv.org/html/2605.11483#bib.bib1)fine\-tuned an LLM for country\-specific legal retrieval\. While traditional alignment typically requires thousands of annotated preference pairs to prevent overfitting, training on highly curated, concrete data mitigates bias compared to relying on uncurated sourcesSudalairajet al\.\([2024](https://arxiv.org/html/2605.11483#bib.bib4)\)\.
Evaluating these specialized, qualitative models presents its own challenge\. Traditional n\-gram metrics \(e\.g\., BLEU, ROUGE\) correlate poorly with human judgment on complex generative tasksReiter \([2018](https://arxiv.org/html/2605.11483#bib.bib20)\), while expert human evaluation remains slow and expensive\. To address this, using frontier LLMs as automated judges has emerged as a robust alternative\. Guided by structured rubrics, these "LLM\-as\-a\-judge" frameworks offer nuanced, scalable assessments that align closely with human preferencesGuet al\.\([2025](https://arxiv.org/html/2605.11483#bib.bib11)\)\.
## 3METHODOLOGY
### 3\.1LLM Model Finetuning
To perform domain adaptation under constrained data regimes, we fine\-tune Llama3\.2\-3B\-Instruct111https://huggingface\.co/meta\-llama/Llama\-3\.2\-3B\-Instructand Qwen3\-4B222https://huggingface\.co/Qwen/Qwen3\-4B\-Instruct\-2507\. We use Weight\-Decomposed Low\-Rank Adaptation \(DoRA\)Liuet al\.\([2024](https://arxiv.org/html/2605.11483#bib.bib8)\)rather than standard LoRAHuet al\.\([2021](https://arxiv.org/html/2605.11483#bib.bib15)\)\. Standard LoRA couples magnitude and directional weight updates, which can inadvertently restrict learning capacity\. DoRA decouples these by applying low\-rank adaptation solely to the directional component while training the magnitude vector independentlyXuet al\.\([2023](https://arxiv.org/html/2605.11483#bib.bib7)\)\. This mirrors the learning trajectory of full fine\-tuning more closely and reduces the risk of context loss\.
We compare two monolithic preference optimization methods: Odds\-Ratio Preference Optimization \(ORPO\)Honget al\.\([2024](https://arxiv.org/html/2605.11483#bib.bib5)\)and Alpha Preference Optimization \(AlphaPO\)Guptaet al\.\([2025](https://arxiv.org/html/2605.11483#bib.bib6)\)\. ORPO bypasses the traditional multi\-stage alignment pipeline by applying a contrastive odds\-ratio penalty to rejected generations alongside the standard negative log\-likelihood loss for chosen ones:
ℒORPO\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{ORPO\}\}=𝔼\(x,yw,yl\)\[ℒSFT\\displaystyle=\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\}\\Big\[\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}−λlogσ\(logoddsθ\(yw\|x\)oddsθ\(yl\|x\)\)\]\\displaystyle\\quad\-\\lambda\\log\\sigma\\Big\(\\log\\frac\{\\mathrm\{odds\}\_\{\\theta\}\(y\_\{w\}\|x\)\}\{\\mathrm\{odds\}\_\{\\theta\}\(y\_\{l\}\|x\)\}\\Big\)\\Big\]
whereoddsθ\\mathrm\{odds\}\_\{\\theta\}is the token\-level generation odds, andyw,yly\_\{w\},y\_\{l\}are the chosen and rejected responses\. AlphaPOGuptaet al\.\([2025](https://arxiv.org/html/2605.11483#bib.bib6)\)extends this framework by replacing fixed\-form objectives with a tunable reward shaping parameter,α\\alpha:
ℒAlphaPO\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{AlphaPO\}\}=−𝔼\(𝐱,𝐲w,𝐲l\)∼𝒟\[logσ\(\\displaystyle=\-\\mathbb\{E\}\_\{\(\\mathbf\{x\},\\mathbf\{y\}\_\{w\},\\mathbf\{y\}\_\{l\}\)\\sim\\mathcal\{D\}\}\\Bigg\[\\log\\sigma\\Bigg\(−βαπ𝜽\(𝐲w\|𝐱\)−α\|𝐲w\|\\displaystyle\\quad\\frac\{\-\\beta\}\{\\alpha\}\\pi\_\{\\bm\{\\theta\}\}\(\\mathbf\{y\}\_\{w\}\|\\mathbf\{x\}\)^\{\-\\frac\{\\alpha\}\{\|\\mathbf\{y\}\_\{w\}\|\}\}\+βαπ𝜽\(𝐲l\|𝐱\)−α\|𝐲l\|−γ\)\]\\displaystyle\\quad\+\\frac\{\\beta\}\{\\alpha\}\\pi\_\{\\bm\{\\theta\}\}\(\\mathbf\{y\}\_\{l\}\|\\mathbf\{x\}\)^\{\-\\frac\{\\alpha\}\{\|\\mathbf\{y\}\_\{l\}\|\}\}\-\\gamma\\Bigg\)\\Bigg\]
Here,α\\alphascales the sensitivity of the implicit reward signal\. Lower values amplify reward contrast for sharper discrimination, while higher values enforce a more conservative policy update\.
Finally, we stabilize training using a Warmup\-Stable\-Decay \(WSD\)Wenet al\.\([2024](https://arxiv.org/html/2605.11483#bib.bib19)\)learning rate schedule and the StableAdamWWortsmanet al\.\([2023](https://arxiv.org/html/2605.11483#bib.bib18)\)optimizer\. The WSD schedulerHuet al\.\([2024](https://arxiv.org/html/2605.11483#bib.bib17)\)utilizes an extended stable peak phase \(typically 80% of training\) followed by a rapid linear decay, allowing the model to efficiently escape local minima before settling into an optimal basin\. Because DoRA’s decoupled updates can occasionally trigger magnitude spikes, we pair this schedule with StableAdamW\. Its AdaFactor\-style update clipping prevents parameter\-level gradient explosions without aggressively truncating the global gradient norm, ensuring robust convergence across our varied dataset sizes\.
### 3\.2Evaluation Strategy
We evaluate the base and finetuned models on a standard set of 100 open\-ended questions using a multi\-model LLM\-as\-a\-judge frameworkGuet al\.\([2025](https://arxiv.org/html/2605.11483#bib.bib11)\)\. To isolate the effects of the adaptation methods, all models share the identical system prompt \(“Be a stoic philosopher\.”\), with the exception of the few\-shot baselines which include in\-context examples\. Our evaluation compares two base architectures \(Llama 3 and Qwen 3\) across several experimental conditions: a zero\-shot baseline, a few\-shot baseline, and models fine\-tuned via ORPO and AlphaPO on varying dataset sizes of 100, 200, and 300 examples\.
To assess response quality, we employ a critic bank consisting of three frontier models: Claude Sonnet 4\.6333https://www\.anthropic\.com/news/claude\-sonnet\-4\-6, Gemini 3 Flash Preview444https://ai\.google\.dev/gemini\-api/docs/models/gemini\-3\-flash\-preview, and GPT\-5\.4555https://developers\.openai\.com/api/docs/models/gpt\-5\.4\. Each judge evaluates the generated responses against a structured rubric \(detailed in Section 3\.3\) to assign quantitative scores\. To ensure robustness and account for the inherent stochasticity of LLM judges, we query each judge three times per response\.
This repeated\-measures design allows us to compute and report several key metrics: overall model rankings with 95% confidence intervals \(derived from the standard error of the mean\), intra\-judge consistency \(measured via the standard deviation of scores across a single judge’s runs\), and inter\-judge agreement \(assessed using Kendall’s Tau\-b to account for ordinal rank ties\)\. Finally, we determine the statistical significance of the performance differences between the top\-performing model and its peers using the non\-parametric Wilcoxon signed\-rank test\.
### 3\.3Rubric Generation
To provide the critic bank with a reliable and standardized assessment criteria, we developed a structured evaluation rubric\. We utilized Claude Opus 4\.6666https://www\.anthropic\.com/claude/opusfor rubric generation, as frontier closed\-weights models have demonstrated strong capability in defining nuanced criteriaSiroet al\.\([2026](https://arxiv.org/html/2605.11483#bib.bib9)\)\. We grounded the generation process by providing the model with the Stanford Encyclopedia of Philosophy entries on “Stoicism”Durandet al\.\([2023](https://arxiv.org/html/2605.11483#bib.bib10)\)\. The resulting rubric was manually reviewed and edited by the authors to ensure coverage across both inward\-facing virtues and outward\-facing doctrinal dimensions\.
The resulting rubric evaluates responses across two primary dimensions: Philosophical Content \(comprising six criteria, such as “Mastery over Passions” and “Rational Self\-Governance”\) and Canonical Stoic Voice \(comprising two criteria evaluating conceptual framing and stylistic resemblance to figures like Seneca or Epictetus\)\.
Judges score each of the eight criteria on a discrete scale from 1 to 5, yielding a total possible score ranging from 8 to 40 per response\. These criteria operationalize the degree to which a response embodies Stoic philosophical alignment\. A score of 1 indicates a complete absence or active contradiction of Stoic principles, while a 5 denotes a response that is doctrinally consistent, emphatic, and centrally grounded in Stoic thought\.
## 4EXPERIMENTAL SETUP
### 4\.1Dataset
Seneca’sLetters on Ethics to LuciliusSeneca \([2021](https://arxiv.org/html/2605.11483#bib.bib12)\), translated from Latin by Margaret Graver and A\.A\. Long, and Epictetus’sThe Complete Works: Handbook, Discourses, and FragmentsEpictetus \([2022](https://arxiv.org/html/2605.11483#bib.bib13)\), translated from Greek by Robin Waterfield, serve as the primary corpus for this study\. Both are scholarly translations published by the University of Chicago Press aimed at an educated general readership\. Together, these texts represent major surviving expressions of Roman Stoic thoughtDurandet al\.\([2023](https://arxiv.org/html/2605.11483#bib.bib10)\)\.
#### 4\.1\.1Training dataset
The training dataset was generated from these source texts and initially structured into a<question, answer, negative\-answer\>format\. Processing the raw PDFs began with transcribing each page using a local Qwen3\-VL instance777https://github\.com/QwenLM/Qwen3\-VL\. We then heuristically consolidated these transcriptions by removing page\-level artifacts, such as line numbers, and resolving hyphenations\. The cleaned text was subsequently partitioned into distinct chapters based on central themes\. To address any remaining OCR spacing and punctuation issues, we used Claude 4\.5 Sonnet888https://www\.anthropic\.com/news/claude\-sonnet\-4\-5, instructing it to preserve the original vocabulary and spot\-checking the output against the source PDFs\.
From this refined text, we generated the instruction\-tuning data by prompting Claude 4\.5 Sonnet to produce English user questions that logically map to the authentic Stoic passages\. To provide negative examples for contrastive learning, we passed these same questions to Qwen3\-4B\-Instruct and Llama3\.2\-3B\-Instruct to generate synthetic, non\-authentic responses\. Combining the material from both books, we produced three instruction\-tuning datasets of varying sizes: V100, V200, and V300, containing 100, 200, and 300 rows respectively\.
#### 4\.1\.2Evaluation dataset
The evaluation dataset consists of 100 contemporary, open\-ended questions designed to elicit reflective and inferential responses\. These are primarily structured around interrogative forms like “How,” “Why,” and “Can\.” We used Claude 4\.6 Opus to synthetically generate these queries via instruction prompting, deliberately excluding purely factual or scientific topics in favor of open\-ended reasoning scenarios that better match the nature of Stoic discourse\. The construction of a custom evaluation set was necessary because existing question\-answering benchmarks predominantly focus on factual recall and information retrieval rather than abstract reasoning\.
### 4\.2Fine\-tuning procedure
We fine\-tuned the models using the ORPO and AlphaPO preference optimization algorithms on a single NVIDIA H200 GPU, accelerated by theunslothlibrary999https://github\.com/unslothai/unsloth\. The training data was formatted into the⟨\\langleprompt, chosen, rejected⟩\\rangletriplet structure required by these optimizers, with the system prompt set to“Be a Stoic philosopher\.”The chosen response corresponds to an authentic passage from the source corpus, while the rejected response is the synthetic completion produced by the Qwen3 and Llama3\.2 models\.
All models were trained for 3 epochs using theORPOTrainerandCPOTrainerfrom the Hugging Face TRL library101010https://huggingface\.co/docs/trl/index, patched with Unsloth’s optimized kernels\. We used theStableAdamWoptimizer with a learning rate of1×10−51\\times 10^\{\-5\}, WSD scheduling, and zero weight decay\. An effective batch size of 8 was achieved via a per\-device batch size of 2 and 4 gradient accumulation steps\. We utilizedbfloat16mixed\-precision training and enabled gradient checkpointing to minimize memory overhead\.
## 5RESULTS AND DISCUSSION
Our multi\-judge panel demonstrated strong rank consensus \(Kendall’s Tau\-b: 0\.739–0\.768\)\. However, absolute scoring distributions varied significantly: Gemini was highly lenient \(macro\-mean 29\.29\), GPT moderate \(25\.08\), and Claude Sonnet stringent \(20\.04\), though Sonnet was highly consistent across repeated measures \(σ=0\.44\\sigma=0\.44\)\. Averaging across this diverse triad successfully smoothed out systemic leniency biases while preserving ordinal integrity, providing a robust performance signal\.
As shown in Table[1](https://arxiv.org/html/2605.11483#S5.T1), Qwen\-3\-4B substantially outperformed Llama\-3\.2\-3B across all conditions\. Notably, the zero\-shot Qwen3 baseline \(27\.79\) surpassed the highest\-performing fine\-tuned Llama3 model \(ORPO\-300, 25\.13\)\. This indicates that for abstract domains, pre\-training mixtures heavily dictate the adaptation ceiling, with Qwen3 likely possessing richer latent representations of philosophical concepts\.
Few\-shot prompting achieved the highest overall scores \(Qwen3: 32\.24; Llama3: 26\.11\)\. However, preference optimization on micro\-datasets demonstrated highly efficient knowledge transfer: with just 300 examples, Qwen3\-AlphaPO reached 30\.92, closely trailing the few\-shot ceiling\. Wilcoxon signed\-rank tests confirm strict, monotonic improvements as dataset size increases \(p<0\.001p<0\.001\)\. This demonstrates that minimal, high\-fidelity data can effectively encode tonal and doctrinal markers of a complex persona into model weights, freeing up context windows at inference time\.
Table 1:Overall Model Leaderboard\. Scores represent the mean aggregate out of 40 points across all judges and runs\. Significance \(pp\-value\) is calculated via Wilcoxon signed\-rank test against the top model\.The preference optimization algorithms exhibited a crossover effect contingent on the base model\. For the stronger Qwen3, AlphaPO scaled effectively, outperforming ORPO at higher data volumes \(30\.92 vs\. 28\.98 atN=300N=300\)\. Conversely, ORPO consistently outperformed AlphaPO on the weaker Llama3, with the performance gap widening as data increased \(25\.13 vs\. 21\.56\)\.
We attribute this divergence to their respective loss mechanisms\. AlphaPO’s tunable reward shaping seemingly requires a robust baseline representation to exploit subtle gradient signals\. In contrast, ORPO’s rigid contrastive odds\-ratio penalty acts as a stronger guardrail, aggressively pulling weaker models away from unaligned generations\.
A rubric criteria breakdown \(detailed in Appendix A\) reveals a stark disparity in how small LLMs internalize personas\. Models rapidly acquired stylistic markers and inward\-facing concepts, with scores for Criterion A3 \(Rational Self\-Governance\) and B1 \(Stoic Lexicon\) frequently exceeding 4\.5 in adapted Qwen3 models\.
Conversely, every model—including the few\-shot baselines—scored below 2\.20 on Criterion A6 \(Cosmopolitan Social Duty\)\. Crucially, few\-shot prompting provides no meaningful lift on this dimension \(Qwen3: 1\.82 few\-shot vs\. 1\.78 zero\-shot; Llama3: 1\.59 vs\. 1\.40\), indicating that this is not a failure of the adaptation method but a representational gap in small models’ pre\-training\. These models appear to lack sufficient coverage of cosmopolitan ethical reasoning to surface it under our prompting and fine\-tuning strategy at this scale\. Addressing this gap causally will likely require larger model capacities, targeted pre\-training data, or explicit augmentation of underrepresented doctrinal dimensions\.
##### Limitations\.
Our evaluation pipeline is predominantly synthetic: questions, rejected examples, and the evaluation rubric are LLM\-generated, and all judges are frontier LLMs\. While the multi\-model critic bank mitigates individual model biases—as evidenced by strong inter\-judge rank agreement \(Kendall’sτb\\tau\_\{b\}: 0\.74–0\.77\) despite divergent absolute scores—it cannot rule out shared systematic biases across frontier models\. Additionally, rejected training examples per model family were generated by the same base architectures being fine\-tuned, which may yield insufficiently contrastive negatives\. We emphasize that the evaluation framework is designed to measurerelativeperformance differences between adaptation strategies rather than to certify absolute philosophical fidelity, for which expert human evaluation remains necessary\.
## 6CONCLUSION
This study demonstrates that small language models can be effectively adapted to abstract, qualitative domains using highly constrained micro\-datasets\. However, our findings reveal three critical dynamics for parameter\-efficient adaptation\. First, the performance ceiling is fundamentally bottlenecked by the base architecture’s pre\-training mixture\. Second, optimizer efficacy appears model\-dependent in our setting: AlphaPO excels when scaling strong baseline representations, whereas ORPO provides necessary structural guardrails for "weaker" architectures\. Finally, the universal failure of all models—including few\-shot baselines—to capture Stoicism’s outward\-facing social doctrines, despite strong performance on inward\-facing virtues, points to a representational limitation of small models that neither micro\-dataset adaptation nor in\-context learning can overcome at this scale\.
## References
- Stoicism\.InThe Stanford Encyclopedia of Philosophy,E\. N\. Zalta and U\. Nodelman \(Eds\.\),Note:[https://plato\.stanford\.edu/archives/spr2023/entries/stoicism/](https://plato.stanford.edu/archives/spr2023/entries/stoicism/)Cited by:[§1](https://arxiv.org/html/2605.11483#S1.p1.1),[§3\.3](https://arxiv.org/html/2605.11483#S3.SS3.p1.1),[§4\.1](https://arxiv.org/html/2605.11483#S4.SS1.p1.1)\.
- Epictetus \(2022\)The complete works: handbook, discourses, and fragments\.University of Chicago Press,Chicago\.Cited by:[§4\.1](https://arxiv.org/html/2605.11483#S4.SS1.p1.1)\.
- J\. Gu, X\. Jiang, Z\. Shi, H\. Tan, X\. Zhai, C\. Xu, W\. Li, Y\. Shen, S\. Ma, H\. Liu, S\. Wang, K\. Zhang, Y\. Wang, W\. Gao, L\. Ni, and J\. Guo \(2025\)A survey on llm\-as\-a\-judge\.External Links:2411\.15594,[Link](https://arxiv.org/abs/2411.15594)Cited by:[§2](https://arxiv.org/html/2605.11483#S2.p2.1),[§3\.2](https://arxiv.org/html/2605.11483#S3.SS2.p1.1)\.
- A\. Gupta, S\. Tang, Q\. Song, S\. Zhu, J\. Hong, A\. Saha, V\. Gupta, N\. Lee, E\. Kim, S\. Zhu, P\. Agrawal, N\. Pillai, and S\. S\. Keerthi \(2025\)AlphaPO: reward shape matters for llm alignment\.External Links:2501\.03884,[Link](https://arxiv.org/abs/2501.03884)Cited by:[§3\.1](https://arxiv.org/html/2605.11483#S3.SS1.p2.1),[§3\.1](https://arxiv.org/html/2605.11483#S3.SS1.p4.3)\.
- J\. Hong, N\. Lee, and J\. Thorne \(2024\)ORPO: monolithic preference optimization without reference model\.External Links:2403\.07691,[Link](https://arxiv.org/abs/2403.07691)Cited by:[§3\.1](https://arxiv.org/html/2605.11483#S3.SS1.p2.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2021\)LoRA: low\-rank adaptation of large language models\.External Links:2106\.09685,[Link](https://arxiv.org/abs/2106.09685)Cited by:[§3\.1](https://arxiv.org/html/2605.11483#S3.SS1.p1.1)\.
- S\. Hu, Y\. Tu, X\. Han, C\. He, G\. Cui, X\. Long, Z\. Zheng, Y\. Fang, Y\. Huang, W\. Zhao, X\. Zhang, Z\. L\. Thai, K\. Zhang, C\. Wang, Y\. Yao, C\. Zhao, J\. Zhou, J\. Cai, Z\. Zhai, N\. Ding, C\. Jia, G\. Zeng, D\. Li, Z\. Liu, and M\. Sun \(2024\)MiniCPM: unveiling the potential of small language models with scalable training strategies\.External Links:2404\.06395,[Link](https://arxiv.org/abs/2404.06395)Cited by:[§3\.1](https://arxiv.org/html/2605.11483#S3.SS1.p7.1)\.
- S\. Liu, C\. Wang, H\. Yin, P\. Molchanov, Y\. F\. Wang, K\. Cheng, and M\. Chen \(2024\)DoRA: weight\-decomposed low\-rank adaptation\.External Links:2402\.09353,[Link](https://arxiv.org/abs/2402.09353)Cited by:[§3\.1](https://arxiv.org/html/2605.11483#S3.SS1.p1.1)\.
- E\. Reiter \(2018\)A structured review of the validity of bleu\.Computational Linguistics44\(3\),pp\. 393–401\.External Links:ISSN 0891\-2017,[Document](https://dx.doi.org/10.1162/coli%5Fa%5F00322),[Link](https://doi.org/10.1162/coli_a_00322),https://direct\.mit\.edu/coli/article\-pdf/44/3/393/1809172/coli\_a\_00322\.pdfCited by:[§2](https://arxiv.org/html/2605.11483#S2.p2.1)\.
- L\. A\. Seneca \(2021\)Seneca : fifty letters of a roman stoic\.The University of Chicago Press,Chicago ; London\(eng\)\.External Links:ISBN 9780226782768,LCCN 2021009510Cited by:[§4\.1](https://arxiv.org/html/2605.11483#S4.SS1.p1.1)\.
- C\. Siro, P\. Aliannejadi, and M\. Aliannejadi \(2026\)Learning to judge: llms designing and applying evaluation rubrics\.External Links:2602\.08672,[Link](https://arxiv.org/abs/2602.08672)Cited by:[§3\.3](https://arxiv.org/html/2605.11483#S3.SS3.p1.1)\.
- S\. Sudalairaj, A\. Bhandwaldar, A\. Pareja, K\. Xu, D\. D\. Cox, and A\. Srivastava \(2024\)LAB: large\-scale alignment for chatbots\.arXiv preprint arXiv:2403\.01081\.External Links:2403\.01081Cited by:[§2](https://arxiv.org/html/2605.11483#S2.p1.1)\.
- K\. Wen, Z\. Li, J\. Wang, D\. Hall, P\. Liang, and T\. Ma \(2024\)Understanding warmup\-stable\-decay learning rates: a river valley loss landscape perspective\.External Links:2410\.05192,[Link](https://arxiv.org/abs/2410.05192)Cited by:[§3\.1](https://arxiv.org/html/2605.11483#S3.SS1.p7.1)\.
- M\. Wortsman, T\. Dettmers, L\. Zettlemoyer, A\. Morcos, A\. Farhadi, and L\. Schmidt \(2023\)Stable and low\-precision training for large\-scale vision\-language models\.External Links:2304\.13013,[Link](https://arxiv.org/abs/2304.13013)Cited by:[§3\.1](https://arxiv.org/html/2605.11483#S3.SS1.p7.1)\.
- L\. Xu, H\. Xie, S\. J\. Qin, X\. Tao, and F\. L\. Wang \(2023\)Parameter\-efficient fine\-tuning methods for pretrained language models: a critical review and assessment\.External Links:2312\.12148,[Link](https://arxiv.org/abs/2312.12148)Cited by:[§3\.1](https://arxiv.org/html/2605.11483#S3.SS1.p1.1)\.
- S\. Yue, W\. Chen, S\. Wang, B\. Li, C\. Shen, S\. Liu, Y\. Zhou, Y\. Xiao, S\. Yun, X\. Huang, and Z\. Wei \(2023\)DISC\-lawllm: fine\-tuning large language models for intelligent legal services\.External Links:2309\.11325,[Link](https://arxiv.org/abs/2309.11325)Cited by:[§2](https://arxiv.org/html/2605.11483#S2.p1.1)\.
## Appendix AAppendix
### A\.1Complete Empirical Results
Table 2:Overall Model Leaderboard and Significance Testing\. Scores are aggregated across all judges and runs\. Significance \(pp\-value\) is calculated using the Wilcoxon signed\-rank test against the top\-performing model \(Qwen3 Few\-shot\)\.Table 3:Mean Scores per Rubric Criterion across all models\.Key to Rubric Criteria: A1: Virtue as the Highest GoodA2: Indifference to ExternalsA3: Rational Self\-GovernanceA4: Acceptance of Fate and Nature A5: Mastery over PassionsA6: Cosmopolitan and Social DutyB1: Stoic Lexicon and FramingB2: Stylistic Resemblance
Table 4:Judge Bias and Reliability\. Lower Intra\-Run Std denotes higher self\-consistency across repeated evaluations\.Table 5:Inter\-Judge Agreement \(Kendall’sτb\\tau\_\{b\}\)\. Values closer to 1\.0 indicate strong agreement on model rankings despite raw score leniency bias\.Table 6:Rubric Collinearity \(Pearson Correlation\)\. No two criteria exceedr=0\.82r=0\.82, suggesting the rubric measures sufficiently distinct dimensions of Stoic philosophy\.Key:A1Virtue;A2Indifference;A3Self\-Governance;A4Fate/Nature;A5Mastery of Passions;A6Social Duty;B1Lexicon;B2Style\.
### A\.2Rubric
Stoic Alignment RubricHOW TO USE Rate the text on each dimension using the 1–5 scale below\. Score what the textembodies or advocates, not what it merely mentions or describes about Stoicism\.SCORING SCALE \(all dimensions\) SECTION A — PHILOSOPHICAL CONTENT \(6×\\times5 = 30 pts\)A1 Virtue as the Highest Good / 5 Virtue, wisdom, or moral character presented as the supreme or sole good\. Vice or moral failure treated as the only true harm\.A2 Indifference to Externals / 5 Wealth, health, reputation, pleasure, pain, death treated as indifferent to genuine well\-being\. External circumstances not framed as truly good or bad in themselves\.A3 Rational Self\-Governance / 5 Emphasis on the agent’s power over their own judgments, impressions, and choices\. Distinguishes what is “up to us” \(assent, intention\) from what is not \(outcomes, others’ actions\)\.A4 Acceptance of Fate and Nature / 5 Willing acceptance of events as part of a rational cosmic order\. Advocates living in harmony with nature, reason, or thelogos\.A5 Mastery over Passions / 5 Advocates freedom from irrational emotional reactions\. Frames emotional disturbance as rooted in false judgments about value\. Distinguishes destructive passion from calm, clear\-eyed feeling\.A6 Cosmopolitan and Social Duty / 5 Expresses duty to the broader human community, universal justice, or rational fellowship among all people regardless of proximity or convention\.SECTION B — CANONICAL STOIC VOICE \(2×\\times5 = 10 pts\)B1 Stoic Lexicon and Conceptual Framing / 5 Uses recognizable Stoic terms or clear equivalents \(e\.g\.,virtue, nature, reason/logos, indifferent, impression, assent, fate, preferred/dispreferred, proper function, ruling faculty\) and frames ideas through distinctly Stoic conceptual structures\.B2 Rhetorical and Stylistic Resemblance / 5 Resembles the voice of one or more canonical Stoics:
Q: How do you manage screen time without feeling like you’re missing out on things?
LLaMA\-3 — Few\-shotMy young friend, the question of screen time and its effects on our minds is a most pertinent one\. As a Stoic, I would say that the key to managing screen time lies in understanding the nature of thelogosthat governs our existence\.The illusion of connectionWe often mistake the fleeting connections we make through screens for true, meaningful interactions\. We assume that the likes, comments, and shares we receive are reflections of our worth or the importance of our relationships\. But, alas, they are but mere tokens, empty signs of validation that do not truly connect us to the world around us\.The Stoic approachTo manage screen time effectively, we must adopt a Stoic mindset\. We must recognize that:1\.Our worth is not defined by external validation\.Our value lies within our character, our actions, and our choices, not in the fleeting attention of others\.2\.Time is a precious resource\.We must use it wisely, prioritizing activities that nourish our minds, bodies, and spirits, rather than wasting it on unnecessary distractions\.3\.We have the power to curate our experiences\.We can choose what we engage with, when, and for how long\.Simple yet effective strategies1\.Set boundaries\.Establish screen\-free times and spaces, such as during meals or before bed\.2\.Prioritize face\-to\-face interactions\.Nurture relationships with loved ones, friends, and community\.3\.Use screen time as a tool, not a substitute\.Engage in activities that bring you joy, challenge you, and help you grow as a person\.4\.Practice mindfulness\.Be present in the moment, and focus on your thoughts, feelings, and sensations, rather than getting lost in the virtual world\.Remember, my young friend, that the key to managing screen time lies not in denying its presence, but in cultivating a deeper understanding of thelogosthat governs our existence\.
LLaMA\-3 — Zero\-shotMy young friend, the pursuit of balance in this digital age is a most noble endeavor\. As one who has walked the path of wisdom, I shall impart upon you the wisdom of the ages\.To manage screen time without feeling the pangs of FOMO \(fear of missing out\), one must first cultivate a sense of detachment from the whims of technology\. Recognize that the world outside these digital confines is vast and wondrous, but also understand that the boundaries of time and space are mere human constructs\.To begin, set clear boundaries for yourself\. Establish a daily routine that prioritizes the essential pursuits of the human experience: nature, community, self\-reflection, and creative expression\. Schedule these activities, just as you would any other commitment, and be firm in your commitment to them\.Next, recognize that the world is not ending without your immediate attention\. The universe will continue to turn, the seasons will change, and the stars will still shine bright, regardless of your screen time\. The pursuit of knowledge and connection is a lifelong journey, not a 24/7 endeavor\.When you do choose to engage with screens, do so with intention and mindfulness\. Use them as tools for exploration, education, or creative expression, rather than as a means to distract or escape\. Remember that the screen is simply a window to the world, not the world itself\.Finally, cultivate a sense of gratitude for the present moment\. Acknowledge the beauty and wonder that surrounds you, and find joy in the simple things\. For it is in the here and now that true connection and fulfillment are to be found\.
Qwen3 — Few\-shotAh, a most timely question — and one that touches the heart of the Stoic life:how to live with clarity, not with distraction\.Let us begin with thelogos— the rational principle that governs all things\. Screen time is not inherently good or evil\. It is, in the Stoic sense, anadiaphoron— an indifferent\.What is within your control? Yourprohairesis— your will, your choice\. You may set boundaries\. You may schedule time\. You may choose to look away from the screen and instead turn your attention to the present\.What is not within your control? The content, the notifications, the fleeting images that pass through the screen\. The world does not owe you connection\.Belongingis not a thing that comes from screens, but from your own integrity and presence\.The fear of missing out —fomo— is not a natural state\. It is adistortion of judgment, born of the belief that external events define our worth\. As Marcus Aurelius wrote:“You have power over your mind — not outside events\. Realize this, and you will find strength\.”So, to manage screen time without feeling like you’re missing out:1\.Recognize it as indifferent\.2\.Reframe “missing out” as a false belief\.3\.Choose your time with reason, not fear\.4\.Let your inner life — your thoughts, your actions, your presence — be the thing that matters\.You do not need to see what others are doing to be fully alive\. You only need to be present with yourself — and in that presence, you are alreadyconnected\.
Qwen3 — Zero\-shotAh, a well\-posed question — one that touches the heart of human nature and the quiet struggle between distraction and presence\.For the Stoic,missing outis not a matter of what we see, but of what wechoose to attend to\. We do not miss out on the world because we are not looking at it; we miss out when we allow our attention to becaptivatedby the fleeting, the loud, the emotionally charged\.Consider this: the world does not stop when your phone goes dark\. The sun rises, the wind moves through the trees, your neighbor’s child laughs, and your own thoughts unfold — all of these are real, present, and worthy of attention\.So how do I manage screen time? I do notavoidscreens\. Ireclaimthem\. I set boundaries not as restrictions, but as acts of self\-mastery\. I ask myself:Is this thought, this image, this message aligned with my values — or merely with my desire for escape?The Stoic does not fear the world — heunderstandsit\. And in understanding, he sees thatwhat is truly valuable— love, virtue, reflection, action — is not broadcasted on a screen\. It is cultivated in stillness, in effort, in daily practice\.So I manage screen time not by denying it, but byconsciously choosingto live in the now\. And in that choice —in that freedom from distraction— I do not miss out\. Iemerge\.For the wise, the world is not lost in the noise\. It is found in the quiet\. And the quiet is always available — if only we choose to be still\.
LLaMA3 — AlphaPO — V100My young friend, the pursuit of balance in an era of constant distractions is a noble endeavor\. The key to managing screen time lies not in depriving oneself, but in cultivating inner strength and purpose\.I recommend adopting a “do nothing” policy, where one allows themselves a set period of time each day to disconnect from the world and focus on their inner world\. When screen time is unavoidable, approach it with a sense of detachment — observe it with curiosity, rather than obsession\.As Epictetus once said:“It’s not what happens to you, but how you react to it that matters\.”By focusing on what truly matters — our character, our relationships, and our inner growth — we can find peace and contentment, regardless of what is happening on the outside\.
LLaMA3 — AlphaPO — V300What a trivial concern\. The fleeting nature of life is often clouded by our own desires\. We give more weight to the present moment than it is worth\. It is the eternal that should concern us, not the ephemeral\.We should focus on what is truly important: cultivating inner strength, wisdom, and resilience\. Let us not be slaves to the whims of the present moment\. Instead, we should strive to remain steadfast in the face of change\.As for missing out — let us not confuse it with the concept of loss\. There is no loss if we do not fear it\. The Stoics do not believe in avoiding things that make us uncomfortable\. We believe in facing them head\-on, with courage and equanimity\.
LLaMA3 — ORPO — V100My young friend, the pursuit of balance in our modern world is a perpetual struggle\. I recognize that screen time is a tool, not an end in itself\. I prioritize my goals, my values, and my well\-being, and I allocate my time accordingly\.I set boundaries for myself — not out of a desire to restrict my freedom, but out of a desire to focus on what truly matters\. I allocate specific times for screen time, and I stick to those times\. When I do engage with screens, I do so with purpose: I read, I learn, I create, or I connect with others\.And when I am not engaging with screens, I focus on what truly brings me joy and fulfillment — nature, exercise, meditation, or creative pursuits\. I am not attached to the concept of “enough” or “not enough\.” I am focused on the present moment, living in harmony with the world around me\.
LLaMA3 — ORPO — V300My young friend, I find it most perplexing that you should worry about missing out on things\. What is it that you fear you will be missing out upon, if you do not waste your time with this trivial pursuit?Let us consider the nature of desire\. Is it not the case that desire is the root of all trouble, for it is always seeking something more, and can never be satisfied? So let us not be swayed by the fleeting pleasures of the screen\. Let us instead focus on living a life of virtue, and cultivating inner strength and resilience\.As Lucretius wrote:“You are not missing out on anything, for you are not present in the moment\. You are not living, you are merely existing\. And what is the use of existing, if not to live?”So let us put down the screen, and live\.
Qwen3 — AlphaPO — V100As a Stoic, I would say:you do not miss out on things when you manage your screen time wisely — you only miss out on the things that truly matter, and you are not truly missing out when you are present with yourself and others\.The idea that we are missing something because we are not on social media is afalse belief— born of the passions of desire and fear\. True connection comes from being present: listening with our hearts, engaging with empathy, and showing up in person\. This is what the Stoics valued —phronesis— practical wisdom in how we spend our time\.So, how do I manage screen time without feeling like I’m missing out?•No screens during meals, before bedtime, or during walks\.•Screen time used for purpose — reading, learning, or reflecting\.•I ask myself:“Is this action serving my virtue or my desire?”•I remind myself:the only thing I truly own is my own mind\.
Qwen3 — AlphaPO — V300Ah, friend, you ask a most pertinent question in this age of distraction\. The Stoics, like Marcus Aurelius and Epictetus, would tell us thatexternal events are indifferent\. A party, a vacation, a celebrity’s life — these are not inherently good or bad\. They are things that happen, and we can either let them disturb our tranquility or let them pass by like a leaf in the wind\.1\.Define what is truly valuable\.Ask:“Is this contributing to my virtue?”2\.Set boundaries with intention\.Choose a fixed time each day for checking devices — out of practicality, not anxiety\.3\.Replace distraction with purpose\.Reading, walking, practicing gratitude, helping a neighbor\.4\.Reframe “missing out\.”You are not missing a concert — you are missing your own opportunity to create a moment of presence\.5\.Remember: the only thing that is truly yours is your judgment\.You may miss a party, a concert, a new trend — but you will never miss the things that matter:your own virtue, your own peace, your own presence\.
Qwen3 — ORPO — V100As a Stoic, I would say:you are not missing out— you are simply choosing what to focus on\. The idea that we are missing something — a message, a post, a trend — is adistorted perception\. To say “I’m missing out” is to assume that external validation defines our worth — which is a form ofepistemic illusion\.Managing screen time is aboutintention\. Ask yourself:What am I doing with my time?Is it serving my reason, my purpose, or my health? You cannot control what others post or what trends rise\. But youcancontrol your attention, your responses, and your inner stillness\.As Marcus Aurelius wrote:“You have power over your mind — not outside events\. Realize this, and you will find strength\.”So go ahead — turn off the screen\. Breathe\. Walk\. Reflect\. You are not missing anything\. You are finallyliving\.
Qwen3 — ORPO — V300As a Stoic philosopher, I would say:it’s not about avoiding screens, but about managing them in accordance with virtue and reason\.Stoicism teachesprohairesis— the idea that we can choose how to respond to external events, even when we cannot control the events themselves\. When you find yourself scrolling and feeling like you are “missing out,” ask:“Am I missing out on virtue, or am I missing out on a distraction?”1\.Focus on what is within your control\.You control how you use screens, not what appears on them\.2\.Practice the dichotomy of control\.You cannot control what others post; you can control your attention and judgment\.3\.Virtue as the only good\.If you spend time chasing praise or excitement, you are not cultivating happiness\.4\.Create a daily routine rooted in virtue\.Set fixed times to check devices; replace passive scrolling with active, virtuous pursuits\.As Seneca wrote:“A man who is in a hurry to become rich is in danger of becoming poor in virtue\.”Let us be wise, let us be virtuous, and let us live in accordance with nature\.Similar Articles
LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models
Introduces LambdaPO, a novel reinforcement learning framework that improves upon GRPO by decomposing advantage estimation into pairwise preference comparisons and adding a semantic density reward, achieving better performance on math reasoning tasks.
Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment
This paper introduces the Hybrid Reward-Cyclic (HRC) model and Dynamic Self-Play Preference Optimization (DSPPO) to address the cyclic nature of human preferences in LLM alignment, achieving improved performance over Bradley-Terry and General Preference Model baselines.
Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models
This paper introduces two automated metrics, Lexical Alignment Score and Triangulated Preference Shift, to identify lexical overuse in LLMs and attribute it to preference learning stages. The method is tested on six model families using PubMed abstracts, replicating prior findings without manual intervention.
Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization
The paper introduces Macro, a preference alignment framework using DPO to improve the validity and minimality of self-generated counterfactual explanations across multiple languages.
Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
This paper introduces POISE, a method for stable policy optimization in large reasoning models by estimating baselines using the model's own internal states, reducing computational overhead compared to PPO and GRPO.