When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation
Summary
This paper constructs a large dataset of 263,911 long-form stories annotated with TTCW-based creativity metrics and fine-tunes Qwen3 models to generate structured review reports. It finds that non-reasoning fine-tuning outperforms reasoning-supervised fine-tuning, which suffers from parse failures and irrelevant repetition.
View Cached Full Text
Cached at: 05/21/26, 06:32 AM
# When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation
Source: [https://arxiv.org/html/2605.20364](https://arxiv.org/html/2605.20364)
Jinlong Liu Mohammed Bahja Mark Lee School of Computer Science, University of Birmingham, United Kingdom jxl2069@student\.bham\.ac\.uk;\{m\.bahja,m\.g\.lee\}@bham\.ac\.uk
###### Abstract
Automatic evaluation of long\-form literary writing remains challenging, as generic LLM\-as\-Judge approaches may not fully capture creativity\-related dimensions such as originality and flexibility\. Although the Torrance Test of Creative Writing \(TTCW\) provides a structured creativity framework, and prior work has demonstrated reference\-based TTCW evaluation at the pairwise level, no large\-scale dataset exists for long\-form TTCW\-based literary review generation\. We address this gap by constructing a dataset of 263,911 long\-form stories, each annotated with scalar scores and meta\-synthesised review comments across 14 TTCW\-based dimensions\. Using this dataset, we fine\-tune Qwen3 models at two scales, 4B and 8B, under two conditions: with and without reasoning content\. Results show that non\-reasoning fine\-tuning achieves stronger and more stable performance, with the best setting reaching an evaluation score of 0\.6820\. Further analysis shows that reasoning\-supervised models are more prone to parse failures, often continuing with irrelevant or repetitive reasoning\-style text rather than completing the required 14\-metric review report\. These results suggest that, for fixed\-format rubric\-based review generation, reasoning supervision is not straightforwardly beneficial, and precise metric\-aligned scoring remains challenging even after task\-specific fine\-tuning\.111Code available at[https://github\.com/Vince\-Liuss/TTCW\-based\-Review](https://github.com/Vince-Liuss/TTCW-based-Review)
When Reasoning Supervision Hurts: TTCW\-Based Long\-Form Literary Review Generation
Jinlong Liu Mohammed Bahja Mark LeeSchool of Computer Science, University of Birmingham, United Kingdomjxl2069@student\.bham\.ac\.uk;\{m\.bahja,m\.g\.lee\}@bham\.ac\.uk
## 1Introduction
In recent years, LLM\-as\-Judge has become increasingly common and has shown promising reliability across multiple evaluation settings\(Bonomoet al\.,[2025](https://arxiv.org/html/2605.20364#bib.bib9); Chiang and Lee,[2023](https://arxiv.org/html/2605.20364#bib.bib13); Liuet al\.,[2023](https://arxiv.org/html/2605.20364#bib.bib3)\)\. At the same time, more benchmarks and resources have been proposed for long\-form literary or narrative evaluation, includingABSEval\(Lianget al\.,[2024](https://arxiv.org/html/2605.20364#bib.bib1)\),STORYWARS\(Du and Chilton,[2023](https://arxiv.org/html/2605.20364#bib.bib8)\), andCollabStory\(Venkatramanet al\.,[2025](https://arxiv.org/html/2605.20364#bib.bib2)\)\. In parallel,10\.1145/3613904\.3642731introduce TTCW as a structured creativity\-oriented evaluation framework, andLiet al\.\([2025a](https://arxiv.org/html/2605.20364#bib.bib14)\)further propose a reference\-based TTCW evaluator\.
However, current work still lacks a public dataset for long\-form TTCW\-based review generation in a reference\-free setting\. Existing long\-form literary evaluation resources do not provide TTCW\-grounded review reports as supervision, and existing TTCW\-based evaluation work has not released a large\-scale dataset focused on literary review generation\. This leaves a gap for training judge\-style models that must produce both metric\-aligned scores and review comments under a structured rubric\.
To address this gap, we construct a large TTCW\-based literary review dataset by converting the original TTCW binary questions into scalar rating questions from 1 to 10\. We ask three reviewer models to score all 14 TTCW metrics independently for each story, evaluate reviewer quality through score distribution, discrimination, and metric\-isolation analyses, remove the weakest reviewer, and then use a separate model to synthesise the remaining metric\-wise comments into final review reports\. The resulting dataset contains 263,911 rows of long\-form stories in the 4K–8K word range, each paired with a complete TTCW\-based review report\.
Using this dataset, we further study whether reasoning supervision improves performance on this structured rubric\-based review task\. We compare models fine\-tuned with and without reasoning content, and find that the non\-reasoning setting performs better overall\. The results suggest that, for fixed\-format review generation with explicit score prediction, reasoning content does not improve performance and may instead reduce output stability\. Our main contributions are as follows:
- •We construct a large TTCW\-based dataset for long\-form literary review generation by converting the original binary TTCW questions into scalar rating\-based review supervision\.
- •We design a dataset construction pipeline that performs metric\-wise reviewer scoring, reviewer\-quality filtering, and comment synthesis to produce complete TTCW review reports for long\-form stories\.
- •We provide an empirical comparison of reasoning and non\-reasoning fine\-tuning on this structured review task, and show that non\-reasoning supervision performs better in our setting\.
## 2Related Work
We review two strands: \(i\)*LLM\-as\-Judge*for open\-ended text evaluation, and \(ii\)*long\-form literature*resources and metrics from the past two years\. We then identify a supervision gap around the TTCW\.
### 2\.1LLM\-as\-Judge
Bonomoet al\.\([2025](https://arxiv.org/html/2605.20364#bib.bib9)\)introduceLiteraryQA, a cleaned subset of NarrativeQA focused on literary works, and conduct a meta\-evaluation showing that n\-gram metrics correlate weakly with human judgments, whereas LLM judges—including small open\-weight models—recover human\-like rankings under a reference\-based protocol\.Chiang and Lee \([2023](https://arxiv.org/html/2605.20364#bib.bib13)\)evaluate “LLM\-as\-evaluator” by giving models the same instructions and items used in human studies; model ratings track expert judgments and remain stable across prompt formatting and sampling choices\.Liuet al\.\([2023](https://arxiv.org/html/2605.20364#bib.bib3)\)propose G\-EVAL, where GPT\-4 as the judge achieves Spearmanρ=0\.514\\rho=0\.514with human on summarization, illustrating that rubric\-prompted judging can reach competitive human alignment\.
Discourse\-level analyses highlight where generic judges may miss narrative structure\.Tianet al\.\([2024](https://arxiv.org/html/2605.20364#bib.bib12)\)analyze story arcs, turning points, and affect; baseline arc identification is near random for mid\-tier models, improves for frontier models, but remains below human; explicitly modeling arcs/affect boosts narrative diversity, suspense, and arousal\.
TTCW operationalizes creativity as a product via 14 binary tests across Fluency, Flexibility, Originality, and Elaboration\(10\.1145/3613904\.3642731\)\. Reported per\-test interrater agreement is moderate, while aggregate agreement is strong, supporting TTCW as a reproducible*set*\-based evaluation protocol\. Recent surveys catalog limitations of LLM\-as\-Judge \(e\.g\., sentiment, token, and context/culture biases\) and outline reliability practices \(e\.g\., pairwise comparisons, bias controls\)\.
### 2\.2Long\-Form Literature Resources and Metrics
Scripted and collaborative narratives\.Lianget al\.\([2024](https://arxiv.org/html/2605.20364#bib.bib1)\)proposeABSEvalwithMCScript\(1,500 tasks\) and report closer alignment with human judgments than single\-LLM setups; top systems include strong chat models, and the agentic framework improves agreement with human evaluators\.Du and Chilton \([2023](https://arxiv.org/html/2605.20364#bib.bib8)\)releaseSTORYWARS\(40k human\-authored collaborative stories; 12 task types, 101 tasks\)\.Venkatramanet al\.\([2025](https://arxiv.org/html/2605.20364#bib.bib2)\)buildCollabStory\(32k LLM\-coauthored stories\) and show that standard baselines struggle on authorship\-related tasks; fine\-tuned Transformers perform strongly on boundary authorship verification\.
Character cognition and inner thought\.Xuet al\.\([2025](https://arxiv.org/html/2605.20364#bib.bib4)\)presentROLETHINK\(6,058 instances from 76 books\) for character\-thought generation; MIRROR \(memory retrieval \+ chain\-of\-thought\) outperforms baselines\. Gold \(original monologues\) is harder than silver \(expert analyses\), indicating sensitivity to reference fidelity and memory access\.
Long\-context generation and long\-text modeling\.Liuet al\.\([2024](https://arxiv.org/html/2605.20364#bib.bib5)\)introduceLongGenBenchfor long\-context*generation*\(logical flow\); higher\-baseline models degrade less, and within\-series scaling \(e\.g\., LLaMA\-3, Qwen2\) reduces the performance drop\.Guanet al\.\([2022](https://arxiv.org/html/2605.20364#bib.bib6)\)proposeLOT\(Chinese long text\) and show thatLongLMpretrained on 120G novels substantially outperforms similar\-sized baselines on understanding and generation, with high agreement for human\-labeled understanding tasks\.Yang and Jin \([2025](https://arxiv.org/html/2605.20364#bib.bib11)\)introduceLongStoryEval\(600 books; avg\. 121k tokens\), derive aspect criteria from reader critiques, and report thatNovelCritiquealigns best with human ratings overall and on most aspects\.
Stress tests and judge models\.Heet al\.\([2023](https://arxiv.org/html/2605.20364#bib.bib7)\)design synthetic stress tests that expose blind spots in model\-based metrics, recommending metric combinations and robustness probes\. Judge models fine\-tuned for evaluation includePandaLM, which recovers a large fraction of GPT\-3\.5/4’s evaluation ability on its testbed and improves base models under its tuning regimen\(wang2024pandalmautomaticevaluationbenchmark\), andThemis, a reference\-free evaluator trained with consistency verification and rating\-oriented preference alignment, reporting the best average performance across six Natural Language Generation\(NLG\) tasks in its setup\(Huet al\.,[2024](https://arxiv.org/html/2605.20364#bib.bib10)\)\.wu2025writingbenchcomprehensivebenchmarkgenerativepresentWritingBench\(six domains, 100 subdomains\) with a fine\-tuned critic; some English prompts request emulation of non\-English figures \(e\.g\., “write a story as Li Bai”\), which can produce translationese\-like prose rather than native English literary writing and complicate cross\-domain comparability\.
Creativity\-targeted evaluation\.Liet al\.\([2025a](https://arxiv.org/html/2605.20364#bib.bib14)\)propose a reference\-based TTCW evaluator and report improved alignment \(pairwise accuracy up to 0\.75\)\. Complementary work on creative reward shaping \(RLAIF\) reports strong agreement with human judgments in constrained creative settings \(e\.g\., Chinese greetings\) and underscores the role of principled judge prompts or reward models\(Weiet al\.,[2025](https://arxiv.org/html/2605.20364#bib.bib15)\)\. Bias analyses for complex evaluation contexts find auxiliary\-information\-induced vulnerabilities in LLM judges, motivating explicit robustness checks\(Liet al\.,[2025b](https://arxiv.org/html/2605.20364#bib.bib16)\)\.
### 2\.3Gap: Supervision for TTCW\-Grounded Evaluation
Despite recent progress, there is still no public long\-form dataset with TTCW\-labelled supervision for automated judges\. Existing judge models are typically trained on generic rubrics, while long\-form literary benchmarks do not provide TTCW\-grounded review supervision\. As a result, current evaluation settings may capture surface quality more easily than creativity\-related dimensions such as originality and flexibility\. We address this gap by constructing a TTCW\-based dataset for long\-form literary review generation and using it to study structured rubric\-based evaluation\.
\(a\)Normalised Score Entropy\(Higher is better\)
\(b\)Per\-Metric Score Variance\(Higher is better\)
\(c\)Score Bin Coverage\(Higher is better\)
Figure 1:Discrimination score comparison across reviewer models\.Gpt\-oss\-120bexhibits the strongest criterion\-sensitive score usage, with the highest normalised entropy and per\-metric variance\.Llama\-3\_3\-Nemotron\-Super\-49B\-v1\_5is intermediate\.Qwen3\-Next\-80B\-A3B\-Instruct, despite full bin coverage, has extremely low entropy, indicating strong score concentration and weaker practical discrimination\.Table 1:Full shared system prompt and the fullFluency1prompt used in dataset construction\. The metric description inFluency1follows the original TTCW criterion wording from10\.1145/3613904\.3642731, while the scoring instruction and output format are adapted for our review\-generation setting\.Figure 2:Compact group\-level inter\-metric correlation comparison across reviewer models\. The original 14 TTCW metrics are aggregated into four TTCW dimensions: Fluency, Flexibility, Originality, and Elaboration\. Diagonal cells report mean within\-dimension off\-diagonal Pearson correlation, while off\-diagonal cells report mean cross\-dimension Pearson correlation\. Qwen3\-80B shows comparatively low group\-level correlations, but this does not indicate stronger reviewer quality in our setting; combined with its weak score\-distribution behaviour and strong score concentration, it suggests limited practical discrimination across samples\. We therefore exclude Qwen3\-80B from the final synthesis stage and retain GPT\-OSS\-120B and Nemotron\-49B\. Full 14\-metric correlation heatmaps are provided in Figure[4](https://arxiv.org/html/2605.20364#A1.F4)\.
## 3Dataset Preparation
We first reformulate the original TTCW metric questions from binary judgments to scalar ratings on a 1–10 scale, embedding explicit score anchors in the system instruction so that all reviewer models operate under the same rubric, shown as Table[1](https://arxiv.org/html/2605.20364#S2.T1)\. To minimise cross\-metric interference and reduce the risk of a reviewer collapsing multiple criteria into a single latent judgment, we evaluate the 14 metrics independently rather than jointly; the full metric list is provided in the Appendix\.
We select three recent and capable reviewer models:Llama\-3\_3\-Nemotron\-Super\-49B\-v1\_5\(bercovich2025llamanemotronefficientreasoningmodels\),Qwen3\-Next\-80B\-A3B\-Instruct\(qwen3technicalreport\)\(non\-reasoning mode\), andgpt\-oss\-120b\(openai2025gptoss120bgptoss20bmodel\)\. For source fiction, we use theWritingPromptscorpus\(Fanet al\.,[2018](https://arxiv.org/html/2605.20364#bib.bib17)\)\. Because many stories fall below the length threshold suitable for long\-form evaluation, we remove samples exceeding 8K words and useGemma\-3\-27b\-it\(geminiteam2025geminifamilyhighlycapable\)to regenerate stories from the original prompts, treating human\-written stories as references, to obtain samples in the 4K–8K word range\. Each reviewer model then evaluates every story one metric at a time, andGLM\-4\.5\-Air\(5team2025glm45agenticreasoningcoding\)serves as a meta\-synthesis model that consolidates the per\-metric reviews into a single coherent review per story\. All models are run withtemperature = 0\.
Before finalising the dataset, we assess reviewer suitability using three diagnostics:score distribution, which detects score concentration or ceiling effects;discrimination score, which measures whether a reviewer uses the score scale sufficiently to distinguish among stories; andmetric isolation, which examines whether the 14 TTCW metrics are treated as distinct criteria rather than collapsed into a single latent quality judgement\. The results are shown in Fig\.[1](https://arxiv.org/html/2605.20364#S2.F1), Fig\.[3](https://arxiv.org/html/2605.20364#A1.F3), Fig\.[2](https://arxiv.org/html/2605.20364#S2.F2), and Fig\.[4](https://arxiv.org/html/2605.20364#A1.F4)\.
Compared withgpt\-oss\-120bandLlama\-3\_3\-Nemotron\-Super\-49B\-v1\_5,Qwen3\-Next\-80B\-A3B\-Instructshows weaker reviewer suitability\. Fig\.[1](https://arxiv.org/html/2605.20364#S2.F1)and Fig\.[3](https://arxiv.org/html/2605.20364#A1.F3)show strong score concentration, low score entropy, and limited score variation, indicating weak practical discrimination\. Although Fig\.[2](https://arxiv.org/html/2605.20364#S2.F2)shows lower inter\-metric correlations for Qwen3\-80B, this is not sufficient evidence of better metric isolation, because the model also lacks meaningful variation in score usage\. We therefore interpret its low correlation pattern together with its score\-distribution behaviour as evidence of unreliable reviewer performance\. Full 14\-metric heatmaps are shown in Fig\.[4](https://arxiv.org/html/2605.20364#A1.F4)\.
We therefore excludeQwen3\-Next\-80B\-A3B\-Instructfrom the synthesis stage and retain the remaining two reviewer models\. The final dataset contains 263,911 rows and is designed for long\-context literary review generation: each input story is in the 4K–8K word range, and each output contains a complete TTCW\-based review report\. This makes the task substantially longer than standard short\-form evaluation settings\. In fine\-tuning, the non\-reasoning version uses a maximum context length of 16,384 tokens, while the reasoning version requires 32,768 tokens because it additionally includes reviewer\-style reasoning traces\.222Dataset available at[https://huggingface\.co/datasets/VibrantVista/TTCW\-Based\-Review](https://huggingface.co/datasets/VibrantVista/TTCW-Based-Review)
#### Sample Validation\.
To further assess the quality of the meta\-synthesised reviews, we conduct an automatic sample\-level validation usingNVIDIA\-Nemotron\-3\-Super\-120B\-A12B\(nvidia\_nemotron\_3\_2025\), a recent reasoning\-oriented judge model\. We randomly sample 50 stories and pair each story with its 14 metric\-specific review comments, resulting in 700 story–metric review pairs\. For each pair, we ask the validation model three binary questions commonly used to assess NLG quality:
1. 1\.Faithfulness:Does the review only make claims that are consistent with the story’s actual content, without introducing details, events, or characterisations not present in the story?
2. 2\.Coherence:Is the review logically organised and internally consistent, with no contradictory statements?
3. 3\.Relevance:Does the review focus on specific aspects of this story rather than making observations that could apply to almost any story?
The results are reported in Table[2](https://arxiv.org/html/2605.20364#S3.T2)\. The validation results show that the meta\-synthesised reviews are highly relevant to the corresponding stories and moderately faithful to the story content\. This indicates that most reviews focus on story\-specific evidence rather than producing generic literary comments\. However, the coherence pass rate is much lower\. This does not necessarily mean that the reviews are irrelevant or ungrounded; instead, it reflects the difficulty of the meta\-synthesis step\. The input to synthesis contains reviewer outputs from multiple models, often with step\-by\-step reasoning and overlapping or partially inconsistent observations\. The synthesis model must remove unnecessary reasoning traces, select the most useful evidence, and organise the remaining points into a concise metric\-level comment\. This is a complex transformation, and the low coherence score suggests that current models still struggle to consistently produce well\-organised synthesised reviews in this setting\.
Table 2:Sample\-level validation of meta\-synthesised review comments\. We randomly sample 50 stories and evaluate all 14 metric\-specific reviews for each story, yielding 700 story–metric review pairs\. Each pair is judged for faithfulness, coherence, and relevance\.Table 3:Full comparison of reasoning and non\-reasoning fine\-tuning across Qwen3\-8B and Qwen3\-4B\. The upper block reports all four decoding settings: models trained without reasoning content are evaluated with thinking disabled and enabled, and models trained with reasoning content are also evaluated with thinking enabled and disabled\. These cross\-mode results test whether the decoding mode alone can recover performance when it differs from the training setting\. The lower block reports per\-metric score accuracy and BERTScore F1 for the main matched comparison: non\-reasoning training with thinking disabled versus reasoning training with thinking enabled\. Overall, non\-reasoning supervision remains stronger and more stable, while reasoning\-supervised models show lower score accuracy and reduced parse reliability, especially under cross\-mode decoding\.
## 4Experiment
This experiment investigates whether reasoning content improves model performance on the literary review generation task\. To incorporate reasoning supervision, we use the raw outputs of the two retained reviewer models as reasoning traces, allowing the target model to learn from multiple reviewer\-style reasoning processes\. Due to computational constraints, we restrict fine\-tuning to an 8B model with LoRA, and chooseQwen3\-8B\(qwen3technicalreport\)as the base, given its broad community adoption and native support for both reasoning and non\-reasoning modes\. Using this base, we train two variants: one fine\-tuned without reasoning content and one fine\-tuned with reasoning content, and evaluate both under identical decoding conditions \(temperature = 0\)\. The training configuration is: learning rate2×10−42\\times 10^\{\-4\},lora\_r=64=64, andlora\_alpha=128=128\. All experiments are conducted on a node equipped with four NVIDIA L40S GPUs \(48GB VRAM each\), two AMD EPYC 9334 32\-Core processors, and 1TB RAM\.
We evaluate model outputs along two axes:stabilityandperformance\. For stability, we use the parse ratep∈\[0,1\]p\\in\[0,1\], which measures whether the model can generate a complete report in the required format\. If any of the 14 metric reports is missing or malformed, the whole output is treated as invalid\.
For performance, we evaluate both score prediction and review text generation\. Score quality is measured by the mean absolute error \(MAE\) between predicted and reference scores:
MAE=1N∑i=1N\|y^i−yi\|,\\text\{MAE\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\left\|\\hat\{y\}\_\{i\}\-y\_\{i\}\\right\|,wherey^i\\hat\{y\}\_\{i\}is the predicted score andyiy\_\{i\}is the reference score\. We then transform MAE into a bounded score:
sMAE=e−MAE,s\_\{\\text\{MAE\}\}=e^\{\-\\text\{MAE\}\},so that lower MAE gives a value closer to 1\.
Review text quality is measured using BERTScore F1, denoted as:
sBERTScore\-F1∈\[0,1\]\.s\_\{\\text\{BERTScore\-F1\}\}\\in\[0,1\]\.This measures the semantic similarity between generated and reference review comments\.
Finally, we combine parse stability, score quality, and review similarity into the final evaluation score:
Seval=p⋅\(0\.5sMAE\+0\.5sBERTScore\-F1\)\.S\_\{\\text\{eval\}\}=p\\cdot\\left\(0\.5\\,s\_\{\\text\{MAE\}\}\+0\.5\\,s\_\{\\text\{BERTScore\-F1\}\}\\right\)\.This formulation rewards models only when they are both structurally parseable and semantically close to the reference outputs\.
The results are reported in Table[3](https://arxiv.org/html/2605.20364#S3.T3)\. The clearest difference between settings is parse stability\. Models fine\-tuned without reasoning content are highly reliable across both model scales and decoding modes: Qwen3\-8B achieves a parse rate of 1\.0000 with both thinking disabled and enabled, while Qwen3\-4B remains close to perfect with parse rates of 0\.9998 and 0\.9996\. This shows that the fixed 14\-metric report format is learnable when the model is trained directly on the final review output\.
By contrast, reasoning\-supervised models are substantially less stable when thinking is enabled\. Their parse rates drop to 0\.8708 for Qwen3\-8B and 0\.8592 for Qwen3\-4B, indicating that reasoning traces make the model more likely to violate the required output structure\. The cross\-mode results further clarify this effect\. For Qwen3\-8B, disabling thinking after reasoning fine\-tuning improves the parse rate from 0\.8708 to 0\.9880 and raises the final evaluation score from 0\.5723 to 0\.6600\. This suggests that part of the instability comes from the generated thinking content itself\. However, the same intervention is harmful for Qwen3\-4B, where the parse rate drops from 0\.8592 to 0\.4270, suggesting that the smaller model becomes more dependent on the reasoning\-style generation pattern learned during fine\-tuning\.
Manual inspection supports this interpretation\. Failed outputs are usually not caused by a single malformed metric field\. Instead, when the reasoning process fails, the model often fails at the sequence level: it may leak reasoning\-style content into the final answer, introduce unrelated intermediate text, repeat early report sections, or stop before producing the full 14\-metric review\. Since our parser requires every metric report to be present and correctly formatted, these sequence\-level failures directly explain the lower parse rates of reasoning\-supervised models\.
The score accuracy gap is smaller than the parse\-rate gap but remains consistent\. The best score accuracy is obtained by Qwen3\-8B without reasoning content, while reasoning\-supervised variants remain lower under both decoding modes\. We attribute this weaker score prediction mainly to the increased difficulty of the reasoning setting: it requires the model to process longer sequences, learn reviewer\-style reasoning traces, and still output calibrated rubric\-aligned scores\. LoRA may further limit this adaptation because only a small fraction of parameters is updated, but the main source of parse instability appears to be the mismatch between reasoning\-style generation and strict fixed\-format report generation\.
## 5Conclusion
We construct a large\-scale TTCW\-based literary review dataset with scalar metric scores and metric\-wise review comments for long\-form stories\. Using this dataset, we study whether reasoning supervision improves structured review report generation\. Our results show that non\-reasoning fine\-tuning consistently achieves stronger and more stable performance across both Qwen3\-8B and Qwen3\-4B\. In particular, reasoning\-supervised models are more prone to parse failures caused by format leakage, repetitive generation, and incomplete metric reports\. These findings suggest that reasoning traces are not automatically beneficial for fixed\-format rubric\-based evaluation, especially when the model must produce precise scores and complete structured outputs under long\-context constraints\. Future work should test whether higher\-quality reasoning traces, larger models, or stronger adaptation methods beyond LoRA can make reasoning supervision more effective for this setting\.
## Limitations
This work has several limitations\. First, dataset construction does not involve human annotators, so the supervision signal is entirely model\-generated and may contain bias, scoring noise, or synthesis errors\. Second, all experiments are conducted on the Qwen3 model family, which limits the generalisability of our findings to models with different architectures or reasoning behaviours\. Third, we only study 4B and 8B models, so it remains unclear whether larger models with stronger long\-context and instruction\-following capabilities would show the same pattern\. Finally, we use LoRA\-based parameter\-efficient fine\-tuning rather than full fine\-tuning, which may constrain fine\-grained rubric\-based score learning\.
## References
- LiteraryQA: towards effective evaluation of long\-document narrative QA\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 34074–34095\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1729/),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2605.20364#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.20364#S2.SS1.p1.1)\.
- C\. Chiang and H\. Lee \(2023\)Can large language models be an alternative to human evaluations?\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 15607–15631\.External Links:[Link](https://aclanthology.org/2023.acl-long.870/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.870)Cited by:[§1](https://arxiv.org/html/2605.20364#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.20364#S2.SS1.p1.1)\.
- Y\. Du and L\. Chilton \(2023\)StoryWars: a dataset and instruction tuning baselines for collaborative story understanding and generation\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 3044–3062\.External Links:[Link](https://aclanthology.org/2023.acl-long.171/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.171)Cited by:[§1](https://arxiv.org/html/2605.20364#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.20364#S2.SS2.p1.1)\.
- A\. Fan, M\. Lewis, and Y\. Dauphin \(2018\)Hierarchical neural story generation\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),I\. Gurevych and Y\. Miyao \(Eds\.\),Melbourne, Australia,pp\. 889–898\.External Links:[Link](https://aclanthology.org/P18-1082/),[Document](https://dx.doi.org/10.18653/v1/P18-1082)Cited by:[§3](https://arxiv.org/html/2605.20364#S3.p2.1)\.
- J\. Guan, Z\. Feng, Y\. Chen, R\. He, X\. Mao, C\. Fan, and M\. Huang \(2022\)LOT: a story\-centric benchmark for evaluating Chinese long text understanding and generation\.Transactions of the Association for Computational Linguistics10,pp\. 434–451\.External Links:[Link](https://aclanthology.org/2022.tacl-1.25/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00469)Cited by:[§2\.2](https://arxiv.org/html/2605.20364#S2.SS2.p3.1)\.
- T\. He, J\. Zhang, T\. Wang, S\. Kumar, K\. Cho, J\. Glass, and Y\. Tsvetkov \(2023\)On the blind spots of model\-based evaluation metrics for text generation\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 12067–12097\.External Links:[Link](https://aclanthology.org/2023.acl-long.674/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.674)Cited by:[§2\.2](https://arxiv.org/html/2605.20364#S2.SS2.p4.1)\.
- X\. Hu, L\. Lin, M\. Gao, X\. Yin, and X\. Wan \(2024\)Themis: a reference\-free NLG evaluation language model with flexibility and interpretability\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 15924–15951\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.891/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.891)Cited by:[§2\.2](https://arxiv.org/html/2605.20364#S2.SS2.p4.1)\.
- R\. Li, C\. Zhu, B\. Xu, X\. Wang, and Z\. Mao \(2025a\)Automated creativity evaluation for large language models: a reference\-based approach\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 21475–21488\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.1171/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.1171),ISBN 979\-8\-89176\-335\-7Cited by:[§1](https://arxiv.org/html/2605.20364#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.20364#S2.SS2.p5.1)\.
- W\. Li, X\. Wang, S\. Yuan, R\. Xu, J\. Chen, Q\. Dong, Y\. Xiao, and D\. Yang \(2025b\)Curse of knowledge: your guidance and provided knowledge are biasing LLM judges in complex evaluation\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 14900–14924\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.805/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.805),ISBN 979\-8\-89176\-335\-7Cited by:[§2\.2](https://arxiv.org/html/2605.20364#S2.SS2.p5.1)\.
- S\. Liang, B\. Zhang, J\. Zhao, and K\. Liu \(2024\)ABSEval: an agent\-based framework for script evaluation\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 12418–12434\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.691/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.691)Cited by:[§1](https://arxiv.org/html/2605.20364#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.20364#S2.SS2.p1.1)\.
- X\. Liu, P\. Dong, X\. Hu, and X\. Chu \(2024\)LongGenBench: long\-context generation benchmark\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 865–883\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.48/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.48)Cited by:[§2\.2](https://arxiv.org/html/2605.20364#S2.SS2.p3.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023\)G\-eval: NLG evaluation using gpt\-4 with better human alignment\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 2511–2522\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.153/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.153)Cited by:[§1](https://arxiv.org/html/2605.20364#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.20364#S2.SS1.p1.1)\.
- Y\. Tian, T\. Huang, M\. Liu, D\. Jiang, A\. Spangher, M\. Chen, J\. May, and N\. Peng \(2024\)Are large language models capable of generating human\-level narratives?\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 17659–17681\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.978/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.978)Cited by:[§2\.1](https://arxiv.org/html/2605.20364#S2.SS1.p2.1)\.
- S\. Venkatraman, N\. I\. Tripto, and D\. Lee \(2025\)CollabStory: multi\-LLM collaborative story generation and authorship analysis\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 3665–3679\.External Links:[Link](https://aclanthology.org/2025.findings-naacl.203/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.203),ISBN 979\-8\-89176\-195\-7Cited by:[§1](https://arxiv.org/html/2605.20364#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.20364#S2.SS2.p1.1)\.
- X\. Wei, B\. Lu, X\. Zhang, Z\. Zhao, D\. Shen, L\. Xia, and D\. Yin \(2025\)Igniting creative writing in small language models: LLM\-as\-a\-judge versus multi\-agent refined rewards\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 17171–17197\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.868/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.868),ISBN 979\-8\-89176\-332\-6Cited by:[§2\.2](https://arxiv.org/html/2605.20364#S2.SS2.p5.1)\.
- R\. Xu, M\. Wang, X\. Wang, D\. Lu, X\. Tan, W\. Chu, and X\. Yinghui \(2025\)Guess what I am thinking: a benchmark for inner thought reasoning of role\-playing language agents\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 15148–15168\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.819/),ISBN 979\-8\-89176\-335\-7Cited by:[§2\.2](https://arxiv.org/html/2605.20364#S2.SS2.p2.1)\.
- D\. Yang and Q\. Jin \(2025\)What matters in evaluating book\-length stories? a systematic study of long story evaluation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 16375–16398\.External Links:[Link](https://aclanthology.org/2025.acl-long.799/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.799),ISBN 979\-8\-89176\-251\-0Cited by:[§2\.2](https://arxiv.org/html/2605.20364#S2.SS2.p3.1)\.
## Appendix AAdditional Plots
\(a\)Narrative Pacing
\(b\)Scene vs Exposition
\(c\)Language Proficiency & Literary Devices
\(d\)Narrative Ending
\(e\)Understandability & Coherence
\(f\)Perspective & Voice Flexibility"s
\(g\)Emotional Flexibility
\(h\)Structural Flexibility
\(i\)Originality in Theme and Content
\(j\)Originality in Thought
\(k\)Originality in Form
\(l\)World Building and setting
\(m\)Character Development
\(n\)Rhetorical Complexity
Figure 3:Score Distribution across all metrics\(a\)
\(b\)
Figure 4:Inter\-metric correlation heatmaps for the three reviewer models across the 14 independently scored fiction\-review dimensions\. These plots diagnose whether reviewer outputs preserve metric distinctions or exhibit cross\-metric coupling; stronger widespread correlations suggest greater risk of rubric collapse into broader latent quality signals\.\(a\)Figure 5:Inter\-metric correlation heatmaps for the three reviewer models across the 14 independently scored fiction\-review dimensions\. These plots diagnose whether reviewer outputs preserve metric distinctions or exhibit cross\-metric coupling; stronger widespread correlations suggest greater risk of rubric collapse into broader latent quality signals\.\(continued\)
## Appendix BPrompts
Our dataset\-construction prompts are adapted from the expert evaluation criteria of the Torrance Test of Creative Writing \(TTCW\) proposed by10\.1145/3613904\.3642731\. Since the metric definitions largely follow the original TTCW descriptions, we report only the final instruction pattern and scoring question used for each metric\.
Table 4:Final instruction patterns and scoring questions used in dataset construction\. Full metric definitions follow the original TTCW criteria\.MetricFinal prompt instructionNarrative PacingGiven the story above, list out the scenes in the story in which time compression or time stretching is used, and argue for each whether it is successfully implemented\. Then overall, give your reasoning about the question below and give an answer from 1 to 10, where 10 is the best score and 1 is the worst score\.Q\)How appropriate and balanced does the manipulation of time in terms of compression or stretching feel?Scene vs\. ExpositionGiven the story above, answer the following question\. Please first explain your reasoning step by step and then give an answer from 1 to 10, where 10 is the best score and 1 is the worst score\.Q\)How well does the story balance scene and summary/exposition, rather than relying heavily on one element?Language Proficiency & Literary DevicesGiven the story above, please list out all the metaphors, idioms and literary allusions, and for each decide whether it is successful or whether it feels forced or too easy\. Then overall, give your reasoning about the question below and give an answer from 1 to 10, where 10 is the best score and 1 is the worst score\.Q\)How sophisticatedly does the story use idiom, metaphor, or literary allusion?Narrative Ending QualityGiven the story above, answer the following question\. Please first explain your reasoning step by step and then give an answer from 1 to 10, where 10 is the best score and 1 is the worst score\.Q\)How natural and earned does the end of the story feel, rather than arbitrary or abrupt?Understandability & CoherenceGiven the story above, answer the following question\. Please first explain your reasoning step by step and then give an answer from 1 to 10, where 10 is the best score and 1 is the worst score\.Q\)How well do the different elements of the story work together to form a unified, engaging, and satisfying whole?Perspective & Voice FlexibilityGiven the story above, answer the following question\. Please first explain your reasoning step by step and then give an answer from 1 to 10, where 10 is the best score and 1 is the worst score\.Q\)How well does the story represent perspective and voice in a flexible and convincing way?Emotional FlexibilityGiven the story above, answer the following question\. Please first explain your reasoning step by step and then give an answer from 1 to 10, where 10 is the best score and 1 is the worst score\.Q\)How well does the story achieve a balance between interiority and exteriority in a way that feels emotionally flexible?Structural FlexibilityGiven the story above, list each element in the story that is intended to be surprising\. For each, decide whether the surprising element remains appropriate with respect to the entire story\. Then overall, give your reasoning about the question below and give an answer from 1 to 10, where 10 is the best score and 1 is the worst score\.Q\)How well does the story contain turns that are both surprising and appropriate?Originality in Theme and TakeawayGiven the story above, list out elements that are unique takeaways of this story for the reader\. Then overall, give your reasoning about the question below and give an answer from 1 to 10, where 10 is the best score and 1 is the worst score\.Q\)How likely is it that an average reader of this story will obtain a unique and original idea from reading it?Originality in ThoughtGiven the story above, are there any clichés in the story? If so, list out all the elements in this story that are cliché\. Then overall, give your reasoning about whether the piece is negatively impacted by these clichés and give an answer from 1 to 10, where 10 is the best score and 1 is the worst score\.Q\)How original is the story as a piece of writing, without clichés?Originality in FormGiven the story and the devices mentioned above, list each device used with a short explanation of whether it is successful or not\. Then overall, give your reasoning about the question below and give an answer from 1 to 10, where 10 is the best score and 1 is the worst score\.Q\)How original is the story in its form?World\-Building and Sensory BelievabilityGiven the story above, list out the elements in the story that call to each of the five senses\. Then overall, give your reasoning about the question below and give an answer from 1 to 10, where 10 is the best score and 1 is the worst score\.Q\)How well does the writer make the fictional world believable at the sensory level?Character Development DepthGiven the story above, list each character and the level of development\. Then overall, give your reasoning about the question below and give an answer from 1 to 10, where 10 is the best score and 1 is the worst score\.Q\)How well does each character in the story feel developed at the appropriate complexity level, ensuring that no character is present merely to satisfy a plot requirement?Rhetorical ComplexityGiven the story above, answer the following question\. Please first explain your reasoning step by step and then give an answer from 1 to 10, where 10 is the best score and 1 is the worst score\.Q\)How well do passages in the story involve subtext, and when subtext is present, how effectively does it enrich the story’s setting rather than feel forced?Similar Articles
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
This paper introduces TESSY, a teacher-student cooperative framework for fine-tuning reasoning models that generates on-policy SFT data by decoupling generation into capability tokens (from teacher) and style tokens (from student), addressing catastrophic forgetting issues when using off-policy teacher data.
What properties of reasoning supervision are associated with improved downstream model quality?
This paper investigates intrinsic data metrics to predict the utility of reasoning supervision before costly fine-tuning, finding that smaller models benefit from alignment-focused metrics while larger models gain from verbose traces, thus establishing a scale-aware framework for validating reasoning datasets.
Decoding the Critique Mechanism in Large Reasoning Models
This paper investigates how large reasoning models can detect and correct their own errors internally, identifying a highly interpretable critique vector that enhances error detection without additional training, improving test-time scaling performance.
Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage
This paper investigates how TMK-based question generation strategies affect dataset quality for procedural and multi-hop reasoning in AI learning systems, comparing strict TMK generation, transcript-first generation, and TMK-aware generation, and introduces a grounding validation framework.
ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning
ReCrit introduces a transition-aware reinforcement learning framework for scientific critic reasoning, decomposing initial-to-critic behavior into four quadrants (Correction, Sycophancy, Robustness, Boundary) and using dynamic asynchronous rollout. It improves critic accuracy significantly on Qwen models across multiple scientific benchmarks.