Towards Human-Level Book-Writing Capability

arXiv cs.AI Papers

Summary

This paper introduces a dataset and training framework that transforms human-authored novels into multi-resolution planning scaffolds, enabling long-context language models to generate book-scale fiction with more human-like prose and narrative dynamics.

arXiv:2605.17064v1 Announce Type: new Abstract: Large language models optimized for instruction following and agentic tasks remain poorly aligned with the requirements of high-quality creative writing. Fiction frequently depends on behaviors that assistant-tuned models are explicitly trained to avoid, particularly deception, moral ambiguity, and unreliable narration. As a result, generated stories often appear structurally correct while remaining stylistically generic, overly explanatory, or weakly grounded in human literary behavior. We present a dataset construction and training framework for book-scale creative writing that reframes supervised fine-tuning as a prompt-to-book generation task grounded in human-authored fiction. Starting from public-domain novels, we derive a multi-resolution planning scaffold by summarizing each book at progressively finer levels, from a high-level premise to chapter- and scene-level structure. We then invert this hierarchy during training: the model learns to expand a prompt into increasingly detailed plans and finally into the original human-authored book text. This formulation preserves human prose as the final supervised target while using intermediate summaries to make book-scale generation learnable. We train a long-context language model on these prompt-to-book trajectories and study whether this objective shifts generation away from assistant-style prose and toward human literary writing.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:38 AM

# Towards Human-Level Book-Writing Capability
Source: [https://arxiv.org/html/2605.17064](https://arxiv.org/html/2605.17064)
###### Abstract

Large language models optimized for instruction following and agentic tasks remain poorly aligned with the requirements of high\-quality creative writing\. Fiction frequently depends on behaviors that assistant\-tuned models are explicitly trained to avoid, particularly deception, moral ambiguity, and unreliable narration\. As a result, generated stories often appear structurally correct while remaining stylistically generic, overly explanatory, or weakly grounded in human literary behavior\. We present a dataset construction and training framework for book\-scale creative writing that reframes supervised fine\-tuning as a prompt\-to\-book generation task grounded in human\-authored fiction\. Starting from public\-domain novels, we derive a multi\-resolution planning scaffold by summarizing each book at progressively finer levels, from a high\-level premise to chapter\- and scene\-level structure\. We then invert this hierarchy during training: the model learns to expand a prompt into increasingly detailed plans and finally into the original human\-authored book text\. This formulation preserves human prose as the final supervised target while using intermediate summaries to make book\-scale generation learnable\. We train a long\-context language model on these prompt\-to\-book trajectories and study whether this objective shifts generation away from assistant\-style prose and toward human literary writing\.111The dataset is available at[https://huggingface\.co/datasets/Pageshift\-Entertainment/LongPage](https://huggingface.co/datasets/Pageshift-Entertainment/LongPage)\.

## 1Introduction

Recent language models can generate long and locally coherent text\[[1](https://arxiv.org/html/2605.17064#bib.bib1),[2](https://arxiv.org/html/2605.17064#bib.bib2),[3](https://arxiv.org/html/2605.17064#bib.bib3)\], but their outputs often remain recognizably assistant\-like even in creative writing settings\. Stories generated by instruction\-tuned models frequently over\-explain character motivations, resolve conflict too directly, or default to safe and predictable interactions\. While these models are highly effective at reasoning and task completion, the behavioral patterns encouraged during assistant alignment do not fully match the distributions present in human fiction\.

This mismatch becomes particularly visible at book scale\. Fiction routinely relies on deception, ambiguity, unreliable narration, and characters acting against the reader’s expectations\. These behaviors are often undesirable in assistant systems, where models are optimized to be helpful, honest, and direct\[[4](https://arxiv.org/html/2605.17064#bib.bib4)\]\. As a result, models trained primarily for assistant\-style interaction may struggle to reproduce the narrative dynamics and prose characteristics commonly found in human\-authored books\.

Existing work on long\-context generation largely approaches this problem through improved memory, retrieval, or planning\[[1](https://arxiv.org/html/2605.17064#bib.bib1),[2](https://arxiv.org/html/2605.17064#bib.bib2),[3](https://arxiv.org/html/2605.17064#bib.bib3),[5](https://arxiv.org/html/2605.17064#bib.bib5)\]\. These techniques are primarily aimed at maintaining long\-range consistency\. However, consistency alone does not guarantee human\-like creative writing quality\. A model can maintain coherent long\-horizon structure while still producing prose that feels synthetic, overly assistant\-like, or stylistically unlike human\-authored fiction\.

In this work, we introduce a dataset construction and training framework for prompt\-to\-book generation grounded in human\-authored fiction\. The central idea is to transform books into a planning scaffold and then reverse that process during training\. A planning scaffold is a multi\-stage representation of a book constructed through progressively finer summaries, ranging from a high\-level book description to chapter\- and scene\-level structure\[[6](https://arxiv.org/html/2605.17064#bib.bib6),[7](https://arxiv.org/html/2605.17064#bib.bib7),[8](https://arxiv.org/html/2605.17064#bib.bib8),[2](https://arxiv.org/html/2605.17064#bib.bib2)\]\. Starting from public\-domain novels, the pipeline first compresses each book into this scaffold representation\. The model is then trained to invert the process: given a prompt, it expands coarse summaries into increasingly detailed representations before generating the original book text\.

This formulation treats creative writing as a staged expansion problem rather than a single\-step continuation task\[[9](https://arxiv.org/html/2605.17064#bib.bib9),[10](https://arxiv.org/html/2605.17064#bib.bib10),[5](https://arxiv.org/html/2605.17064#bib.bib5)\]\. The planning scaffold provides supervision for long\-horizon generation, while the final target remains the human\-authored book itself\. The objective is therefore not merely to generate coherent long text, but to align model behavior with the structure and prose distributions present in published fiction\.

### 1\.1Overview

Section[2](https://arxiv.org/html/2605.17064#S2)describes the dataset construction pipeline, including book preprocessing, prompt generation, and creation of the planning scaffold from progressively finer summaries of each book\.

Section[3](https://arxiv.org/html/2605.17064#S3)describes the supervised fine\-tuning setup\. The model is trained to generate the planning scaffold and the original book text from a synthetic prompt, following a coarse\-to\-fine expansion process\.

## 2Dataset

This section describes the construction of the dataset used for supervised book\-scale generation\. The source books provide the final prose targets, while the prompts, intermediate plans, and metadata used for training are generated from the books themselves\. The construction problem is therefore inverse to the generation problem: starting from a complete book, the pipeline recovers the structured information that a model should later learn to generate before writing the book\.

The resulting training examples contain three main components: a synthetic user prompt, an intermediate planning scaffold, and the original book text\. The prompt specifies the requested book, the scaffold exposes the latent narrative structure, and the book text remains the final human\-authored target\. The following subsections describe the corpus, the annotation strategy, and the hierarchical processing pipeline used to produce these representations\.

### 2\.1Corpus Construction

The source corpus consists of public\-domain books from Project Gutenberg\[[11](https://arxiv.org/html/2605.17064#bib.bib11)\]\. The final release contains approximately 6,000 books\.

The corpus is constructed in two stages\. The first stage contains a 300\-book seed set drawn from the global top\-300 Project Gutenberg downloads at the time of collection\. The second stage adds 5,700 further books\. Both stages produce the same final representation and training format\. The difference lies in how the annotations are generated\.

### 2\.2Annotation Strategy

Stage one processes all 300 seed books with a prompted Qwen3\-32B model\[[12](https://arxiv.org/html/2605.17064#bib.bib12)\]\. The model is used as a reasoning system throughout the pipeline, producing the intermediate representations later used for training\. These outputs additionally serve as supervision for distilling a faster model specialized for the repeated low\-level stages of the pipeline\.

Stage two replaces the scene\-level and chapter\-level components with a distilled Qwen3\-14B tool model\[[12](https://arxiv.org/html/2605.17064#bib.bib12)\]trained from the stage\-one outputs\. Unlike the Qwen3\-32B setup, the distilled model operates without reasoning, making it substantially faster to run\.

This separation is motivated by computational cost\. Most processing effort occurs at the scene and chapter levels because these operations must be repeated many times within each book\. In contrast, the global processing stages are invoked only a few times per example\. Stage two therefore continues to use the prompted Qwen3\-32B reasoning model for the higher\-level abstractions and metadata generation steps, while the distilled Qwen3\-14B model handles the repeated local processing\.

### 2\.3Pipeline

The pipeline progressively converts raw book text into scene\-level, chapter\-level, and book\-level representations\. Each level captures a different scale of narrative information\. Scene\-level processing preserves local events and narrative function\. Chapter\-level processing links scenes into larger developments and records how information is distributed across a chapter\. Book\-level processing compresses the chapter representations into global narrative structure\.

A central design choice is that summaries throughout the pipeline are represented as bullet\-point lists rather than prose paragraphs\. The objective is not to produce polished standalone summaries, but to preserve narrative facts in a form that can later be aggregated and recombined\. The pipeline therefore biases the model toward producing multiple short bullet points instead of dense free\-form text\. A typical bullet point contains roughly 10–20 words, with an absolute upper limit of 45 words\.

At the scene level, the representation captures both content and narrative role\. In addition to local events, the pipeline records information such as which characters are central to the scene, how the narration is structured, and whether the focus is placed on action, exposition, dialogue, or changes in pacing\. The objective is to preserve details that are often lost in conventional summarization\.

At the chapter level, the scene representations are aggregated into a larger structural unit\. The chapter representation links together the local developments of individual scenes while preserving broader stylistic and narrative balance across the chapter\.

Finally, the book\-level stage compresses the chapter representations into a global representation of the narrative\. This representation is then used to generate metadata and synthetic prompts\. The resulting structure acts as an intermediate planning scaffold that makes the latent organization of long\-form narratives explicit\.

Dataset construction therefore runs from completed book text to prompts and planning structure\. During training, the direction is reversed: the model is conditioned on a synthetic user prompt and learns to generate the scaffold before producing the final book text\. This preserves human\-authored prose as the final target while still providing explicit supervision for long\-range planning\.

![Refer to caption](https://arxiv.org/html/2605.17064v1/figures/data_pipeline.png)Figure 1:The figure illustrates the transformation of raw book text into a hierarchical planning scaffold through scene\-level, chapter\-level, and book\-level processing stages\. Colors denote the processing level of each component, while arrows indicate the flow of information between extracted representations\. The diagram also highlights that later book\-level and synthetic prompt components are constructed from multiple intermediate outputs rather than from a single linear summarization path\.#### 2\.3\.1Scene\-Level Processing

##### Scene Segmentation

Scene\-level processing provides the first structural layer of the pipeline\. Rather than treating a chapter as one continuous block of text, the pipeline divides it into smaller narrative units that can be analyzed and summarized more precisely\. This is important because chapters often contain shifts in setting, time, focal character, character grouping, dialogue focus, point of view, or narrative purpose\.

Scene segmentation is guided by these narrative criteria, but they are not used as rigid deterministic rules\. Instead, they define what the model should consider when identifying scene boundaries\. This allows the pipeline to handle cases where a transition is implied by narrative movement rather than explicit formatting\.

The result is a structured scene breakdown for each chapter\. Each scene is assigned a short name, a text span, a narrative focus, and a narrative perspective\. This makes the chapter easier to process in later stages, because each scene can be scored, summarized, and aggregated as a distinct narrative unit\.

##### Scene Schema

The scene schema preserves both the position of the scene inside the chapter and the basic narrative information needed for later processing\. Each scene includes a short descriptive name, its textual span, the dominant narrative focus, and the narrative perspective\. The narrative focus identifies the character or narrator through whom the scene is mainly presented, while the narrative perspective describes how access to that viewpoint is framed\.

Once the scene boundaries are validated, the corresponding scene text and word count are attached\. This turns each scene into a self\-contained unit for later scoring, summarization, and chapter\-level aggregation\.

##### Scene Embedding Space

A plain text summary is useful for describing what happens in a scene, but it does not always preserve the scene’s functional properties\. Some narrative features are easy to underrepresent in prose summaries, especially when they are diffuse rather than event\-like\. For example, pacing, exposition density, world\-building intensity, or the amount of dialog may shape the reading experience even if they are not naturally stated as explicit plot points\.

To preserve these properties, each scene is assigned a seven\-dimensional narrative score vector\. The dimensions are action, dialog, world building, exposition, romantic content, erotic content, and pacing\. Each dimension is scored from 0 to 100, indicating how strongly that feature is present in the scene\. These scores act as interpretable control signals rather than neural embeddings: they make visible what kind of narrative work the scene performs\.

This representation complements the textual scene summary\. The summary captures the concrete events, while the score vector captures broader structural qualities\. A scene can therefore be represented not only as “what happened,” but also as whether it is dialog\-driven, exposition\-heavy, fast\-paced, action\-oriented, or primarily used for world building\. Later stages use this information when generating chapter summaries and when aggregating scene\-level information into chapter\-level representations\.

To make the scores more stable, the seed pipeline uses an ensemble\-style procedure\. The same scene is scored multiple times by a reasoning model, and the final value for each dimension is computed as the mean across these generations\. This is useful because reasoning models may arrive at slightly different scores across runs, since each generation can produce a different internal reasoning trace before assigning values\.

Averaging these outputs reduces the influence of individual noisy judgments\. Very low values are then thresholded to zero, which further suppresses weak or inconsistent signals and produces a cleaner representation of the dominant narrative features in each scene\.

##### Scene Summarization

Scene summarization is not treated as a generic text\-compression step\. Important structural signals, such as shifts in tension, exposition, pacing, dialogue focus, or world\-building, may be compressed away or treated as secondary details if the model is asked only to summarize the scene in general terms\.

Instead, each annotated scene is converted into a compact natural\-language summary guided by its narrative score vector\. The score vector provides an additional signal about the function of the scene within the chapter, helping the summary emphasize the aspects that are most relevant for later aggregation\.

The resulting summaries are concise bullet\-style descriptions\. They preserve enough local information to support chapter\-level processing, while avoiding the need to reconsider the full scene text at every later stage\.

#### 2\.3\.2Chapter\-Level Processing

After the scene\-level representation has been created, the pipeline moves from local narrative units to the chapter level\. A chapter is not treated as a simple concatenation of scenes, but as a larger narrative unit with its own structure, emphasis, character dynamics, and stylistic properties\. The goal of chapter\-level processing is therefore to compress the detailed scene information into a representation that is compact enough to be used at the book level, while still preserving the main developments of the chapter\.

The chapter\-level stage builds on the scene summaries, scene score vectors, and scene metadata produced in the previous stage\. It aggregates these local signals into chapter summaries, chapter\-level narrative scores, short scene mappings, character information, and writing\-style descriptions\. This creates an intermediate layer between detailed scene analysis and global book planning\.

##### Chapter\-Level Writing Style

The motivation for extracting chapter\-level writing style is to represent narrative text not only by what happens, but also by how it is written\. Event\-based summaries capture plot progression, characters, and world information, but they discard stylistic signals that strongly influence perceived authorial voice, readability, atmosphere, and genre convention\.

Therefore, a separate writing\-style representation is introduced to preserve prose\-level characteristics independently from narrative content\. This enables the dataset to support tasks where stylistic consistency is important, such as style\-controlled generation, authorial voice modeling, chapter retrieval, and comparative analysis of narrative form\.

By abstracting away names, locations, and plot\-specific details, the extracted style descriptors function as reusable stylistic fingerprints rather than summaries of chapter events\.

##### Chapter Embedding

The chapter embedding summarizes the narrative profile of the chapter as a whole\. It is derived from the scene\-level score vectors and uses the same dimensions\. This keeps the representation consistent across levels of the hierarchy\.

The purpose of the chapter embedding is to describe the dominant narrative character of the chapter\. For example, a chapter may be largely expository, dialogue\-driven, action\-heavy, or focused on world building\. While the scene\-level scores describe local variation, the chapter\-level score captures the overall balance of these features across the chapter\.

This allows later stages to compare chapters not only by what happens in them, but also by how they function structurally within the book\. A chapter with high exposition may serve a different planning role than a chapter with high action or high dialogue, even if both are similar in length\.

##### Chapter Summary Generation

Instead of summarizing the raw chapter text directly, the pipeline builds on the scene summaries, which already identify the main events and narrative functions\. The summary length is scaled to the amount of scene\-level detail, so more complex chapters receive proportionally richer summaries\.

These chapter summaries then serve as compact inputs for book\-level processing, including story arc detection, world\-rule extraction, and character analysis\. In this way, chapter\-level summaries preserve the main developments of the chapter without requiring later stages to process the full chapter text again\.

##### Short Scene Summaries

The pipeline also produces shorter scene summaries that connect the chapter\-level summary back to the individual scenes\. Their purpose is not to replace the detailed scene summaries, but to preserve the alignment between the chapter’s overall development and its local scene structure\.

This intermediate layer is useful because the chapter summary compresses the narrative into a higher\-level form, while the full scene summaries may contain more detail than later stages need\. Short scene summaries keep the scene sequence visible in a compact way, making it easier to relate chapter plans to the scenes that support them\.

##### Chapter Characters

The chapter\-level pipeline also extracts character information to preserve who is active in each part of the book\. This is important because character relevance is not constant across a narrative\. A character may drive one chapter, support another, or only be mentioned indirectly, while still being important for the larger story\.

By distinguishing between main characters, side characters, and mentioned characters at the chapter level, the pipeline captures how characters enter, leave, and influence the narrative over time\. This gives later book\-level processing a more reliable basis for identifying structurally important characters and constructing coherent character descriptions across the full book\.

#### 2\.3\.3Book\-Level Processing

After chapter\-level processing, the pipeline moves to a global representation of the book\. The goal is no longer to summarize individual events, but to recover the higher\-level structure that organizes the narrative as a whole\. This includes story arcs, central character functions, world rules, writing style, and the book’s narrative archetype\.

This layer is important because long\-form generation requires global coherence\. Characters need stable roles, conflicts need a larger trajectory, and style should remain consistent across chapters\. The book\-level representation therefore acts as a planning layer that connects chapter\-level details into a unified narrative structure\.

##### Story Arc Detection

The pipeline groups chapter\-level developments into a small set of broader story arcs\. For each arc, the model generates a synthetic arc name and a compact bullet\-point progression\. The arc name provides a high\-level label, while the bullet points describe how conflicts, goals, relationships, or turning points develop across chapters\.

This representation makes long\-range narrative movement explicit without turning the arcs into another detailed chapter summary\. Each arc is therefore kept short, so it can function as a book\-level planning signal\.

##### Character Archetypes

Character archetypes describe the functional role of characters within the narrative\. This is different from measuring how much space a character occupies in the text\. A character may appear often without shaping the central structure, while another may appear less frequently but still define the main conflict, motivation, or turning point of the story\.

The pipeline therefore abstracts characters by their narrative function and by their relationships to other characters\. These cross\-character archetype relationships capture roles such as opposition, support, mentorship, rivalry, dependence, or emotional contrast\. This helps represent the character system as a structured network of functions rather than as a simple list of frequently appearing names\.

##### Book Character List

The book\-level character list is constructed from the character information extracted at the chapter level\. Local character appearances are consolidated into a stable representation\. The book\-level representation keeps only main and side characters, so that the character list remains focused on characters with a sustained role in the book\.

For each retained character, the pipeline generates a compact bullet\-point profile that summarizes recurring traits, relationships, motivations, and developments\. These profiles provide an accessible lookup representation during later generation, so that character information does not need to be reconstructed from longer narrative summaries\.

##### Book Writing Style

To preserve prose\-level information, the pipeline consolidates the writing\-style analyses extracted for individual chapters\. Each chapter contributes local evidence about the prose form of the text, and these chapter\-wise descriptions are aggregated into a single style profile for the full book\.

The resulting representation is a flat list of approximately 35 bullet points, each describing a recurring stylistic property of the book\. It captures prose\-level patterns such as tense, diction, syntax, dialogue handling, punctuation, figurative language, narrative voice, register, spelling conventions, and tone\.

This step is decoupled from plot\-oriented processing\. Its input consists of chapter\-level style analyses rather than story arcs, character descriptions, or world rules\. The objective is therefore not to summarize narrative content, but to preserve the stylistic regularities that govern how the text is written across chapters\. This provides later generation with a compact style representation grounded in observed prose patterns rather than event\-level summaries\.

##### Book Archetype

At the book level, the pipeline produces a compact abstraction of the narrative pattern underlying the text\. It is derived from book\-level signals such as story arcs, chapter summaries, and writing style, and maps them into a higher\-level description of the book’s structural form\.

The motivation for this step is that long\-form generation depends not only on event order, but also on the broader narrative logic that organizes those events\. The model needs an early signal of what kind of story is being generated, which conventions shape it, and what type of resolution the structure implies\.

The archetype is therefore written as a short abstract paragraph rather than as another content summary\. It captures dominant narrative modes, structural expectations, transformation patterns, ending shape, and trope orientation, while avoiding concrete plot details\. This gives later generation a compact frame for interpreting the more detailed arcs and chapter plans\.

#### 2\.3\.4Synthetic Prompt and Metadata Generation

##### Book Preview

The data pipeline processed book\-level information into a user\-facing short book preview\. The preview is user\-facing: it gives a compact, spoiler\-controlled summary of the book and states, in direct form, what the user can and cannot expect from it\.

The preview is represented by three components: a synthetic title, a highlight, and exactly seven tags\. The synthetic title assigns a concise name to the book\. The highlight describes the premise, conflict, stakes, and narrative hook in approximately 90–130 words, or 3–5 sentences\.

The seven tags are short descriptors of the book’s general theme or genre\. Tags are ordered from most to least important and impactful for the book\.

##### Synthetic Prompt Generation

A central component of the pipeline is the generation of synthetic user prompts\. A primary challenge in this stage is that prompts produced by large language models without explicit control mechanisms often exhibit limited diversity\. In practice, unconstrained generation tends to converge toward similar phrasing patterns, comparable levels of detail, and recurring structural formats\. This results in a collapse of the synthetic prompt distribution and reduces its ability to approximate the variability observed in real user instructions\.

To address this limitation, we introduceStructured Guide Sampling\. Rather than allowing the language model to determine the prompt form autonomously, the pipeline first samples a prompt profile from a predefined set of stylistic dimensions\. These dimensions do not specify the narrative content of the target book; instead, they govern the expected form of the generated prompt, including its length, phrasing style, structural organization, specificity, and surface quality\. The sampled profile is then combined with the book\-level representation and provided to the language model, ensuring that the resulting prompt remains semantically aligned with the target book while adhering to explicitly controlled stylistic constraints\.

This procedure is closely related to recent work on attribute\-conditioned synthetic data generation, where control variables are specified or sampled before generation rather than relying on the model’s implicit diversity\. For example, AttrPrompt generates synthetic data from attributed prompts containing dimensions such as length and writing style, while TinyStories and SimpleStories use randomly sampled or parameterized prompt features to control properties of generated stories\[[13](https://arxiv.org/html/2605.17064#bib.bib13),[14](https://arxiv.org/html/2605.17064#bib.bib14),[15](https://arxiv.org/html/2605.17064#bib.bib15)\]\. Similar conditioning appears in synthetic prompt generation as well, such as SynPO, which generates prompts from sampled keyword constraints; in contrast, our method samples dimensions of the user prompt form itself, including length, phrasing style, structure, specificity, and surface noise\[[16](https://arxiv.org/html/2605.17064#bib.bib16)\]\.

The sampled dimensions capture multiple aspects of prompt variation, including prompt length, request phrasing, structural layout, and realistic surface\-level imperfections\. Consequently, the pipeline can generate prompts ranging from short and informal requests to detailed and highly structured writing specifications\. The framework additionally controls whether prompts are expressed as free\-form prose, lists, or field\-based specifications\.

To further improve realism, the pipeline incorporates controlled noise patterns, including minor spelling mistakes, grammatical inconsistencies, and punctuation irregularities\. These perturbations are applied exclusively to the surface form of the prompt and do not modify the underlying book representation\. Their purpose is to prevent the synthetic distribution from consisting solely of polished and idealized instructions, thereby better approximating the irregularities present in real\-world user inputs\.

All prompt\-style dimensions are sampled according to fixed probability distributions\. As a result, the generated dataset contains a controlled mixture of prompt regimes, consisting of approximately 30% short realistic prompts, 60% medium\-length diverse prompts, and 10% long structured prompts\. This explicitly shaped distribution reduces reliance on the intrinsic variability of the language model and improves coverage across a broader range of possible request forms\.

The final synthetic prompt is therefore derived from two complementary information sources\. The book representation provides semantic grounding, including plot, characters, setting, writing style, and chapter structure, while the sampled prompt profile determines how this information is expressed\. The language model must synthesize both components into a single prompt that is semantically consistent with the target book while conforming to the sampled stylistic constraints\.

Overall, this procedure produces synthetic training prompts that are semantically grounded, structurally diverse, and more representative of the range of instructions encountered in long\-form text generation settings\.

#### 2\.3\.5Corpus Scale and Token Composition

The corpus is characterized in terms of token\-level properties of complete sequence representations\. Each sequence is decomposed into three components: a synthetic user promptpp, an intermediate planning scaffoldss, and the target book textbb\. Token counts are computed using the Llama 3 tokenizer\. For a sequenceii, the corresponding lengths are denoted bypip\_\{i\},sis\_\{i\}, andbib\_\{i\}, with total length

ℓi=pi\+si\+bi\.\\ell\_\{i\}=p\_\{i\}\+s\_\{i\}\+b\_\{i\}\.This decomposition enables separate analysis of conditioning, intermediate structure, and generated content\.

![Refer to caption](https://arxiv.org/html/2605.17064v1/figures/token_composition.png)

![Refer to caption](https://arxiv.org/html/2605.17064v1/figures/book_length_distribution.png)

Figure 2:Token\-level characterization of the corpus sequence\. Top: upper\-envelope token composition across length buckets, separating prompt, planning scaffold, and book text\. Bottom: histogram of total sequence lengths\.Figure[2](https://arxiv.org/html/2605.17064#S2.F2)summarizes the token\-level scale and composition of the corpus\. The lower panel shows the distribution of total sequence lengths across 6,000 books\. In aggregate, the corpus contains 649\.1M tokens under the Llama 3 tokenizer\. Total sequence lengths range from 1,085 to 1,255,497 tokens, with a mean of 106,940 tokens and a median of 88,983 tokens\. The distribution is right\-skewed: the 90th, 95th, and 99th percentiles are 187,346, 245,090, and 390,604 tokens, respectively\. While most books fall below 100k tokens, 2,512 books exceed this threshold, 257 exceed 262k tokens, 20 exceed 500k tokens, and 2 exceed one million tokens\.

The upper panel reports the corresponding token composition across the length spectrum\. Each bar corresponds to the maximum\-length sequence within a fixed\-width total\-length bin and is decomposed into synthetic user prompt, thinking scaffold, and book\-text tokens\. Across the full corpus, book text accounts for 60\.43% of all tokens, while the thinking scaffold accounts for 39\.35%; the synthetic user prompt contributes only 0\.23%\. At the per\-book level, the median shares are 56\.60% book text, 43\.08% thinking scaffold, and 0\.16% synthetic prompt\. Thus, although book text forms the largest component, the thinking scaffold constitutes a substantial fraction of the total sequence budget, whereas the synthetic prompt is negligible by comparison\.

At larger sequence lengths, the absolute contribution of both book text and planning tokens increases proportionally, while their relative proportions remain approximately stable\. This indicates that increases in total sequence length are primarily driven by longer book content rather than changes in prompt or scaffold size\.

Across the selected sequences, the book text constitutes the majority of tokens, particularly at larger scales\. The planning scaffold contributes a consistent fraction of the total token budget, while the synthetic prompt remains comparatively small\. These results indicate that, in addition to long\-form content, a non\-negligible portion of each sequence is allocated to structured intermediate representation\.

Overall, the corpus exhibits \(i\) a heavy\-tailed distribution of sequence lengths and \(ii\) a structured token composition in which planning information occupies a measurable share of the total context\.

## 3Training the Model

### 3\.1Model Choice

We initialize from Ministral 3 14B Base\[[17](https://arxiv.org/html/2605.17064#bib.bib17),[18](https://arxiv.org/html/2605.17064#bib.bib18)\]\. Since the target task is text\-only, we remove the vision encoder and train only the language model\.

We use a base model rather than an instruction\-tuned model because general\-purpose assistant models are not naturally aligned with purely creative writing\. Assistant\-style post\-training encourages models to behave as truthful, honest, non\-deceptive, and helpful respondents\[[4](https://arxiv.org/html/2605.17064#bib.bib4)\]\. These objectives are appropriate for question answering and task assistance, but they can conflict with core requirements of fiction generation\. A fiction model must be able to invent events that are not true, maintain fictional world states, write unreliable narrators, and generate characters who lie, manipulate, misunderstand, or conceal information\. In these settings, fabrication and deception are not failures; they are narrative devices\.

This mismatch also affects narrative form\. Assistant\-tuned models tend to answer requests directly, explain ambiguity, and avoid misleading the user\. Long\-form fiction often requires withholding information, delaying resolution, preserving ambiguity, sustaining voice, and allowing characters to act under false beliefs\. If assistant priors are too strong, the model can collapse fictional narration toward explanatory prose, moral commentary, or over\-literal compliance with the prompt\. Starting from a base model reduces the amount of assistant\-specific behavior that the creative\-writing training must overwrite\.

Nevertheless, we observe evidence that the base checkpoint still contains substantial instruction\-following behavior\. After our book\-writing SFT stage, the model still achieves roughly 50% accuracy on GSM8K when prompted with an instruction\-style template\[[19](https://arxiv.org/html/2605.17064#bib.bib19)\]\. This was unexpected for a model trained only on the serialized book\-writing scaffold and suggests that the base checkpoint may have seen instruction\-like data during mid\-training\. We also observe a qualitative failure mode in historical fiction prompts: the model sometimes inserts real historical figures even when they are not useful for the story\. We hypothesize that both behaviors are downstream effects of residual instruction\-style training signals in the base model\. In future iterations, we plan to either start from an older base model with less instruction\-following behavior or train a new text\-only base model with tighter control over the training data\.

### 3\.2Dataset Representation

Each training example is represented as a single serialized long\-form generation trajectory\. The example begins with a user prompt and then expands into a hierarchy of supervised planning and prose fields\. Rather than representing the dataset as a direct mapping from a prompt to a complete book, we represent each example as a structured scaffold that decomposes the book\-generation process into book\-level planning, early first\-chapter realization, remaining chapter planning, and chapter\-level prose generation\.

![Refer to caption](https://arxiv.org/html/2605.17064v1/figures/training_representation.png)

Figure 3:Hierarchical structure of a composed training example\. The representation begins with the user prompt and organizes the target output into a book plan, an early first\-chapter plan, the first chapter text, full\-book chapter plans, and the remaining chapter texts\.The representation is intentionally hierarchical\. Generation begins with high\-level book planning, then moves into early chapter planning, followed by chapter prose generation\. Rather than training the model on a direct prompt\-to\-book mapping, the scaffold exposes intermediate planning stages that structure the generation process and provide explicit supervision over long\-range organization\.

A central design choice is the inclusion of an*Early First Chapter*stage\. The first chapter text is generated as early as possible, before the remaining long\-running generation stages\. This provides an early quality\-control point for the sample\. If the generated first chapter is poor, later stages such as full\-book chapter planning and later chapter generation can be skipped, avoiding unnecessary compute expenditure on low\-quality generations\.

The early first\-chapter stage also includes limited forward planning information for the next chapter\. This helps the first chapter establish continuity into the rest of the book while still allowing the first prose realization to occur early in the sequence\.

After the first chapter is generated, the sequence continues into full\-book chapter planning and chapter\-level scene decomposition\. These later stages provide more detailed structure for the remaining chapters, including chapter focus, scene\-level breakdowns, narrative perspective, and target lengths\.

### 3\.3Training Objective and Hyperconvergence

Training is supervised fine\-tuning with a standard autoregressive cross\-entropy objective\. Each example is formatted as a single token sequence containing the initial prompt, scaffold headers, planning components, and prose components\. The loss is masked on the initial prompt tokens only; every following token contributes to the loss\.

Lety1:Ty\_\{1:T\}denote the serialized token sequence and letmim\_\{i\}be the loss mask for tokenyiy\_\{i\}, wheremi=0m\_\{i\}=0for prompt tokens andmi=1m\_\{i\}=1otherwise\. We minimize

ℒ​\(θ\)=−1∑i=1Tmi​∑i=1Tmi​log⁡pθ​\(yi∣y<i\)\.\\mathcal\{L\}\(\\theta\)=\-\\frac\{1\}\{\\sum\_\{i=1\}^\{T\}m\_\{i\}\}\\sum\_\{i=1\}^\{T\}m\_\{i\}\\log p\_\{\\theta\}\(y\_\{i\}\\mid y\_\{<i\}\)\.This objective trains the model to reproduce both the prose and the scaffold structure used to generate that prose\.

We train to hyperconvergence, following the training outline of the Hyperfitting work\[[20](https://arxiv.org/html/2605.17064#bib.bib20)\]\. In this setting, continued SFT is used not only to teach the model the dataset format, but also to sharpen open\-ended long\-form generation\. We use the term*hyperconvergence*for the point at which the model has strongly adapted to the scaffold and produces low\-entropy, template\-consistent continuations for the target generation process\.

The maximum training sequence length is 262,144 tokens\. Samples longer than this limit are clipped to the maximum length\.

### 3\.4Training System

We train with JAX on a TPU v6e\-256 pod\[[21](https://arxiv.org/html/2605.17064#bib.bib21)\]\. Training uses bfloat16 for the model copy used in the forward and backward passes\[[22](https://arxiv.org/html/2605.17064#bib.bib22)\]\. We use ZeRO\[[23](https://arxiv.org/html/2605.17064#bib.bib23)\]to shard the FP32 master parameters and FP32 optimizer states\. Optimizer computations are performed in FP32, while gradient accumulation is performed in bfloat16\.

The long sequence length requires sequence\-parallel attention\. We use Ring Attention for sequence parallelism over the TPU topology, making 262,144\-token training examples tractable across the pod\[[24](https://arxiv.org/html/2605.17064#bib.bib24)\]\.

We use the Muon optimizer for training\[[25](https://arxiv.org/html/2605.17064#bib.bib25)\]\. The FP32 master parameters are updated by the optimizer, and a bfloat16 training copy is materialized from the master parameters before the model computation\.

### 3\.5Learning\-Rate\-Scaled Stochastic Downcasting

A novel part of our training system is the stochastic downcast from the FP32 master parameters to the bfloat16 training copy\. Prior work studies stochastic rounding as a way to reduce numerical error in low\-precision LLM training\[[26](https://arxiv.org/html/2605.17064#bib.bib26)\]\. Our setup differs from that setting: we keep the master parameters, optimizer states, and optimizer computations in FP32, and apply stochasticity only when materializing the bfloat16 copy used for training computation\.

We first recall the idealized scalar stochastic\-rounding operator\. Letℱbf16\\mathcal\{F\}\_\{\\mathrm\{bf16\}\}be the set of bfloat16\-representable values\. For a scalarxx, let

q−​\(x\)=max⁡\{q∈ℱbf16:q≤x\},q\+​\(x\)=min⁡\{q∈ℱbf16:q≥x\}\.q^\{\-\}\(x\)=\\max\\\{q\\in\\mathcal\{F\}\_\{\\mathrm\{bf16\}\}:q\\leq x\\\},\\qquad q^\{\+\}\(x\)=\\min\\\{q\\in\\mathcal\{F\}\_\{\\mathrm\{bf16\}\}:q\\geq x\\\}\.Nearest rounding mapsxxto the closest representable value,

Qnr​\(x\)=arg⁡minq∈ℱbf16⁡\|q−x\|\.Q\_\{\\mathrm\{nr\}\}\(x\)=\\arg\\min\_\{q\\in\\mathcal\{F\}\_\{\\mathrm\{bf16\}\}\}\|q\-x\|\.Stochastic rounding instead samples between the adjacent representable values:

Qsr​\(x\)=\{q\+​\(x\),with probability​x−q−​\(x\)q\+​\(x\)−q−​\(x\),q−​\(x\),with probability​q\+​\(x\)−xq\+​\(x\)−q−​\(x\)\.Q\_\{\\mathrm\{sr\}\}\(x\)=\\begin\{cases\}q^\{\+\}\(x\),&\\text\{with probability \}\\dfrac\{x\-q^\{\-\}\(x\)\}\{q^\{\+\}\(x\)\-q^\{\-\}\(x\)\},\\\\\[11\.99998pt\] q^\{\-\}\(x\),&\\text\{with probability \}\\dfrac\{q^\{\+\}\(x\)\-x\}\{q^\{\+\}\(x\)\-q^\{\-\}\(x\)\}\.\\end\{cases\}This gives

𝔼​\[Qsr​\(x\)∣x\]=x,\\mathbb\{E\}\\\!\\left\[Q\_\{\\mathrm\{sr\}\}\(x\)\\mid x\\right\]=x,so the rounding error has zero conditional mean in the ideal scalar case\. In contrast, nearest rounding introduces a deterministic quantization error

enr​\(x\)=Qnr​\(x\)−x,e\_\{\\mathrm\{nr\}\}\(x\)=Q\_\{\\mathrm\{nr\}\}\(x\)\-x,which can systematically remove updates whose magnitude is below the local bfloat16 resolution\.

Our implementation uses a dithered stochastic downcast rather than explicitly sampling the adjacent bfloat16 values\. LetΘt\\Theta\_\{t\}be the FP32 master parameters at optimization steptt\. For each parameter elementΘt,i\\Theta\_\{t,i\}, we define the local bfloat16 resolution scale

r​\(Θt,i\)=max⁡\(\|Θt,i\|​εbf16,tinyfp32\),r\(\\Theta\_\{t,i\}\)=\\max\\left\(\|\\Theta\_\{t,i\}\|\\,\\varepsilon\_\{\\mathrm\{bf16\}\},\\operatorname\{tiny\}\_\{\\mathrm\{fp32\}\}\\right\),whereεbf16\\varepsilon\_\{\\mathrm\{bf16\}\}is the bfloat16 machine epsilon andtinyfp32\\operatorname\{tiny\}\_\{\\mathrm\{fp32\}\}is the smallest positive normal FP32 value\. We then sample

ut,i∼𝒰​\(0,1\)u\_\{t,i\}\\sim\\mathcal\{U\}\(0,1\)and construct a zero\-mean perturbation

δt,i=αt​\(ut,i−12\)​r​\(Θt,i\)\.\\delta\_\{t,i\}=\\alpha\_\{t\}\\left\(u\_\{t,i\}\-\\frac\{1\}\{2\}\\right\)r\(\\Theta\_\{t,i\}\)\.The perturbation satisfies

𝔼​\[δt,i∣Θt,i\]=0,Var⁡\[δt,i∣Θt,i\]=αt2​\(r​\(Θt,i\)\)212\.\\mathbb\{E\}\[\\delta\_\{t,i\}\\mid\\Theta\_\{t,i\}\]=0,\\qquad\\operatorname\{Var\}\[\\delta\_\{t,i\}\\mid\\Theta\_\{t,i\}\]=\\frac\{\\alpha\_\{t\}^\{2\}\{\\left\(r\(\\Theta\_\{t,i\}\)\\right\)\}^\{2\}\}\{12\}\.The bfloat16 training copy is then materialized as

Wt,i=Qnr​\(Θt,i\+δt,i\)\.W\_\{t,i\}=Q\_\{\\mathrm\{nr\}\}\\\!\\left\(\\Theta\_\{t,i\}\+\\delta\_\{t,i\}\\right\)\.Equivalently, in vector form,

Wt=Qnr​\(Θt\+αt​r​\(Θt\)⊙\(ut−12\)\),ut∼\(𝒰​\(0,1\)\)\|Θt\|\.W\_\{t\}=Q\_\{\\mathrm\{nr\}\}\\left\(\\Theta\_\{t\}\+\\alpha\_\{t\}\\,r\(\\Theta\_\{t\}\)\\odot\\left\(u\_\{t\}\-\\frac\{1\}\{2\}\\right\)\\right\),\\qquad u\_\{t\}\\sim\{\\left\(\\mathcal\{U\}\(0,1\)\\right\)\}^\{\|\\Theta\_\{t\}\|\}\.
The stochastic strengthαt\\alpha\_\{t\}is tied to the learning rate\. Letηt\\eta\_\{t\}be the learning rate at steptt, letηmax\\eta\_\{\\max\}be the maximum learning rate, and let

ηfloor=7⋅10−7\.\\eta\_\{\\mathrm\{floor\}\}=7\\cdot 10^\{\-7\}\.We set

αt=clip⁡\(ηt−ηfloorηmax−ηfloor,0,1\)\.\\alpha\_\{t\}=\\operatorname\{clip\}\\left\(\\frac\{\\eta\_\{t\}\-\\eta\_\{\\mathrm\{floor\}\}\}\{\\eta\_\{\\max\}\-\\eta\_\{\\mathrm\{floor\}\}\},0,1\\right\)\.Thus, stochasticity is strongest at the maximum learning rate and is annealed as the learning rate decays\. Whenηt≤ηfloor\\eta\_\{t\}\\leq\\eta\_\{\\mathrm\{floor\}\}, we haveαt=0\\alpha\_\{t\}=0, so the downcast becomes the deterministic nearest bfloat16 cast:

Wt=Qnr​\(Θt\)\.W\_\{t\}=Q\_\{\\mathrm\{nr\}\}\(\\Theta\_\{t\}\)\.
The stochastic downcast is applied once per optimization step, when refreshing the bfloat16 training copy from the FP32 master parameters\. The random stream is keyed by both the current training step and the parameter identity\. For sharded parameters, the noise is generated using global data\-parallel and tensor\-parallel shard offsets, so the stochastic downcast is stable with respect to the distributed layout\.

The motivation is that the FP32 master parameters may receive updates that are meaningful in FP32 but too small to survive repeated deterministic materialization into bfloat16\. The stochastic downcast converts these sub\-resolution changes into zero\-mean rounding variability rather than always discarding them in the same direction\. Prior convergence analysis of stochastic rounding shows that, under the analyzed optimizer setting, stochastic rounding can provide more favorable quantization\-error behavior than nearest rounding\[[26](https://arxiv.org/html/2605.17064#bib.bib26)\]\. Empirically, our learning\-rate\-scaled stochastic downcast improves convergence compared to deterministic FP32\-to\-bfloat16 materialization\. We found the unscaled variant less stable, while annealing the stochasticity with the learning rate retained the convergence benefit and improved training stability\.

### 3\.6Generation Template and Constrained Decoding

At inference time, generation follows the same scaffold structure used during training\. Examples are serialized using a Llama\-3\-style instruction format\[[27](https://arxiv.org/html/2605.17064#bib.bib27)\], but the standarduserandassistantrole identifiers are replaced with component\-specific headers corresponding to the hierarchical generation scaffold\.

The model is trained and evaluated as a single\-turn generation system\. Each sample contains a single initial prompt with no system prompt and no interactive dialogue structure\. Although the generation process expands into multiple intermediate planning and prose components, these are represented as structured continuations of the original request rather than as conversational turns\.

During inference, decoding is constrained to the same structural template used during dataset construction\. We implement this using regex\-guided constrained decoding, which enforces the ordering of sections, component headers, and boundary markers\. The constraint mechanism governs only the structural form of the output and does not constrain the semantic or stylistic content of the generated prose\.

Generation proceeds hierarchically in the same order as the training representation described in Section[3\.2](https://arxiv.org/html/2605.17064#S3.SS2)\. The model first generates the book\-level plan, followed by the early first\-chapter planning stage and the first chapter itself\.

## 4Discussion and Limitations

The framework presented here focuses specifically on structured long\-form fiction generation\. While the hierarchical supervision pipeline improves tractability for book\-scale training, the intermediate representations remain approximations of narrative structure rather than complete literary analyses\. Some stylistic, thematic, and interpretive features of fiction are inevitably compressed during summarization and decomposition\.

The dataset is derived from public\-domain literature, providing a reproducible and legally accessible source of book\-length supervision\. However, this distribution does not fully represent the diversity of contemporary fiction, genres, languages, or cultural traditions\. Expanding both the corpus and the annotation pipeline remains an important direction for future work\.

In addition, several components of the training recipe are evaluated jointly rather than in isolation\. Future ablation studies would help better quantify the contribution of hierarchical planning, chapter decomposition, long\-context supervision, and low\-precision optimization techniques to overall generation quality and training stability\.

## 5Conclusion

We presented a hierarchical framework for book\-scale fiction generation that transforms novels into structured prompt\-to\-book training trajectories\. By supervising planning, chapter structure, scene decomposition, character information, style, and final prose jointly, the approach makes long\-range narrative structure explicit during training\.

Combined with long\-context training and efficient low\-precision optimization, the resulting pipeline provides a practical foundation for studying large\-scale creative\-writing models\. Future work should focus on broader datasets, stronger evaluation, and iterative revision\-based generation\.

## Acknowledgments

Research supported with Cloud TPUs from Google’s TPU Research Cloud \(TRC\)\.

## References

- \[1\]Kevin Yang, Yuandong Tian, Nanyun Peng, and Dan Klein\.*Re3*: Generating longer stories with recursive reprompting and revision\.In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing*, pages 4393–4479, Abu Dhabi, United Arab Emirates, 2022\.Association for Computational Linguistics\.[https://aclanthology\.org/2022\.emnlp\-main\.296/](https://aclanthology.org/2022.emnlp-main.296/)\.
- \[2\]Kevin Yang, Dan Klein, Nanyun Peng, and Yuandong Tian\.*DOC*: Improving long story coherence with detailed outline control\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics*, pages 3378–3465, Toronto, Canada, 2023\.Association for Computational Linguistics\.[https://aclanthology\.org/2023\.acl\-long\.190/](https://aclanthology.org/2023.acl-long.190/)\.
- \[3\]Qianyue Wang, Jinwu Hu, Zhengping Li, Yufeng Wang, Daiyuan Li, Yu Hu, and Mingkui Tan\.Generating long\-form story using dynamic hierarchical outlining with memory\-enhancement\.In*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies*, pages 1352–1391, Albuquerque, New Mexico, 2025\.Association for Computational Linguistics\.[https://aclanthology\.org/2025\.naacl\-long\.63/](https://aclanthology.org/2025.naacl-long.63/)\.
- \[4\]Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L\. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe\.Training language models to follow instructions with human feedback\.In*Advances in Neural Information Processing Systems*, 2022\.[https://arxiv\.org/abs/2203\.02155](https://arxiv.org/abs/2203.02155)\.
- \[5\]Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan\.Plan\-and\-Write: Towards better automatic storytelling\.In*Proceedings of the Thirty\-Third AAAI Conference on Artificial Intelligence*, volume 33, number 1, pages 7378–7385, 2019\.[https://doi\.org/10\.1609/aaai\.v33i01\.33017378](https://doi.org/10.1609/aaai.v33i01.33017378)\.
- \[6\]Angela Fan, Mike Lewis, and Yann Dauphin\.Hierarchical neural story generation\.In*Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics*, pages 889–898, Melbourne, Australia, 2018\.Association for Computational Linguistics\.[https://aclanthology\.org/P18\-1082/](https://aclanthology.org/P18-1082/)\.
- \[7\]Angela Fan, Mike Lewis, and Yann Dauphin\.Strategies for structuring story generation\.In*Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics*, pages 2650–2660, Florence, Italy, 2019\.Association for Computational Linguistics\.[https://aclanthology\.org/P19\-1254/](https://aclanthology.org/P19-1254/)\.
- \[8\]Hannah Rashkin, Asli Celikyilmaz, Yejin Choi, and Jianfeng Gao\.*PlotMachines*: Outline\-conditioned generation with dynamic plot state tracking\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing*, pages 4274–4295, Online, 2020\.Association for Computational Linguistics\.[https://aclanthology\.org/2020\.emnlp\-main\.349/](https://aclanthology.org/2020.emnlp-main.349/)\.
- \[9\]Mark O\. Riedl and R\. Michael Young\.Narrative planning: Balancing plot and character\.*Journal of Artificial Intelligence Research*, 39:217–268, 2010\.[https://doi\.org/10\.1613/jair\.2989](https://doi.org/10.1613/jair.2989)\.
- \[10\]Lara J\. Martin, Prithviraj Ammanabrolu, Xinyu Wang, William Hancock, Shruti Singh, Brent Harrison, and Mark O\. Riedl\.Event representations for automated story generation with deep neural nets\.In*Proceedings of the Thirty\-Second AAAI Conference on Artificial Intelligence*, pages 868–875, 2018\.
- \[11\]Project Gutenberg\.*Project Gutenberg*\.[https://www\.gutenberg\.org/](https://www.gutenberg.org/)\.
- \[12\]Qwen Team\.Qwen3 Technical Report\.*arXiv preprint arXiv:2505\.09388*, 2025\.[https://arxiv\.org/abs/2505\.09388](https://arxiv.org/abs/2505.09388)\.
- \[13\]Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, and Chao Zhang\.Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias\.In*Advances in Neural Information Processing Systems*, 2023\.[https://arxiv\.org/abs/2306\.15895](https://arxiv.org/abs/2306.15895)\.
- \[14\]Ronen Eldan and Yuanzhi Li\.TinyStories: How Small Can Language Models Be and Still Speak Coherent English?*arXiv preprint arXiv:2305\.07759*, 2023\.[https://arxiv\.org/abs/2305\.07759](https://arxiv.org/abs/2305.07759)\.
- \[15\]Lennart Finke, Thomas Dooms, Mat Allen, Juan Diego Rodriguez, Noa Nabeshima, and Dan Braun\.Parameterized Synthetic Text Generation with SimpleStories\.*arXiv preprint arXiv:2504\.09184*, 2025\.[https://arxiv\.org/abs/2504\.09184](https://arxiv.org/abs/2504.09184)\.
- \[16\]Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, and Furu Wei\.Self\-Boosting Large Language Models with Synthetic Preference Data\.*arXiv preprint arXiv:2410\.06961*, 2024\.[https://arxiv\.org/abs/2410\.06961](https://arxiv.org/abs/2410.06961)\.
- \[17\]Alexander H\. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andy Ehrenberg, Andy Lo, Anton Eliseev, Antonia Calvi, Avinash Sooriyarachchi, Baptiste Bout, Baptiste Rozière, and others\.Ministral 3\.*arXiv preprint arXiv:2601\.08584*, 2026\.[https://arxiv\.org/abs/2601\.08584](https://arxiv.org/abs/2601.08584)\.
- \[18\]Mistral AI\.Ministral 3 14B Base\.Hugging Face model card, 2025\.[https://huggingface\.co/mistralai/Ministral\-3\-14B\-Base\-2512](https://huggingface.co/mistralai/Ministral-3-14B-Base-2512)\.
- \[19\]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\.[https://arxiv\.org/abs/2110\.14168](https://arxiv.org/abs/2110.14168)\.
- \[20\]Fredrik Carlsson, Fangyu Liu, Daniel Ward, Murathan Kurfali, and Joakim Nivre\.The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open\-Ended Text Generation\.*arXiv preprint arXiv:2412\.04318*, 2024\.[https://arxiv\.org/abs/2412\.04318](https://arxiv.org/abs/2412.04318)\.
- \[21\]James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman\-Milne, and Qiao Zhang\.JAX: composable transformations of Python\+NumPy programs\.Software, 2018\.[https://github\.com/google/jax](https://github.com/google/jax)\.
- \[22\]Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey\.A study of bfloat16 for deep learning training\.*arXiv preprint arXiv:1905\.12322*, 2019\.[https://arxiv\.org/abs/1905\.12322](https://arxiv.org/abs/1905.12322)\.
- \[23\]Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He\.ZeRO: Memory optimizations toward training trillion parameter models\.*arXiv preprint arXiv:1910\.02054*, 2020\.[https://arxiv\.org/abs/1910\.02054](https://arxiv.org/abs/1910.02054)\.
- \[24\]Hao Liu, Matei Zaharia, and Pieter Abbeel\.Ring Attention with Blockwise Transformers for Near\-Infinite Context\.*arXiv preprint arXiv:2310\.01889*, 2023\.[https://arxiv\.org/abs/2310\.01889](https://arxiv.org/abs/2310.01889)\.
- \[25\]Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang\.Muon is scalable for LLM training\.*arXiv preprint arXiv:2502\.16982*, 2025\.[https://arxiv\.org/abs/2502\.16982](https://arxiv.org/abs/2502.16982)\.
- \[26\]Kaan Ozkara, Tao Yu, and Youngsuk Park\.Stochastic Rounding for LLM Training: Theory and Practice\.In*Proceedings of the International Conference on Artificial Intelligence and Statistics*, 2025\.[https://arxiv\.org/abs/2502\.20566](https://arxiv.org/abs/2502.20566)\.
- \[27\]Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and others\.The Llama 3 Herd of Models\.*arXiv preprint arXiv:2407\.21783*, 2024\.[https://arxiv\.org/abs/2407\.21783](https://arxiv.org/abs/2407.21783)\.

Similar Articles

Summarizing books with human feedback

OpenAI Blog

OpenAI presents a scalable alignment technique using hierarchical summarization of entire books with human feedback, demonstrating how models can be trained to act in accordance with human intentions on complex, difficult-to-evaluate tasks.

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

arXiv cs.AI

PlanningBench is a framework for generating scalable, diverse, and verifiable planning data to evaluate and train large language models, featuring a constraint-driven synthesis pipeline with adaptive difficulty control and quality filtering. Experiments show that frontier LLMs struggle with coupled constraints, and reinforcement learning on PlanningBench data improves performance on unseen planning tasks.