Losses that Cook: Topological Optimal Transport for Structured Recipe Generation

arXiv cs.CL Papers

Summary

This paper proposes topological optimal transport-based loss functions for improving structured recipe generation in language models, addressing the limitations of standard cross-entropy training by better handling ingredient composition, quantities, and procedural accuracy. The approach shows significant improvements on recipe-specific metrics with 62% human preference over baseline methods.

arXiv:2601.02531v2 Announce Type: replace Abstract: Cooking recipes are complex procedures that require not only fluent and factual text, but also accurate timing, temperature, and procedural coherence, as well as the correct composition of ingredients. Standard training procedures are primarily based on cross-entropy and focus solely on fluency. Building on RECIPE-NLG, we investigate the use of several composite objectives and present a new topological loss that represents ingredient lists as point clouds in embedding space, minimizing the divergence between predicted and gold ingredients. Using both standard NLG metrics and recipe-specific metrics, we find that our loss significantly improves ingredient- and action-level metrics. Meanwhile, the Dice loss excels in time/temperature precision, and the mixed loss yields competitive trade-offs with synergistic gains in quantity and time. A human preference analysis supports our finding, showing our model is preferred in 62% of the cases.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:31 AM

# Topological Optimal Transport for Structured Recipe Generation
Source: https://arxiv.org/html/2601.02531

Mattia Ottoborgo1, Daniele Rege Cambrin2, Paolo Garza2
1Trustpilot
2Politecnico di Torino
Correspondence: [email protected], {daniele.regecambrin,paolo.garza}@polito.it

## Abstract

Cooking recipes are complex procedures that require not only fluent and factual text, but also accurate timing, temperature, and procedural coherence, as well as the correct composition of ingredients. Standard training procedures are primarily based on cross-entropy and focus solely on fluency. Building on RECIPE-NLG, we investigate the use of several composite objectives and present a new topological loss that represents ingredient lists as point clouds in embedding space, minimizing the divergence between predicted and gold ingredients. Using both standard language generation metrics and recipe-specific metrics, we find that our loss significantly improves ingredient- and action-level metrics. Meanwhile, the Dice loss excels in time/temperature precision, and the mixed loss yields competitive trade-offs with synergistic gains in quantity and time. A human preference analysis supports our finding, showing our model is preferred in 62% of the cases.

## 1 Introduction

Generating usable cooking recipes with language models requires more than fluent text: models must produce ingredients, quantities, and step-by-step instructions that are factually correct, numerically plausible, and procedurally executable. In this setting, errors on a few key tokens (e.g., omitting "eggs" in carbonara pasta or doubling the cooking temperature) can render the entire recipe unusable, even if the text is fluent and the semantics are similar to the correct output (Bień et al. 2020; Liu et al. 2025).

Standard fine-tuning with cross-entropy (CE) is ill-suited to this challenge because it treats all tokens as equally important, despite a strong asymmetry between high-impact tokens (ingredients, quantities, times, temperatures, core actions) and low-impact connective words (Chen et al. 2024). This misalignment manifests in common failure modes: poor ingredient recall, inaccurate quantities, and instruction sequences that are syntactically plausible but procedurally incorrect. Existing work on recipe generation and structured text generation has largely relied on CE-based objectives, beam search, or schema-constrained decoding, but has not directly targeted the holistic composition of ingredient sets and numerical aspects of recipes through the training loss (Lam et al. 2024).

In parallel, some works have explored alternative or auxiliary objectives in other natural language processing (NLP) tasks, such as focal and dice losses (Rege Cambrin et al. 2024). These approaches suggest that rethinking the loss can steer models toward rare but important events or holistic set properties; however, they have not been systematically applied to structured recipe generation and do not exploit the inherent topology of recipes. Moreover, standard natural language generation (NLG) metrics, such as ROUGE (Lin 2004) and BERTScore (Zhang et al. 2020), capture fluency and semantic similarity but fail to directly measure whether a recipe accurately specifies the ingredients, quantities, and cooking parameters.

This work addresses these gaps by focusing on Small Language Models (SLMs) fine-tuned for structured recipe generation on a focused subset of the RECIPE-NLG corpus (Bień et al. 2020)—pasta, rice, and sandwiches. We introduce a topological loss that represents ingredient lists as point clouds in embedding space and minimizes a Sinkhorn divergence (Cuturi 2013) between predicted and gold ingredients, explicitly encoding ingredient-level structure beyond token-wise CE. We further investigate how combining our proposal with existing losses can balance ingredient structure with numerical accuracy. To evaluate these objectives, we design a recipe-specific metric suite, including ingredient recall, quantity precision, action and step edit distances, and time/temperature precision, in addition to standard text metrics.

Our experiments demonstrate that augmenting CE with the proposed topological loss substantially improves ingredient recall, quantity precision, and procedural accuracy over CE alone, while Dice-based losses excel in terms of time and temperature precision. The mixed loss yields well-rounded trade-offs and, in some cases, synergistic gains (e.g., in quantity and time precision), with many improvements over CE and single custom losses. Together, these results demonstrate that carefully designed loss functions can meaningfully improve structured recipe generation in SLMs without increasing model size or inference-time complexity. We release the code for reproducibility at https://github.com/DarthReca/losses-cook.

## 2 Methodology

This section presents the task, the dataset, the proposed loss, and the evaluation metrics.

### 2.1 Task

We formalize structured recipe generation as a mapping f: P_in → R_out, where P_in is a natural language prompt (e.g., "Generate a recipe for Pasta Carbonara") and R_out = {I, S} is a structured JavaScript Object Notation (JSON) output containing: (1) a list of ingredients I = {I_1, I_2, ..., I_n}, and (2) a list of instruction steps S = {S_1, S_2, ..., S_m} as shown in Appendix A. The objective is to learn f that satisfies multiple constraints simultaneously: fluency, factual correctness (appropriate ingredients), numerical accuracy (plausible quantities, times, temperatures), and procedural coherence (logical instruction sequences).

### 2.2 Dataset

We use a subset of 5,000 recipes from RECIPE-NLG (Bień et al. 2020), focusing on pasta, rice, and sandwich dishes to ensure distributional consistency between the training and test sets (knowing how to cook rice does not necessarily enable you to prepare a good steak). To improve domain-specific learning, we augment the dataset with 235 manually curated cooking questions covering: missing ingredient identification, ingredient substitution, recipe scaling, quantity reasoning, time estimation, and temperature specification. These questions teach the model critical relationships between ingredients and numerical reasoning required for recipe generation. More details and examples are reported in Appendix A.

### 2.3 Loss Functions

Standard Cross-Entropy (CE) minimizes L_CE = -log p_c, where p_c is the model's predicted probability for the correct token c. We establish CE-only fine-tuning as our primary baseline. We also evaluate Focal Loss (Lin et al. 2017), which down-weights easy examples via a modulating factor (1 - p_t)^γ to address token frequency imbalance, and Dice Loss (Sudre et al. 2017), which optimizes set-level overlap using a differentiable Dice coefficient to encourage correct token sets.

#### 2.3.1 Topological Loss

Our main contribution is a topological loss that operates in embedding space to capture the structural coherence of the ingredient section. The key insight is that token sequences representing semantically similar ingredient lists should form geometrically similar shapes in embedding space, unlike cross-entropy, which treats all substitutions equally regardless of semantic proximity.

We construct two point clouds in embedding space from all tokens within the ingredient section: (1) For the predicted recipe, we generate soft probabilistic embeddings by applying softmax to output logits P = softmax(logits) and computing weighted embedding averages emb_soft = P · E, where E is the model's token embedding matrix as shown in Figure 1. (2) For the ground truth, we perform standard embedding lookups.

**Figure 1:** Soft embedding computation: output logits z are converted into token probabilities p using softmax, which are used to compute a differentiable weighted average over the model's embedding matrix E to create emb_soft = p · E.

The loss then measures the geometric dissimilarity between these clouds using Sinkhorn divergence S_ε (Cuturi 2013; Cuturi and Peyré 2016), a differentiable approximation of optimal transport (Wasserstein) distance:

L_Topo = S_ε(PC_pred, PC_target)

where ε is the entropic regularization parameter. This encourages the model to generate ingredient lists that are both semantically and structurally coherent in the embedding space, not just token-wise accurate, as shown in Figure 2.

**Figure 2:** The loss aligns the ground truth (black) and predicted (blue) tokens in embedding space. Shared tokens like "flour" (black dots with blue halos) have zero transport cost. The loss minimizes the transport distance for divergent tokens, penalizing semantic shifts (e.g., "salt" → "pepper") and structural deviations (e.g., "egg" → "eggs").

### 2.4 Evaluation Metrics

To comprehensively assess recipe quality, we combine standard NLG metrics with recipe-specific measures. We report ROUGE-1 (R1) and BERTScore F1 (BS) to measure linguistic fluency and semantic coherence. Since factual and procedural correctness are paramount, we introduce ad-hoc metrics. Ingredient Recall (IR) is the fraction of ground-truth ingredients correctly generated; Quantity Precision (QP) is the accuracy of numerical quantities for correctly recalled ingredients; Action Precision (AP) is the precision of key cooking verbs (e.g., boil, fry, sauté) in generated instructions; Action (AD) and Step (SD) Edit Distances are Levenshtein distance between sequences of cooking actions or full instruction steps with times/temperatures, measuring procedural correctness; Time (TiP) and Temperature (TeP) Precision are the accuracy of time durations and temperatures mentioned in instructions. Additional details on the metrics computation are provided in Appendix B.

## 3 Experiments

In this section, we present the experimental settings and results.

### 3.1 Experimental Settings

We fine-tune a pre-trained Qwen3-4B (Yang et al. 2025) model with Low-Rank Adaptation (LoRA) (Hu et al. 2022) and the AdamW optimizer. More training details are provided in Appendix C. All custom losses (i.e., dice, focal, and topological) are combined with cross-entropy (CE) as a composite objective to maintain linguistic fluency while enhancing domain-specific correctness. In all composite setups with a single custom loss, the objective is L = 0.6L_CE + 0.4L_custom, chosen empirically to preserve fluency while amplifying the signal on critical tokens. We also train a mixed-loss configuration with L = 0.6L_CE + 0.2L_Dice + 0.2L_Topo. The same augmented dataset is used for all fine-tuning conditions. We compared our model against a commercial model (Gemini 2.0 Flash) and a larger version of Qwen3 with 14B parameters. To assess whether the observed trends generalize across architectures and parameter scales, we additionally evaluate SmolLM3-3B and Qwen2.5-1.5B; those results are reported in Appendix D.

### 3.2 Quantitative Evaluation

| Model | R1↑ | BS↑ | AP↑ | QP↑ | IR↑ | TeP↑ | TiP↑ | AD↓ | SD↓ |
|---|---|---|---|---|---|---|---|---|---|
| No-FT Gemini 2.0 | 15.08 | 47 | 15.08 | 47.88 | 88.50 | 88.50 | 43.80 | 43.80 | 44.51 |
| Qwen3-14B | 25.23 | 93 | 25.23 | 93 | 85.69 | 85.69 | 42.12 | 42.12 | 44.51 |
| Qwen3-4B | 22.49 | 46 | 22.49 | 46 | 87.93 | 87.93 | 32.40 | 32.40 | 25.09 |
| **Qwen3-4B FT** | | | | | | | | | |
| CE | 27.30 | 48 | 27.30 | 48 | 88.78 | 88.78 | 45.09 | 45.09 | 50.94 |
| Focal | 26.09 | 49 | 26.09 | 49 | 89.94 | 89.94 | 41.09 | 41.09 | 54.94 |
| Dice | 29.87 | 44 | 29.87 | 44 | 90.49 | 90.49 | 50.59 | 50.59 | 57.44 |
| Topological | 30.39 | 85 | 30.39 | 85 | 90.97 | 90.97 | 59.68 | 59.68 | 63.93 |
| Topo+Dice | 31.90 | 45 | 31.90 | 45 | 90.99 | 90.99 | 57.59 | 57.59 | 65.09 |

**Table 1:** Results for fine-tuned Qwen3-4B and pre-trained models using ROUGE-1 (R1), BERTScore (BS), Action Precision (AP), Quantity Precision (QP), Ingredient Recall (IR), Temperature Precision (TeP), Time Precision (TiP), Action Distance (AD), Step Distance (SD). In bold the top performance, and in italics the second best. FT = Fine-Tuned; No-FT = Non Fine-Tuned. Results on additional architectures (SmolLM3-3B, Qwen2.5-1.5B) are reported in Appendix D.

As shown in Table 1, strong pre-trained instruction-tuned LLMs (Gemini 2.0, Qwen3-14B, and Qwen3-4B) underperform our fine-tuned models on both general NLG metrics (R1, BS) and on recipe-specific measures. While Qwen3-14B improves over Gemini in R1 and yields the best temperature precision (TeP), both models exhibit substantially weaker action/ingredient grounding (e.g., lower AP and IR) and larger procedural divergences (AD, SD) than the fine-tuned objectives. This suggests that general conversational competence does not directly translate into executable, constraint-satisfying recipe generation, highlighting the importance of domain adaptation for maintaining ingredient coverage and step-level alignment.

Across fine-tuned configurations, even pure cross-entropy (CE) yields large improvements over the pre-trained baselines in AP, QP, IR, TiP, and TeP, while also reducing AD and SD, indicating better procedural faithfulness to the ground-truth recipes. Among composite objectives, focal loss slightly improves BS and IR over CE, but lags behind Dice and Topological losses on most task-specific metrics, indicating that reweighting difficult tokens alone is insufficient to enforce fine-grained culinary constraints (quantities, times, and action sequences). Consistent with this

Similar Articles

An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning

Hugging Face Daily Papers

This paper introduces MMOT, an online mixture model learning framework based on optimal transport theory that addresses incremental learning with distributional shifts through dynamic centroid updates and improved class similarity estimation. The approach includes a Dynamic Preservation strategy to mitigate catastrophic forgetting and maintain class separability in latent space.

DOT-MoE: Differentiable Optimal Transport for MoEfication

Hugging Face Daily Papers

DOT-MoE formulates dense layer decomposition as a differentiable optimal transport problem, enabling efficient training of sparse MoE models that retain 90% of original performance while reducing active parameters by 50%.

Improving GANs using optimal transport

OpenAI Blog

OT-GAN introduces a novel GAN variant using optimal transport combined with energy distance in an adversarially learned feature space to improve training stability and image generation quality. The method demonstrates state-of-the-art results on benchmark problems with stable training using large mini-batches.