OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control

arXiv cs.AI Papers

Summary

The paper introduces OracleTSC, a method using oracle-informed reward hurdles and uncertainty regularization to stabilize reinforcement fine-tuning of LLMs for traffic signal control. It demonstrates significant improvements in traffic flow metrics on the LibSignal benchmark using LLaMA-3-8B while maintaining interpretability.

arXiv:2605.08516v1 Announce Type: new Abstract: Transparent decision-making is essential for traffic signal control (TSC) systems to earn public trust. However, traditional reinforcement learning-based TSC methods function as black boxes with limited interpretability. Although large language models (LLMs) can provide natural language reasoning, reinforcement finetuning for TSC remains unstable because feedback is sparse and delayed, while most actions produce only marginal changes in congestion metrics. We introduce OracleTSC, which stabilizes LLM-based TSC through two mechanisms: (1) a reward hurdle mechanism that filters weak learning signals by subtracting a calibrated threshold from environmental rewards, and (2) uncertainty regularization that maximizes the probability of the selected response to encourage consistent decisions across sampled outputs. Experiments on the LibSignal benchmark show that OracleTSC enables a compact LLaMA3-8B model to substantially improve traffic efficiency, achieving a 75% reduction in travel time and a 67% decrease in queue length compared with the pretrained baseline while preserving interpretability through natural language explanations. OracleTSC also demonstrates strong cross-intersection generalization: a policy trained on one intersection transfers to a structurally different intersection with 17% lower travel time and 39% lower queue length without additional finetuning. These results suggest that uncertainty-aware reward shaping can improve the stability and effectiveness of reinforcement fine-tuning for TSC.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/12/26, 07:16 AM

# OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control
Source: [https://arxiv.org/html/2605.08516](https://arxiv.org/html/2605.08516)
\\svgsetup

inkscapelatex=true

Darryl C\. Jacobdzj0055@auburn\.edu Department of Computer Science and Software Engineering Auburn UniversityXinyu Liuxil0004@auburn\.edu Department of Computer Science and Software Engineering Auburn UniversityMuchao Yemuchao\-ye@uiowa\.edu Department of Computer Science University of IowaXiaoyong Yuanxiaoyon@clemson\.edu Department of Electrical and Computer Engineering Clemson UniversityPan Hepan\.he@auburn\.edu Department of Computer Science and Software Engineering Auburn University

###### Abstract

Transparent decision\-making is essential for traffic signal control \(TSC\) systems to earn public trust\. However, traditional reinforcement learning–based TSC methods function as black boxes, providing little to no insight into their decisions\. Although large language models \(LLMs\) could provide the needed interpretability through natural language reasoning, they face challenges such as limited memory and difficulty in deriving optimal policies from sparse environmental feedback\. Existing TSC methods that apply reinforcement fine\-tuning to LLMs face notable training instability and deliver only limited improvements over pretrained models\. We attribute this instability to the long\-horizon nature of TSC: feedback is sparse and delayed, most control actions yield only marginal changes in congestion metrics, and the resulting weak reward signals interact poorly with policy\-gradient optimization\. We introduce OracleTSC, which addresses these issues through: \(1\) a reward hurdle mechanism that filters weak learning signals by subtracting a calibrated threshold from environmental feedback, and \(2\) preventing policy degeneracy by maximizing the probability of the chosen answer, which promotes consistent decision\-making across multiple responses\. Experiments on the standard LibSignal benchmark demonstrate that our approach enables a compact model \(LLaMA3\-8B\) to achieve substantial improvements in traffic flow, with a 75% reduction in travel time and 67% decrease in queue lengths over the pretrained baseline while preserving interpretability through natural language explanations\. Furthermore, the method exhibits strong cross\-intersection generalization: a policy trained on one intersection transfers to a structurally distinct intersection with 17% lower travel time and 39% lower queue length, all without any additional finetuning for the target topology\. These findings show that uncertainty\-aware reward shaping could stabilize reinforcement fine\-tuning and provide a new perspective for improving its effectiveness in TSC tasks\.

## 1Introduction

Most reinforcement fine\-tuning on large language models \(LLMs\) is limited to short\-horizon tasks such as question answering and summarizationCobbeet al\.\([2021](https://arxiv.org/html/2605.08516#bib.bib4)\); Hendryckset al\.\([2021](https://arxiv.org/html/2605.08516#bib.bib5)\); Reinet al\.\([2024](https://arxiv.org/html/2605.08516#bib.bib6)\), where the outcomes of model actions are immediately observable or verifiable, thereby simplifying the credit assignment problem\. In contrast to short\-horizon tasks, long\-horizon RL problems—such as optimizing traffic signals over extended periodsZhanget al\.\([2019](https://arxiv.org/html/2605.08516#bib.bib9)\); Meiet al\.\([2024](https://arxiv.org/html/2605.08516#bib.bib7)\); Heet al\.\([2024](https://arxiv.org/html/2605.08516#bib.bib8)\)—pose unique challenges\. Although recent pioneering studies have explored applying LLMs to traffic signal control \(TSC\)Wanget al\.\([2024a](https://arxiv.org/html/2605.08516#bib.bib11)\); Laiet al\.\([2025b](https://arxiv.org/html/2605.08516#bib.bib10)\), these methods often rely on external components—such as pretrained critic models for policy learning and trajectory filtering, or ensembles with conventional TSC algorithms for assisted decision\-making\. Such dependencies limit the TSC system’s autonomy and scalability and arise from practical constraints: LLMs applied to TSC must make decisions whose long\-term utility becomes evident only after many timesteps\. However, eliciting such long\-horizon behavior using gold\-standard Chain\-of\-Thought \(CoT\)Weiet al\.\([2022b](https://arxiv.org/html/2605.08516#bib.bib14)\)annotations is impractical, as the state–action space expands exponentially with simulation length\.

![Refer to caption](https://arxiv.org/html/2605.08516v1/figures/architecture.png)Figure 1:Overview of the OracleTSC framework\. Traffic states are converted into prompts for the LLM policy, which generates candidate signal phases fromGGsampled responses\. The proposed Reward Hurdle Mechanism \(RHM\) emphasizes high\-impact actions, while Uncertainty Regularization penalizes high\-entropy responses\. Both objectives are optimized jointly under the PPO framework, producing refined policies for adaptive traffic signal control\.Key Observations and Insights\.While Proximal Policy Optimization \(PPO\) is a natural framework for long\-horizon TSC, we find that it struggles to produce consistent gains when optimizing LLM\-based policies under realistic traffic dynamics, consistent with recent observations in LLM\-driven TSCLaiet al\.\([2025b](https://arxiv.org/html/2605.08516#bib.bib10)\)\. Our empirical analysis further identifies two recurring failure modes that are especially pronounced\.

\(1\) Weak reward signals reinforce suboptimal actions\.In TSC, most feasible phase changes lead to only small improvements—or slight deteriorations—in congestion measures such as average queue length\. As a result, reward signals are weak and often masked by randomness, producing very small policy\-gradient updates\. Instead of strengthening the few actions that truly help, PPO tends to spread learning across many actions with little impact\. This problem is made worse by delayed traffic responses and strong temporal dependence: the effect of a signal change may appear only many steps later, making it difficult to assign credit correctly\. Consequently, learning progresses slowly and may even stall, despite extended training\.

\(2\) Reasoning drift under output uncertainty\.We further observe that when the LLM policy is uncertain, its generated explanations can differ widely across responses for the same traffic state, resulting in inconsistent phase choices\. We refer to this behavior as reasoning drift\. It injects randomness directly into the action selection process and allows errors to accumulate over time\. Notably, producing longer or more detailed reasoning does not resolve this issue\. Stable decisions emerge only when the model’s output becomes focused on a single action\. In practice, this appears as high variability across responses during early training and is reduced only when uncertainty is explicitly penalized\.

Together, these observations highlight challenges that are not typically encountered in short\-horizon or rule\-based reinforcement learning tasks, but are central in TSC systems where rewards are sparse, delayed, and weakly informative, and where action consistency across time is critical for stable TSC control\.

To address these limitations, we introduceOracleTSC, a traffic signal control framework that uses a reward hurdle to filter out low\-impact updates and a Softmax Discrete Semantic Entropy regularizer to stabilize reasoning\. Together, these components produce more robust and consistent control policies with clearer, more reliable explanations across varied intersection settings\. We evaluateOracleTSConLibSignalMeiet al\.\([2024](https://arxiv.org/html/2605.08516#bib.bib7)\), the standard benchmark for TSC\. Results show that adjusting rewards for time\-varying inflows, under a hurdle rate, combined with minimizing semantic entropy — measured as the presence of multiple distinct answer modes under stochastic token generation, improves both control performance and the consistency of the model’s explanations\.

Contributions\. To summarize, our contributions are as follows:

- •While prior studies have highlighted the challenges of finetuning PPO for long\-horizon, TSC tasks with time\-varying traffic patterns, we identify insufficient suppression of suboptimal actions as a key factor underlying this difficulty\. To address this issue, we introduce a simple penalty that imposes an explicit performance threshold, penalizing actions falling below a baseline\. By suppressing low\-quality policy updates, our introduced Reward Hurdle Mechanism \(RHM\) amplifies the learning signal from high\-impact actions, leading to substantial performance gains\.
- •We propose a principled approach to quantifying and mitigating uncertainty across responses by introducing the semantic entropy reward via temperature\-scaled softmax as a regularization term in policy optimization\. Thisuncertainty\-aware mechanismcurbs drift in long\-horizon reasoning and enhances TSC performance across multiple metrics, including shorter queue lengths and reduced travel times\.
- •We show that our OracleTSC not only enhances policy stability but also delivers consistent performance gains across diverse model scales, effectively shortening queue lengths and improving the consistency of generated explanations\. The model also demonstrates strong cross\-intersection generalization, performing well when trained on one intersection and tested on a structurally distinct new intersection\.

## 2Related Work

Traffic Signal Control\.Early TSC systems relied on rule\-based heuristics and fixed\-time schedules that were unable to adapt to time\-varying traffic patternsBoardet al\.\([2015](https://arxiv.org/html/2605.08516#bib.bib48)\); Martinezet al\.\([2011](https://arxiv.org/html/2605.08516#bib.bib51)\)\. Classical algorithms such as Max Pressure \(MP\) controlVaraiya \([2013](https://arxiv.org/html/2605.08516#bib.bib49)\)improved throughput by activating signal phases proportional to weighted queue differences, yet remained limited to local optimization\.

Deep reinforcement learning \(DRL\) introduced adaptive decision\-making based on learning from experience\. IntelliLightWeiet al\.\([2018](https://arxiv.org/html/2605.08516#bib.bib64)\)pioneered this direction by using real\-world video\-derived data to train deep Q\-networks, while PressLightWeiet al\.\([2019a](https://arxiv.org/html/2605.08516#bib.bib55)\)extended MP theory into its reward function to guide efficient signal control\. Subsequent work improved generalization and coordination: AdLightWanget al\.\([2023](https://arxiv.org/html/2605.08516#bib.bib60)\)leveraged movement\-level augmentation to encode fine\-grained vehicle interactions, andWanget al\.\([2024b](https://arxiv.org/html/2605.08516#bib.bib61)\); Jianget al\.\([2024](https://arxiv.org/html/2605.08516#bib.bib63)\)proposed junction\-matrix and topology\-mapping methods that enabled zero\-shot transfer across heterogeneous intersections\. Other advances emphasized robustness and sample efficiency—FuzzyLightLiet al\.\([2025a](https://arxiv.org/html/2605.08516#bib.bib72)\)used fuzzy logic to handle noisy sensors, while DreamerV3Liet al\.\([2025b](https://arxiv.org/html/2605.08516#bib.bib71)\)incorporated world models to perform latent rollouts, reducing environment simulation costs\. Importantly, our objective is not to claim superiority over all existing RL algorithms or fully optimized black\-box RL systems\. Highly optimized, non\-explainable RL controllers remain strong on several benchmarks\. Rather, our goal is to improve the performance, stability, and reliability of explainable LLM\-based controllers while preserving natural\-language reasoning and semantic\-level uncertainty modeling\.

Recent work has begun exploring the use of LLMs to overcome these limitations through explicit reasoning and explainable decision\-making\. LLMLightLaiet al\.\([2025a](https://arxiv.org/html/2605.08516#bib.bib47)\)demonstrated that prompting pretrained LLMs with structured traffic states enables natural\-language explanations of phase selection, while VLMLightWanget al\.\([2025](https://arxiv.org/html/2605.08516#bib.bib67)\)extended this to multimodal, vision\-language reasoning for safety\-critical intersections\. Yet these approaches rely primarily on zero\-shot prompting or supervised imitation, lacking the closed\-loop optimization needed for long\-horizon policy improvement\. Weak or noisy reward signals slow convergence, allowing suboptimal reasoning to overwhelm beneficial actions in long\-horizon tasks\. Standard PPO stabilizers, such as clipped ratios, reward normalization, and advantage scaling, alleviate but do not eliminate training drift, leaving LLM\-based TSC vulnerable to unstable gradients under weak and delayed rewards\. This exposes a shared challenge between control variance and epistemic uncertainty\. Motivated by this connection, we treat uncertainty not merely as a diagnostic signal but as a direct optimization objective\. While PPO and reward shaping are well established in reinforcement learning, our contribution lies in showing that targeted reward thresholding and uncertainty regularization can substantially improve the stability of LLM\-based traffic signal control\.

Beyond decision\-making and control, recent work has explored using LLMs for automatic algorithm design and heuristic discovery\. A recent systematic surveyLiuet al\.\([2026](https://arxiv.org/html/2605.08516#bib.bib107)\)highlights how LLMs can generate executable algorithms, code\-based solutions, and structured reasoning processes that improve both interpretability and responsiveness\. Similarly,Evolution of HeuristicsLiuet al\.\([2024](https://arxiv.org/html/2605.08516#bib.bib108)\)demonstrates that LLMs can iteratively refine algorithmic strategies, enabling efficient automatic design of problem\-specific solvers\.

These approaches focus primarily on synthesizing or evolving discrete algorithmic procedures\. In contrast, traffic signal control requires stable policy optimization under weak, delayed reward feedback in a stochastic control environment\. Our work is complementary: rather than designing new symbolic algorithms, we study how to directly stabilize reinforcement fine\-tuning of LLM\-based controllers, enabling autonomous, interpretable decision\-making without reliance on external critics or hybrid ensembles\.

Uncertainty Quantification and Reduction\. A parallel line of research focuses on improving reasoning reliability and trustworthiness in LLMs through uncertainty quantification and entropy minimization\. Unsupervised fine\-tuning frameworks such as EMPOZhanget al\.\([2025](https://arxiv.org/html/2605.08516#bib.bib74)\)and RENTPrabhudesaiet al\.\([2025](https://arxiv.org/html/2605.08516#bib.bib86)\)remove the need for labeled reward models by minimizing entropy at the answer and token levels, respectively, enabling confidence\-driven optimization\. Building on CoT promptingWeiet al\.\([2022a](https://arxiv.org/html/2605.08516#bib.bib75)\), UnCert\-CoTZhuet al\.\([2025](https://arxiv.org/html/2605.08516#bib.bib76)\)leverages token\-level entropy signals to trigger additional reasoning steps, whileUncertainty of ThoughtsHuet al\.\([2024](https://arxiv.org/html/2605.08516#bib.bib77)\)models the reasoning process as a decision tree in which the LLM generates self\-queries and selects the branch that maximizes entropy reduction in the final answer\. Semantic entropyKuhnet al\.\([2023](https://arxiv.org/html/2605.08516#bib.bib90)\); Farquharet al\.\([2024](https://arxiv.org/html/2605.08516#bib.bib91)\)further refines these principles by exploiting bidirectional entailment between reasoning paths and answer candidates as a proxy for the true entropy over semantic classes\. This approach links semantic entropy values to the probability of model error, providing a theoretically grounded measure of confidence\. Meanwhile, Kernel Language Entropy \(KLE\)Nikitinet al\.\([2024](https://arxiv.org/html/2605.08516#bib.bib78)\)generalizes this concept by computing von Neumann entropy over a kernelized Laplacian of a semantic graph, where edge weights represent the degree of entailment between responses\. Collectively, these approaches reveal a strong coupling between accuracy and confidenceYeet al\.\([2025](https://arxiv.org/html/2605.08516#bib.bib79)\), motivating entropy reduction as a core regularization objective in reasoning\-focused training\.

Across both domains,OracleTSCattempts to unify advances in TSC and uncertainty\-aware reasoning\. Its RHM subtracts a fixed threshold from rewards to filter weak updates, reducing gradient variance and stabilizing long\-horizon PPO optimization\. Complementing this, theSoftmax\-with\-Temperature Discrete Semantic Entropyregularizer extends entropy\-minimization methods such as EMPOZhanget al\.\([2025](https://arxiv.org/html/2605.08516#bib.bib74)\)and RENTPrabhudesaiet al\.\([2025](https://arxiv.org/html/2605.08516#bib.bib86)\)to sequential control by adding an uncertainty\-based bonus to an environmental reward\. This coupling of reward filtering and entropy\-based feedback grounds semantic consistency in RL, yielding interpretable, confidence\-calibrated policies that improve stability and long\-horizon performance beyond prior LLM\-based TSC systems\.

## 3Method

To address TSC with LLMs, we begin with a key observation: although general\-purpose LLMs demonstrate strong natural language reasoning capabilities, they lack the precision and domain awareness required for TSC decision\-making\. To bridge this gap, we develop a domain\-specific reinforcement fine\-tuning framework that enables LLMs to acquire optimal phase\-selection policies through direct interaction with simulated traffic environments\.

We follow a variant of PPOSchulmanet al\.\([2017](https://arxiv.org/html/2605.08516#bib.bib12)\), modified to preserve the LLM’s instruction\-following ability\. Central to our approach is the idea that minimizing uncertainty and suppressing subpar actions enhances both reasoning quality and policy performance\.

### 3\.1Task Formulation

Given a TSC environment, we represent traffic states, reasoning, and control actions as natural language text and formulate LLM\-based TSC at timestepttas a Markov Decision Process \(MDP\)\(𝒮,𝒜,P,r,γ\)\(\\mathcal\{S\},\\mathcal\{A\},P,r,\\gamma\)\.

The state space𝒮⊆\[1,V\]L\\mathcal\{S\}\\subseteq\[1,V\]^\{L\}consists of token sequences representing traffic intersection states, whereVVis the vocabulary size andLLis the maximum sequence length\. The action space𝒜⊆\[1,V\]\\mathcal\{A\}\\subseteq\[1,V\]contains the next candidate token generated by the policy\.

For a maximum response lengthOO, the token\-level state evolves as

sl\+1t=\{s0t,l=0,concat⁡\(slt,alt\),1≤l<O,s^\{t\}\_\{l\+1\}=\\begin\{cases\}s^\{t\}\_\{0\},&l=0,\\\\\[6\.0pt\] \\operatorname\{concat\}\\\!\\left\(s^\{t\}\_\{l\},a^\{t\}\_\{l\}\\right\),&1\\leq l<O,\\end\{cases\}wherellindexes a generated output token\. Here,s0ts^\{t\}\_\{0\}denotes the initial token sequence formed by concatenating the system prompt with the tokenized traffic state, andconcat⁡\(⋅\)\\operatorname\{concat\}\(\\cdot\)denotes token concatenation subject to the length constraintOO\. The system prompt specifies the task instructions, output format for phase selection, and the current as well as the two most recent traffic states and corresponding actions \(see Listing[3](https://arxiv.org/html/2605.08516#LST3)\)\. The traffic state reports the number of vehicles in each lane\. FollowingLaiet al\.\([2025b](https://arxiv.org/html/2605.08516#bib.bib10)\), each lane is subdivided into four portions based on vehicle position and motion\. The first portion consists of early queued vehicles, defined as vehicles traveling below 0\.1 m/s \(i\.e\., waiting at the intersection\)\. The second portion includes vehicles within 10% of the road length from the signal\. The third portion contains vehicles located between 10% and 33% of the road length from the signal\. The fourth portion comprises vehicles that are more than 33% of the road length away from the signal\.

The procedure for extracting the selected phase from the output tokens is outlined in Algorithm[2](https://arxiv.org/html/2605.08516#alg2)\. This formulation naturally induces the state transition functionP:𝒮×𝒜→𝒮P:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathcal\{S\}\. The reward functionr:𝒮×𝒜→ℝr:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathbb\{R\}evaluates the agent’s performance at each token\. At timesteptt, the agent observesslt∈𝒮s^\{t\}\_\{l\}\\in\\mathcal\{S\}, selectsalt∈𝒜a^\{t\}\_\{l\}\\in\\mathcal\{A\}via policyπ:𝒮→𝒫​\(𝒜\)\\pi:\\mathcal\{S\}\\to\\mathcal\{P\}\(\\mathcal\{A\}\), and receives

r​\(slt,alt,l\)=\{−βKL\(πθ\(alt∣slt\)∥πREF\(alt∣slt\)\),0≤l<O,R​\(sOt,aOt\),l=O,r\(s^\{t\}\_\{l\},a^\{t\}\_\{l\},l\)=\\begin\{cases\}\-\\beta\\,\\text\{KL\}\\\!\\left\(\\pi\_\{\\theta\}\(a^\{t\}\_\{l\}\\mid s^\{t\}\_\{l\}\)\\;\\big\\\|\\;\\pi\_\{\\text\{REF\}\}\(a^\{t\}\_\{l\}\\mid s^\{t\}\_\{l\}\)\\right\),&0\\leq l<O,\\\\\[6\.0pt\] R\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\),&l=O,\\end\{cases\}\(1\)whereπREF\\pi\_\{\\text\{REF\}\}is a frozen reference model andβ\\betacontrols the KL penalty\. Thus, rewards combine \(i\) intermediate KL regularization to stabilize generation and \(ii\) a final task\-specific signalR​\(sOt,aOt\)R\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\)measuring phase effectiveness \(e\.g\., congestion reduction\)\. The objective is to learnπ∗\\pi^\{\*\}that maximizes the expected discounted returnGl=𝔼​\[∑l=1Oγl​r​\(slt,alt,l\)\],G\_\{l\}=\\mathbb\{E\}\\\!\\left\[\\sum\_\{l=1\}^\{O\}\\gamma^\{l\}r\(s^\{t\}\_\{l\},a^\{t\}\_\{l\},l\)\\right\],with discountγ∈\[0,1\]\\gamma\\in\[0,1\]\.

We aim to train a LLM\-based agentπθ:𝒮→𝒜\\pi\_\{\\theta\}:\\mathcal\{S\}\\rightarrow\\mathcal\{A\}that generates the selected traffic phase and a corresponding CoT explanation\. A traffic phase is defined as “a controller timing unit associated with the control of one or more movements”[4](https://arxiv.org/html/2605.08516#bib.bib93), where a movement refers to a specific permitted vehicle direction \(e\.g\., eastbound through, northbound left\-turn\)\. The policy is parameterized by Low\-Rank Adapter \(LoRA\)Huet al\.\([2022](https://arxiv.org/html/2605.08516#bib.bib15)\)weightsθ\\theta, which are injected into all linear layers of the backbone LLM \(e\.g\., the pre\-trained Qwen3\-0\.6BYanget al\.\([2025](https://arxiv.org/html/2605.08516#bib.bib16)\)\)\. Detailed descriptions of the prompt templates and sample inputs are provided in the Appendix\.

![Refer to caption](https://arxiv.org/html/2605.08516v1/figures/detailed_instance_illustration.png)Figure 2:The environment state is translated into a structured textual traffic representation and provided to the LLM, including lane\-wise early queued vehicles and segmented approaching vehicles\. The model then generates a chain\-of\-thought reasoning trace, analyzing queue imbalances, anticipating near\-term inflow from upstream segments, and evaluating candidate signal phases before producing a final phase selection\.
### 3\.2Training Objective

We train our LLM\-based agent to allocate its token budget toward effective reasoning that minimizes queue lengths at intersections\. Following PPO, we optimize the policy networkπθ\\pi\_\{\\theta\}while jointly training a value networkVϕV\_\{\\phi\}to reduce variance and improve stability\.

Training proceeds at the level of individual state–response exchanges\. Given a traffic statests\_\{t\}, the LLM generates a single response trajectoryκ=\{\(s0,a0\),…,\(sO,aO\)\}\\kappa=\\\{\(s\_\{0\},a\_\{0\}\),\\ldots,\(s\_\{O\},a\_\{O\}\)\\\}consisting of reasoning tokens followed by a phase selection\. Each exchange is treated as an independent training instance, similar in spirit to contextual bandit\-style reinforcement fine\-tuning used in RLHF\. We do not propagate gradients across multiple environment steps or perform trajectory\-level temporal credit assignment over extended horizons\.

For notational simplicity, we omit the timestep superscriptttfromsl\+1ts^\{t\}\_\{l\+1\}in the following discussion\. Specifically, given a response trajectoryκ=\{\(s0,a0\),…,\(sO,aO\)\}\\kappa=\\\{\(s\_\{0\},a\_\{0\}\),\\ldots,\(s\_\{O\},a\_\{O\}\)\\\}rolled out underπold\\pi\_\{\\text\{old\}\}\(the current LoRA\-adapted LLM\), the clipped surrogate objective for the policy is

𝒥CLIP​\(θ\)=𝔼^κ∼πold​\[min⁡\(rl​\(θ\)​A^l,clip⁡\(rl​\(θ\),1−ϵ,1\+ϵ\)​A^l\)\],\\mathcal\{J\}\_\{\\text\{CLIP\}\}\(\\theta\)=\\hat\{\\mathbb\{E\}\}\_\{\\kappa\\sim\\pi\_\{\\text\{old\}\}\}\\Big\[\\min\\\!\\big\(r\_\{l\}\(\\theta\)\\,\\hat\{A\}\_\{l\},\\ \\operatorname\{clip\}\(r\_\{l\}\(\\theta\),\\,1\-\\epsilon,\\,1\+\\epsilon\)\\,\\hat\{A\}\_\{l\}\\big\)\\Big\],\(2\)where

rl​\(θ\)=πθ​\(al∣sl\)πold​\(al∣sl\),ϵ=0\.2,r\_\{l\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(a\_\{l\}\\mid s\_\{l\}\)\}\{\\pi\_\{\\text\{old\}\}\(a\_\{l\}\\mid s\_\{l\}\)\},\\qquad\\epsilon=0\.2,and𝔼^\\hat\{\\mathbb\{E\}\}denotes the empirical expectation over minibatches of tokens sampled from trajectories\. The clipping term constrains the policy update to remain within the trust region defined by\(1−ϵ,1\+ϵ\)\(1\-\\epsilon,1\+\\epsilon\)\. When the likelihood ratiorl​\(θ\)r\_\{l\}\(\\theta\)moves outside this interval, the objective is flattened so that excessively large policy shifts do not receive disproportionately large gradients\. This prevents unstable oscillations in the log\-probabilities of selected actions and helps maintain monotonic policy improvement, consistent with the motivations of Trust Region Policy OptimizationSchulmanet al\.\([2015a](https://arxiv.org/html/2605.08516#bib.bib17)\)and PPOSchulmanet al\.\([2017](https://arxiv.org/html/2605.08516#bib.bib12)\)\. We estimate advantages using Generalized Advantage Estimation \(GAE\)Schulmanet al\.\([2015b](https://arxiv.org/html/2605.08516#bib.bib18)\), where temporal differences are defined as

δl=rl\+γ​Vϕ​\(sl\+1\)−Vϕ​\(sl\)\\delta\_\{l\}=r\_\{l\}\+\\gamma V\_\{\\phi\}\(s\_\{l\+1\}\)\-V\_\{\\phi\}\(s\_\{l\}\)\(3\)withVϕ​\(sO\+1\)=0V\_\{\\phi\}\(s\_\{O\+1\}\)=0, and advantages are computed as

A^l=∑k=0O−l\(γ​λ\)k​δl\+k\\hat\{A\}\_\{l\}=\\sum\_\{k=0\}^\{O\-l\}\(\\gamma\\lambda\)^\{k\}\\delta\_\{l\+k\}\(4\)with discountγ∈\[0,1\]\\gamma\\in\[0,1\]and GAE weightλ∈\[0,1\]\\lambda\\in\[0,1\]\.

We employ Generalized Advantage Estimation \(GAE\) to stabilize token\-level policy updates within a single response trajectory\. Although traffic control is a long\-horizon problem at deployment time, our optimization operates at the exchange level: advantages are computed over the generated reasoning sequence for the current decision only\. We do not accumulate discounted returns across multiple environment transitions\.

Within a response trajectory, Monte Carlo token\-level returns often suffer from high variance, whereas fully bootstrapped value targets can introduce substantial bias\. Generalized Advantage Estimation \(GAE\) provides a principled trade\-off between these extremes through the parameterλ\\lambda, improving the stability of token\-level policy updates without introducing cross\-exchange temporal dependencies\.

Conversely, relying solely on bootstrapped value predictions introduces bias, as the value networkVϕV\_\{\\phi\}is trained to approximate returns under a previous policy rather than the updated one\. GAE provides a principled mechanism for balancing this bias–variance trade\-off through the parameterλ\\lambda, interpolating between low\-variance, high\-bias temporal\-difference estimates \(λ=0\\lambda=0\) and high\-variance, low\-bias Monte Carlo estimates \(λ=1\\lambda=1\)\.

This bias–variance trade\-off is important in LLM\-based control because reasoning sequences are long in token space, even though optimization is performed at the level of independent exchanges\. The bootstrapped return target is thenG^l=A^l\+Vϕ​\(sl\)\\hat\{G\}\_\{l\}=\\hat\{A\}\_\{l\}\+V\_\{\\phi\}\(s\_\{l\}\)\.

We train the value network by minimizing a clipped regression loss:

𝒥value​\(ϕ\)=12​𝔼^​\[max⁡\(\(Vϕ​\(sl\)−G^l\)2,\(clip⁡\(Vϕ​\(sl\)−G^l,−ϵv,ϵv\)\)2\)\],\\mathcal\{J\}\_\{\\text\{value\}\}\(\\phi\)=\\tfrac\{1\}\{2\}\\,\\hat\{\\mathbb\{E\}\}\\Big\[\\max\\\!\\Big\(\\big\(V\_\{\\phi\}\(s\_\{l\}\)\-\\hat\{G\}\_\{l\}\\big\)^\{2\},\\ \\big\(\\operatorname\{clip\}\(V\_\{\\phi\}\(s\_\{l\}\)\-\\hat\{G\}\_\{l\},\\,\-\\epsilon\_\{v\},\\,\\epsilon\_\{v\}\)\\big\)^\{2\}\\Big\)\\Big\],\(5\)with a separate clipping hyperparameterϵv\\epsilon\_\{v\}\(e\.g\.,ϵv=0\.2\\epsilon\_\{v\}=0\.2\)\.

Overall objective\.Our overall objective is to minimize directly

ℒ​\(θ,ϕ\)=−𝒥CLIP​\(θ\)\+α​𝒥value​\(ϕ\),\\mathcal\{L\}\(\\theta,\\phi\)=\-\\,\\mathcal\{J\}\_\{\\text\{CLIP\}\}\(\\theta\)\+\\alpha\\,\\mathcal\{J\}\_\{\\text\{value\}\}\(\\phi\),\(6\)withα≥0\\alpha\\geq 0\.

Importantly, optimization is performed independently for each exchange; future traffic states influence learning only through the scalar reward assigned to the current decision, rather than through multi\-step temporal backpropagation across environment transitions\.

Although entropy bonuses are standard in PPO for discrete action spaces, extending them to LLM\-based policies is nontrivial because actions correspond to multi\-token response trajectories rather than single categorical choices\. Prior work, such as Discrete Semantic Entropy \(DSE\)Zhanget al\.\([2025](https://arxiv.org/html/2605.08516#bib.bib74)\), adapts entropy regularization to LLM outputs by aggregating uncertainty over semantically distinct answers\. However, these formulations are often too coarse to reliably penalize low\-quality reasoning trajectories and do not explicitly model the smoothness or concentration of the induced answer distribution\. As a result, they provide limited control over response\-level instability in long\-horizon decision\-making\. Motivated by these limitations, we introduce two complementary mechanisms in the following sections that directly target weak learning signals and reasoning instability in LLM\-based TSC\.

### 3\.3Reward Hurdle

PPO for LLMsLuonget al\.\([2024](https://arxiv.org/html/2605.08516#bib.bib45)\)is most commonly applied to short\-horizon settings \(e\.g\., context\-conditioned question answering\)\. In long\-horizon control, however, naive application of PPO is insufficient: early suboptimal actions can push the system into regimes where later improvements are severely limited\.

![Refer to caption](https://arxiv.org/html/2605.08516v1/figures/hurdle_intuition.png)Figure 3:Each runner represents a trajectory; a hurdle represents the improvement thresholdHRH\_\{R\}\. Runners that consistently clear hurdles \(Renv​\(sOt,aOt\)≥HRR\_\{\\text\{env\}\}\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\)\\\!\\geq\\\!H\_\{R\}\) advance, illustrating why the hurdle mechanism shifts the policy toward higher\-impact actions\. Contacting a hurdle \(failing to exceedHRH\_\{R\}\) incurs a penalty and reduces progress, analogous to a sub\-hurdle sequence receiving negative reinforcement that propagates backward through tokens\.In TSC, once queues approach or exceed traffic capacity, even principled controllers such as MaxPressureVaraiya \([2013](https://arxiv.org/html/2605.08516#bib.bib49)\)struggle to recover; the robust strategy is to keep queues low proactively\. This aligns with our preliminary experiments, where PPO often rewarded trajectories showing only minor*reductions*in queue length \(small positiveR​\(sOt,aOt\)R\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\)\)\. The optimization thus failed to produce policies that preemptively cleared queues, instead yielding idle strategies that led to traffic saturation, where subsequent actions could no longer alleviate congestion\.

To address this, we introduce the Hurdle RateHRH\_\{R\}as a constant that shifts the sequence\-level reward, ensuring that only trajectories surpassing a minimum improvement threshold receive positive reinforcement, while suboptimal ones are penalized relative to baseline\. This encourages the LLM to explore more aggressive, high\-impact actions\. Specifically, the original environmental reward used in PPO at timestepttis defined as

Renv​\(sOt,aOt\)=queuet−1−queuet\.R\_\{\\text\{env\}\}\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\)=\\text\{queue\}\_\{t\-1\}\-\\text\{queue\}\_\{t\}\.\(7\)We definequeuet\\text\{queue\}\_\{t\}as the average number of vehicles whose speed falls below 0\.1 m/s across all lanes at timesteptt\. We use the*queue difference*between timestepsttandt−1t\-1to mitigate instabilities in the reward signal induced by time\-varying traffic demand drift\. Without this adjustment, increasing inflows may introduce spurious penalties that incorrectly discourage otherwise reasonable actions\. We incorporate a hurdle rateHRH\_\{R\}and yield the reward defined as:

R​\(sOt,aOt\)\\displaystyle R\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\)=Renv​\(sOt,aOt\)−HR\.\\displaystyle=R\_\{\\text\{env\}\}\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\)\-H\_\{R\}\.\(8\)
Accounting for time\-varying inflow\-induced regime shifts alone does not substantially reduce congestion\. As shown in Figure[4](https://arxiv.org/html/2605.08516#S3.F4)\(a\) for the Qwen3–0\.6B model on the CityFlow1×1 benchmark, nearly 70% of actions reduce queues by fewer than 2\.5 vehicles—far too weak to produce meaningful policy updates\. The remaining 30% of actions exceed this 2\.5\-vehicle threshold, but they occur too infrequently to drive reliable long\-horizon improvement\. To amplify the learning signal associated with these higher\-impact actions, we use 2\.5 vehicles as an initial reference point and introduce a constantHRH\_\{R\}: actions that fail to surpass this threshold within each control interval are down\-weighted\.

![Refer to caption](https://arxiv.org/html/2605.08516v1/figures/hurdle_hist_subplots.png)Figure 4:Empirical distribution of queue length differences with HurdleHR=2\.5H\_\{R\}=2\.5from a baseline episode on CityFlow1x1 using pretrained Qwen3\-0\.6B and a testing episode on CityFlow1x1 after finetuning Qwen3\-0\.6B for one episode using theRenv​\(sOt,aOt\)−HRR\_\{\\text\{env\}\}\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\)\-H\_\{R\}configuration\. Since decisions are taken every 10 timesteps, 360 decisions are taken overT=3600T=3600timesteps in an episode\.As shown in Figure[4](https://arxiv.org/html/2605.08516#S3.F4)\(b\), subtracting this hurdle substantially sharpens the reward distribution\. The proportion of actions exceeding the 2\.5\-vehicle threshold increases from 30% to over 40% after a single training episode—indicating a clear shift toward higher\-impact interventions\. In practice, we determineHRH\_\{R\}through a combination of empirical tuning and reward\-distribution analysis, using the 70% statistic only as an initial guide rather than a fixed rule\. This procedure provides a principled cutoff for distinguishing impactful from low\-impact actions while adapting across environment–model pairs\.

Connection to Variance Reduction Techniques\.FollowingGreensmithet al\.\([2004](https://arxiv.org/html/2605.08516#bib.bib99)\), an effective baseline for policy gradient methods should \(i\) depend only on the state rather than the sampled actions, and \(ii\) be selected to minimize the variance of the policy gradient estimator for𝒥CLIP\\mathcal\{J\}\_\{\\text\{CLIP\}\}:

Var​\[∇θ𝒥CLIP​\(θ\)\]\\displaystyle\\text\{Var\}\[\\nabla\_\{\\theta\}\\mathcal\{J\}\_\{\\text\{CLIP\}\(\\theta\)\}\]≈Eτ∼πθ​\[∑l=1O∇θlog⁡πθ​\(alτ\|slτ\)​\(∑k=lOr​\(skτ,akτ\)−b​\(skτ\)\)\]\\displaystyle\\approx\\mathrm\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\}\\big\[\\sum\_\{l=1\}^\{O\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a^\{\\tau\}\_\{l\}\|s^\{\\tau\}\_\{l\}\)\\big\(\\sum\_\{k=l\}^\{O\}r\(s^\{\\tau\}\_\{k\},a^\{\\tau\}\_\{k\}\)\-b\(s^\{\\tau\}\_\{k\}\)\\big\)\\big\]\(9\)A common choice in PPO is to use a state\-dependent learned value functionVϕ​\(s\)V\_\{\\phi\}\(s\)as the baseline, i\.e\., an estimator of the expected future return\. However, as noted inGreensmithet al\.\([2004](https://arxiv.org/html/2605.08516#bib.bib99)\), other forms of baselines can also effectively reduce gradient variance\. In particular,Greensmithet al\.\([2004](https://arxiv.org/html/2605.08516#bib.bib99)\)derives an*optimal constant baseline*—a single scalar value \(per policy\) that minimizes the variance of the policy\-gradient estimator\. Their result, presented in Theorem 13 for the GPOMDP gradient∇πθ/πθ\\nabla\\pi\_\{\\theta\}/\\pi\_\{\\theta\}, can be rewritten in the equivalent∇log⁡πθ\\nabla\\log\\pi\_\{\\theta\}form via a standard transformation:

b​\(skτ\)\\displaystyle b\(s^\{\\tau\}\_\{k\}\)=Eτ∼πθ,τ0=skτ\[∑l=1O\|\|∇θlogπθ\(alτ\|slτ\)\|\|22\(∑k=lOr\(skτ,akτ\)\)\]Eτ∼πθ,τ0=skτ\[∑l=1O\|\|∇θlogπθ\(alτ\|slτ\)\|\|22\]\\displaystyle=\\frac\{\\mathrm\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\},\\tau\_\{0\}=s^\{\\tau\}\_\{k\}\}\\big\[\\sum\_\{l=1\}^\{O\}\|\|\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a^\{\\tau\}\_\{l\}\|s^\{\\tau\}\_\{l\}\)\|\|\_\{2\}^\{2\}\\big\(\\sum\_\{k=l\}^\{O\}r\(s^\{\\tau\}\_\{k\},a^\{\\tau\}\_\{k\}\\big\)\\big\)\\big\]\}\{\\mathrm\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\},\\tau\_\{0\}=s^\{\\tau\}\_\{k\}\}\\big\[\\sum\_\{l=1\}^\{O\}\|\|\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a^\{\\tau\}\_\{l\}\|s^\{\\tau\}\_\{l\}\)\|\|\_\{2\}^\{2\}\\big\]\}\(10\)
Although the expression includes expectations over trajectories, the baseline itself is a*single constant value*for a fixed policy, not a function ofss\.Greensmithet al\.\([2004](https://arxiv.org/html/2605.08516#bib.bib99)\)refer to it as a “constant baseline” because, despite being defined through trajectory\-level expectations, it does not vary across states\. While theoretically appealing, computing this optimal constant baseline is computationally infeasible in practice\. To overcome this limitation, we blend the benefits of both state\-dependent and constant forms\. Early in training—whenVϕV\_\{\\phi\}is still inaccurate—we stabilize updates by subtracting a fixedHurdle RateHRH\_\{R\}, which functions as a constant baseline\. As training progresses andVϕV\_\{\\phi\}improves, the value functionnaturallybegins to dominate the baseline term in the advantage estimate, providing a state\-dependent adjustment that better captures long\-horizon structure\. Thus the transition between the constant and state\-dependent baselines arises automatically: the Hurdle Rate provides stability when the value function is unreliable, andVϕV\_\{\\phi\}increasingly guides the updates once it becomes accurate\. Conceptually, the Hurdle Rate plays a role similar to that of the optimal constant baseline, with one key distinction: instead of being subtracted from every reward in the trajectory, it is applied only to the final environmental reward of each trajectoryκ\\kappa\.

### 3\.4Semantic Entropy Reward via Temperature\-Scaled Softmax

A major challenge in long\-horizon traffic control is that standard entropy\-based regularizers measure uncertainty at the token level rather than at the level of the actual decision\. Token uncertainty often reflects linguistic variation rather than uncertainty about the traffic\-phase choice\. As a result, token entropy poorly estimates semantic confidence and can destabilize training, especially when the same phase is expressed with different wording across responses\. To address this limitation, we measure uncertainty at the level of the semantic action \(Figure[5](https://arxiv.org/html/2605.08516#S3.F5)\)—the traffic phase predicted by the model—rather than at the level of the text token\. Our approach relies on a simple assumption: if the model predicts the same traffic phase across multiple responses, it is semantically confident; if it alternates between phases, it is not\.

![Refer to caption](https://arxiv.org/html/2605.08516v1/figures/Uncertainty-tight.png)Figure 5:Uncertainty Regularization\. Responses are first separated by extracted phase to estimate the probability of the LLM responding with a particular action\. We quantify the spread of their distributions using the Softmax with Temperature Discrete Semantic Entropy\. We apply a reward termREanswer\{R\}^\{\\text\{answer\}\}\_\{E\}to encourage confident policies during optimization\.![Refer to caption](https://arxiv.org/html/2605.08516v1/figures/cityflow-visualization.png)Figure 6:Visualization of eight supported signal phases at the intersection in the CityFlow 1×\\times1 benchmark\.Formally, given two output responses,e1t=s0∪\{a0,…,aO\}e^\{t\}\_\{1\}=s\_\{0\}\\cup\\\{a\_\{0\},\\dots,a\_\{O\}\\\}ande2t=s0∪\{a0′,…,aO′\}e^\{t\}\_\{2\}=s\_\{0\}\\cup\\\{a^\{\\prime\}\_\{0\},\\dots,a^\{\\prime\}\_\{O\}\\\}, featuring the same tokenized traffic intersection states0s\_\{0\}, we can cluster both responses based on the phase we extract from the answeree\. This step is essential because the same phase can be expressed in many linguistic forms; without grouping responses by their underlying phase, we cannot tell whether the model is semantically consistent or simply paraphrasing\. Clustering converts free\-form responses into a discrete distribution over phase choices, allowing us to compute a meaningful measure of semantic entropy\. A list of supported phases for the CityFlow 1x1 intersection are provided in Figure[6](https://arxiv.org/html/2605.08516#S3.F6)\. Finally, not all uncertainty should be treated equally\. A distribution such as\(0\.55,0\.45\)\(0\.55,0\.45\)may be acceptable in some contexts but signal high uncertainty in others\. To control sensitivity to disagreement across responses, we apply temperature\-scaled normalization to the phase counts, producing a tunable, decision\-level entropy measure that can be directly incorporated into the reward\.

In our study, we sample eight output token sequencese1t,e2t,…,e8t\{e^\{t\}\_\{1\},e^\{t\}\_\{2\},\\dots,e^\{t\}\_\{8\}\}and extract a single phase from each\. These extracted phases are treated as integers from 1 toNPN\_\{P\}\(inclusive\) to serve as discrete cluster labels\. We estimate\[p1,…,pj,…,pNP\]\\big\[p\_\{1\},\\dots,p\_\{j\},\\dots,p\_\{N\_\{P\}\}\\big\]using the empirical probability of each phase when extracted from each of the eight output responses given the same state:

countjanswer\\displaystyle\\text\{count\}^\{\\text\{answer\}\}\_\{j\}=∑g=1G=8𝕀\(phaseg==j\)\\displaystyle=\\sum\_\{g=1\}^\{G=8\}\\mathbb\{I\}\(\\text\{phase\}\_\{g\}==j\)pjanswer\\displaystyle p^\{\\text\{answer\}\}\_\{j\}=exp⁡\(countjanswer/τ\)∑j′=1NPexp⁡\(countj′answer/τ\)\\displaystyle=\\frac\{\\exp\\big\(\\text\{count\}^\{\\text\{answer\}\}\_\{j\}/\\tau\\big\)\}\{\\sum\_\{j^\{\\prime\}=1\}^\{N\_\{P\}\}\\exp\\big\(\\text\{count\}^\{\\text\{answer\}\}\_\{j^\{\\prime\}\}/\\tau\\big\)\}\(11\)where𝕀\\mathbb\{I\}denotes the indicator function that is 1 when we extract thejj\-th phase from thegg\-th output response and 0 otherwise, andτ\\tauis a temperature hyperparameter that controls how much we prioritize marginal changes in the empirical count of an extracted phase\.

Given phase probabilities for the DSE over extracted phases to proxy the LLM’s uncertainty over which phase to report\. Since we use the phase extracted from the first response as the chosen TSC action, we mark the chosen phase asjcj\_\{c\}\. We turn these probabilities into a reward as follows:

REanswer​\(e1t,…,eGt,Re​n​v,HR\)=\{pjcanswerRe​n​v\>HR0otherwiseR^\{\\text\{answer\}\}\_\{E\}\(e^\{t\}\_\{1\},\\dots,e^\{t\}\_\{G\},R\_\{env\},H\_\{R\}\)=\\begin\{cases\}p^\{\\text\{answer\}\}\_\{j\_\{c\}\}&R\_\{env\}\>H\_\{R\}\\\\ 0&\\text\{otherwise\}\\end\{cases\}\(12\)The answer entropy reward uses the probability of the selected phasepjcanswerp^\{\\text\{answer\}\}\_\{j\_\{c\}\}only when the environmental rewardRe​n​vR\_\{env\}\(measured as the reduction in queue length\) exceeds the hurdle thresholdHRH\_\{R\}\. Otherwise, the reward is set to zero\. This mechanism prevents reinforcing poorly performing actions that are generated with high confidence, effectively decoupling prediction certainty from control quality\.

Discussion and Relation to DSE and Self\-Questioning\.The classic DSEZhanget al\.\([2025](https://arxiv.org/html/2605.08516#bib.bib74)\)serves as a natural baseline, computing entropy directly from empirical action counts, which is defined as

REanswer,DSE=countanswer​j∑j′=1NPcountj′answerR^\{\\text\{answer\},\\mathrm\{DSE\}\}\_\{E\}=\\frac\{\\text\{count\}^\{\\text\{answer\}\}\{j\}\}\{\\sum\_\{j^\{\\prime\}=1\}^\{N\_\{P\}\}\\text\{count\}^\{\\text\{answer\}\}\_\{j^\{\\prime\}\}\}\(13\)While effective in unsupervised question\-answering settings, DSE relies on manually tuned entropy clipping and is sensitive to dataset\-specific entropy scales\. We generalize this idea by applying a temperature\-scaled Softmax to the phase frequencies, yielding a more flexible confidence measure\. The temperatureτ\\taucreates a continuum between two behaviors\. At moderate temperatures, the reward mirrors DSE, weighting phases by frequency\. Asτ→0\\tau\\rightarrow 0, the Softmax approaches anarg​max\\operatorname\*\{arg\\,max\}, producing a self\-questioning–styleChenet al\.\([2025](https://arxiv.org/html/2605.08516#bib.bib101)\)reward that activates only when the chosen phase matches the majority vote\. Asτ→∞\\tau\\rightarrow\\infty, the distribution becomes uniform and insensitive to agreement\. This unified formulation connects semantic\-entropy minimizationZhanget al\.\([2025](https://arxiv.org/html/2605.08516#bib.bib74)\)with self\-consistency maximizationChenet al\.\([2025](https://arxiv.org/html/2605.08516#bib.bib101)\), providing a single, tunable mechanism that encourages both confidence and internal consensus—two properties essential for stabilizing long\-horizon control in traffic signal optimization\.

After incorporating the two mechanisms, our final total reward is then formulated as

R​\(sOt,aOt\)=Renv​\(sOt,aOt\)−HR\+wE​REanswer,R\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\)=R\_\{\\text\{env\}\}\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\)\-H\_\{R\}\+w\_\{E\}R^\{\\text\{answer\}\}\_\{E\},\(14\)wherewEw\_\{E\}is a hyperparameter that balances the impact of uncertainty\.

## 4Experiments

We empirically investigate three key questions:\(Q1\)Does minimizing*Temperature\-Scaled Softmax Discrete Semantic Entropy*improve traffic\-signal control performance compared to training without an uncertainty regularizer?\(Q2\)Does introducing a reward hurdle on the environmental reward improve performance by suppressing low\-impact trajectories?\(Q3\)Does our approach improve the consistency and informativeness of the LLM’s CoT rationales \(e\.g\., cross\-response agreement and alignment between reasoning and actions\)?

### 4\.1Experimental Setup

#### 4\.1\.1Testbeds

We conduct experiments using LibSignalMeiet al\.\([2024](https://arxiv.org/html/2605.08516#bib.bib7)\), a unified benchmark for traffic\-signal reinforcement learning that includes widely used algorithms such as IDQNVincentet al\.\([2023](https://arxiv.org/html/2605.08516#bib.bib103)\), CoLightWeiet al\.\([2019b](https://arxiv.org/html/2605.08516#bib.bib102)\), PressLightWeiet al\.\([2019a](https://arxiv.org/html/2605.08516#bib.bib55)\), and MaxPressureVaraiya \([2013](https://arxiv.org/html/2605.08516#bib.bib49)\)\. LibSignal provides an extensible Gym\-style observation–action interface, allowing us to integrate natural\-language policies by adding a custom answer\-extraction mechanism \(Algorithm[2](https://arxiv.org/html/2605.08516#alg2)\)\. We evaluate our method on two contrasting intersections:CityFlow1×\\times1: a compact, four\-approach intersection in Hangzhou, China, characterized by symmetric inflow patterns and an eight\-phase signal plan controlling through and left\-turn movements \(Figure[7](https://arxiv.org/html/2605.08516#S4.F7)\);Cologne1: a more irregular and structurally complex intersection in Cologne, Germany, featuring mixed through, right\-turn, left\-turn, and U\-turn movements across four approaches \(Figure[7](https://arxiv.org/html/2605.08516#S4.F7)\)\. LibSignal incorporates real\-world traffic data from each corresponding region—Hangzhou for CityFlow 1×\\times1 and Cologne for Cologne1—generating realistic inflow patterns and driver behaviors and providing a challenging testbed for long\-horizon control\.

Figure 7:Visualization of the road network at a selected intersection in the \(a\) CityFlow 1×\\times1 and \(b\) Cologne 1 Benchmark\.
#### 4\.1\.2Evaluation Metrics

We evaluate performance using four standard traffic\-signal metrics following LibSignalMeiet al\.\([2024](https://arxiv.org/html/2605.08516#bib.bib7)\)\.Travel timeis the average duration required for a vehicle to complete its assigned route\.Queue lengthis computed as the average number of vehicles whose speed falls below0\.1​m/s0\.1\\,\\text\{m/s\}across all lanes and timesteps\.Delayfollows the U\.S\. Federal Highway Administration \(FHWA\) definition as the additional travel time experienced relative to free\-flow conditions[4](https://arxiv.org/html/2605.08516#bib.bib93); that is, the average difference between actual travel time and its free\-flow counterpart\.Throughputdenotes the total number of vehicles that successfully complete their routes during the simulation\.

#### 4\.1\.3Implementation Details

We fine\-tune LoRA adaptersHuet al\.\([2022](https://arxiv.org/html/2605.08516#bib.bib15)\)with rank 16 on top of pretrained LLM backbones rather than updating all model parameters\. All experiments are conducted on a single NVIDIA RTX A6000 GPU\. Each traffic simulation runs for3,6003,600timesteps, and the policy selects an action every 10 steps\. When a phase change is issued, a fixed 5\-second yellow interval is inserted to ensure safe transitions\. Policy and value networks are optimized using the objective in Equation[6](https://arxiv.org/html/2605.08516#S3.E6)\. Rewards are computed according to Equation[14](https://arxiv.org/html/2605.08516#S3.E14)withα=1\\alpha=1and discount factorγ=0\.999\\gamma=0\.999\. We perform parameter updates every360360timesteps using a replay buffer containing the most recent 400 timesteps of state–action–reward tuples, and we save LoRA and value\-head checkpoints every720720timesteps\. The selected phase is extracted directly from the LLM’s output using Algorithm[2](https://arxiv.org/html/2605.08516#alg2), without additional model calls\.

#### 4\.1\.4TSC Baselines

We evaluate three backbone LLMs—Qwen3\-0\.6B, Qwen3\-8B, and LLaMA3\-8B\-Instruct—both before and after reinforcement\-learning fine\-tuning\. All models use identical prompts, reward functions, and training protocols\. During inference, we follow the decoding hyperparameters recommended by each model’s authors \(e\.g\., temperature, top\-kk, top\-pp\), ensuring that downstream performance differences arise from policy optimization rather than decoding artifacts\. For context, we compare our RL\-finetuned models against widely used non\-LLM traffic\-signal controllers, including IDQN, PressLight, FRAP, and other fixed\-time or rule\-based baselines\. Unlike these non\-explainable controllers, our fine\-tuned LLMs produce both high\-quality control actions and interpretable chain\-of\-thought rationales\.

### 4\.2Main Results

Table 1:Performance of TSC agents in CityFlow and Cologne\.MethodCityFlow 1×\\times1Cologne1Travel Time↓\\downarrowQueue↓\\downarrowDelay↓\\downarrowThroughput↑\\uparrowTravel Time↓\\downarrowQueue↓\\downarrowDelay↓\\downarrowThroughput↑\\uparrowFixed\-control Transportation MethodsFixedTime \(t\_fixed=10\)923\.67114\.424\.831006206\.4651\.594\.041847FixedTime \(t\_fixed=30\)552\.72100\.345\.191455156\.9646\.234\.221910MaxPressure303\.81100\.946\.77171758\.347\.692\.772002Non\-explainable RL MethodsSOTL212\.2269\.115\.4218611358\.62104\.615\.94515IDQN116\.6428\.370\.61195950\.585\.370\.312000MAPG490\.71118\.610\.74151446\.735\.060\.302000IPPO308\.8795\.380\.76173655\.747\.350\.331994PressLight105\.6823\.200\.58196551\.794\.820\.292001FRAP104\.2823\.290\.62197966\.6712\.090\.311995Explainable LLM MethodsQwen3\-0\.6B506\.1104\.060\.711485531\.840\.610\.571548Qwen3\-8B468\.9598\.910\.721566119\.434\.370\.431919LLaMA3\-8B594\.9138\.630\.8113597413\.520\.361995Hurdle and Softmax Probability \(Qwen3\-0\.6B\)204\.1 \( –60%\)70\.4 \( –32%\)0\.73 \( \+3%\)1897 \( \+28%\)80\.7 \( –85%\)20\.89 \( –49%\)0\.42 \( –26%\)1990 \( \+29%\)Hurdle and Softmax Probability \(Qwen3\-8B\)187\.9 \( –60%\)66\.26 \( –33%\)0\.76 \( \+6%\)1902 \( \+21%\)84\.3 \( –29%\)21\.25 \( –38%\)0\.42 \( –2%\)1987 \( \+4%\)Hurdle and Softmax Probability \(LLaMA3\-8B\)146\.9 \( –75%\)45\.50 \( –67%\)0\.67 \( –17%\)1946 \( \+43%\)52\.9 \( –29%\)6\.73 \( –50%\)0\.32 \( –11%\)2001 \( \+<1%\)

We train policies on two different road network topologies: CityFlow1×\\times1 and Cologne1\. LibSignal provides real\-world traffic data from an intersection in Hangzhou withNP=8N\_\{P\}=8phases, and from an intersection in Cologne withNP=4N\_\{P\}=4\. Our method trains policies using the objective in Equation[6](https://arxiv.org/html/2605.08516#S3.E6)withα=1\\alpha=1, where each rewardrtr\_\{t\}incorporates both the environmental reward and the hurdle rateHRH\_\{R\}\. For computing Temperature\-scaled Softmax DSE, we setG=8G=8as a balance between computational cost and estimation quality: larger values ofGGreduce the variance of the estimated output phase distribution but increase wall\-clock time and memory usage\. To avoid undersampling control modes, we follow the heuristicG≥NPG\\geq N\_\{P\}, whereNPN\_\{P\}is the number of admissible traffic phases \(NP=8N\_\{P\}\{=\}8for CityFlow1×\\times1\)\. We apply the same setting to Cologne1 for consistency\. Each traffic simulation episode lasts3,6003,600timesteps, and the agent selects an action every1010timesteps\. When the LLM proposes a phase change, a yellow\-light period of five seconds is inserted to transition safely from the current phase\. We report agent performance using the best checkpoint, selected based on the lowest average queue length during a held\-out test episode\.

Table[1](https://arxiv.org/html/2605.08516#S4.T1)shows that OracleTSC consistently improves all major metrics across models and intersections\. On CityFlow1×\\times1, applying both the Reward Hurdle and Temperature\-Scaled Softmax DSE yields a60−75%60\-75\\%reduction in travel time, a32−67%32\-67\\%reduction in queue length, and substantial gains in throughput\. On Cologne1, which is structurally different and has fewer phases, the method again reduces queue length by29−−50%29\-\-50\\%and improves travel time despite the more constrained action space\. Importantly, the gains persist across model scales: even the 0\.6B\-parameter Qwen model benefits significantly, demonstrating that the proposed mechanisms operate independently of model size\.

### 4\.3Ablation Studies

We utilize ablation studies to study the impact of our design choices on the effectiveness of our method\. We conduct our ablation studies on the CityFlow1×\\times1 configuration with LLaMA3\-8B as the base LLM\.

#### 4\.3\.1Reward Shaping

Table[2](https://arxiv.org/html/2605.08516#S4.T2)summarizes an ablation over the key reward components in OracleTSC\. We compare: \(i\) the pretrained LLM without RL finetuning, \(ii\) finetuning using only the environmental reward, \(iii\) adding a constant hurdle rate, \(iv\) applying Temperature\-scaled Softmax DSE as an uncertainty regularizer, and \(v\) their combination\.

Using the environmental reward alone yields only limited performance gains\. Although average travel time decreases from594\.9594\.9s to391\.1391\.1s, congestion\-related metrics remain: queue length is reduced only marginally \(138\.63→117138\.63\\rightarrow 117\), and delay exhibits negligible change \(0\.81→0\.770\.81\\rightarrow 0\.77\)\. These results indicate that, in long\-horizon TSC, the raw environmental reward is both weak and noisy, and therefore insufficient to reliably guide policy optimization\.

Introducing a hurdle rate fundamentally alters the optimization landscape\. By subtractingHRH\_\{R\}, low\-impact actions are explicitly penalized, preventing the policy from reinforcing transitions that induce only marginal improvements\. As a result, learning is redirected toward decisions that produce meaningful congestion reduction\. This leads to substantial gains across all metrics, most notably a reduction in queue length from138\.63138\.63to56\.956\.9, confirming that reward shaping is critical for mitigating PPO’s tendency to overfit to weak positive signals in long\-horizon settings\.

Table 2:Ablation study over reward design\.REanswerR^\{\\text\{answer\}\}\_\{E\}is given by Softmax DSE withτ=0\\tau=0\. Base LLM is LLaMA3\-8B\. Evaluated on CityFlow1×\\times1\.Uncertainty regularization alone, implemented via temperature\-scaled Softmax DSE, also improves performance relative to the baseline\. Queue length decreases from138\.63138\.63to71\.171\.1, suggesting that a significant source of PPO instability with LLM\-based policies stems from reasoning drift and inconsistent action generation\. Penalizing high\-entropy response distributions encourages more stable and coherent phase selection, even in the absence of explicit reward shaping\.

The strongest performance is achieved when the hurdle rate and uncertainty regularization are combined\. This joint objective yields consistent improvements across all metrics: travel time is reduced from594\.9594\.9s to146\.9146\.9s, and queue length drops from138\.63138\.63to45\.545\.5vehicles\. These results demonstrate that the two mechanisms are complementary: the hurdle rate strengthens informative learning signals, while uncertainty regularization stabilizes the LLM’s internal reasoning process and reduces gradient variance\.

Overall, these results indicate that entropy regularization alone delivers consistent performance improvements, and that Temperature\-scaled Softmax DSE provides a stronger and more stable training signal than naïve DSE\. More broadly, the findings underscore the importance of managing epistemic uncertainty in LLM\-based control policies—particularly in long\-horizon domains where PPO is sensitive to variance in both generated reasoning and trajectory\-level rewards\.

![Refer to caption](https://arxiv.org/html/2605.08516v1/figures/entropy_ablation.png)Figure 8:Effect of entropy regularization and temperature scaling across training episodes under theRenv​\(sOt,aOt\)\+wuncertainty​REanswerR\_\{\\text\{env\}\}\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\)\+w\_\{\\text\{uncertainty\}\}R^\{\\text\{answer\}\}\_\{E\}configuration wherewuncertainty=1w\_\{\\text\{uncertainty\}\}=1andHR=0H\_\{R\}=0\. Each subplot reports one evaluation metric \(Travel Time, Queue Length, Delay, and Throughput\) as a function of the hurdle rate for LLaMA3\-8B on the CityFlow1×\\times1 intersection\. The results indicate that entropy regularization alone is effective, while the sharpest probability distribution over phases \(τ=0\\tau=0\) yields the strongest overall performance in terms of Queue Length\.
#### 4\.3\.2Comparison to naive Discrete Semantic Entropy

We further analyze the role of entropy regularization by comparing the proposed temperature\-scaled Softmax DSE against the previously introduced naive DSE\. All methods are evaluated under a unified reward formulationR=Renv\+wuncertainty​REanswerR=R\_\{\\text\{env\}\}\+w\_\{\\text\{uncertainty\}\}R^\{\\text\{answer\}\}\_\{E\}, withwuncertainty=1w\_\{\\text\{uncertainty\}\}=1\. Experiments are conducted using LLaMA3\-8B\-Instruct on the CityFlow\-1×11\\times 1benchmark, and results are summarized in Figure[8](https://arxiv.org/html/2605.08516#S4.F8)\.

Across all episodes, entropy\-based regularization consistently outperforms the pretrained baseline, yielding faster convergence and improved traffic efficiency\. In Episode 1, naive DSE reduces average travel time from594\.9594\.9s to540\.4540\.4s, while the temperature\-scaled Softmax DSE \(τ=2\\tau=2\) further lowers it to409\.4409\.4s and increases throughput from1,3591\{,\}359to1,5961\{,\}596vehicles\. By Episode 2, both uncertainty\-aware methods continue improving performance, with naive DSE achieving356\.5356\.5s travel time and1,6731\{,\}673throughput, while theτ=2\\tau=2variant further improves to343\.8343\.8s and1,7411\{,\}741vehicles\. In Episode 3, uncertainty minimization sustains these gains, reducing travel time to248\.0248\.0s and reaching a peak throughput of1,7711\{,\}771vehicles\.

Notably, the sharpest response distribution \(τ=0\\tau=0\) attains the lowest queue length in Episode 3 \(91\.2991\.29vehicles\), despite slightly worse travel time and throughput, suggesting that overly aggressive entropy suppression can favor local consistency over beneficial exploration\. This limitation is mitigated by incorporating the hurdle rate, which filters low\-impact trajectories and stabilizes optimization under low\-entropy regimes\.

![Refer to caption](https://arxiv.org/html/2605.08516v1/figures/hurdle_ablation.png)Figure 9:Effect of the Hurdle rate on model performance for theRenv​\(sOt,aOt\)−HR\+wuncertainty​REanswerR\_\{\\text\{env\}\}\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\)\-H\_\{R\}\+w\_\{\\text\{uncertainty\}\}R^\{\\text\{answer\}\}\_\{E\}configuration wherewuncertainty=1w\_\{\\text\{uncertainty\}\}=1\. Each subplot reports one evaluation metric \(Travel Time, Queue Length, Delay, and Throughput\) as a function of the hurdle rate for Qwen3\-0\.6B, Qwen3\-8B, and LLaMA3\-8B on the CityFlow1×\\times1 intersection\. The results indicate that moderate hurdle rates consistently balance exploration and reward exploitation, leading to shorter travel times and reduced congestion compared with both low and high thresholds, though optimal performance requires careful tuning\.Overall, these results confirm that entropy regularization is a key driver of performance improvement, and that temperature\-scaled Softmax DSE provides a more effective and stable learning signal than naive DSE\. This highlights the critical role of uncertainty control in LLM\-based policies, particularly for long\-horizon decision\-making problems where PPO is highly sensitive to output variance and compounded reward noise\.

#### 4\.3\.3Hurdle Rate

We next investigate the influence of the hurdle rateHRH\_\{R\}under the reward formulationR=Renv−HR\+wuncertainty​REanswer,with​wuncertainty=1\.R=R\_\{\\text\{env\}\}\-H\_\{R\}\+w\_\{\\text\{uncertainty\}\}R^\{\\text\{answer\}\}\_\{E\},\\quad\\text\{with \}w\_\{\\text\{uncertainty\}\}=1\.Figure[9](https://arxiv.org/html/2605.08516#S4.F9)summarizes results across a wide range of hurdle values for three backbone models: Qwen3\-0\.6B, Qwen3\-8B, and LLaMA3\-8B\.

Across all architectures, a consistent trend emerges: intermediate hurdle rates yield the strongest performance\. This behavior aligns with variance\-reduction theory in policy gradientsGreensmithet al\.\([2004](https://arxiv.org/html/2605.08516#bib.bib99)\), where the variance of the gradient estimator increases with the squared deviation between the optimal constant baselineb∗b^\{\*\}and a mis\-specified baselinebb\. In our setting, the hurdle rate plays a role analogous to a baseline, shaping the effective learning signal\.

For Qwen3\-0\.6B, performance peaks aroundHR≈3\.7H\_\{R\}\\approx 3\.7, reducing average travel time from497\.8497\.8s to204\.1204\.1s and queue length from96\.296\.2to70\.470\.4relative to the lowest tested hurdle \(HR=3\.5H\_\{R\}=3\.5\), while increasing throughput from1,5111\{,\}511to1,8971\{,\}897\. Qwen3\-8B achieves optimal performance at slightly lower thresholds \(HR≈2\.7H\_\{R\}\\approx 2\.7\), reaching a travel time of385\.4385\.4s and throughput of1,6751\{,\}675\. LLaMA3\-8B exhibits similar behavior, attaining its best results atHR=3\.0H\_\{R\}=3\.0, with a travel time of293\.1293\.1s in a representative training episode\.

These results reveal two complementary failure modes associated with miscalibrated hurdle rates\. WhenHRH\_\{R\}is too low, weak or marginal trajectories are insufficiently penalized, allowing noisy rewards to dominate PPO updates and slow convergence\. In contrast, overly aggressive hurdle rates suppress informative transitions, impairing coordination and leading to increased travel times and reduced throughput across all models\.

![Refer to caption](https://arxiv.org/html/2605.08516v1/figures/metrics_over_time.png)Figure 10:Training dynamics of the LLM\-based traffic signal control agent over two full episodes\. Each subplot reports instantaneous metrics for the finetuned LLM \(inblue\) and a 30\-step moving average for \(a\) delay, \(b\) queue length, \(c\) travel time, and \(d\) throughput change\. We also include the baseline LLM’s dynamics \(inred\) The shaded region highlights the onset of a large improvement in traffic efficiency, marked by reduced delay, shorter queues, and larger and more consistent throughput gains\. Together, these curves illustrate how the agent transitions from unstable exploratory behavior to stable, reward\-aligned control\.Model scale further modulates the optimal hurdle threshold\. Smaller models, such as Qwen3\-0\.6B, benefit from slightly higher hurdle rates, whereas larger models attain peak performance at more moderate values\. We attribute this divergence to differences in intrinsic reasoning stability: larger models typically produce more coherent and internally consistent action sequences prior to reinforcement fine\-tuning, and are therefore more susceptible to performance degradation when informative trajectories are excessively pruned\.

Taken together, these findings demonstrate that the hurdle rate is not a fixed hyperparameter, but rather a model\- and topology\-dependent control mechanism that actively shapes the policy optimization landscape\. Proper calibration is crucial for balancing exploratory learning against the suppression of low\-value updates, thereby enabling stable and efficient optimization in long\-horizon, sparse\-reward traffic signal control\.

### 4\.4Comparison to Other RL Controllers

Table[1](https://arxiv.org/html/2605.08516#S4.T1)compares our LLM\-based traffic signal controllers against fixed\-time baselines and non\-LLM reinforcement learning methods across two intersection settings\. On CityFlow\-1×11\\times 1, fixed\-time controllers exhibit poor performance, characterized by long average travel times \(552−924552\-924s\) and severe queueing\. Classical deep\-RL approaches, including IDQN, PressLight, and FRAP, substantially outperform fixed\-time control, achieving travel times of approximately 100−12000\-120s and maintaining short queues of roughly23−−2823\-\-28vehicles\. In contrast, pretrained LLM backbones are initially uncompetitive: for example, LLaMA3\-8B yields an average travel time of594\.9594\.9s with a queue length of138\.63138\.63vehicles, demonstrating that naïve prompting alone is insufficient for effective TSC\. After applying the proposed reward\-hurdle mechanism and entropy\-regularized fine\-tuning, however, the same backbone improves markedly, reaching146\.9146\.9s average travel time, a queue length of45\.545\.5vehicles, and a throughput of1,9461,946vehicles\. This closes most of the performance gap relative to specialized deep\-RL controllers, while uniquely preserving interpretable natural\-language rationales for each control decision\.

On the more irregular Cologne1 intersection, the gap to strong RL baselines narrows further\. PressLight and MaxPressure obtain travel times of51\.7951\.79s and58\.3458\.34s, respectively, with near\-maximal throughput \(≈2,000\\approx 2\{,\}000vehicles\)\. Our fine\-tuned LLaMA3\-8B controller achieves performance on par with this regime, attaining an average travel time of52\.952\.9s, a queue length of6\.736\.73vehicles, and a throughput of2,0012\{,\}001vehicles, while substantially outperforming its pretrained counterpart \(7474s,13\.5213\.52vehicles, and1,9951\{,\}995vehicles, respectively\)\. Smaller backbones exhibit similar trends: for example, Qwen3\-0\.6B improves from531\.8531\.8s to80\.780\.7s travel time and from1,5481\{,\}548to1,9901\{,\}990vehicles in throughput on Cologne1\. These results show that \(i\) untrained LLM policies fall far short of established RL controllers, but \(ii\) after RL fine\-tuning with our reward hurdle and uncertainty regularizer, OracleTSC attains performance comparable to strong non\-LLM RL baselines on realistic networks, while simultaneously providing interpretable chain\-of\-thought explanations\.

Figure[10](https://arxiv.org/html/2605.08516#S4.F10)illustrates how these gains emerge over time as the agent transitions from unstable exploration to stable, reward\-aligned control\.

### 4\.5Generalization across Intersections

Table 3:Generalization performance across intersections for LLaMA3\-8B under the Environmental Reward \-HRH\_\{R\}\+ Softmax DSE configuration\. Each row reports evaluation metrics when a policy trained on one traffic network is directly applied to another without additional finetuning\. Models trained on Cologne1 generalize well to CityFlow1×\\times1, improving travel time from 594\.9 s to 517\.5 s and throughput from 1359 to 1427\. Conversely, transferring from CityFlow1×\\times1 to Cologne1 yields further gains in efficiency \(travel time 61\.3 s vs\. 74 s and queue length 8\.27 vs\. 13\.52\), suggesting that the learned representations capture transferable signal\-phase coordination patterns across distinct network topologies\.We evaluate the cross\-topology generalization capabilities of OracleTSC by training policies on one intersection and deploying them on another without any additional finetuning\. Table[3](https://arxiv.org/html/2605.08516#S4.T3)reports zero\-shot transfer results for LLaMA3\-8B under theRenv−HR\+wuncertainty​REanswerR\_\{\\text\{env\}\}\-H\_\{R\}\+w\_\{\\text\{uncertainty\}\}R^\{\\text\{answer\}\}\_\{E\}configuration\.

A policy trained on Cologne1 transfers effectively to CityFlow1×1, improving travel time from594\.9594\.9s to517\.5517\.5s and increasing throughput from1,3591,359to1,4271,427\. This demonstrates that representations learned on a structurally heterogeneous intersection can generalize to a simpler, more symmetric geometry with measurable gains in efficiency\. The reverse transfer \(CityFlow1×1→\\rightarrowCologne1\) yields even stronger improvements: travel time decreases from7474s to61\.361\.3s, queue length is reduced by nearly40%40\\%\(13\.5213\.52→\\rightarrow8\.278\.27vehicles\), and throughput increases from1,9951,995to2,0022,002—the highest throughput achieved among all RL and non\-RL controllers \(explainable or not\)

These bidirectional improvements suggest that combining hurdle\-based shaping with uncertainty minimization leads the policy to acquire transferable phase\-selection behaviors rather than memorizing intersection\-specific patterns\. The ability to deploy a model zero\-shot across geometrically distinct networks indicates that OracleTSC learns stable decision\-making heuristics that generalize beyond the training topology\. These findings provide early evidence for the feasibility of scalable, reusable LLM\-based traffic\-signal controllers that require minimal per\-intersection adaptation\.

<think\>

Okay,let’ssee\.ThecurrentphaseisETEL\.Theintersection’sstateisgivenforeachphase,andIneedtofigureoutwhichsignalisthemosteffective\.</think\>

First,Ishouldunderstandwhateachsignaldoes\.ThesignalphasesareETL,NTST,ELWL,NLSL,ETEL,WTL,NTNL,STSL\.Eachphaseallowscertainlanesthroughandleft\-turnlanes\.Thegoalistochoosethephasethatbestreducesdelays,whichwoulddependontheearlyqueuedvehiclesandthenumberofapproachingvehiclesindifferentsegments\.

Lookingatthedataforeachphase:

PhaseETL:AllowedlanesareEasternthroughandleft\-turn\.EarlyqueuedEasternthrough:3,Westernthrough:38\.Segment1:1,segment2:3,segment3:13\.Wait,thenumbersmightbedifferent\.Wait,theuserprovidedthedataforeachphase\.Letmecheckagain\.

Wait,theuser’sinputforeachphase’sstateis:

Phase:ETWT

Allowed lanes: Eastern and western through lanes\.

Early queued Eastern through: 3\.0 Western through: 38\.0

segment\_1 Eastern through: 1 Western through: 1

segment\_2 Eastern through: 3 Western through: 3

segment\_3 Eastern through: 13 Western through: 51

Wait,thatseemsconflicting\.TheinitialstatehadEasternthrough:9,Westernthrough:38\.Theninthesecondphase,ETL,it‘s3and38\.Buttheuser‘scurrentinputshowsthatafterETL,thenextphaseisETLagain\.Wait,no,theuser‘smessageis:

ThecurrentphaseisETEL\.Thestateoftheintersection:

Phase:ETWT

Allowed lanes: Eastern and western through lanes\.

Early queued Eastern through: 3\.0 Western through: 38\.0

segment\_1 Eastern through: 1 Western through: 1

segment\_2 Eastern through: 3 Western through: 3

segment\_3 Eastern through: 13 Western through: 51

Wait,butinthepreviousinteraction,theuserhadETL,andthenafterthat,thenextphaseisETLagain\.SomaybethecurrentphaseisETL,andthestateforETLis3and38\.

SoforphaseETL,allowedlanesarethroughandleft\-turn\.Theearlyqueuedforthroughlanesis3,andforleft\-turnis0\.Thesegmentsare1,3,13forthrough,and0,0,0forleft\-turn\.Wait,buttheinitialdatahadsegment\_1:1,segment\_2:3,segment\_3:13\.Butinthenextphase,maybethesegmentschange?Wait,no,theuser‘sinputhereisforthecurrentphase,somaybethesegmentsareasgiven\.

SoforphaseETL,thenumberofearlyqueuedvehiclesinallowedlanesis3\(through\)and0\(left\)\.Thesegmentsforthroughare1,3,13\.Forleft,theyare0\.

Now,lookingatotherphases\.Forexample,phaseNTST:allowedlanesarenorthernandsouthernthrough\.EarlyqueuedNorthernthrough:38,Southernthrough:26\.Segmentsforthrough:0,0,38\.Sohere,there‘sahighearlyqueuedinnorthernthrough\.

PhaseWTL:allowedlanesarewesternthroughandleft\-turn\.Earlyqueued:38,5\.Segments:0,0,38\.\.\.

Listing 1:Illustrative reasoning trace before fine\-tuning\. The model demonstrates self\-contradictions and hesitation during phase selection \(highlighted inred\), with several phases misidentified through incorrect mnemonics \(highlighted inblue\), reflecting weak adherence to prompt instructions\. In addition, certain phases \(highlighted inorange\) are repeatedly re\-analyzed, inefficiently consuming the LLM’s output token budget\.Step1:Identifytheoptimaltrafficsignalbyanalyzingthecurrentphaseanditsstate\.ThecurrentphaseisETEL,whichallowsEasternthroughandleft\-turnlanes\.Thehighestsegment\(segment3\)is4,andtherearenoearlyqueuedvehicles\.Thissuggestsapotentialforefficienttrafficflowbyprioritizingthefarthestsegments\.

Step2:TheoptimaltrafficsignalisETEL\.Itmaximizessegmentutilizationandminimizesdelaysbyallowingvehiclestoflowthroughthefarthestsegmentswithoutwaitingforearlyqueuedvehicles\.

<signal\>ETEL</signal\>

Listing 2:Output generation after finetuning with environmental, hurdle rate, and Temperature\-scaled Softmax DSE
### 4\.6Qualitative Results

We qualitatively analyze how incorporating environmental rewards, a hurdle rate, and Softmax DSE regularization refines the model’s internal reasoning and decision\-making trajectory\.

Before fine\-tuning \(Listing[1](https://arxiv.org/html/2605.08516#LST1)\), the model’s reasoning trace exhibits self\-contradictions, vacillation, and redundant phase re\-evaluation\. It frequently reverses intermediate conclusions \(e\.g\., “Wait…”\), misidentifies phase mnemonics \(e\.g\.,ETL, WTLshould beETELandWTWL\), and repeatedly re\-analyzes the same input block \(Phase:ETWT\), inefficiently consuming output tokens and demonstrating weak adherence to the prompt\. Collectively, these behaviors reflect diffuse credit assignment and elevated epistemic uncertainty in the model’s chain\-of\-thought\.

After fine\-tuning \(Listing[2](https://arxiv.org/html/2605.08516#LST2)\), the CoT becomes markedly structured, grounded, and concise\. The model explicitly enumerates reasoning steps \(e\.g\., “Step 1: Identify… Step 2: The optimal traffic signal…”\), anchors each inference in observable quantities \(“\(segment 3\) is 4, and there are no early queued vehicles”\), and concludes decisively with a single aligned action,<signal\>ETEL</signal\>\. The absence of reversals or reanalysis suggests sharper credit propagation under the hurdle objective and reduced output entropy from the Softmax DSE term\.

The generated chain\-of\-thought is not guaranteed to faithfully reflect the model’s internal reasoning process, and ensuring mechanistic faithfulness in LLM reasoning remains an open research challenge\. Recent surveys highlight the difficulty of scaling inference, supervising reasoning processes, and building reliable agentic systemsKeet al\.\([2025](https://arxiv.org/html/2605.08516#bib.bib105)\)\. Similarly, studies of the extractive–abstractive spectrum reveal inherent trade\-offs among abstraction, verifiability, and perceived utility in LLM generationsWorledgeet al\.\([2024](https://arxiv.org/html/2605.08516#bib.bib106)\)\. Accordingly, we interpret OracleTSC’s explanations as structured, human\-readable rationales that improve transparency and behavioral consistency, rather than as formal guarantees of internal causal computation\.

Overall, the qualitative shift from hesitant exploration to structured and task\-aligned rationales complements our quantitative improvements in traffic metrics, while acknowledging that reasoning faithfulness verification remains beyond the scope of the present work\.

## 5Discussion and Conclusion

This paper presentsOracleTSC, a reinforcement learning framework for LLM\-based TSC that explicitly targets the training instabilities inherent to long\-horizon decision\-making\. The framework integrates two complementary mechanisms: \(1\) a reward hurdle that suppresses low\-impact actions and sharpens long\-term credit assignment, and \(2\) Temperature\-scaled Softmax DSE, which regularizes uncertainty in semantic action generation and stabilizes the model’s internal reasoning process across responses\. Together, these components address fundamental challenges in applying PPO to LLM\-based control, including reward sparsity, time\-varying traffic patterns, and cross\-token variance in semantic action selection\.

We demonstrate that reinforcement fine\-tuning of LLM controllers is brittle, and show that targeted reward shaping and uncertainty regularization substantially improve stability and performance\. On CityFlow\-1×11\\times 1, fine\-tuned LLaMA3\-8B reduces average travel time by 75% and queue length by 67%, with similar gains observed across multiple backbone models and intersection configurations\. Importantly,OracleTSCperforms competitively with specialized, black\-box reinforcement learning controllers, while uniquely preserving the interpretability and transparency afforded by natural\-language decision rationales\. Ablation studies further confirm that both the reward hurdle and Softmax DSE are critical to performance, with the strongest results obtained when the two mechanisms are jointly applied\.

Despite these gains, several limitations remain\. The optimal hurdle magnitude depends on environment dynamics and currently requires manual tuning, while semantic entropy estimation incurs additional computational cost due to multi\-sample generation\. These limitations motivate several future directions, including adaptive or learned hurdle scheduling, multi\-agent coordination across networked intersections, and robustness evaluation under multimodal sensing and real\-world deployment settings\.

Importantly, our objective is not to claim superiority over all existing RL algorithms\. Highly optimized, non\-explainable RL controllers remain strong on several benchmarks\. Rather, our goal is to improve the performance, stability, and reliability of explainable LLM\-based controllers while preserving natural\-language reasoning and semantic\-level uncertainty modeling\.

In summary,OracleTSCdemonstrates that combining reward filtering with uncertainty\-aware semantic regularization provides a principled and effective pathway for stabilizing reinforcement fine\-tuning of LLM\-based controllers\. These results suggest a broader role for uncertainty\-controlled semantic objectives in long\-horizon control tasks, bridging the gap between interpretability and performance in next\-generation intelligent transportation systems\.

#### Acknowledgment

This work was supported in part by the ORAU Ralph E\. Powe Junior Faculty Enhancement Award, the National Science Foundation \(NSF\) NAIRR Start\-Up Program under Award NAIRR250417, and the Lambda Research Grant\. The authors gratefully acknowledge this support\. Any opinions, findings, conclusions, or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the supporting organizations\.

## References

- Signal timing manual \- second edition\.The National Academies Press,Washington, DC\.External Links:ISBN 978\-0\-309\-30888\-5,[Document](https://dx.doi.org/10.17226/22097),[Link](https://nap.nationalacademies.org/catalog/22097/signal-timing-manual-second-edition)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p1.1)\.
- L\. Chen, M\. Prabhudesai, K\. Fragkiadaki, H\. Liu, and D\. Pathak \(2025\)Self\-questioning language models\.External Links:2508\.03682,[Link](https://arxiv.org/abs/2508.03682)Cited by:[§3\.4](https://arxiv.org/html/2605.08516#S3.SS4.p5.4)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§1](https://arxiv.org/html/2605.08516#S1.p1.1)\.
- \[4\]Definition, interpretation, and calculation of traffic analysis tools measures of effectiveness \- executive summary\.External Links:[Link](https://ops.fhwa.dot.gov/publications/fhwahop08054/execsum.htm)Cited by:[§3\.1](https://arxiv.org/html/2605.08516#S3.SS1.p5.2),[§4\.1\.2](https://arxiv.org/html/2605.08516#S4.SS1.SSS2.p1.1)\.
- S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal \(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630,pp\. 625–630\.External Links:[Document](https://dx.doi.org/10.1038/s41586-024-07421-0)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p6.1)\.
- E\. Greensmith, P\. L\. Bartlett, and J\. Baxter \(2004\)Variance reduction techniques for gradient estimates in reinforcement learning\.J\. Mach\. Learn\. Res\.5,pp\. 1471–1530\.External Links:ISSN 1532\-4435Cited by:[§3\.3](https://arxiv.org/html/2605.08516#S3.SS3.p6.1),[§3\.3](https://arxiv.org/html/2605.08516#S3.SS3.p6.4),[§3\.3](https://arxiv.org/html/2605.08516#S3.SS3.p8.6),[§4\.3\.3](https://arxiv.org/html/2605.08516#S4.SS3.SSS3.p2.2)\.
- P\. He, Q\. Li, X\. Yuan, and B\. Zhou \(2024\)A holistic framework towards vision\-based traffic signal control with microscopic simulation\.arXiv preprint arXiv:2403\.06884\.Cited by:[§1](https://arxiv.org/html/2605.08516#S1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[§1](https://arxiv.org/html/2605.08516#S1.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.ICLR1\(2\),pp\. 3\.Cited by:[§3\.1](https://arxiv.org/html/2605.08516#S3.SS1.p5.2),[§4\.1\.3](https://arxiv.org/html/2605.08516#S4.SS1.SSS3.p1.5)\.
- Z\. Hu, C\. Liu, X\. Feng, Y\. Zhao, S\. Ng, A\. T\. Luu, J\. He, P\. W\. Koh, and B\. Hooi \(2024\)Uncertainty of thoughts: uncertainty\-aware planning enhances information seeking in large language models\.External Links:2402\.03271,[Link](https://arxiv.org/abs/2402.03271)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p6.1)\.
- H\. Jiang, Z\. Li, Z\. Li, L\. Bai, H\. Mao, W\. Ketter, and R\. Zhao \(2024\)A general scenario\-agnostic reinforcement learning for traffic signal control\.IEEE Transactions on Intelligent Transportation Systems25\(9\),pp\. 11330–11344\.External Links:[Document](https://dx.doi.org/10.1109/TITS.2024.3377106)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p2.1)\.
- Z\. Ke, F\. Jiao, Y\. Ming, X\. Nguyen, A\. Xu, D\. X\. Long, M\. Li, C\. Qin, P\. Wang, S\. Savarese, C\. Xiong, and S\. Joty \(2025\)A survey of frontiers in llm reasoning: inference scaling, learning to reason, and agentic systems\.External Links:2504\.09037,[Link](https://arxiv.org/abs/2504.09037)Cited by:[§4\.6](https://arxiv.org/html/2605.08516#S4.SS6.p4.1)\.
- L\. Kuhn, Y\. Gal, and S\. Farquhar \(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.External Links:2302\.09664,[Link](https://arxiv.org/abs/2302.09664)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p6.1)\.
- S\. Lai, Z\. Xu, W\. Zhang, H\. Liu, and H\. Xiong \(2025a\)LLMLight: large language models as traffic signal control agents\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.1,KDD ’25,New York, NY, USA,pp\. 2335–2346\.External Links:ISBN 9798400712456,[Link](https://doi.org/10.1145/3690624.3709379),[Document](https://dx.doi.org/10.1145/3690624.3709379)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p3.1)\.
- S\. Lai, Z\. Xu, W\. Zhang, H\. Liu, and H\. Xiong \(2025b\)LLMLight: large language models as traffic signal control agents\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 1,pp\. 2335–2346\.Cited by:[§A\.8](https://arxiv.org/html/2605.08516#A1.SS8.p3.1),[§1](https://arxiv.org/html/2605.08516#S1.p1.1),[§1](https://arxiv.org/html/2605.08516#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.08516#S3.SS1.p3.5)\.
- M\. Li, J\. Wang, B\. Du, J\. Shen, and Q\. Wu \(2025a\)FuzzyLight: a robust two\-stage fuzzy approach for traffic signal control works in real cities\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.1,KDD ’25,New York, NY, USA,pp\. 2393–2404\.External Links:ISBN 9798400712456,[Link](https://doi.org/10.1145/3690624.3709393),[Document](https://dx.doi.org/10.1145/3690624.3709393)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p2.1.1)\.
- Q\. Li, Y\. Lin, Q\. Luo, and L\. Yu \(2025b\)DreamerV3 for traffic signal control: hyperparameter tuning and performance\.External Links:2503\.02279,[Link](https://arxiv.org/abs/2503.02279)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p2.1)\.
- F\. Liu, X\. Tong, M\. Yuan, X\. Lin, F\. Luo, Z\. Wang, Z\. Lu, and Q\. Zhang \(2024\)Evolution of heuristics: towards efficient automatic algorithm design using large language model\.InProceedings of the 41st International Conference on Machine Learning,ICML’24\.Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p4.1)\.
- F\. Liu, Y\. Yao, P\. Guo, Z\. Yang, X\. Lin, Z\. Zhao, X\. Tong, K\. Mao, Z\. Lu, Z\. Wang, M\. Yuan, and Q\. Zhang \(2026\)A systematic survey on large language models for algorithm design\.ACM Comput\. Surv\.58\(8\)\.External Links:ISSN 0360\-0300,[Link](https://doi.org/10.1145/3787585),[Document](https://dx.doi.org/10.1145/3787585)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p4.1)\.
- T\. Q\. Luong, X\. Zhang, Z\. Jie, P\. Sun, X\. Jin, and H\. Li \(2024\)ReFT: reasoning with reinforced fine\-tuning\.External Links:2401\.08967,[Link](https://arxiv.org/abs/2401.08967)Cited by:[§3\.3](https://arxiv.org/html/2605.08516#S3.SS3.p1.1)\.
- F\. J\. Martinez, C\. K\. Toh, J\. Cano, C\. T\. Calafate, and P\. Manzoni \(2011\)A survey and comparative study of simulators for vehicular ad hoc networks \(vanets\)\.Wireless Communications and Mobile Computing11\(7\),pp\. 813–828\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1002/wcm.859),[Link](https://onlinelibrary.wiley.com/doi/abs/10.1002/wcm.859),https://onlinelibrary\.wiley\.com/doi/pdf/10\.1002/wcm\.859Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p1.1)\.
- H\. Mei, X\. Lei, L\. Da, B\. Shi, and H\. Wei \(2024\)Libsignal: an open library for traffic signal control\.Machine Learning113\(8\),pp\. 5235–5271\.Cited by:[§1](https://arxiv.org/html/2605.08516#S1.p1.1),[§1](https://arxiv.org/html/2605.08516#S1.p6.1),[§4\.1\.1](https://arxiv.org/html/2605.08516#S4.SS1.SSS1.p1.2),[§4\.1\.2](https://arxiv.org/html/2605.08516#S4.SS1.SSS2.p1.1)\.
- A\. Nikitin, J\. Kossen, Y\. Gal, and P\. Marttinen \(2024\)Kernel language entropy: fine\-grained uncertainty quantification for llms from semantic similarities\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 8901–8929\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/10c456d2160517581a234dfde15a7505-Paper-Conference.pdf)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p6.1)\.
- M\. Prabhudesai, L\. Chen, A\. Ippoliti, K\. Fragkiadaki, H\. Liu, and D\. Pathak \(2025\)Maximizing confidence alone improves reasoning\.External Links:2505\.22660,[Link](https://arxiv.org/abs/2505.22660)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p6.1),[§2](https://arxiv.org/html/2605.08516#S2.p7.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)Gpqa: a graduate\-level google\-proof q&a benchmark\.InFirst Conference on Language Modeling,Cited by:[§1](https://arxiv.org/html/2605.08516#S1.p1.1)\.
- J\. Schulman, S\. Levine, P\. Abbeel, M\. Jordan, and P\. Moritz \(2015a\)Trust region policy optimization\.InInternational conference on machine learning,pp\. 1889–1897\.Cited by:[§3\.2](https://arxiv.org/html/2605.08516#S3.SS2.p3.7)\.
- J\. Schulman, P\. Moritz, S\. Levine, M\. Jordan, and P\. Abbeel \(2015b\)High\-dimensional continuous control using generalized advantage estimation\.arXiv preprint arXiv:1506\.02438\.Cited by:[§3\.2](https://arxiv.org/html/2605.08516#S3.SS2.p3.7)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§3\.2](https://arxiv.org/html/2605.08516#S3.SS2.p3.7),[§3](https://arxiv.org/html/2605.08516#S3.p2.1)\.
- H\. Shen \(2025\)On entropy control in llm\-rl algorithms\.External Links:2509\.03493,[Link](https://arxiv.org/abs/2509.03493)Cited by:[§A\.7](https://arxiv.org/html/2605.08516#A1.SS7.p1.4)\.
- P\. Varaiya \(2013\)Max pressure control of a network of signalized intersections\.Transportation Research Part C: Emerging Technologies36,pp\. 177–195\.External Links:ISSN 0968\-090X,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.trc.2013.08.014),[Link](https://www.sciencedirect.com/science/article/pii/S0968090X13001782)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p1.1),[§3\.3](https://arxiv.org/html/2605.08516#S3.SS3.p2.1),[§4\.1\.1](https://arxiv.org/html/2605.08516#S4.SS1.SSS1.p1.2)\.
- T\. Vincent, B\. Belousov, C\. D’Eramo, and J\. Peters \(2023\)Iterated deep q\-network: efficient learning of bellman iterations for deep reinforcement learning\.InSixteenth European Workshop on Reinforcement Learning,External Links:[Link](https://openreview.net/forum?id=6dJGuVyR7K)Cited by:[§4\.1\.1](https://arxiv.org/html/2605.08516#S4.SS1.SSS1.p1.2)\.
- M\. Wang, Y\. Chen, A\. Pang, Y\. Cai, C\. S\. Chen, Y\. Kan, and M\. Pun \(2025\)VLMLight: traffic signal control via vision\-language meta\-control and dual\-branch reasoning\.External Links:2505\.19486,[Link](https://arxiv.org/abs/2505.19486)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p3.1)\.
- M\. Wang, A\. Pang, Y\. Kan, M\. Pun, C\. S\. Chen, and B\. Huang \(2024a\)LLM\-assisted light: leveraging large language model capabilities for human\-mimetic traffic signal control in complex urban environments\.arXiv preprint arXiv:2403\.08337\.Cited by:[§A\.8](https://arxiv.org/html/2605.08516#A1.SS8.p3.1),[§1](https://arxiv.org/html/2605.08516#S1.p1.1)\.
- M\. Wang, X\. Xiong, Y\. Kan, C\. Xu, and M\. Pun \(2024b\)UniTSA: a universal reinforcement learning framework for v2x traffic signal control\.IEEE Transactions on Vehicular Technology73\(10\),pp\. 14354–14369\.External Links:[Document](https://dx.doi.org/10.1109/TVT.2024.3403879)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p2.1)\.
- M\. Wang, Y\. Xu, X\. Xiong, Y\. Kan, C\. Xu, and M\. Pun \(2023\)ADLight: a universal approach of traffic signal control with augmented data using reinforcement learning\.External Links:2210\.13378,[Link](https://arxiv.org/abs/2210.13378)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p2.1)\.
- H\. Wei, C\. Chen, G\. Zheng, K\. Wu, V\. Gayah, K\. Xu, and Z\. Li \(2019a\)PressLight: learning max pressure control to coordinate traffic signals in arterial network\.InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,KDD ’19,New York, NY, USA,pp\. 1290–1298\.External Links:ISBN 9781450362016,[Link](https://doi.org/10.1145/3292500.3330949),[Document](https://dx.doi.org/10.1145/3292500.3330949)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p2.1),[§4\.1\.1](https://arxiv.org/html/2605.08516#S4.SS1.SSS1.p1.2)\.
- H\. Wei, N\. Xu, H\. Zhang, G\. Zheng, X\. Zang, C\. Chen, W\. Zhang, Y\. Zhu, K\. Xu, and Z\. Li \(2019b\)CoLight: learning network\-level cooperation for traffic signal control\.InProceedings of the 28th ACM International Conference on Information and Knowledge Management,CIKM ’19,pp\. 1913–1922\.External Links:[Link](http://dx.doi.org/10.1145/3357384.3357902),[Document](https://dx.doi.org/10.1145/3357384.3357902)Cited by:[§4\.1\.1](https://arxiv.org/html/2605.08516#S4.SS1.SSS1.p1.2)\.
- H\. Wei, G\. Zheng, H\. Yao, and Z\. Li \(2018\)IntelliLight: a reinforcement learning approach for intelligent traffic light control\.InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,KDD ’18,New York, NY, USA,pp\. 2496–2505\.External Links:ISBN 9781450355520,[Link](https://doi.org/10.1145/3219819.3220096),[Document](https://dx.doi.org/10.1145/3219819.3220096)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p2.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022a\)Chain\-of\-thought prompting elicits reasoning in large language models\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p6.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022b\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.08516#S1.p1.1)\.
- T\. Worledge, T\. Hashimoto, and C\. Guestrin \(2024\)The extractive\-abstractive spectrum: uncovering verifiability trade\-offs in llm generations\.External Links:2411\.17375,[Link](https://arxiv.org/abs/2411.17375)Cited by:[§4\.6](https://arxiv.org/html/2605.08516#S4.SS6.p4.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§3\.1](https://arxiv.org/html/2605.08516#S3.SS1.p5.2)\.
- F\. Ye, M\. Yang, J\. Pang, L\. Wang, D\. F\. Wong, E\. Yilmaz, S\. Shi, and Z\. Tu \(2025\)Benchmarking llms via uncertainty quantification\.InProceedings of the 38th International Conference on Neural Information Processing Systems,NIPS ’24,Red Hook, NY, USA\.External Links:ISBN 9798331314385Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p6.1)\.
- H\. Zhang, S\. Feng, C\. Liu, Y\. Ding, Y\. Zhu, Z\. Zhou, W\. Zhang, Y\. Yu, H\. Jin, and Z\. Li \(2019\)Cityflow: a multi\-agent reinforcement learning environment for large scale city traffic scenario\.InThe world wide web conference,pp\. 3620–3624\.Cited by:[§1](https://arxiv.org/html/2605.08516#S1.p1.1)\.
- Q\. Zhang, H\. Wu, C\. Zhang, P\. Zhao, and Y\. Bian \(2025\)Right question is already half the answer: fully unsupervised llm reasoning incentivization\.External Links:2504\.05812,[Link](https://arxiv.org/abs/2504.05812)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p6.1),[§2](https://arxiv.org/html/2605.08516#S2.p7.1),[§3\.2](https://arxiv.org/html/2605.08516#S3.SS2.p11.1),[§3\.4](https://arxiv.org/html/2605.08516#S3.SS4.p5.4),[§3\.4](https://arxiv.org/html/2605.08516#S3.SS4.p5.5)\.
- Y\. Zhu, G\. Li, X\. Jiang, J\. Li, H\. Mei, Z\. Jin, and Y\. Dong \(2025\)Uncertainty\-guided chain\-of\-thought for code generation with llms\.External Links:2503\.15341,[Link](https://arxiv.org/abs/2503.15341)Cited by:[§2](https://arxiv.org/html/2605.08516#S2.p6.1)\.

## Appendix AAppendix

### A\.1Reward Design and PPO Stabilization Sensitivity

This section provides additional empirical and methodological context for the reward design used in OracleTSC\. Specifically, we examine the optimization behavior of PPO under the queue\-length reward\.

Letqueuet\\text\{queue\}\_\{t\}denote the average number of vehicles whose speed falls below 0\.1 m/s across all lanes\. The standard congestion objective corresponds to the reward

Rq​\(sOt,aOt\)=−queuetR\_\{q\}\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\)=\-\\text\{queue\}\_\{t\}\(15\)While this objective is well aligned with long\-horizon traffic efficiency, in practice it produces weak action\-conditioned gradients during short control intervals\. Inflow variation and regime effects can dominatequeuet\\text\{queue\}\_\{t\}, producing large baseline shifts in reward that are not attributable to a single phase decision\. This results in high\-variance advantages and unstable policy updates when fine\-tuning language\-model controllers with PPO\.

We evaluate whether standard PPO stabilization techniques suffice underRqR\_\{q\}\. Across multiple hyperparameter sweeps—including learning rate, GAE\-λ\\lambda, reward discount factorγ\\gamma, and removal of the critic via REINFORCE—we observe either policy collapse or negligible improvements over the pretrained baseline\.

Table 4:Learning Rate Sensitivity on CityFlow1x1 \(Reward = Queue Length, Qwen3\-0\.6B\)Table 5:GAE\-λ\\lambdaSensitivity on CityFlow1x1 \(Reward = Queue Length, Qwen3\-0\.6B\)Table 6:Reward Discount Factor Sensitivity on CityFlow1x1 \(Reward = Queue Length, Qwen3\-0\.6B\)Table 7:REINFORCE \(No Critic, No Baseline\) on CityFlow1x1 \(Reward = Queue Length, Qwen3\-0\.6B\)These results indicate that optimizer\-level tuning alone does not reliably stabilize learning under the absolute queue\-length reward in our exchange\-level fine\-tuning regime\.

### A\.2Hyper\-Parameters for Table[1](https://arxiv.org/html/2605.08516#S4.T1)

The value estimatorVϕV\_\{\\phi\}s parameterized as a two\-layer feedforward network across all model families, sizes, and intersection configurations\. The first hidden layer expands the input dimension from the model’s hidden state size to twice that dimension, followed by a Leaky ReLU activation with negative slope 0\.01\. The second layer produces a scalar value estimate\.

For optimization, we set the value estimator learning rate to10−510^\{\-5\}and weight decay to5×10−75\\times 10^\{\-7\}uniformly across experiments\. Gradient clipping is applied with a maximum norm of 0\.5 for the policy network and 5\.0 for the value estimator\.

The KL divergence penalty uses the K3 approximation:

KL\(p\|\|q\)=pq−logpq−1KL\(p\|\|q\)=\\frac\{p\}\{q\}\-\\log\\frac\{p\}\{q\}\-1
weighted byβ=0\.05\\beta=0\.05in the token\-level reward\. The policy is optimized using the clipped surrogate objective given in Equation[16](https://arxiv.org/html/2605.08516#A1.E16)\.

Model\-specific and intersection\-specific hyperparameters are detailed in Table[8](https://arxiv.org/html/2605.08516#A1.T8), whereϵl\\epsilon\_\{l\}andϵu\\epsilon\_\{u\}denote the lower and upper PPO clipping bounds, andHRH\_\{R\}represents the reward hurdle threshold\.

Table 8:Model\-specific hyperparameters for experiments in Table[1](https://arxiv.org/html/2605.08516#S4.T1)
### A\.3Supported Phase Mnemonics for Evaluated Intersections

Table 9:Supported Phase Mnemonics for CityFlow1x1\.Table 10:Supported Phase Mnemonics for Cologne1\.
### A\.4System Prompt

Youareanexpertintrafficmanagement\.Youcanuseyourknowledgeoftrafficcommonsensetosolvetrafficsignalcontroltasks\.

Atrafficlightregulatesafour\-sectionintersectionwithnorthern,southern,eastern,andwesternsections,eachcontainingtwolanes:oneforthroughtrafficandoneforleft\-turns\.Eachlaneisfurtherdividedintothreesegments\.

Segment1istheclosesttotheintersection\.Segment2isinthemiddle\.Segment3isthefarthest\.Inalane,theremaybeearlyqueuedvehiclesandapproachingvehiclestravelingindifferentsegments\.Earlyqueuedvehicleshavearrivedattheintersectionandawaitpassagepermission\.Approachingvehicleswillarriveattheintersectioninthefuture\.

Thetrafficlighthas8signalphases\.Eachsignalrelievesvehicles‘flowinthetwospecificlanes\.Thestateoftheintersectionislistedbelow\.Itdescribes:

\-Thegroupoflanesrelievingvehicles‘flowundereachsignalphase\.

\-Thenumberofearlyqueuedvehiclesoftheallowedlanesofeachsignal\.

\-Thenumberofapproachingvehiclesindifferentsegmentsoftheallowedlanesofeachsignal\.Thesignalphaseyouchoosewillpersistfor10seconds\.

Pleaseanswer:Whichisthemosteffectivetrafficsignalthatwillmostsignificantlyimprovethecurrenttrafficcondition?

Requirements:

\-Let‘sthinkstepbystep\.

\-Youcanonlychooseoneofthesignalslistedbytheuser\.

\-Youmustfollowthefollowingstepstoprovideyouranalysis:

Step1:Provideyouranalysisforidentifyingtheoptimaltrafficsignal\.

Step2:Answeryourchosensignal\.

\-Yourchoicecanonlybegivenafterfinishingtheanalysis\.\-Yourchoicemustbeidentifiedbythetag:<signal\>YOUR\_CHOICE</signal\>\.

Listing 3:Our System Prompt
### A\.5Textual Phase Representation

Phase:ETWT\(EasternandWesternThroughLanes\)

\*\*EarlyQueuedVehicles\*\*:

\-Easternthrough:17

\-Westernthrough:3

\*\*ApproachingVehicles\*\*:

\-Segment1:Easternthrough:1,Westernthrough:2

\-Segment2:Easternthrough:5,Westernthrough:7

\-Segment3:Easternthrough:17,Westernthrough:20

Listing 4:Representation of a phase and its traffic observation\. Example shown is for Eastern and Western Through \(ETWT\) lanes in CityFlow1x1\.
### A\.6Algorithms

Algorithm 1PPO\-Based Training for LLM\-Guided Traffic Signal Control using Hurdle Rate and Softmax with Temperature Discrete Semantic EntropyNumber of responses for entropy computation

GG, entropy weight

α\\alpha, clipping threshold

ϵ\\epsilon, discount

γ\\gamma, GAE parameter

λ\\lambda, episode length

TT, maximum number of output tokens

LL, boolean symbolizing whether to use Discrete Semantic Entropy with or without Softmaxuse\_softmax,optionalSoftmax Temperature

τ\\tau, number of batches to train over

nbn\_\{b\}, batch size

BB, Hurdle Rate

HRH\_\{R\}
Initialize Buffer

ℬ←∅\\mathcal\{B\}\\leftarrow\\emptyset
Load pretrained language model

πθ\\pi\_\{\\theta\}and reference model

πref\\pi\_\{\\text\{ref\}\}
Initialize value head

VϕV\_\{\\phi\}
foreach training iterationdo

foreach environment step

t=1t=1to

TTdo

Observe traffic state

ssand current signal phase

pp
Create verbalized traffic state

s0ts^\{t\}\_\{0\}using

ss,

pp
Sample response

\{a0t,…,aOt\}∼πθ\(⋅∣s0t\)\\\{a^\{t\}\_\{0\},\\dots,a^\{t\}\_\{O\}\\\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s^\{t\}\_\{0\}\)with maximum length

OO
Parse selected action

pn​e​x​tp\_\{next\}from

ata\_\{t\}
Sample

GGresponses

\{\{a0t,…,aOt\}∼πθ\(⋅∣st\)\}g=1G\\\{\\\{a^\{t\}\_\{0\},\\dots,a^\{t\}\_\{O\}\\\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{t\}\)\\\}\_\{g=1\}^\{G\}with maximum length

OO
Compute phase histogram

\{countpanswer\}p=1NP\\\{\\text\{count\}^\{\\text\{answer\}\}\_\{p\}\\\}\_\{p=1\}^\{N\_\{P\}\}from

GGresponses using Algorithm[3](https://arxiv.org/html/2605.08516#alg3)

ifuse\_softmaxthen

Compute

rE←REanswerr\_\{E\}\\leftarrow R^\{\\text\{answer\}\}\_\{E\}from

\{countp′answer\}p′=1NP\\\{\\text\{count\}^\{\\text\{answer\}\}\_\{p^\{\\prime\}\}\\\}\_\{p^\{\\prime\}=1\}^\{N\_\{P\}\},

pn​e​x​tp\_\{next\},

τ\\tauusing Equation[11](https://arxiv.org/html/2605.08516#S3.E11)

else

Compute

rE←REanswer,D​S​Er\_\{E\}\\leftarrow R^\{\\text\{answer\},DSE\}\_\{E\}from

\{countp′answer\}p′=1NP\\\{\\text\{count\}^\{\\text\{answer\}\}\_\{p^\{\\prime\}\}\\\}\_\{p^\{\\prime\}=1\}^\{N\_\{P\}\},

pn​e​x​tp\_\{next\}using Equation[13](https://arxiv.org/html/2605.08516#S3.E13)

endif

Execute action

pn​e​x​tp\_\{next\}, observe reward

R​\(sOt,aOt\)R\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\), next state

s0t\+1s^\{t\+1\}\_\{0\}
Compute token\-level rewards

rtr\_\{t\}from total reward

Rhurdle​\(sOt,aOt\)R\_\{\\text\{hurdle\}\}\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\)using Equations[14](https://arxiv.org/html/2605.08516#S3.E14)and[1](https://arxiv.org/html/2605.08516#S3.E1)

Compute action sequence log\-probabilities

\{log⁡πθo​l​d​\(aot\|so−1t\)o=1O\}\\\{\\log\\pi\_\{\\theta\_\{old\}\}\(a^\{t\}\_\{o\}\|s^\{t\}\_\{o\-1\}\)\_\{o=1\}^\{O\}\\\}for all output tokens in

a0t,…,aOta^\{t\}\_\{0\},\\dots,a^\{t\}\_\{O\}from

πθ\\pi\_\{\\theta\}\.

Compute state value estimate

Vo​l​d←Vϕ​\(s0t\)V\_\{old\}\\leftarrow V\_\{\\phi\}\(s^\{t\}\_\{0\}\)\.

Store

\(s0t,at=\{a0t,…,aOt\}\},rt,FILLlogπθo​l​d\(at\|st\),Vo​l​d\)\(s^\{t\}\_\{0\},a\_\{t\}=\\\{a^\{t\}\_\{0\},\\dots,a^\{t\}\_\{O\}\\\}\\\},r\_\{t\},\\newline \{\\color\[rgb\]\{1,1,1\}FILL\}\\log\\pi\_\{\\theta\_\{old\}\}\(a\_\{t\}\|s\_\{t\}\),V\_\{old\}\)in buffer

ℬ\\mathcal\{B\}
endfor

Compute Advantages

A^=\[A^t∀t∈\[1,T\]\]\\hat\{A\}=\[\\hat\{A\}\_\{t\}\\quad\\forall t\\in\[1,T\]\]using Generalized Advantage Estimation\(

r1r\_\{1\}, …,

rTr\_\{T\},

Vϕ​\(s1\)V\_\{\\phi\}\(s\_\{1\}\), …,

Vϕ​\(sT\)V\_\{\\phi\}\(s\_\{T\}\),

γ\\gamma,

λ\\lambda\)\.

Compute

λ\\lambda\-Return

R^=\[A^t\+Vϕ​\(st\)∀t∈\[1,T\]\]\\hat\{R\}=\[\\hat\{A\}\_\{t\}\+V\_\{\\phi\}\(s\_\{t\}\)\\quad\\forall t\\in\[1,T\]\]\.

Shuffle Buffer

ℬ,R^,A^\\mathcal\{B\},\\hat\{R\},\\hat\{A\}in\-place\.

Get

nbn\_\{b\}batches from

ℬ,A^,R^\\mathcal\{B\},\\hat\{A\},\\hat\{R\}of batch size

BBeach\.

for

b∈\[1,nb\]b\\in\[1,n\_\{b\}\]do

\(sbatch,abatch,rbatch,log⁡πo​l​dbatch,Vo​l​dbatch\),F​I​L​L​A^batch,R^batch←batchb\(s^\{\\text\{batch\}\},a^\{\\text\{batch\}\},r^\{\\text\{batch\}\},\\log\\pi\_\{old\}^\{\\text\{batch\}\},V\_\{old\}^\{\\text\{batch\}\}\),\\newline \{\\color\[rgb\]\{1,1,1\}FILL\}\\hat\{A\}^\{\\text\{batch\}\},\\hat\{R\}^\{\\text\{batch\}\}\\leftarrow\\text\{batch\}\_\{b\}

Standardize all elements in

A^batch\\hat\{A\}^\{\\text\{batch\}\}in\-place\.

Compute action sequence log\-probability

log⁡πθbatch\\log\\pi^\{\\text\{batch\}\}\_\{\\theta\}for all output tokens in

abatcha^\{\\text\{batch\}\}\.

Compute ratio:

ratiobatch=exp⁡\(log⁡πθbatch−log⁡πoldbatch\)\\text\{ratio\}^\{\\text\{batch\}\}=\\exp\(\\log\\pi^\{\\text\{batch\}\}\_\{\\theta\}\-\\log\\pi^\{\\text\{batch\}\}\_\{\\text\{old\}\}\)
Compute Policy objective at batch\-level using Equation[16](https://arxiv.org/html/2605.08516#A1.E16)\.

Compute Value objective at batch level using Equation[5](https://arxiv.org/html/2605.08516#S3.E5)\.

Update

θ,ϕ\\theta,\\phiwith

∇𝒥​\(θ,ϕ\)\\nabla\\mathcal\{J\}\(\\theta,\\phi\)using Equation[6](https://arxiv.org/html/2605.08516#S3.E6)

endfor

\# Clear buffer

Set

B←∅B\\leftarrow\\emptyset
endfor

Algorithm 2Extract Phase from Output TextOutput text

ss, valid phase mnemonics

𝒦\\mathcal\{K\}, phase descriptions

𝒟\\mathcal\{D\}, mapping from phase mnemonics to OpenAI Gym action classesphase\_to\_code, default action class

default\_code∈\[0,NP\)\\texttt\{default\\\_code\}\\in\[0,N\_\{P\}\)
Extract all matches

ℳ←\\mathcal\{M\}\\leftarrowregex search for<signal\>…</signal\>in

ss
forphase

∈\\inreverse\(

ℳ\\mathcal\{M\}\)do

ifphase

∈𝒦\\in\\mathcal\{K\}then

return

phase\_to\_code​\[phase\]\\texttt\{phase\\\_to\\\_code\}\[\\text\{phase\}\]
endif

endfor

sl​o​w←s\_\{low\}\\leftarrowlowercase\(

ss\),

i←−1i\\leftarrow\-1,

k∗←Nonek^\{\\ast\}\\leftarrow\\text\{None\}
for all

k∈𝒦k\\in\\mathcal\{K\}do

jk←rfind\(sl​o​w,k\.lower\(\)\)j\_\{k\}\\leftarrow\\texttt\{rfind\}\(s\_\{low\},k\.\\texttt\{lower\(\)\}\)

jd←rfind\(sl​o​w,𝒟\[k\]\.lower\(\)\)j\_\{d\}\\leftarrow\\texttt\{rfind\}\(s\_\{low\},\\mathcal\{D\}\[k\]\.\\texttt\{lower\(\)\}\)

if

max⁡\(jk,jd\)\>i\\max\(j\_\{k\},j\_\{d\}\)\>ithen

k∗←kk^\{\\ast\}\\leftarrow k,

i←max⁡\(jk,jd\)i\\leftarrow\\max\(j\_\{k\},j\_\{d\}\)
endif

endfor

if

k∗≠k^\{\\ast\}\\neqNonethen

return

phase\_to\_code​\[k∗\]\\texttt\{phase\\\_to\\\_code\}\[k^\{\\ast\}\]
else

returndefault\_code

endif

Algorithm 3Compute Unnormalized Probability of Phase from LLM ResponsesNumber of responses

GG, Policy LLM

πθ\\pi\_\{\\theta\}, Input state

ss
Histogram of extracted phases from

GGresponses,

hh
Generate

GGoutput responses:

\{a1,a2,…,aG\}←πθ​\(s\)\\\{a\_\{1\},a\_\{2\},\\dots,a\_\{G\}\\\}\\leftarrow\\pi\_\{\\theta\}\(s\)
Initialize empty list

𝒫←\[\]\\mathcal\{P\}\\leftarrow\[\]
for

g=1g=1to

GGdo

pg←ExtractPhase​\(ag\)p\_\{g\}\\leftarrow\\texttt\{ExtractPhase\}\(a\_\{g\}\)⊳\\trianglerightSee Algorithm[2](https://arxiv.org/html/2605.08516#alg2)

Append

pgp\_\{g\}to

𝒫\\mathcal\{P\}
endfor

Build histogram

\[count0answer,…,countNP−1answer\]\[\\text\{count\}^\{\\text\{answer\}\}\_\{0\},\\dots,\\text\{count\}^\{\\text\{answer\}\}\_\{N\_\{P\}\-1\}\]over

𝒫\\mathcal\{P\}with

NPN\_\{P\}bins for

NPN\_\{P\}supported phases

returnPhase Histogram

\[count0answer,…,countNP−1answer\]\[\\text\{count\}^\{\\text\{answer\}\}\_\{0\},\\dots,\\text\{count\}^\{\\text\{answer\}\}\_\{N\_\{P\}\-1\}\]

### A\.7Ablation on Upper Clip Limit

We ablate the effect of theupper clip limitϵu\\epsilon\_\{u\}in the clipped surrogate objective \(Equation[16](https://arxiv.org/html/2605.08516#A1.E16)\) to examine how it influences the policy stability and learning dynamics of language\-model\-based controllers\. In traditional PPO, a smaller clip range \(1±ϵ1\\pm\\epsilon\) constrains the policy update and prevents destructive policy shifts, while a larger range permits more aggressive updates but risks over\-fitting to noisy or stochastic rewards\. In our entropy\-controlled formulation,ϵu\\epsilon\_\{u\}further interacts with the uncertainty\-weighted auxiliary rewardwuncertainty​REanswerw\_\{\\text\{uncertainty\}\}R^\{\\text\{answer\}\}\_\{E\}, effectively scaling the policy’s sensitivity to exploration signals\. Prior workShen \([2025](https://arxiv.org/html/2605.08516#bib.bib100)\)suggests that larger upper clip limits can sustain higher entropy across training, discouraging premature convergence to deterministic policies\.

As shown in Figures[11](https://arxiv.org/html/2605.08516#A1.F11)and[12](https://arxiv.org/html/2605.08516#A1.F12), this trade\-off manifests clearly in both models\. ForQwen3\-0\.6Bon the Cologne1 intersection \(HR=2\.7H\_\{R\}=2\.7\), increasingϵu\\epsilon\_\{u\}from 0\.2 to 0\.5 led to degraded traffic efficiency, reflected in higher travel time and queue length\. The smaller clip coefficients yielded smoother delay curves and higher throughput, indicating more stable and well\-regularized learning\. We observe a similar pattern withQwen3\-8Bon CityFlow1x1 \(HR=3\.1H\_\{R\}=3\.1\); strictest clipping \(ϵu≈0\.2\\epsilon\_\{u\}\\approx 0\.2\) achieved the best balance between exploration and control, while larger clip limits again inflated travel time and reduced throughput\. However, Delay metrics performed slightly better with a marginally higherϵu=0\.35\\epsilon\_\{u\}=0\.35\.

Overall, these results reinforce the importance ofbounded update magnitudesin language\-model\-based reinforcement learning\. Tight upper clip limits constrain policy drift and encourage gradual adaptation, enabling the agent to maintain interpretable, state\-responsive decision patterns rather than collapsing to single\-phase policies\. However, excessively restrictive clipping can also suppress meaningful exploration\. Hence, an intermediate range ofϵu\\epsilon\_\{u\}appears most effective—sufficiently permissive to preserve entropy yet conservative enough to maintain reward stability\.

𝒥CLIP​\(θ\)=𝔼^τ∼πold​\[min⁡\(rl​\(θ\)​A^l,clip⁡\(rl​\(θ\),1−ϵl,1\+ϵu\)​A^l\)\],\\mathcal\{J\}\_\{\\text\{CLIP\}\}\(\\theta\)=\\hat\{\\mathbb\{E\}\}\_\{\\tau\\sim\\pi\_\{\\text\{old\}\}\}\\Big\[\\min\\\!\\big\(r\_\{l\}\(\\theta\)\\,\\hat\{A\}\_\{l\},\\ \\operatorname\{clip\}\(r\_\{l\}\(\\theta\),\\,1\-\\epsilon\_\{l\},\\,1\+\\epsilon\_\{u\}\)\\,\\hat\{A\}\_\{l\}\\big\)\\Big\],\(16\)
![Refer to caption](https://arxiv.org/html/2605.08516v1/figures/qwen06b_clip_ablation_first_test.png)Figure 11:Effect of the Upper Clip Limitϵu\\epsilon\_\{u\}in theR​\(sOt,aOt\)−HR\+wuncertainty​REanswerR\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\)\-H\_\{R\}\+w\_\{\\text\{uncertainty\}\}R^\{\\text\{answer\}\}\_\{E\}configuration wherewuncertainty=1w\_\{\\text\{uncertainty\}\}=1andHR=2\.7H\_\{R\}=2\.7\. Each subplot reports one evaluation metric \(Travel Time, Queue Length, Delay, and Throughput\) as a function ofϵu\\epsilon\_\{u\}for Qwen3\-0\.6B on the Cologne1 intersection\.![Refer to caption](https://arxiv.org/html/2605.08516v1/figures/qwen8b_clip_ablation_first_test.png)Figure 12:Effect of the Upper Clip Limitϵu\\epsilon\_\{u\}in theR​\(sOt,aOt\)−HR\+wuncertainty​REanswerR\(s^\{t\}\_\{O\},a^\{t\}\_\{O\}\)\-H\_\{R\}\+w\_\{\\text\{uncertainty\}\}R^\{\\text\{answer\}\}\_\{E\}configuration wherewuncertainty=1w\_\{\\text\{uncertainty\}\}=1andHR=3\.1H\_\{R\}=3\.1\. Each subplot reports one evaluation metric \(Travel Time, Queue Length, Delay, and Throughput\) as a function of theϵu\\epsilon\_\{u\}for Qwen3\-8B on the CityFlow1x1 intersection\.
### A\.8Experimental Rigor: Repeating with Multiple Seeds

To improve experimental robustness, we evaluate the primary configuration across multiple random seeds\.

Table 11:Performance across three independent random seeds on CityFlow1x1 \(Reward = Queue Difference−\-Hurdle Rate \+ Temperature\-scaled Softmax DSE, LLaMA3\-8B\)Across seeds, performance remains stable\. Travel time varies within a narrow band \(140\.2–243\.5\), and throughput remains consistently high \(1844–1975\)\. Queue length shows moderate variation \(42\.93–82\.65\), but no run exhibits policy collapse or severe degradation\.

These results suggest that the observed gains are not attributable to a single favorable initialization\. We acknowledge that larger\-scale seed evaluation would provide tighter confidence intervals\. However, each full fine\-tuning run requires approximately 12 hours of GPU time under our four\-episode training protocol\. Given the computational cost of long\-horizon LLM reinforcement learning, large multi\-seed sweeps are uncommon in recent LLM\-based traffic signal control studiesLaiet al\.\([2025b](https://arxiv.org/html/2605.08516#bib.bib10)\); Wanget al\.\([2024a](https://arxiv.org/html/2605.08516#bib.bib11)\)\. Nonetheless, the additional runs presented here strengthen the empirical validity of our findings\.

### A\.9Ablation on Number of Responses \(G\)

Table 12:Sensitivity to the Number of Responses \(GG\) on CityFlow1x1 \(Reward = Queue Difference−\-Hurdle Rate \+ Temperature\-scaled Softmax DSE, LLaMA3\-8B\)\. \(Episode 2\)AsGGincreases, performance improves consistently in terms of travel time and throughput\. IncreasingGGstrengthens semantic\-level uncertainty estimation by aggregating agreement across multiple independently sampled reasoning trajectories\. TheG=8G=8configuration achieves the strongest results, reducing travel time by approximately 41% compared toG=2G=2, and increasing throughput by over 12%\.

However, this improvement comes at substantial computational cost\. Because each additional response requires a full forward rollout and reward evaluation, GPU memory usage scales approximately linearly withGG\.

These results highlight a clear performance–efficiency trade\-off\. Smaller values ofGGreduce computational burden but weaken semantic consensus regularization, while larger values improve stability and final policy quality at increased cost\. In practice,GGshould be selected based on available compute budget and desired stability guarantees\.

Similar Articles

Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

arXiv cs.CL

This paper proposes Adaptive Entropy Regularization (AER), a framework that dynamically balances exploration and exploitation in LLM reinforcement learning by addressing policy entropy collapse through difficulty-aware coefficient allocation and initial-anchored target entropy. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements in both accuracy and exploration capability.

TRACER: Trace-Based Adaptive Cost-Efficient Routing for LLM Classification

Hugging Face Daily Papers

TRACER is an open-source system that trains lightweight ML surrogates on production traces from LLM classification endpoints, routing requests through a parity gate that activates surrogates only when agreement with the original model exceeds a specified threshold. This approach achieves 83-100% surrogate coverage on intent classification benchmarks while maintaining interpretability into handling boundaries and failure modes.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Hugging Face Daily Papers

This paper introduces AutoTTS, an environment-driven framework that automates the discovery of test-time scaling strategies for LLMs by formulating it as controller synthesis. It demonstrates improved accuracy-cost tradeoffs on mathematical reasoning benchmarks with minimal computational overhead.