reward-shaping

#reward-shaping

OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control

arXiv cs.AI ↗ · yesterday Cached

The paper introduces OracleTSC, a method using oracle-informed reward hurdles and uncertainty regularization to stabilize reinforcement fine-tuning of LLMs for traffic signal control. It demonstrates significant improvements in traffic flow metrics on the LibSignal benchmark using LLaMA-3-8B while maintaining interpretability.

0 favorites 0 likes

#reward-shaping

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

Hugging Face Daily Papers ↗ · yesterday Cached

This paper proposes an empirical 'sparse-to-dense' reward principle for language model post-training, arguing that scarce labeled data should be used with sparse rewards for teacher model discovery and dense rewards for student compression via distillation. The authors demonstrate that this staged approach, bridging sparse RL and on-policy distillation, outperforms direct GRPO on deployment-sized models in math benchmarks.

0 favorites 0 likes

#reward-shaping

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

arXiv cs.AI ↗ · 2d ago Cached

This paper proposes a signal reshaping method for Group Relative Policy Optimization (GRPO) to improve weak-feedback agentic code repair, showing significant gains in compile and semantic accuracy.

0 favorites 0 likes

reward-shaping

OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

Submit Feedback