@ickma2311: CMU Advanced NLP: Reinforcement Learning I had been curious about how RL works on top of LLMs, and this CMU lecture mad…
Summary
CMU Advanced NLP lecture clarifies how reinforcement learning optimizes whole-output rewards (correctness, helpfulness, safety) rather than next-token prediction used in pretraining/fine-tuning.
View Cached Full Text
Cached at: 04/22/26, 08:22 AM
CMU Advanced NLP: Reinforcement Learning I had been curious about how RL works on top of LLMs, and this CMU lecture made it much clearer for me: Pretraining/fine tuning focus on the next token; RL focuses on the reward of the whole output: correctness, helpfulness, safety,
Similar Articles
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
This paper investigates whether reinforcement learning can improve the direct recall of parametric knowledge in LLMs beyond reasoning tasks. It demonstrates that RL with binary rewards yields significant gains in factual QA benchmarks by redistributing probability mass to unlock latent knowledge rather than acquiring new facts.
@jiqizhixin: Awesome blog! State of RL for reasoning LLMs https://aweers.de/blog/2026/rl-for-llms/…
A comprehensive blog post reviewing the state of reinforcement learning for reasoning LLMs, covering methods from REINFORCE and PPO to GRPO and beyond, with connections to key models like InstructGPT and DeepSeek-R1.
@blc_16: MIT just released a new RL method called Pedagogical RL. The main lesson -> correct reasoning traces can still be bad t…
MIT introduces Pedagogical RL, a method that trains a teacher to produce trajectories that are learnable for a student by penalizing surprising steps, improving RL training efficiency.
When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL
This paper frames LLM-generated reward shaping for sparse structured RL as a debugging problem, identifying failure modes like reward flooding and semantic misunderstanding. The authors propose diagnostic-driven iterative refinement, achieving dramatic success rate improvements (e.g., DoorKey-8×8 from 2.3% to 97.6%) compared to one-shot generation.
Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning
This paper proposes a reinforcement learning framework that improves LLM reasoning efficiency by modeling token significance to selectively penalize unimportant tokens while preserving essential reasoning, using both significance-aware and dynamic length rewards to reduce verbosity without sacrificing accuracy.