@sheriyuo: https://x.com/sheriyuo/status/2063295181131247674

X AI KOLs Timeline Tools

Summary

A curated list of 35 key reinforcement learning interview questions covering both algorithm and infrastructure topics, compiled from community experiences and recent trends.

https://t.co/aJEXibSlmN
Original Article
View Cached Full Text

Cached at: 06/08/26, 01:25 PM

RL Interview Questions 2026

After seeing several people receive PhD offers and then immediately land highly paid industry positions during spring recruiting, I started wondering whether going straight into industry might actually be the better move.

So I went through essentially every RL-related interview experience I could find on Zhihu, combined them with recent discussions and my own observations, and distilled everything into 35 of the most interesting questions.

Think of it as an RL interview benchmark.

CN version in Zhihu: https://zhuanlan.zhihu.com/p/2046740446353811230

A few notes:

• The list does not strictly separate LLM RL from Agentic RL. Some questions have very different answers depending on the setting.

• Nearly every question can be extended much further. No reference answers are provided. If you use an LLM, keep asking follow-up questions and search extensively.

• Modern RL hiring increasingly expects full-stack understanding. If you are an algorithm researcher, people will still ask infrastructure questions. The reverse is also true.

• Data-related questions are not included. Those are almost impossible to memorize and depend heavily on your actual experience.

• Memorizing interview questions is not enough. Deep understanding matters far more.

Algorithm

  • Why use Actor-Critic instead of a pure Critic approach?

  • What is the relationship between KL divergence, cross entropy, and MLE?

  • How should rewards be designed in different RL scenarios?

  • How do importance sampling, rejection sampling, and other Monte Carlo methods fit into RL?

  • How is advantage computed in PPO and GRPO? Why subtract a baseline? Is standard deviation normalization really necessary?

  • How do RL training and test-time scaling perform exploration differently?

  • How does PPO clipping work? Why take the minimum objective? What happens without clipping? How does CISPO differ?

  • Why does GRPO include a KL penalty? How is the KL computed? Why do methods such as DAPO and GSPO remove it?

  • During LLM training, what happens if loss is accidentally All Reduced multiple times?

  • What is the reward function in DPO? Can reward hacking occur? How can it be mitigated?

  • What methods address train-inference mismatch in MoE models, and how do they work?

  • How should group size, learning rate, PPO epochs, and generation length be selected during RL training?

  • Compared with GRPO, how do Dr.GRPO, DAPO, GSPO, CISPO, SAPO, DPPO, MaxRL, and SimKO improve the training process? What are their limitations?

  • How do TRPO, DPPO, and AReaL enforce trust-region constraints on RL objectives?

  • Can RL fundamentally expand the capability frontier of LLMs?

  • Based on works such as ProRL, how should we think about scaling the boundaries of RL training?

  • What improvements does OPD introduce over traditional RL and SFT? What are its applications?

  • At which stage of training does reasoning ability emerge in LLMs?

  • From DeepSeek R1 to V3.2 and future V4 systems, what RL-related improvements have been introduced? How is RL different in MoE models?

Infrastructure

  • Ignoring CPU offload, how many model copies exist in memory during GRPO training? How much memory can various optimizations save?

  • Distributed inference: KV cache transfer optimization and multi-GPU communication strategies.

  • INT8 versus FP8. What are the tradeoffs? Which precisions are preferred for training and inference?

  • What is the long-tail problem in RL rollouts, and how can it be addressed?

  • What issues does continuous batching introduce in RL training? How do vLLM and SGLang differ?

  • How do you measure utilization in vLLM and SGLang? How do you evaluate KV cache utilization during training?

  • How is backpropagation implemented in large-scale multi-node RL training?

  • What asynchronous RL frameworks exist, and what synchronization bottlenecks do they solve?

  • In AReaL or other partially rollout frameworks, are KV caches from previous policies preserved?

  • How does Expert Parallelism affect MoE throughput?

  • In long-context training, how should compute-communication overlap be designed? How do Megatron and FSDP differ in parallelism strategies?

  • How do you enable deterministic execution? What is batch invariance? What causes it? Is atomic add involved? Can atomic add solve the issue?

  • How do AReaL and slime differ in their understanding of the RL rollout bottleneck?

  • How should we think about staleness in fully asynchronous RL training? What are typical values in practice?

  • How does data flow through slime? How is it integrated with Megatron? How is the loss computed?

  • If you had to choose among VeRL, TRL, Unsloth, AReaL, and slime, which one would you use and why?

Good luck.

And remember: interview preparation helps, but genuine understanding scales much further than memorized answers.

Similar Articles