Tag
POLARIS is a training recipe using GRPO with LLM-as-judge rewards and human-reference injection to improve long-form story generation in small models. Applied to Qwen3.5-9B, the resulting POLARIS-9B model matches Qwen3.5-27B performance on creative writing benchmarks while better adhering to length instructions.
This paper introduces Micro-Macro Retrieval (M2R), a retrieve-while-generate framework that reduces hallucination in long-form LLM outputs by ensuring key information stays close to generated text. It uses curriculum learning-based reinforcement learning to train retrieval and grounding skills, showing effectiveness especially in lengthy contexts.