Tag
RICE-PO is a critic-free policy optimization framework that turns retrieval interactions into localized credit signals for training reasoning agents, outperforming prompt-based and group-based RL baselines on BRIGHT and BEIR benchmarks.
Co-ReAct introduces a rubric-guided action-selection framework that uses rubrics as step-level guidance during inference for ReAct agents, improving trajectory quality and outperforming baselines on DeepResearchBench and SQA-CS-V2.