dpo

Tag

Cards List
#dpo

talkie-lm/talkie-1930-13b-it

Hugging Face Models Trending · 2026-04-20 Cached

Talkie-1930-13b-it is a 13B parameter instruction-tuned language model trained on pre-1931 text and fine-tuned using reinforcement learning with DPO.

0 favorites 0 likes
#dpo

Where does output diversity collapse in post-training?

arXiv cs.CL · 2026-04-20 Cached

This paper investigates where and why output diversity collapses during post-training of language models, analyzing three OLMo 3 lineages (Think, Instruct, RL-Zero) across multiple tasks and metrics. The authors find that diversity collapse is primarily determined by training data composition and embedded in model weights during training, not addressable at inference time alone.

0 favorites 0 likes
#dpo

GroupDPO: Memory efficient Group-wise Direct Preference Optimization

arXiv cs.CL · 2026-04-20 Cached

GroupDPO introduces a memory-efficient algorithm for group-wise direct preference optimization that leverages multiple candidate responses per prompt while reducing peak memory usage through decoupled backpropagation. The method demonstrates consistent improvements over standard DPO across offline and online alignment settings.

0 favorites 0 likes
← Back to home

Submit Feedback