Tag
User questions how Qwen's 27B dense model can outperform its 397B MoE variant, sparking discussion on MoE efficiency versus dense model quality.
Andrej Karpathy claimed to Dwarkesh Patel that a 1B-parameter model trained on ultra-clean data could match today's 1.8T-parameter frontier models, implying 1,800× effective compression.
ShadowPEFT introduces a centralized parameter-efficient fine-tuning method that uses a depth-shared shadow module to refine transformer layer representations, matching or outperforming LoRA/DoRA with comparable trainable parameters.