Tag
This paper introduces CDLinear, a block-circulant neural network layer that reduces parameter count and improves Hessian conditioning via FFT diagonalization, validated on MNIST with theoretical proofs.
User questions how Qwen's 27B dense model can outperform its 397B MoE variant, sparking discussion on MoE efficiency versus dense model quality.
Andrej Karpathy claimed to Dwarkesh Patel that a 1B-parameter model trained on ultra-clean data could match today's 1.8T-parameter frontier models, implying 1,800× effective compression.
ShadowPEFT introduces a centralized parameter-efficient fine-tuning method that uses a depth-shared shadow module to refine transformer layer representations, matching or outperforming LoRA/DoRA with comparable trainable parameters.