Tag
The author maps the Kullback-Leibler divergence of KV cache quantization for the Qwen3.6-35B-A3B and Gemma4-E2B QAT models.
An analysis of how transparent Google's DiffusionGemma model release is, discussing the implications for AI safety and accountability.
This paper analyzes on-policy distillation (OPD), finding that OPD updates are sparse, distributed across layers and FFN-heavy, and retain geometric properties distinct from dense parameter rewriting. The sparse structure is operationally useful, but sparsity-inducing SGD underperforms AdamW due to heterogeneous gradient scales.