DeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]

Reddit r/MachineLearning 05/09/26, 08:10 AM Papers

Summary

DeepSeek released the full V4 paper detailing FP4 quantization-aware training, MoE training stability tricks (anticipatory routing and SwiGLU clamping), and a generative reward model for RLHF, achieving dramatic efficiency gains—V4-Flash uses only 10% of V3.2's FLOPs and 7% of its KV cache at 1M context length.

DeepSeek dropped the full V4 paper this week. preview from april was 58 pages, this version adds a lot of technical depth. What stood out for me. FP4 quantization aware training. theyre running FP4 QAT directly in late stage training. MoE expert weights quantized to FP4 (the main gpu memory consumer). QK path in the CSA indexer uses FP4 activations. 2x speedup on QK selector with 99.7% recall preserved. inference runs directly on the FP4 weights. Efficiency table is striking: |Model|1M context FLOPs|KV cache| |:-|:-|:-| |V3.2|baseline|baseline| |V4-Pro|27% of baseline|10% of baseline| |V4-Flash|10% of baseline|7% of baseline| Training stability, two mechanisms. Trillion parameter MoE has the loss spike problem, divergence, unpredictable failures. they documented two fixes. Anticipatory routing. they deliberately desync main model and router updates. current step uses latest params for features, but routing uses cached older params. breaks the feedback loop that amplifies anomalies. 20% overhead but only kicks in during loss spikes. SwiGLU clamping. hard limits on the SwiGLU linear path (-10 to 10) and gate path (max 10). suppresses extreme values that would cascade. Generative reward model. instead of separate reward models for RLHF, they use the same model to generate and evaluate. trained on scored data, model learns to judge its own outputs with reasoning attached. minimal human labeling, reasoning grounded eval, unified training. Human eval results. chinese writing, V4-Pro 62.7% win rate vs gemini 3.1 pro, 77.5% on writing quality specifically. white collar tasks (30 advanced tasks across 13 industries), V4-Pro-Max gets 63% non loss rate vs opus 4.6 max. coding agent eval, 52% of users said V4-Pro is ready as their default coding model, 39% leaned yes, less than 9% said no. tracks my own use, swapped V4-Pro into my verdent runs last week and havent noticed a quality hit on day to day work. The headline for me is FP4 QAT with minimal quality degradation. if this generalizes the cost structure of training and inference shifts a lot, especially noticeable on multi agent setups where one task can spawn 5-10 model calls. Paper link in comments.

Original Article

DeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]

Similar Articles

deepseek-ai/DeepSeek-V4-Flash

deepseek-ai/DeepSeek-V4-Pro

DeepSeek-V4: a million-token context that agents can actually use

Jackrong/Qwen3.5-9B-DeepSeek-V4-Flash-GGUF

Need Info on quality benchmarks to run on DeepSeek V3.2 different quant levels [D]

Submit Feedback

Similar Articles

DeepSeek-V4: a million-token context that agents can actually use

Jackrong/Qwen3.5-9B-DeepSeek-V4-Flash-GGUF

Need Info on quality benchmarks to run on DeepSeek V3.2 different quant levels [D]