Tag
DeepSeek released the full V4 paper detailing FP4 quantization-aware training, MoE training stability tricks (anticipatory routing and SwiGLU clamping), and a generative reward model for RLHF, achieving dramatic efficiency gains—V4-Flash uses only 10% of V3.2's FLOPs and 7% of its KV cache at 1M context length.
AI2 released EMO, a Mixture of Experts language model with 1B active parameters out of 14B total, trained on 1 trillion tokens and featuring document-level routing where experts cluster around domains.
Allen AI releases EMO, a mixture-of-experts model where modular structure emerges naturally from data, enabling use of just 12.5% of experts for a task while maintaining near full-model performance.
This technical report introduces ZAYA1-8B, a mixture-of-experts reasoning model trained on AMD hardware that achieves competitive performance on math and coding benchmarks using under 1B active parameters. It also details Markovian RSA, a novel test-time compute method for aggregating parallel reasoning traces.
The paper introduces an information-theoretic framework for communication-efficient expert routing in sparse mixture-of-experts models, treating the gate as a stochastic channel and deriving practical mutual information estimators to analyze accuracy-rate tradeoffs over finite expert banks.
MACS is a training-free inference framework that mitigates the straggler effect in expert parallelism for multimodal MoE MLLMs by introducing entropy-weighted load and dynamic modality-adaptive capacity mechanisms.
EMO is a Mixture-of-Experts model that enables modular deployment by grouping similar domain tokens with shared experts, achieving performance comparable to standard MoEs while allowing significant expert pruning (25% experts retain 99% performance) without performance degradation.
UniPool introduces a shared expert pool architecture for Mixture-of-Experts models, reducing parameter growth with depth while improving efficiency and performance over standard MoE baselines.
Jackrong releases Qwopus3.6-35B-A3B-v1, a reasoning-enhanced fine-tune of Alibaba's Qwen3.6 MoE model, optimized for logic and agentic coding with 35B total parameters and 3B active parameters.
Zyphra released ZAYA1-8B, an 8.4B parameter Mixture-of-Experts model with 760M active parameters, demonstrating high efficiency and strong performance in mathematical and coding reasoning tasks.
The paper introduces PRISM, a method that inserts a distribution-alignment stage between supervised fine-tuning and reinforcement learning to mitigate distributional drift in multimodal models. It uses a black-box adversarial game with an MoE discriminator to improve RLVR performance on models like Qwen3-VL.
NVIDIA announces Nemotron 3 Nano Omni, an open multimodal model that unifies vision, audio, and language processing to enable faster and more efficient AI agents, achieving up to 9x higher throughput compared to other open omni models.
Xiaomi releases MiMo-V2.5-Pro, an open-source MoE language model with 1.02T total parameters and 1M token context, optimized for complex agentic and software engineering tasks.
Poolside releases Laguna XS.2, a 33B parameter MoE model with 3B activated parameters designed for agentic coding and local deployment on Macs with 36GB RAM.
DeepSeek releases V4-Pro and V4-Flash, Mixture-of-Experts models supporting million-token context with hybrid attention and Muon optimizer.
SAMoRA introduces a semantic-aware router and task-adaptive scaling to improve expert specialization and dynamic weighting in MoE-LoRA fine-tuning, outperforming prior methods on multi-task benchmarks.
Ling-2.6-flash is a 104B-total/7.4B-active sparse instruct model optimized for token efficiency, aiming to cut costs and boost throughput on agent tasks.
User reports successfully running a 35B-parameter mixture-of-experts model at 768K context length using Q4_K_M quantization and YaRN on an RTX 3090 via a llama.cpp fork, offloading only 8 experts to CPU while maintaining acceptable performance.
A user benchmarks three Qwen models (Qwen3.5-27B dense, Qwen3.5-122B-A10B MoE, Qwen3.6-35B-A3B MoE) on 4x RTX 3090 GPUs under real agentic workloads, finding that MoE models consistently underperform the dense 27B at following strict global rules despite speed advantages, with the Qwen3.6-35B leading in generation throughput.
FineSteer is a novel inference-time steering framework that decomposes steering into conditional steering and fine-grained vector synthesis stages, using Subspace-guided Conditional Steering (SCS) and Mixture-of-Steering-Experts (MoSE) mechanisms to improve safety and truthfulness while preserving model utility. Experiments show 7.6% improvement over state-of-the-art methods on TruthfulQA with minimal utility loss.