mixture-of-experts

#mixture-of-experts

DeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]

Reddit r/MachineLearning ↗ · 8h ago

DeepSeek released the full V4 paper detailing FP4 quantization-aware training, MoE training stability tricks (anticipatory routing and SwiGLU clamping), and a generative reward model for RLHF, achieving dramatic efficiency gains—V4-Flash uses only 10% of V3.2's FLOPs and 7% of its KV cache at 1M context length.

0 favorites 0 likes

#mixture-of-experts

new MoE from ai2, EMO

Reddit r/LocalLLaMA ↗ · 19h ago

AI2 released EMO, a Mixture of Experts language model with 1B active parameters out of 14B total, trained on 1 trillion tokens and featuring document-level routing where experts cluster around domains.

0 favorites 1 likes

#mixture-of-experts

EMO: Pretraining mixture of experts for emergent modularity

Hugging Face Blog ↗ · yesterday Cached

Allen AI releases EMO, a mixture-of-experts model where modular structure emerges naturally from data, enabling use of just 12.5% of experts for a task while maintaining near full-model performance.

0 favorites 0 likes

#mixture-of-experts

ZAYA1-8B Technical Report

arXiv cs.AI ↗ · yesterday Cached

This technical report introduces ZAYA1-8B, a mixture-of-experts reasoning model trained on AMD hardware that achieves competitive performance on math and coding benchmarks using under 1B active parameters. It also details Markovian RSA, a novel test-time compute method for aggregating parallel reasoning traces.

0 favorites 1 likes

#mixture-of-experts

Expert Routing for Communication-Efficient MoE via Finite Expert Banks

arXiv cs.LG ↗ · yesterday Cached

The paper introduces an information-theoretic framework for communication-efficient expert routing in sparse mixture-of-experts models, treating the gate as a stochastic channel and deriving practical mutual information estimators to analyze accuracy-rate tradeoffs over finite expert banks.

0 favorites 0 likes

#mixture-of-experts

MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

arXiv cs.LG ↗ · yesterday Cached

MACS is a training-free inference framework that mitigates the straggler effect in expert parallelism for multimodal MoE MLLMs by introducing entropy-weighted load and dynamic modality-adaptive capacity mechanisms.

0 favorites 0 likes

#mixture-of-experts

EMO: Pretraining Mixture of Experts for Emergent Modularity

Hugging Face Daily Papers ↗ · 2d ago Cached

EMO is a Mixture-of-Experts model that enables modular deployment by grouping similar domain tokens with shared experts, achieving performance comparable to standard MoEs while allowing significant expert pruning (25% experts retain 99% performance) without performance degradation.

0 favorites 0 likes

#mixture-of-experts

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

Hugging Face Daily Papers ↗ · 2d ago Cached

UniPool introduces a shared expert pool architecture for Mixture-of-Experts models, reducing parameter growth with depth while improving efficiency and performance over standard MoE baselines.

0 favorites 0 likes

#mixture-of-experts

Jackrong/Qwopus3.6-35B-A3B-v1-GGUF

Hugging Face Models Trending ↗ · 3d ago Cached

Jackrong releases Qwopus3.6-35B-A3B-v1, a reasoning-enhanced fine-tune of Alibaba's Qwen3.6 MoE model, optimized for logic and agentic coding with 35B total parameters and 3B active parameters.

0 favorites 0 likes

#mixture-of-experts

Zyphra/ZAYA1-8B

Hugging Face Models Trending ↗ · 4d ago Cached

Zyphra released ZAYA1-8B, an 8.4B parameter Mixture-of-Experts model with 760M active parameters, demonstrating high efficiency and strong performance in mathematical and coding reasoning tasks.

0 favorites 0 likes

#mixture-of-experts

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

Papers with Code Trending ↗ · 2026-05-01 Cached

The paper introduces PRISM, a method that inserts a distribution-alignment stage between supervised fine-tuning and reinforcement learning to mitigate distributional drift in multimodal models. It uses a black-box adversarial game with an MoE discriminator to improve RLVR performance on models like Qwen3-VL.

0 favorites 0 likes

#mixture-of-experts

NVIDIA Launches Nemotron 3 Nano Omni Model, Unifying Vision, Audio and Language for up to 9x More Efficient AI Agents

NVIDIA Blog ↗ · 2026-04-28 Cached

NVIDIA announces Nemotron 3 Nano Omni, an open multimodal model that unifies vision, audio, and language processing to enable faster and more efficient AI agents, achieving up to 9x higher throughput compared to other open omni models.

0 favorites 0 likes

#mixture-of-experts

XiaomiMiMo/MiMo-V2.5-Pro

Hugging Face Models Trending ↗ · 2026-04-27 Cached

Xiaomi releases MiMo-V2.5-Pro, an open-source MoE language model with 1.02T total parameters and 1M token context, optimized for complex agentic and software engineering tasks.

0 favorites 0 likes

#mixture-of-experts

poolside/Laguna-XS.2

Hugging Face Models Trending ↗ · 2026-04-23 Cached

Poolside releases Laguna XS.2, a 33B parameter MoE model with 3B activated parameters designed for agentic coding and local deployment on Macs with 36GB RAM.

0 favorites 0 likes

#mixture-of-experts

deepseek-ai/DeepSeek-V4-Pro

Hugging Face Models Trending ↗ · 2026-04-22 Cached

DeepSeek releases V4-Pro and V4-Flash, Mixture-of-Experts models supporting million-token context with hybrid attention and Muon optimizer.

0 favorites 0 likes

#mixture-of-experts

SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning

arXiv cs.CL ↗ · 2026-04-22 Cached

SAMoRA introduces a semantic-aware router and task-adaptive scaling to improve expert specialization and dynamic weighting in MoE-LoRA fine-tuning, outperforming prior methods on multi-task benchmarks.

0 favorites 0 likes

#mixture-of-experts

@AntLingAGI: Introducing Ling-2.6-flash, an instruct model with 104B total parameters and 7.4B active parameters. Ling-2.6-flash is …

X AI KOLs Following ↗ · 2026-04-21 Cached

Ling-2.6-flash is a 104B-total/7.4B-active sparse instruct model optimized for token efficiency, aiming to cut costs and boost throughput on agent tasks.

0 favorites 0 likes

#mixture-of-experts

@ProTekkFZS: Q4_K_M 3.6 35B at 768k with yarn on my 3090 has been a joy, I can't lie. Using the llama.cpp fork from @no_stp_on_snek …

X AI KOLs Following ↗ · 2026-04-20 Cached

User reports successfully running a 35B-parameter mixture-of-experts model at 768K context length using Q4_K_M quantization and YaRN on an RTX 3090 via a llama.cpp fork, offloading only 8 experts to CPU while maintaining acceptable performance.

0 favorites 0 likes

#mixture-of-experts

Qwen3.5-27B, Qwen3.5-122B, and Qwen3.6-35B on 4x RTX 3090 — MoEs struggle with strict global rules

Reddit r/LocalLLaMA ↗ · 2026-04-20

A user benchmarks three Qwen models (Qwen3.5-27B dense, Qwen3.5-122B-A10B MoE, Qwen3.6-35B-A3B MoE) on 4x RTX 3090 GPUs under real agentic workloads, finding that MoE models consistently underperform the dense 27B at following strict global rules despite speed advantages, with the Qwen3.6-35B leading in generation throughput.

0 favorites 0 likes

#mixture-of-experts

FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models

arXiv cs.CL ↗ · 2026-04-20 Cached

FineSteer is a novel inference-time steering framework that decomposes steering into conditional steering and fine-grained vector synthesis stages, using Subspace-guided Conditional Steering (SCS) and Mixture-of-Steering-Experts (MoSE) mechanisms to improve safety and truthfulness while preserving model utility. Experiments show 7.6% improvement over state-of-the-art methods on TruthfulQA with minimal utility loss.

0 favorites 0 likes

mixture-of-experts

Submit Feedback