Confidence-Adaptive SwiGLU for Mixture-of-Experts
Summary
Proposes Confidence-Aware SwiGLU (κ-SwiGLU) that adjusts expert gate sharpness in Mixture-of-Experts models based on token-level routing confidence, improving performance with minimal computational overhead.
View Cached Full Text
Cached at: 06/02/26, 03:36 PM
Paper page - Confidence-Adaptive SwiGLU for Mixture-of-Experts
Source: https://huggingface.co/papers/2606.00761
Abstract
Confidence-Aware SwiGLU adjusts expert gate sharpness in Mixture-of-Experts models based on token-level routing confidence, improving performance with minimal computational overhead.
SwiGLUhas become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of thegating function-- is typically fixed throughout training. In this work, we propose Confidence-AwareSwiGLU(κ-SwiGLU), a variant ofSwiGLUforMixture-of-Experts(MoE) models that adjusts expert gate sharpness according totoken-level routing confidence. Specifically, κ-SwiGLUparameterizes theSiLUgate sharpness coefficient as a learnable function of therouter logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate κ-SwiGLUon the FineWeb-Edu dataset acrossMoETransformer models ranging from 8 to 28 layers. Across these settings, κ-SwiGLUimproves meanCORE performancewhile adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improvingMoEMLPs. The code is available at https://github.com/askerlee/kappa-swiglu.
View arXiv pageView PDFGitHub0Add to collection
Get this paper in your agent:
hf papers read 2606\.00761
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.00761 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.00761 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.00761 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Introducing Laguna XS 2.1 (5 minute read)
Poolside releases Laguna XS 2.1, a 33B parameter Mixture-of-Experts model with 3B activated parameters per token, designed for agentic coding, with improvements on SWE-bench Multilingual and other benchmarks, now available under the permissive OpenMDW-1.1 license.
SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs
SHAPE proposes a coalition-aware expert pruning framework for sparse MoE LLMs that uses Shapley-style attribution over routing traces to identify essential experts, achieving competitive accuracy under 20-40% pruning and reducing GPU memory footprint.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
UniPool introduces a shared expert pool architecture for Mixture-of-Experts models, reducing parameter growth with depth while improving efficiency and performance over standard MoE baselines.
Integrating Local and Global Entropy for Uncertainty Quantification in LLMs
This paper proposes Global-Local Uncertainty (GLU), an unsupervised single-pass score that fuses token-level local entropy with hidden-state geometric global entropy for uncertainty quantification in LLMs, showing that the two are near-orthogonal and together capture confident-but-wrong failures.
Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention
Grouped Query Experts (GQE) improves Transformer efficiency by applying a mixture-of-experts layer on top of grouped-query attention, selectively activating query heads per token while keeping key-value cache benefits, matching baseline accuracy with half the query-head compute at 250M parameter scale.