Confidence-Adaptive SwiGLU for Mixture-of-Experts

Hugging Face Daily Papers 05/30/26, 12:00 AM Papers

mixture-of-experts swiglu confidence-adaptive routing transformer gating-function

Summary

Proposes Confidence-Aware SwiGLU (κ-SwiGLU) that adjusts expert gate sharpness in Mixture-of-Experts models based on token-level routing confidence, improving performance with minimal computational overhead.

SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU (κ-SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that adjusts expert gate sharpness according to token-level routing confidence. Specifically, κ-SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate κ-SwiGLU on the FineWeb-Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, κ-SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa-swiglu.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:36 PM

Paper page - Confidence-Adaptive SwiGLU for Mixture-of-Experts

Source: https://huggingface.co/papers/2606.00761

Abstract

Confidence-Aware SwiGLU adjusts expert gate sharpness in Mixture-of-Experts models based on token-level routing confidence, improving performance with minimal computational overhead.

SwiGLUhas become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of thegating function-- is typically fixed throughout training. In this work, we propose Confidence-AwareSwiGLU(κ-SwiGLU), a variant ofSwiGLUforMixture-of-Experts(MoE) models that adjusts expert gate sharpness according totoken-level routing confidence. Specifically, κ-SwiGLUparameterizes theSiLUgate sharpness coefficient as a learnable function of therouter logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate κ-SwiGLUon the FineWeb-Edu dataset acrossMoETransformer models ranging from 8 to 28 layers. Across these settings, κ-SwiGLUimproves meanCORE performancewhile adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improvingMoEMLPs. The code is available at https://github.com/askerlee/kappa-swiglu.

View arXiv page View PDF GitHub0 Add to collection

Get this paper in your agent:

hf papers read 2606\.00761

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.00761 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.00761 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.00761 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Confidence-Adaptive SwiGLU for Mixture-of-Experts

Paper page - Confidence-Adaptive SwiGLU for Mixture-of-Experts

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Introducing Laguna XS 2.1 (5 minute read)

SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

Integrating Local and Global Entropy for Uncertainty Quantification in LLMs

Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention

Submit Feedback

Similar Articles

Introducing Laguna XS 2.1 (5 minute read)

SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

Integrating Local and Global Entropy for Uncertainty Quantification in LLMs

Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention