Tag
Discusses multi-tier caching strategies for MoE models to improve inference speed by keeping frequently activated experts on GPU, referencing existing implementations like PowerInfer and llama.cpp branches.
Prime Intellect releases prime-rl v0.6.0, enabling efficient reinforcement learning at trillion-parameter scale on large Mixture-of-Experts models, with sub-5-minute step times and optimizations for asynchronous RL.
A guide on running Z.ai's open model GLM-5.2 locally using Unsloth Dynamic GGUFs. The model features 744B total parameters (40B active) and a 1M context window, with quantized versions reducing memory to 239GB for 2-bit, enabling local inference on 256GB Macs.
Using TurboQuant, the user achieved 20 tokens per second on a Qwen 3.6 35B MoE model running on a GTX1060 3GB, showcasing impressive performance on outdated hardware.
Z.ai (formerly Zhipu AI) has released GLM-5.2, a 744-billion parameter Mixture-of-Experts AI model designed for agentic tasks like autonomous software engineering, with a 1-million token context window, low moderation, and trained on domestic Huawei Ascend chips.
The article discusses how LLMs have grown increasingly complex, moving beyond simple transformer stacks to incorporate diverse attention variants, mixture-of-experts, and multimodal encoders, drawing parallels with recommendation systems and emphasizing the need for composable kernel optimization like FlexAttention.
Poolside releases Laguna M.1, a 225B parameter Mixture-of-Experts model with 23B activated parameters per token, designed for agentic coding and long-horizon tasks. It achieves competitive results on SWE-bench benchmarks and is released under an Apache 2.0 license.
Proposes a structural pruning framework for MoE models that maximizes channel-score coverage via attribution-based approximation, achieving 50% or 25% pruning with 4-bit quantization and reducing memory footprint by 5.27x on Qwen3-30B-A3B.
The article provides a detailed explanation of Mixture of Experts (MoE) in transformers, covering routing, load balancing, and recent innovations like fine-grained experts. It also highlights the significance of Noam Shazeer's research contributions and his move from Google to OpenAI.
Noam Shazeer, a key researcher behind transformers and MoE, is joining OpenAI as head of architecture research, moving from Google.
Grouped Query Experts (GQE) improves Transformer efficiency by applying a mixture-of-experts layer on top of grouped-query attention, selectively activating query heads per token while keeping key-value cache benefits, matching baseline accuracy with half the query-head compute at 250M parameter scale.
Chinese AI lab Z.ai released GLM-5.2, a 753B parameter open weights LLM with a 1M token context window under MIT license, achieving top scores on the Artificial Analysis Intelligence Index and ranking second on the Code Arena WebDev leaderboard.
A blog post from LMSYS Org details optimizing Ling-2.6-1T, a 1 trillion parameter hybrid MoE model, on TPU v7x using SGLang-JAX, achieving efficient inference by hiding MoE data movement behind computation with a single Pallas kernel.
A blog post discussing the debate on gate normalization in Mixture of Experts (MoE) models.
Moonshot AI 发布了专注于编程的开放式权重模型 Kimi K2.7 Code,拥有1万亿参数和384个专家,性能在MCP工具调用上超越Opus 4.8,成本仅为十分之一。
This paper introduces MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE multimodal LLMs that addresses biases in expert importance estimation by decomposing selection frequency by modality and filtering redundant vision tokens, achieving minimal performance loss under aggressive quantization.
This paper introduces Neuro-JEPA, a foundation model that uses a latent predictive objective and Mixture-of-Experts architecture to encode brain MRI scans across T1w, T2w, and FLAIR sequences, pretrained on a large dataset of 1.55 million scans.
Introduces TimeMoDE, a framework combining Diffusion Transformers with Mixture-of-Experts for generating realistic time series under data scarcity, using pre-training on multi-domain datasets and domain prompts to handle domain-specific features and diffusion timestep signals for adaptive denoising.
Qwable-v1 is an open-weights agentic coding model (35B MoE, 3B active) built by chaining distills from Claude Opus 4.7 reasoning and Claude Fable-5 agentic tool-use traces. It can think in explicit CoT chains and act as a Claude-Code-style agent when prompted.
Decoupled Mixture-of-Experts (DMoE) proposes a modular architecture for parametric knowledge injection, decoupling experts and router from the base model to enable efficient auto-regressive inference and mitigate catastrophic forgetting.