mixture-of-experts

#mixture-of-experts

Multi Tier MoE Caching

Reddit r/LocalLLaMA ↗ · 12h ago

Discusses multi-tier caching strategies for MoE models to improve inference speed by keeping frequently activated experts on GPU, referencing existing implementations like PowerInfer and llama.cpp branches.

0 favorites 0 likes

#mixture-of-experts

@eliebakouch: every infra piece you need to know to do RL on GLM-5 https://primeintellect.ai/blog/rl-at-1t-scale…

X AI KOLs Timeline ↗ · 17h ago Cached

Prime Intellect releases prime-rl v0.6.0, enabling efficient reinforcement learning at trillion-parameter scale on large Mixture-of-Experts models, with sub-5-minute step times and optimizations for asynchronous RL.

0 favorites 0 likes

#mixture-of-experts

Unsloth GLM-5.2 – How to Run Locally

Hacker News Top ↗ · 22h ago Cached

A guide on running Z.ai's open model GLM-5.2 locally using Unsloth Dynamic GGUFs. The model features 744B total parameters (40B active) and a 1M context window, with quantized versions reducing memory to 239GB for 2-bit, enabling local inference on 256GB Macs.

0 favorites 0 likes

#mixture-of-experts

@BlackRainLabs: Using TurboQuant i was able to push 20 tk/s on qwen 3.6 35b MoE on a GTX1060 3GB. Insane for such a small and old card.…

X AI KOLs Following ↗ · yesterday Cached

Using TurboQuant, the user achieved 20 tokens per second on a Qwen 3.6 35B MoE model running on a GTX1060 3GB, showcasing impressive performance on outdated hardware.

0 favorites 0 likes

#mixture-of-experts

What Is GLM-5.2? Inside Z.ai’s 744B-Parameter Agentic AI Model

Reddit r/AI_Agents ↗ · 2d ago

Z.ai (formerly Zhipu AI) has released GLM-5.2, a 744-billion parameter Mixture-of-Experts AI model designed for agentic tasks like autonomous software engineering, with a 1-million token context window, low moderation, and trained on domestic Huawei Ascend chips.

0 favorites 0 likes

#mixture-of-experts

LLMs Are Complicated Now

Hacker News Top ↗ · 3d ago Cached

The article discusses how LLMs have grown increasingly complex, moving beyond simple transformer stacks to incorporate diverse attention variants, mixture-of-experts, and multimodal encoders, drawing parallels with recommendation systems and emphasizing the need for composable kernel optimization like FlexAttention.

0 favorites 0 likes

#mixture-of-experts

poolside/Laguna-M.1 · Hugging Face - 225B-A23B

Reddit r/LocalLLaMA ↗ · 5d ago Cached

Poolside releases Laguna M.1, a 225B parameter Mixture-of-Experts model with 23B activated parameters per token, designed for agentic coding and long-horizon tasks. It achieves competitive results on SWE-bench benchmarks and is released under an Apache 2.0 license.

0 favorites 0 likes

#mixture-of-experts

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

arXiv cs.LG ↗ · 5d ago Cached

Proposes a structural pruning framework for MoE models that maximizes channel-score coverage via attribution-based approximation, achieving 50% or 25% pruning with 4-bit quantization and reducing memory footprint by 5.27x on Qwen3-30B-A3B.

0 favorites 0 likes

#mixture-of-experts

@jbhuang0604: Huge! It’s amazing how often Noam’s papers end up at the center of the field. In many tutorial videos I’ve made, they’v…

X AI KOLs Following ↗ · 5d ago Cached

The article provides a detailed explanation of Mixture of Experts (MoE) in transformers, covering routing, load balancing, and recent innovations like fine-grained experts. It also highlights the significance of Noam Shazeer's research contributions and his move from Google to OpenAI.

0 favorites 0 likes

#mixture-of-experts

@markchen90: Very excited to welcome @NoamShazeer to OpenAI as our new lead for architecture research! His work on transformers, MoE…

X AI KOLs Timeline ↗ · 5d ago Cached

Noam Shazeer, a key researcher behind transformers and MoE, is joining OpenAI as head of architecture research, moving from Google.

0 favorites 0 likes

#mixture-of-experts

Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention

Hugging Face Daily Papers ↗ · 5d ago Cached

Grouped Query Experts (GQE) improves Transformer efficiency by applying a mixture-of-experts layer on top of grouped-query attention, selectively activating query heads per token while keeping key-value cache benefits, matching baseline accuracy with half the query-head compute at 250M parameter scale.

0 favorites 0 likes

#mixture-of-experts

GLM-5.2 is probably the most powerful text-only open weights LLM

Simon Willison's Blog ↗ · 5d ago Cached

Chinese AI lab Z.ai released GLM-5.2, a 753B parameter open weights LLM with a 1M token context window under MIT license, achieving top scores on the Artificial Analysis Intelligence Index and ranking second on the Code Arena WebDev leaderboard.

0 favorites 0 likes

#mixture-of-experts

@ying11231: Impressive performance on TPU.

X AI KOLs Timeline ↗ · 6d ago Cached

A blog post from LMSYS Org details optimizing Ling-2.6-1T, a 1 trillion parameter hybrid MoE model, on TPU v7x using SGLang-JAX, achieving efficient inference by hiding MoE data movement behind computation with a single Pallas kernel.

0 favorites 0 likes

#mixture-of-experts

@Jianlin_S: MoE (9): The Gate Normalization Debate https://kexue.fm/archives/11782

X AI KOLs Timeline ↗ · 6d ago

A blog post discussing the debate on gate normalization in Mixture of Experts (MoE) models.

0 favorites 0 likes

#mixture-of-experts

Kimi K2.7 Code: 1T MoE, $0.95/M tokens, MIT license, beats Opus 4.8 on MCP tool-calling

Reddit r/AI_Agents ↗ · 6d ago

Moonshot AI 发布了专注于编程的开放式权重模型 Kimi K2.7 Code，拥有1万亿参数和384个专家，性能在MCP工具调用上超越Opus 4.8，成本仅为十分之一。

0 favorites 0 likes

#mixture-of-experts

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

arXiv cs.LG ↗ · 6d ago Cached

This paper introduces MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE multimodal LLMs that addresses biases in expert importance estimation by decomposing selection frequency by modality and filtering redundant vision tokens, achieving minimal performance loss under aggressive quantization.

0 favorites 0 likes

#mixture-of-experts

@iScienceLuvr: Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging This paper introduces Neuro-JEPA, a foun…

X AI KOLs Following ↗ · 2026-06-16 Cached

This paper introduces Neuro-JEPA, a foundation model that uses a latent predictive objective and Mixture-of-Experts architecture to encode brain MRI scans across T1w, T2w, and FLAIR sequences, pretrained on a large dataset of 1.55 million scans.

0 favorites 0 likes

#mixture-of-experts

Towards a Unified Generative Model for Scarce Time Series with Domain Experts

arXiv cs.LG ↗ · 2026-06-16 Cached

Introduces TimeMoDE, a framework combining Diffusion Transformers with Mixture-of-Experts for generating realistic time series under data scarcity, using pre-training on multi-domain datasets and domain prompts to handle domain-specific features and diffusion timestep signals for adaptive denoising.

0 favorites 0 likes

#mixture-of-experts

Claude Fable 5 distilled

Reddit r/LocalLLaMA ↗ · 2026-06-16 Cached

Qwable-v1 is an open-weights agentic coding model (35B MoE, 3B active) built by chaining distills from Claude Opus 4.7 reasoning and Claude Fable-5 agentic tool-use traces. It can think in explicit CoT chains and act as a Claude-Code-style agent when prompted.

0 favorites 0 likes

#mixture-of-experts

Decoupled Mixture-of-Experts for Parametric Knowledge Injection

arXiv cs.CL ↗ · 2026-06-15 Cached

Decoupled Mixture-of-Experts (DMoE) proposes a modular architecture for parametric knowledge injection, decoupling experts and router from the base model to enable efficient auto-regressive inference and mitigate catastrophic forgetting.

0 favorites 0 likes

mixture-of-experts

Submit Feedback