Tag
A discussion about the advantages of Mixture of Experts (MoE) models over dense models beyond speed, considering RAM constraints and scaling limits.
Delta Attention Residuals improve layer-wise routing in transformer models by attending to feature changes (deltas) rather than cumulative hidden states, achieving 1.7–8.2% validation perplexity gains across scales from 220M to 7.6B parameters.
Interfaze introduces a hybrid AI model architecture combining CNN/DNN specialization with transformer capabilities, achieving superior accuracy on deterministic tasks like OCR and translation while maintaining cost efficiency at scale.
This article questions why major LLM providers are not investing in Diffusion LLMs despite recent advancements like Mercury 2. It explores potential fundamental issues or hardware bottlenecks hindering broader adoption.
UniPool introduces a shared expert pool architecture for Mixture-of-Experts models, reducing parameter growth with depth while improving efficiency and performance over standard MoE baselines.
A comprehensive survey reviewing recent advances in intrinsic interpretability for Large Language Models, categorizing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. The paper addresses the challenge of building transparency directly into model architectures rather than relying on post-hoc explanation methods.