Tag
This paper systematically evaluates 11 synthetic time-series generators for foundation model pretraining and finds that generator rankings are not stable across architectures, but an equal-weight mixture of all generators matches or beats the best individual. Blending this mixture with real data yields the strongest pretraining corpora, reframing synthetic pretraining as a corpus composition problem rather than a generator selection problem.
Researchers propose a novel router redesign for Mixture-of-Experts models that aligns router rows with principal singular directions using Manifold Power Iteration, improving model effectiveness.
An opinion piece argues that Google's Gemma4-12B model is strategically designed for IoT and mobile devices within the Android ecosystem, not just laptops as marketed, prioritizing low-latency speech and video processing over quality.
A discussion about the advantages of Mixture of Experts (MoE) models over dense models beyond speed, considering RAM constraints and scaling limits.
Delta Attention Residuals improve layer-wise routing in transformer models by attending to feature changes (deltas) rather than cumulative hidden states, achieving 1.7–8.2% validation perplexity gains across scales from 220M to 7.6B parameters.
Interfaze introduces a hybrid AI model architecture combining CNN/DNN specialization with transformer capabilities, achieving superior accuracy on deterministic tasks like OCR and translation while maintaining cost efficiency at scale.
This article questions why major LLM providers are not investing in Diffusion LLMs despite recent advancements like Mercury 2. It explores potential fundamental issues or hardware bottlenecks hindering broader adoption.
UniPool introduces a shared expert pool architecture for Mixture-of-Experts models, reducing parameter growth with depth while improving efficiency and performance over standard MoE baselines.
A comprehensive survey reviewing recent advances in intrinsic interpretability for Large Language Models, categorizing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. The paper addresses the challenge of building transparency directly into model architectures rather than relying on post-hoc explanation methods.