Tag
A blog post from LMSYS Org details optimizing Ling-2.6-1T, a 1 trillion parameter hybrid MoE model, on TPU v7x using SGLang-JAX, achieving efficient inference by hiding MoE data movement behind computation with a single Pallas kernel.
LlamaIndex improved their LiteParse PDF parsing skill for Claude agents, making it 37% cheaper and more accurate by optimizing agent behavior through evaluation traces.
Explains the use of GCC's computed goto extension to improve the performance of bytecode VM dispatch tables compared to traditional switch statements, with a simple example.
Proposes MGUP, a momentum-gradient alignment update policy for selective intra-layer parameter updates in stochastic optimization, which integrates with optimizers like AdamW, Lion, and Muon, and provides theoretical convergence guarantees along with superior performance on large-scale model training tasks.
This paper uses a Transformer-based model on MLB Statcast data to counterfactually optimize baseball pitch sequences, finding that optimizing both final and setup pitches can improve season-level statistics like K/9 by over 1.0.
This paper rethinks the role of grouping in critic-free reinforcement learning for LLMs and proposes negative token filtering to enable stable training with a single rollout per prompt, achieving comparable or better performance on reasoning and agentic tasks.
This paper presents a skill-constrained model predictive control approach for resilient manufacturing supply chains, where training decisions affect future certified capacity. The controller solves a finite-horizon mixed-integer program and is evaluated on synthetic scenarios, showing that predictive control helps when bottlenecks are forecastable but is not universally superior.
The author discovered that compiling llama.cpp with both CUDA and Vulkan backends simultaneously is possible, yielding a ~10% improvement in tokens/sec for decoding. They plan to run further benchmarks to assess the benefits.
The Reflex team optimized Python's ast.walk by 220x for their AI code generation linter by removing generator overhead, inlining functions, and implementing a Rust binding.
A new Nature paper introduces ERA, an AI system that iteratively writes, runs, scores, and improves scientific code through tree search, moving AI for science from text generation to code testing.
A story from a Windows x86 emulator team about encountering a program with a fully unrolled 64KB initialization loop (65,536 instructions) and adding a special optimization to replace it with a tight loop.
This survey categorizes LLM-based optimization into three paradigms—direct, tool-augmented, and tool-creating—and reviews their performance frontiers and limitations.
This paper introduces Spokes, a probabilistic diversification framework using the G-Vendi score to optimize diversity in pretraining data selection, achieving significant improvements in downstream task performance on FineWeb and DCLM by jointly optimizing quality and diversity.
This paper provides guidance on the appropriate use of different Schatten-p norms in deep learning, analyzing their theoretical properties and practical implications for model regularization and optimization.
This paper introduces AdaNAGED, a method that combines zero-order optimization, parameter-free adaptation, and non-Euclidean update geometry for memory-efficient fine-tuning of large language models, with theoretical convergence guarantees and validation on the OPT-1.3B model.
This paper proposes an α-Fair Individual Solvent Premium (α-FISP) framework for insurance pricing that balances actuarial fairness and solidarity fairness while ensuring solvency, using constrained optimization to yield a continuum of pricing solutions.
Z Lab, SGLang, and Modal release DFlash, a new speculative decoding model for Qwen 3.5 397B-A17B that uses block diffusion and KV injection to achieve over 4x throughput improvement over baseline and 1.5x over native MTP.
A tweet from Song Han highlights continued work on KV cache compression, featuring a blog by Weian Mao that discusses system-level aspects often overlooked in papers.
A new KV cache optimization called kvflash doubles generation speed and reduces VRAM usage for Qwen 3.6-27B on a single RTX 3090 while maintaining accuracy.
This article details how Clojure, with the JVM's Vector API and careful optimization, achieved frame rates within 20% of C for a 3D stress test, demonstrating that a dynamic language can approach low-level performance on hot loops.