Tag
This paper identifies a geometric mismatch in the Dion low-rank spectral optimizer and proposes Orth-Dion, which replaces column normalization with QR orthogonalization to close the convergence gap to full-rank methods like Muon at the same communication cost, validated on large-scale language model pre-training.
SignMuon is a 1-bit, matrix-aware optimizer for distributed training that combines signSGD's majority-vote sign aggregation with Muon's polar-step framework, achieving 32x bandwidth reduction over float32 while maintaining strong convergence and performance on benchmarks like CIFAR-10/ResNet-50 and nanoGPT.
This paper proposes mirror descent-type algorithms for solving variational inequality problems with functional constraints, proving optimal convergence rates for problems with bounded monotone operators and Lipschitz convex constraints. A modification is introduced to improve efficiency for many constraints.
This paper studies autonomous generative AI agents in multi-echelon supply chains using the MIT Beer Game, identifying four inference-time levers and introducing the concept of agent bullwhip. It shows that a reasoning model can exceed human performance, and proposes GRPO-based post-training to improve reliability.
This blog post explores how LoRA's interaction with weight decay leads to a different optimization objective than full fine-tuning, where weights are regularized towards the initial model rather than zero. It explains the implications for practitioners.
Author describes building FlashRT, a CUDA-first inference runtime that rewrites model inference paths with C++/CUDA kernels to address bottlenecks beyond GEMM for small-batch/realtime workloads, achieving significant latency improvements on Jetson Thor and RTX 5090. The article discusses lessons on precision (FP8 helpful, FP4 mixed) and the need to bypass generic runtimes for realtime inference.
The article argues that the real challenge in AI isn't just building smarter models but making them cost-efficient at scale, highlighting the importance of reducing token usage, improving speed, and optimizing infrastructure.
The author describes using HAProxy caching to reduce unnecessary load on snac threads in the FediMeteo service, following previous similar optimizations with nginx. The approach aims to keep the lightweight ActivityPub server efficient by having the reverse proxy absorb repeated public requests.
This academic paper investigates the asymmetry between pruning and growth in structural plasticity for neural networks, showing that newborn units suffer from weaker gradient signals than incumbent units, and proposes interventions to improve integration.
This paper proposes φ-balancing, a principled framework for load balancing in Mixture-of-Experts models that directly targets population-level expert balance using convex duality and mirror descent, achieving more stable expert utilization and outperforming prior methods on reasoning and code generation benchmarks.
This paper presents a case study using an LLM-driven tree search algorithm (ERA) combined with a coding agent (AntiGravity) to autonomously generate high-efficiency three-dimensional photovoltaic structures, overcoming limitations of flat solar panels at mid-latitudes. The workflow includes iterative patching to eliminate reward hacking and discovers improved designs under various constraints.
Benchmarking the b9200 update of llama.cpp with optimized flags for Qwen 3.6 27B MTP on a single RTX 3090 shows significant performance gains, especially in prompt processing speed, for agentic workflows.
AMD's ROCm 7.13 tech preview adds optimizations for Strix Halo (Ryzen AI Max 300) and open-sources the ROCprof Trace Decoder.
This pull request optimizes llama.cpp by avoiding unnecessary copying of logits during prompt decode in multi-token prediction, improving inference performance.
The article discusses how the KV cache is evolving into a memory hierarchy for LLM inference, optimizing memory management during decoding.
Explores when C++ compilers can devirtualize virtual function calls, covering cases like known dynamic types and final keyword, with comparisons across GCC, Clang, MSVC, and ICC.
The article explains the singleflight pattern in Go, which eliminates redundant concurrent calls to expensive operations by ensuring only one call is in flight at a time, sharing results among all callers.
The Fil-C optimized calling convention ensures memory safety for C programs even under adversarial misuse, while maintaining efficiency by omitting safety checks in the common case. It explains the generic and register-passing optimizations that handle type violations via panics or well-defined behavior.
A Codex skill that analyzes codebases to identify performance hotspots such as loops, repeated lookups, and N+1 patterns.
This paper proposes out-of-place write optimizations for database systems to fully leverage SSD performance, achieving 1.65-2.24x throughput improvement and 6.2-9.8x reduction in flash writes on OLTP benchmarks.