Tag
The author discovered that compiling llama.cpp with both CUDA and Vulkan backends simultaneously is possible, yielding a ~10% improvement in tokens/sec for decoding. They plan to run further benchmarks to assess the benefits.
The Reflex team optimized Python's ast.walk by 220x for their AI code generation linter by removing generator overhead, inlining functions, and implementing a Rust binding.
A new Nature paper introduces ERA, an AI system that iteratively writes, runs, scores, and improves scientific code through tree search, moving AI for science from text generation to code testing.
A story from a Windows x86 emulator team about encountering a program with a fully unrolled 64KB initialization loop (65,536 instructions) and adding a special optimization to replace it with a tight loop.
This survey categorizes LLM-based optimization into three paradigms—direct, tool-augmented, and tool-creating—and reviews their performance frontiers and limitations.
This paper introduces Spokes, a probabilistic diversification framework using the G-Vendi score to optimize diversity in pretraining data selection, achieving significant improvements in downstream task performance on FineWeb and DCLM by jointly optimizing quality and diversity.
This paper provides guidance on the appropriate use of different Schatten-p norms in deep learning, analyzing their theoretical properties and practical implications for model regularization and optimization.
This paper introduces AdaNAGED, a method that combines zero-order optimization, parameter-free adaptation, and non-Euclidean update geometry for memory-efficient fine-tuning of large language models, with theoretical convergence guarantees and validation on the OPT-1.3B model.
This paper proposes an α-Fair Individual Solvent Premium (α-FISP) framework for insurance pricing that balances actuarial fairness and solidarity fairness while ensuring solvency, using constrained optimization to yield a continuum of pricing solutions.
Z Lab, SGLang, and Modal release DFlash, a new speculative decoding model for Qwen 3.5 397B-A17B that uses block diffusion and KV injection to achieve over 4x throughput improvement over baseline and 1.5x over native MTP.
A tweet from Song Han highlights continued work on KV cache compression, featuring a blog by Weian Mao that discusses system-level aspects often overlooked in papers.
A new KV cache optimization called kvflash doubles generation speed and reduces VRAM usage for Qwen 3.6-27B on a single RTX 3090 while maintaining accuracy.
This article details how Clojure, with the JVM's Vector API and careful optimization, achieved frame rates within 20% of C for a 3D stress test, demonstrating that a dynamic language can approach low-level performance on hot loops.
Presents a Transformer-based scheduling policy trained with reinforcement learning for the open shop scheduling problem, showing that a model trained on small instances can generalize to much larger problems and compete with classical dispatching heuristics.
Proposes FedSPC, a modular correction method for personalized federated learning that applies control-variate correction only to shared parameters, improving performance across various PFL methods on CIFAR-100 and Tiny-ImageNet.
This paper presents a forecast-then-optimize algorithmic pricing tool for fashion e-commerce sales campaigns, using gradient-boosted trees for daily-demand forecasting and multi-objective optimization. A/B tests across 12 markets show the system achieves 6% higher profit while maintaining sales and revenue, and it has been deployed at Zalando.
This master's thesis at Uppsala University, done in collaboration with Oracle, investigates reducing the overhead of weak reference processing in the ZGC garbage collector by proposing three pipeline modifications and an alternative annotated-field mechanism.
A free 57-minute resource by MIT's Applied Math team covers matrix calculations and automatic differentiation for quants and optimization, highlighting Jane Street's high compensation for such skills.
Summarizes multiple community plugins and resources around the Hermes Agent framework, including Chinese practical guides, optimization manuals, visual monitoring tools, native macOS GUI, and design skill packs, helping users from beginner to advanced optimization.
Published a custom kernel to further optimize LTX-2.3 from Lightricks, achieving 1.52x speedup on GB10, building upon previous torch.compile and cuDNN attention optimizations.