Tag
This pull request by ggerganov optimizes kv-cache in llama.cpp to avoid unnecessary copies of kv cells, improving inference performance. It is a contribution to the open-source LLM inference library llama.cpp.
Steeve Morin reports that after 5 days of work, his implementation is now within 10% of llama.cpp's speed, achieving 64 tok/s vs 70 tok/s, with more work to do.
This paper introduces a general acceleration mechanism for multi-objective Bayesian optimisation that uses Gaussian process predictive gradients as auxiliary signals to augment existing acquisition functions, enabling faster convergence to the global Pareto set under limited evaluation budgets.
This paper addresses the open question of maximum step size for gradient descent convergence on non-L-smooth objectives, introducing adaptive methods that operate at the edge of stability and can minimize sharpness globally.
This book presents a mathematical theory of deep representation learning, aiming to demystify the internal mechanisms of large deep networks using optimization and information theory, making architecture design a matter of linear algebra and calculus.
This paper presents a QUBO-based model for coordinating departure sequencing and track allocation in railway short-term concentrated departure scenarios, evaluated using simulation and hybrid quantum algorithms. Results show quantum-enhanced methods reduce cost and delay under dynamic conditions.
The article coins the term 'dopamine fracking' to describe the process of pumping excessive resources into casual activities to extract maximum dopamine, ignoring long-term harm. It critiques the commodification of online culture, hobbies, and relationships in the digital age.
This article provides a technical breakdown of how the project management tool Linear achieves its fast performance by using a browser-side database (IndexedDB), local-first mutations, and a sync engine, eliminating network latency from user interactions.
A proposal to add spawn templates to the Linux kernel aims to optimize the fork+exec pattern by caching executable information, though the current patch set is unlikely to be accepted as-is.
Apple's dsymutil tool, which links DWARF debug info into self-contained bundles, is adopting a parallel DWARF linker to address the single-threaded bottleneck in type deduplication, despite challenges in qualification due to non-binary-identical output.
This article details practical techniques to speed up terminal startup by avoiding frameworks, caching completions, and lazy-loading tools, achieving a 30ms shell start.
A detailed account of running the Qwen3.6-35B-A3B MoE model on an 8GB laptop GPU, covering effective optimizations like --no-mmap and VRAM headroom, unexpected findings where speculative decoding improved speed by 26% contrary to benchmarks, and pitfalls with Windows and CPU bottlenecks.
This article explains a technique to avoid calculating the greatest common divisor when performing cycle decomposition in std::rotate, as used in OpenJDK's Collections.rotate method. It provides a C++ implementation that tracks the count of rotated elements to determine when all cycles are complete.
Detailed breakdown of running Qwen 3.5 122B MoE on a single RTX 3090 at 35 t/s using a custom llama.cpp fork (ik_llama.cpp) with fused MoE operations and expert offloading to CPU RAM, significantly outperforming stock llama.cpp MTP.
This paper reveals that zeroth-order fine-tuning of LLMs is dominated by a single decoding layer, which can be identified by activation outliers, and fine-tuning only that layer matches or exceeds full-model fine-tuning with up to 4.52x speedup.
This paper proves sharp dimension-free first-order lower bounds for finding epsilon-stationary points in higher-order smooth nonconvex optimization, resolving open problems for Hessian-Lipschitz and third-order smooth cases.
DP-MacAdam combines adaptive clipping and adaptive momentum to improve differentially private SGD, achieving better model utility without manual tuning of the clipping threshold.
This paper shows that discrete Gradient Descent with large step sizes restores symmetry in multi-pathway Deep Linear Networks, countering the symmetry-breaking predicted by Gradient Flow, and leads to signal re-balancing across pathways. The authors theoretically prove that balanced solutions are flatter (less sharp) than sparse ones, and large learning rates drive the network toward stable, balanced configurations.
A guide on optimizing AI agent performance by improving the harness component to compensate for expensive model costs, focusing on hill climbing techniques.
A recommendation for the Algorithmica resource on CPU cache memory organization, which provides detailed experimental analysis and optimization techniques for in-memory algorithms.