Tag
Detailed breakdown of running Qwen 3.5 122B MoE on a single RTX 3090 at 35 t/s using a custom llama.cpp fork (ik_llama.cpp) with fused MoE operations and expert offloading to CPU RAM, significantly outperforming stock llama.cpp MTP.
This paper reveals that zeroth-order fine-tuning of LLMs is dominated by a single decoding layer, which can be identified by activation outliers, and fine-tuning only that layer matches or exceeds full-model fine-tuning with up to 4.52x speedup.
This paper proves sharp dimension-free first-order lower bounds for finding epsilon-stationary points in higher-order smooth nonconvex optimization, resolving open problems for Hessian-Lipschitz and third-order smooth cases.
DP-MacAdam combines adaptive clipping and adaptive momentum to improve differentially private SGD, achieving better model utility without manual tuning of the clipping threshold.
This paper shows that discrete Gradient Descent with large step sizes restores symmetry in multi-pathway Deep Linear Networks, countering the symmetry-breaking predicted by Gradient Flow, and leads to signal re-balancing across pathways. The authors theoretically prove that balanced solutions are flatter (less sharp) than sparse ones, and large learning rates drive the network toward stable, balanced configurations.
A guide on optimizing AI agent performance by improving the harness component to compensate for expensive model costs, focusing on hill climbing techniques.
A recommendation for the Algorithmica resource on CPU cache memory organization, which provides detailed experimental analysis and optimization techniques for in-memory algorithms.
MIT researchers show that the edge of stability (EoS) in neural network training is not merely a global optimization phenomenon but selectively redistributes learning across subsets of the training distribution, amplifying progress on some data groups while suppressing others. They identify two key conditions governing this allocation: gradient alignment with the top Hessian eigenvector and sustained non-vanishing gradient magnitude.
Researchers from the University of Amsterdam propose a tabular reinforcement learning approach to the Metro Network Expansion Problem, showing it achieves comparable performance to Deep RL while reducing training episodes by 18x and carbon emissions by 12x on average. The method also incorporates social equity criteria and is evaluated on real-world metro networks in Xi'an and Amsterdam.
This paper develops a sharp pseudospectral theory for block-triangular Jacobians in coupled gradient descent, proving Kreiss-constant bounds and establishing iteration complexity results. The work exposes non-asymptotic, instance-dependent transient amplification phenomena relevant to bilevel optimization, two-time-scale stochastic approximation, and GAN training.
This paper proposes a principle of 'constraint-enhanced physical search' where temporal correlations in exploration are matched to constraint-induced spatial correlations in update dynamics, demonstrated via a tug-of-war bandit model. The authors show that efficient search emerges not from maximal randomness but from matching temporal correlation to the physical update scale that converts feedback into evidence.
Researchers from Beihang University and Baidu propose 'constraint injection,' a dual verification method for LLM-based optimization modeling that detects spurious or omitted constraints beyond objective equivalence. They develop VRPCoder, an 8B model for translating natural-language vehicle routing problems into Gurobi scripts, achieving 93% average Pass@1 and outperforming Claude Sonnet and prior OR-LLMs by large margins.
A survey of inlining heuristics in method JIT compilers, discussing the challenges of when to inline and the trade-offs involved, with examples from Ruby and Python.
llama.cpp releases version b9495 with optimizations for Qwen3.6/3.5-MTP (Multi-Token Prediction) and requests users to share their benchmark results with full command details.
Manticore Search introduces early termination for HNSW-based KNN vector search, reducing distance computations by up to 80% for large k values while maintaining precision within 2-4% of full search.
A user thanks for the GEPA tool, highlighting its natural workflow for LLM programs, fast iteration, and ability to bias optimization with data-derived priors.
The paper introduces GAMBLe, a framework that decomposes AI-Driven Research Systems into generator, assessor, discovery mechanism, and budget, revealing how component interactions shape optimization landscapes. Experiments on NP-hard problems show no universally best configuration, emphasizing the need for careful component selection.
Introduces SNMPBB, a nonmonotone gradient-based algorithm for symmetric nonnegative matrix factorization that achieves significant speedups over existing methods, with extensions to graph clustering and low-rank approximations.
GRZO is a novel zeroth-order optimization method for fine-tuning large language models that reduces variance by using group-relative normalization, achieving better accuracy and memory efficiency compared to MeZO.
This paper presents an exact decomposition of the curvature exponent α in neural network loss landscapes, explaining why it varies across layer types. It introduces the spectral alignment decomposition and derives a spectral transfer identity linking curvature, gradient rank decay, and Hessian exponents, validated across architectures and datasets.