Tag
A detailed technical exploration of MTP speculative decoding in llama.cpp with Gemma 4 models, showing that assistant model selection and quantization significantly impact speedups, and that not all 'same name' assistants perform equally.
EAGLE3, a speculative decoding method, has been integrated into llama.cpp, enabling faster inference.
This paper reveals that the scaling factor α in LoRA optimization is more influential than the learning rate, and proposes LoRA-α, a framework that improves performance and simplifies hyperparameter search by restoring α to its principled regime.
Arbor introduces structured tree search as a cognition layer for autonomous agents, enabling multi-day, full-stack LLM inference optimization with up to 193% throughput-latency improvement over vendor baselines through a checks-and-balances multi-agent architecture.
This paper introduces NaturalFlow, a fluency-aware optimization framework that reduces disruptive pauses in simultaneous speech-to-speech translation by leveraging model-internal signals, achieving a balance between low latency and natural speech flow.
This blog post presents an algorithm using integer linear programming to compute optimal tokenizers for language models, drawing parallels to solving the Traveling Salesman Problem. It notes that while the result is theoretically interesting, practical tokenizers are already near-optimal and the method may not generalize well.
This paper reveals that Mirror Descent with non-quadratic regularizers can be exponentially more sensitive to initialization than Gradient Descent, even under well-conditioned settings, which has implications for reproducibility in RL and LLM post-training.
SwiftCTS is a physics-informed surrogate framework that uses gradient-boosted ensembles and few-shot calibration to rapidly predict and Pareto-optimize clock tree metrics (power, wirelength, timing skew) across unseen designs, achieving high accuracy with minimal training data.
Introduces Compatibility-Aware Dynamic Fine-Tuning (CADFT), an extension of Dynamic Fine-Tuning that controls sample-level optimization variance in LLM supervised fine-tuning, improving stability and generalization.
Fulcrum Research introduces Inverse Rubric Optimization (IRO), a testbed for studying long-horizon agent behavior where agents must optimize the preferences of a black-box judge. The approach enables smooth scaling and rich behavior analysis, with experiments showing frontier models like Fable 5 and Opus 4.6 have different scaling characteristics.
Browser Use Beta achieved state-of-the-art results on a difficult internal web agent benchmark, using Fable for optimization and analysis.
This paper analyzes on-policy distillation (OPD), finding that OPD updates are sparse, distributed across layers and FFN-heavy, and retain geometric properties distinct from dense parameter rewriting. The sparse structure is operationally useful, but sparsity-inducing SGD underperforms AdamW due to heterogeneous gradient scales.
A pull request for llama.cpp that removes padding and multiple device-to-device copies for Multi-Token Prediction (MTP), improving performance on GPU.
This paper proposes trainable smooth-rotation transforms with quantile-robust scaling and gradient-based optimization to improve post-training quantization of LLMs, achieving significant error reduction on LLaMA-3.2-1B under W4A4 quantization.
This paper introduces Sim2Schedule, a simulator-guided LLM framework for autonomous open-pit mine scheduling that achieves 94-99% of the optimal NPV from MILP while scaling linearly in computation time, operating zero-shot without fine-tuning.
Proposes and compares two mathematical formulations for robust microgrid sizing and power scheduling under uncertainties, using a local reduction algorithm that achieves high feasibility rates in Monte Carlo simulations.
This paper formalizes the problem of ordering filters in sequential filtering pipelines under independent cost and selectivity models, proving that ordering by increasing ratio of cost to rejection probability is optimal. Monte Carlo simulations demonstrate that this ordering dominates common heuristics both in expectation and across the full distribution of outcomes.
Improves prefill speeds for k-quants and refactors matrix multiplication for Q4/Q5/Q8 and k-quants in llama.cpp's WebGPU backend.
Harvard researchers present AutoScientists, a multi-agent system that forms self-organizing scientific teams without a central coordinator, achieving strong results on BioML-Bench and optimization tasks.
The article explains value numbering, a compiler optimization technique that identifies identical computations to avoid redundancy, building on Static Single Assignment (SSA) form and using hash-consing for efficient comparison.