Tag
The author argues that AI models like GPT and Claude over-optimize human creations, missing the value of imperfection, messiness, and emotional depth in art and life.
Lucebox Hub provides optimized CUDA kernels (Megakernel, DFlash, PFlash) for local LLM inference, achieving significant speedups (2-10x) over llama.cpp on various models and GPUs.
Arthur Pastel optimized the fast_blur function in the Rust image-rs crate, achieving up to 5.9x speedup on u8 images by using box blur approximations for faster Gaussian-like blurs.
Steven Brunton announces his new book 'Optimization: A Bootcamp for Machine Learning, Inverse Problems, and Control', with pre-order available and accompanying free PDF, YouTube videos, and Python code.
The user shares their experience optimizing Qwen3.6-27B inference speed on a Mac using different quantization methods (Unsloth Q5, MLX 6bit + DFlash, MTPLX 4bit), ultimately reaching 43 tok/s.
Hugging Face's kernels project is expanding and seeking contributors for agentic kernel development to provide real optimization value to models.
EnergyLens is an end-to-end framework for predictive energy-aware optimization of multi-GPU LLM inference, validated on Llama3 and Qwen3-MoE, achieving mean absolute percentage errors between 9.25% and 13.19% and revealing significant energy variation across configurations.
This paper develops a principled scaling theory for Mixture-of-Experts (MoE) architectures, introducing the Maximally Scale-Stable Parameterization (MSSP) that ensures stable training and hyperparameter transfer across width, depth, expert width, and number of experts, validated by experiments.
Proposes PPOW, a reinforcement learning framework for optimizing draft models in speculative decoding using window-level objectives and adaptive windowing, achieving significant speedups across multiple benchmarks.
Developer reports progress implementing sparse attention for mlx-swift-lm, achieving only +4% overhead vs dense attention on M5 Max.
A user shares a fix for performance bottlenecks when running AI models on AMD GPUs in Windows 11 by disabling memory compression via the command 'Disable-mmagent -mc'.
Introduces Bayesian Model Merging (BMM), a plug-and-play bi-level optimization framework for combining multiple task-specific experts into a single model, achieving state-of-the-art performance on vision and language benchmarks.
This paper identifies 'staleness amplification' in bilevel optimization under delayed feedback and proposes IGT-OMD, which uses Implicit Gradient Transport to achieve sublinear regret and improve decision loss on benchmarks like Warcraft shortest-path and LQR.
Fast-Slow Training (FST) interleaves context optimization (via GEPA) with model weight updates via RL, achieving 3× sample efficiency over RL alone on math, code, and physics reasoning while preserving plasticity and enabling continual learning.
The article argues that the primary AI risk may not be superintelligence but rather systems that optimize flawed, incomplete representations of reality, leading to institutional drift, automated misclassification, and invisible governance failures.
TanStack Devtools migrated to OxcProject parser and magic-string, achieving a 3.56× speedup with per-file transform dropping from 1.65 ms to 0.46 ms.
Crustimate is a tool that helps optimize your LinkedIn profile to be discovered by AI-powered recruiters.
DeepSpeed is an open-source deep learning optimization library from Microsoft that enables efficient distributed training and inference of large-scale models with features like ZeRO, 3D parallelism, and Mixture-of-Experts.
The article discusses using Google's OR-Tools CP-SAT solver to optimize maintenance scheduling for cloud infrastructure at Akamai, addressing complex constraints like capacity and concurrency.
The article discusses Partial Static Single Information (SSI) form, an extension to SSA in compilers that captures path-dependent type information. It proposes a practical shortcut for implementing Partial SSI during SSA construction in dynamic languages, specifically referencing an implementation in Ruby's ZJIT.