Tag
OTCache is a training-free framework that uses optimal transport to predict caching schedules for diffusion models, achieving up to 4.7x acceleration on FLUX.1, Qwen-Image, and HunyuanVideo while improving generation fidelity.
BlockPilot proposes an instance-adaptive policy that predicts the optimal block size for diffusion-based speculative decoding, achieving significant speedup with minimal overhead.
A tweet highlighting four open-source libraries (Unsloth, LLaMA Factory, DeepSpeed, Axolotl) that accelerate fine-tuning of large language models with memory and speed optimizations.
ResilPhase is a training-free acceleration framework for diffusion models that reformulates accelerated inference as stable macro-trajectory extrapolation in ODE space, using derivative-free barycentric Lagrange extrapolation and bounded phase mapping to achieve state-of-the-art fidelity under high acceleration ratios.
Enze Xie announces Sol Video Inference Engine, an agent-native, training-free full-stack accelerator for video diffusion that auto-tunes cache, sparse attention, token pruning, quantization, and kernel fusion, achieving >2× end-to-end speedup on large models like 64B Cosmos3-Super and 22B LTX-2.3.
Elad Gil reflects on the accelerating pace of AI progress, linking to a review of Charles Stross's sci-fi novel Accelerando, which explores singularity themes.
This paper proposes eCNNTO, a CNN with residual connections to accelerate density-based topology optimization by predicting near-optimal densities from early iteration histories, achieving up to 97% reduction in iterations and strong generalization across different boundary conditions, geometries, and mesh resolutions.
AdaPLD is a training-free method that improves model-free speculative decoding by using adaptive retrieval combining lexical and semantic similarity, and constructing branched reuse hypotheses to handle continuation uncertainty, achieving up to 3.10x decoding speedup.
TAPS proposes a target-aware prefix tree selection method for diffusion-drafted speculative decoding, achieving up to 7.9x lossless end-to-end speedup by improving the acceptance-cost tradeoff over prior methods.
An article chronicling the timeline of AI model releases since GPT-2, highlighting the accelerating pace of model launches over time.
This article argues that AI creates a fast feedback loop where humans and machines mutually shape truth, accelerating consensus shifts and making truth increasingly synthetic and detached from reality.
This paper proposes Speculative Pipeline Decoding (SPD), a framework that uses pipeline parallelism within a single LLM to enable parallel token speculation, avoiding the latency bubbles and accuracy degradation of multi-token prediction in traditional speculative decoding.
Greg Brockman highlights how AI gives researchers like mathematician Terence Tao the freedom to explore bolder, more creative ideas in their work.
RT-Lynx proposes using activation sparsity instead of weight sparsity to accelerate diffusion models, achieving up to 1.55× linear-layer speedup while maintaining generation quality, and is accepted at ICML 2026.
Global warming has accelerated to twice the rate of previous decades, with a 98% confidence that the acceleration is due to climate change. If warming continues at this pace, the 1.5°C Paris Agreement limit could be breached by 2028.
Sam Altman shares three areas of excitement for AGI: accelerating research, companies, and personal goals. He also notes recent announcements including a unit distance result and $2M in OpenAI credits for Y Combinator startups.
This paper introduces CATS, a cascaded adaptive tree speculation framework designed to accelerate LLM inference on memory-constrained edge devices by optimizing memory usage while maintaining high token acceptance rates.
This paper introduces PARD-2, a dual-mode speculative decoding framework that uses target-aligned parallel draft models to accelerate LLM inference, achieving up to 6.94x lossless acceleration on Llama 3.1-8B.
This paper introduces DARE, a method for improving the inference efficiency of Diffusion Large Language Models by reusing cached key-value and output activations to reduce computational redundancy with negligible quality loss.
This paper introduces SpecBlock, a block-iterative speculative decoding method that combines path dependence with efficient drafting to accelerate LLM inference. It demonstrates improved speedup over existing methods like EAGLE-3 while maintaining lower drafting costs.