memory-optimization

#memory-optimization

CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration

arXiv cs.LG ↗ · 12h ago Cached

This paper introduces CATS, a cascaded adaptive tree speculation framework designed to accelerate LLM inference on memory-constrained edge devices by optimizing memory usage while maintaining high token acceptance rates.

0 favorites 0 likes

#memory-optimization

Training-Inference Consistent Segmented Execution for Long-Context LLMs

arXiv cs.CL ↗ · 12h ago Cached

This paper proposes a training-inference consistent segmented execution framework for long-context LLMs to address the mismatch between full-context training and restricted inference regimes, achieving comparable performance with significantly reduced memory usage.

0 favorites 0 likes

#memory-optimization

The hidden cost of mpsc channels

Lobsters Hottest ↗ · 22h ago Cached

This article analyzes unexpected memory allocation costs in Tokio's mpsc channels in Rust, revealing a fixed overhead per channel due to internal block sizing. It demonstrates how this impacts large-scale applications like Agent Gateway and suggests alternatives like futures-channel for memory efficiency.

0 favorites 0 likes

#memory-optimization

CERSA: Cumulative Energy-Retaining Subspace Adaptation for Memory-Efficient Fine-Tuning

arXiv cs.LG ↗ · yesterday Cached

The paper introduces CERSA, a novel parameter-efficient fine-tuning method that uses singular value decomposition to retain principal components, significantly reducing memory usage while outperforming existing methods like LoRA.

0 favorites 0 likes

#memory-optimization

Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant

arXiv cs.LG ↗ · yesterday Cached

This paper analyzes KV cache quantization schemes inspired by TurboQuant, using statistical inference and a new 6D error framework to evaluate quality measures like KL divergence and geometric error.

0 favorites 0 likes

#memory-optimization

@tom_doerr: Runs 35B models on 16GB RAM Macs https://github.com/walter-grace/mac-code…

X AI KOLs Timeline ↗ · 2d ago Cached

A tool that enables running large language models like Qwen3.5-35B on 16GB Macs by streaming model weights from SSD, achieving up to 30 tok/s with an optimal configuration.

0 favorites 0 likes

#memory-optimization

Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

arXiv cs.CL ↗ · 2d ago Cached

This paper introduces LaProx, a novel KV Cache eviction strategy for long-context LLM inference that reformulates the problem as an output-aware matrix multiplication approximation, achieving high performance with only 5% cache usage.

0 favorites 0 likes

#memory-optimization

LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction

arXiv cs.LG ↗ · 2d ago Cached

This paper introduces LKV, a method for end-to-end learning of head-wise budgets and token selection to optimize KV cache eviction in large language models, achieving state-of-the-art performance with high compression rates.

0 favorites 0 likes

#memory-optimization

RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory

arXiv cs.LG ↗ · 2d ago Cached

This paper introduces RateQuant, a method for optimal mixed-precision KV cache quantization that uses rate-distortion theory to address distortion model mismatch. It significantly reduces perplexity compared to existing methods like KIVI and QuaRot with minimal calibration overhead.

0 favorites 0 likes

#memory-optimization

I solved kv-cache

Reddit r/AI_Agents ↗ · 2d ago

The author has open-sourced a novel KV-cache solution called catalyst-brain, claiming to dramatically reduce RAM usage for local models and potentially enable infinite context windows.

0 favorites 0 likes

#memory-optimization

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Hugging Face Daily Papers ↗ · 3d ago Cached

This paper introduces a learned global retention-based KV cache eviction method that improves long-context reasoning by selectively retaining useful tokens and reducing attention dilution, while significantly lowering memory usage.

0 favorites 0 likes

#memory-optimization

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

arXiv cs.LG ↗ · 5d ago Cached

This paper introduces sparse prefix caching for hybrid and recurrent LLMs, which stores recurrent states at a limited set of checkpoint positions to avoid dense caching while minimizing recomputation. The method outperforms standard heuristics on real-world data, especially when requests share substantial but non-identical prefixes.

0 favorites 1 likes

#memory-optimization

@AI_jacksaku: This week’s GitHub dark horse—Unsloth speeds up AI model training 2-5× while cutting VRAM use by 80%. What does that mean? Fine-tuning a large model used to require an A100 cluster and tens of thousands of dollars. Now one RTX 4090 can finish the job in a few hours. How? By optimizing attention compute, eliminating redundant memory copies, and adding QLoRA & Flash Attention support.

X AI KOLs Timeline ↗ · 2026-04-23 Cached

Unsloth open-source tool boosts large-model fine-tuning speed 2-5× and slashes VRAM by 80%, letting a single RTX 4090 finish in hours what once needed an A100 cluster.

0 favorites 0 likes

#memory-optimization

@0xSero: Locally Part 1 - Apple Silicon Macs give you large pools of memory to run big models, but the token generation speed wi…

X AI KOLs Following ↗ · 2026-04-22 Cached

Apple Silicon Macs offer large memory pools for running big models but with slower token generation, performing best with large MoEs that have low active parameters.

0 favorites 0 likes

memory-optimization

Submit Feedback