offloading

#offloading

@dabit3: 1,000 tok/s vs 85 tok/s visualized

X AI KOLs Timeline ↗ · 2026-07-15 Cached

Nader Dabit visualizes the speed difference between 1,000 tok/s subagents and 85 tok/s, highlighting that lightning skill offload enables ~5x faster execution by using subagents for implementation while keeping frontier models as planners and reviewers.

0 favorites 0 likes

#offloading

Are we offloading too much of our thinking to AI?

Hacker News Top ↗ · 2026-07-14 Cached

The article explores the growing trend of relying on AI for thinking and decision-making, using anecdotes and a reference to a Ken Liu short story to question the loss of human autonomy.

0 favorites 0 likes

#offloading

Maybe KV cache offload to RAM isn't bad

Reddit r/LocalLLaMA ↗ · 2026-06-05

A user shares their experience offloading the KV cache to RAM in llama.cpp, achieving comparable speeds while freeing VRAM for larger models and context windows, suggesting this trade-off is often worthwhile.

0 favorites 0 likes

#offloading

Qwen 3.5 122B MoE OC on a single 3090 at 35 t/s — full local stack breakdown

Reddit r/openclaw ↗ · 2026-06-05

Detailed breakdown of running Qwen 3.5 122B MoE on a single RTX 3090 at 35 t/s using a custom llama.cpp fork (ik_llama.cpp) with fused MoE operations and expert offloading to CPU RAM, significantly outperforming stock llama.cpp MTP.

0 favorites 0 likes

#offloading

Performance When Offloading Large Models to System RAM?

Reddit r/LocalLLaMA ↗ · 2026-05-24

Discusses performance trade-offs of offloading large AI model weights from GPU VRAM to system RAM, comparing different GPU configurations like RTX 5090 vs RTX6000 for models like DeepSeek V4 Pro.

0 favorites 0 likes

#offloading

Seeking resources to read about llama.cpp server and how offloading works

Reddit r/LocalLLaMA ↗ · 2026-05-22

A user shares their experience with llama.cpp server's model offloading, noting performance trade-offs and quiet operation, and asks for resources to understand how the tool manages memory across VRAM and system RAM.

0 favorites 0 likes

#offloading

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

arXiv cs.AI ↗ · 2026-05-20

This paper presents an empirical study on scheduling multiple LLMs on shared heterogeneous hardware, focusing on performance implications of CPU-GPU offloading and preemption. It finds that offloading causes non-linear decode degradation, especially for smaller models, and preemption overhead is dominated by model state reload, providing design guidance for future multi-model schedulers.

0 favorites 0 likes

offloading

@dabit3: 1,000 tok/s vs 85 tok/s visualized

Are we offloading too much of our thinking to AI?

Maybe KV cache offload to RAM isn't bad

Qwen 3.5 122B MoE OC on a single 3090 at 35 t/s — full local stack breakdown

Performance When Offloading Large Models to System RAM?

Seeking resources to read about llama.cpp server and how offloading works

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

Submit Feedback