offloading

Tag

Cards List
#offloading

Qwen 3.5 122B MoE OC on a single 3090 at 35 t/s — full local stack breakdown

Reddit r/openclaw · 30m ago

Detailed breakdown of running Qwen 3.5 122B MoE on a single RTX 3090 at 35 t/s using a custom llama.cpp fork (ik_llama.cpp) with fused MoE operations and expert offloading to CPU RAM, significantly outperforming stock llama.cpp MTP.

0 favorites 0 likes
#offloading

Performance When Offloading Large Models to System RAM?

Reddit r/LocalLLaMA · 2026-05-24

Discusses performance trade-offs of offloading large AI model weights from GPU VRAM to system RAM, comparing different GPU configurations like RTX 5090 vs RTX6000 for models like DeepSeek V4 Pro.

0 favorites 0 likes
#offloading

Seeking resources to read about llama.cpp server and how offloading works

Reddit r/LocalLLaMA · 2026-05-22

A user shares their experience with llama.cpp server's model offloading, noting performance trade-offs and quiet operation, and asks for resources to understand how the tool manages memory across VRAM and system RAM.

0 favorites 0 likes
#offloading

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

arXiv cs.AI · 2026-05-20

This paper presents an empirical study on scheduling multiple LLMs on shared heterogeneous hardware, focusing on performance implications of CPU-GPU offloading and preemption. It finds that offloading causes non-linear decode degradation, especially for smaller models, and preemption overhead is dominated by model state reload, providing design guidance for future multi-model schedulers.

0 favorites 0 likes
← Back to home

Submit Feedback