offloading

Tag

Cards List
#offloading

Performance When Offloading Large Models to System RAM?

Reddit r/LocalLLaMA · 2026-05-24

Discusses performance trade-offs of offloading large AI model weights from GPU VRAM to system RAM, comparing different GPU configurations like RTX 5090 vs RTX6000 for models like DeepSeek V4 Pro.

0 favorites 0 likes
#offloading

Seeking resources to read about llama.cpp server and how offloading works

Reddit r/LocalLLaMA · 2026-05-22

A user shares their experience with llama.cpp server's model offloading, noting performance trade-offs and quiet operation, and asks for resources to understand how the tool manages memory across VRAM and system RAM.

0 favorites 0 likes
#offloading

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

arXiv cs.AI · 2026-05-20

This paper presents an empirical study on scheduling multiple LLMs on shared heterogeneous hardware, focusing on performance implications of CPU-GPU offloading and preemption. It finds that offloading causes non-linear decode degradation, especially for smaller models, and preemption overhead is dominated by model state reload, providing design guidance for future multi-model schedulers.

0 favorites 0 likes
← Back to home

Submit Feedback