Tag
This paper investigates whether Large Language Models exhibit the same usage-based linguistic productivity constraints (entrenchment and preemption) as humans, finding that models can reproduce coercion but fail to apply statistical preemption to avoid overgeneralization.
This paper presents an empirical study on scheduling multiple LLMs on shared heterogeneous hardware, focusing on performance implications of CPU-GPU offloading and preemption. It finds that offloading causes non-linear decode degradation, especially for smaller models, and preemption overhead is dominated by model state reload, providing design guidance for future multi-model schedulers.