Tag
Together AI open-sources OSCAR, an attention-aware 2-bit KV cache quantization system that enables efficient long-context LLM serving by redistributing quantization error according to attention importance.
OSCAR is an offline spectral covariance-aware rotation method for 2-bit KV cache quantization that aligns quantization with attention covariance structures, achieving high accuracy and efficiency for long-context LLM serving.