Tag
Chamath explains the two key phases of AI compute: prefill, which is compute-bound and favors parallel GPUs like Nvidia's, and decode, which is memory-bandwidth bound and depends on scanning previously generated tokens.
Turbo3 achieves 5% faster decode tokens per second compared to fp8 at 32K context, a performance improvement in quantization or model optimization.