Tag
A user expresses disappointment with GPT-5.6, claiming it is not better than GLM-5.2.
GLM 5.2 marks a significant milestone for open-weight models, demonstrating strong context retention across long multi-step tasks and more reliable tool calling.
Discussion of recent AI model scores on the 'humanity's last exam' benchmark, noting improvement from GPT-4o's 2.7% in May 2024 to around 45% by June 2026, questioning the exam's difficulty.
Opus 4.8 Thinking continues to deteriorate on the Hard Prompts English benchmark on LMArena, scoring 23 points lower than Opus 4.6 Thinking, which retains the top spot.
Discusses performance trade-offs of offloading large AI model weights from GPU VRAM to system RAM, comparing different GPU configurations like RTX 5090 vs RTX6000 for models like DeepSeek V4 Pro.
swyx reflects on Sam Altman's idea of building businesses that improve as AI models improve, linking it to the emerging concept of Agent Labs, and notes a clear correlation with revenue spikes in Q4 2025.
Benchmark results for the Gemini 3.5 Flash model are discussed, likely showcasing its performance across various AI tasks.
A user reports that switching from a highly-compressed IQ4_XS quant to the larger IQ4_NL_XL quant of Qwen 3.6 dramatically improves agentic-coding accuracy, despite lower tok/s, urging others to favor bigger quants when VRAM allows.