Tag
Highlights OpenAI researcher Noam Brown's argument: the true ceiling of LLM capabilities is far higher than current benchmarks show, due to insufficient test-time compute, and stronger models benefit more from additional computation. This poses a serious challenge for AI safety evaluation, as many dangerous capabilities may only emerge under long time and high compute budgets.
A free Stanford lecture by Percy Liang on AI generalization explains why models excel on benchmarks but fail on real codebases, covering benchmark memorization, bias-variance tradeoff, and hallucination.