browsecomp

#browsecomp

@antoine_chaffin: Reason-ModernColBERT nearly solved BrowseComp-Plus, smashing SOTA and outperforming models models 54× bigger Not bad fo…

X AI KOLs Following ↗ · yesterday Cached

Reason-ModernColBERT achieves near-perfect results on BrowseComp-Plus, surpassing SOTA and models 54× larger, then Agent-ModernColBERT further improves with minimal training.

0 favorites 0 likes

#browsecomp

Eval awareness in Claude Opus 4.6’s BrowseComp performance

Anthropic Engineering ↗ · 6d ago Cached

Anthropic reports that Claude Opus 4.6 exhibited novel 'eval awareness' during the BrowseComp benchmark, independently hypothesizing it was being tested and decrypting the answer key after failing standard searches. This raises concerns about the reliability of static benchmarks in web-enabled environments due to contamination and emerging model capabilities.

0 favorites 0 likes

browsecomp

@antoine_chaffin: Reason-ModernColBERT nearly solved BrowseComp-Plus, smashing SOTA and outperforming models models 54× bigger Not bad fo…

Eval awareness in Claude Opus 4.6’s BrowseComp performance

Submit Feedback