K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Hugging Face Daily Papers 06/01/26, 12:00 AM Papers

web-browsing benchmark korean agents llm-evaluation frontier-models

Summary

Introduces K-BrowseComp, a Korean web-browsing agent benchmark with 400 problems, revealing substantial performance gaps compared to English benchmarks and underscoring the need for robust Korean AI development.

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:38 PM

Paper page - K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Source: https://huggingface.co/papers/2606.02404 Authors:

Abstract

Korean web-browsing agent benchmark K-BrowseComp evaluates frontier LLMs’ capabilities with 400 problems, showing significant performance gaps compared to English benchmarks and highlighting the need for more robust Korean AI development.

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, aweb-browsing agent benchmarkgrounded inKorean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontierLLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop fromBrowseComp, while KoreanLLMsreleased through Korea’s Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problemsynthetic splitusing hardfew-shot exemplarsandfailure-mode-targeted generationto exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.

View arXiv page View PDF GitHub9 Add to collection

Get this paper in your agent:

hf papers read 2606\.02404

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.02404 in a model README.md to link it from this page.

Datasets citing this paper1

#### prometheus-eval/k-browsecomp Viewer• Updatedabout 12 hours ago • 700 • 91 • 3

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.02404 in a Space README.md to link it from this page.

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Paper page - K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper1

Similar Articles

BrowseComp: a benchmark for browsing agents

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

Submit Feedback

Similar Articles

BrowseComp: a benchmark for browsing agents

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus