K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts
Summary
Introduces K-BrowseComp, a Korean web-browsing agent benchmark with 400 problems, revealing substantial performance gaps compared to English benchmarks and underscoring the need for robust Korean AI development.
View Cached Full Text
Cached at: 06/02/26, 03:38 PM
Paper page - K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts
Source: https://huggingface.co/papers/2606.02404 Authors:
,
,
,
,
,
,
,
,
,
,
,
Abstract
Korean web-browsing agent benchmark K-BrowseComp evaluates frontier LLMs’ capabilities with 400 problems, showing significant performance gaps compared to English benchmarks and highlighting the need for more robust Korean AI development.
Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, aweb-browsing agent benchmarkgrounded inKorean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontierLLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop fromBrowseComp, while KoreanLLMsreleased through Korea’s Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problemsynthetic splitusing hardfew-shot exemplarsandfailure-mode-targeted generationto exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.
View arXiv pageView PDFGitHub9Add to collection
Get this paper in your agent:
hf papers read 2606\.02404
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.02404 in a model README.md to link it from this page.
Datasets citing this paper1
#### prometheus-eval/k-browsecomp Viewer• Updatedabout 12 hours ago • 700 • 91 • 3
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.02404 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
BrowseComp: a benchmark for browsing agents
OpenAI released BrowseComp, a benchmark of 1,266 challenging problems designed to measure AI agents' ability to locate hard-to-find information across the internet, available in their simple evals GitHub repository.
EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge
This paper introduces EvoBrowseComp, a dynamic benchmark of 400 English and 400 Chinese complex questions that are synthesized via live-web traversal to evaluate search agents without test-set contamination, ensuring robustness against parametric memorization.
EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge
EvoBrowseComp is an evolving benchmark with 800 contamination-free questions for evaluating search agents, designed to prevent parametric memorization and maintain temporal freshness through a three-agent framework.
KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness
KoALa-Bench introduces a Korean-focused benchmark suite for evaluating large audio language models on six tasks, including novel measures of speech faithfulness and Korea-specific cultural content.
KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context
KMMMU is a native Korean benchmark for evaluating multimodal understanding with 3,466 questions across nine disciplines and visual modality categories, addressing the gap of English-centric benchmarks by testing performance on Korean-specific cultural and institutional contexts.