Tag
This paper introduces HieraRAG, a hierarchical framework for determining optimal granularity in RAG benchmarks. It generates 5,872 synthetic QA pairs across three dimensions and finds that ideal granularity varies by dimension, offering a portable procedure for practitioners.
This paper introduces Elmes+, an automated framework for constructing fine-grained evaluation rubrics for LLMs in long-tail educational scenarios, and presents the Edu-330 benchmark covering 330 scenarios across 11 subjects. The framework uses a multi-agent engine and self-evolving module to co-optimize evaluation criteria and test data, revealing multidimensional educational capability differences among top LLMs.