Tag
LLMEval-Logic is a new Chinese benchmark for evaluating logical reasoning in LLMs, featuring solver-verified answers and adversarial hardening. The benchmark reveals significant gaps in current models, with the best reaching only 37.5% accuracy on hard items.
Researchers introduce HoWToBench, a large-scale Chinese writing benchmark with 1,302 instructions across 12 genres, and Tree-of-Writing (ToW), a tree-structured evaluation method that achieves 0.93 Pearson correlation with human judgments while mitigating biases in LLM writing assessment.