chinese-benchmark

#chinese-benchmark

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

arXiv cs.CL ↗ · 2026-05-20 Cached

LLMEval-Logic is a new Chinese benchmark for evaluating logical reasoning in LLMs, featuring solver-verified answers and adversarial hardening. The benchmark reveals significant gaps in current models, with the best reaching only 37.5% accuracy on hard items.

0 favorites 0 likes

#chinese-benchmark

HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

arXiv cs.CL ↗ · 2026-04-22 Cached

Researchers introduce HoWToBench, a large-scale Chinese writing benchmark with 1,302 instructions across 12 genres, and Tree-of-Writing (ToW), a tree-structured evaluation method that achieves 0.93 Pearson correlation with human judgments while mitigating biases in LLM writing assessment.

0 favorites 0 likes

chinese-benchmark

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

Submit Feedback