statistical-test

#statistical-test

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

arXiv cs.CL ↗ · yesterday Cached

UnpredictaBench is a benchmark for evaluating how well large language models can sample from target distributions, including statistical and natural-language random processes. Experiments show that current models struggle to capture true underlying distributions, with no model exceeding 40% on the KS@100 metric.

0 favorites 0 likes

statistical-test

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

Submit Feedback