Tag
This paper introduces GrowLoop, a self-evolving evaluation system for assessing human-likeness in open-ended conversations. It uses minimal human seed annotations to iteratively refine evaluation rubrics, addressing challenges of tacit knowledge, varying human agreement, and evolving model capabilities.
This paper introduces a register-aware linguistic evaluation framework to assess how human-like large language models (LLMs) are by comparing the distribution of 67 lexico-grammatical features between human and LLM-generated texts using Maximum Mean Discrepancy. Experiments across seven instruction-tuned open-source models and five registers show that no model perfectly matches human baselines, and closeness to human language varies by register rather than model size.
A new study published in PNAS shows that advanced LLMs like GPT-4.5 can pass the Turing Test, with participants finding them more human than actual humans, prompting a reevaluation of what the test measures.
A research paper finds that base language models appear human to AI detectors, unlike instruction-tuned models. The authors propose a paraphrasing pipeline (HIP) that improves human-likeness while preserving semantics across model sizes.
Introduces Persona Policies (PPol), a plug-and-play control layer that uses LLM-driven evolutionary program search to generate diverse, human-like user personas for evaluating LLM agents. Achieves 33–62% fitness gains over baseline, with human-likeness rated at 80.4%, and improves agent robustness with +17% task success.
Research paper examining how large language models express social emotions compared to human cultural norms, finding systematic misalignment where LLMs show inconsistent patterns of engaging vs. disengaging emotion expressivity across cultural personas (European American and Latin American) compared to human responses.