Tag
This paper proposes a framework for evaluating LLMs' ability to generate multiple responses to scientific queries at different language complexity levels. The study finds that models often vary complexity inconsistently, with Claude Sonnet 4.5 performing best but only shifting complexity correctly 46% of the time.