@rohanpaul_ai: New Microsoft + York Univ paper argues that LLMs should not be treated as human-like without clear tests and narrower c…

X AI KOLs Following Papers

Summary

A Microsoft and York University paper argues that attributing human-like attributes to LLMs is problematic due to flawed experimental designs, using Age of Empires II as an analogy to highlight measurement issues.

New Microsoft + York Univ paper argues that LLMs should not be treated as human-like without clear tests and narrower claims. Many studies ask whether LLMs have things like understanding, empathy, anxiety, or self-awareness, but they often build those ideas into the test from the start. The author shows that, in principle, the old strategy game can implement logic gates, train a tiny perceptron, and serve as a substrate for computation. If the same language model could be rebuilt inside a game, with goats moving around as bits, would we still say it “understands,” “feels anxiety,” or “has empathy” when it produces the same sentence? The point is not that the game is secretly intelligent, but that the same computation can be represented in a very different form. If an LLM-like system were rebuilt inside that game, its answers might stay similar, but people would probably find its “feelings” or “understanding” much less convincing. The authors argue that this shows a big measurement problem: many human-like claims about LLMs may depend on the interface and the observer, not only on the system itself. The paper is not saying LLMs definitely lack human-like attributes, or that all talk of AI cognition is nonsense. It is saying that many experiments smuggle the conclusion into the setup: they assume the model has, or cannot have, a human-like property, then interpret behavior through that assumption. ---- Link – arxiv. org/abs/2605.31514 Title: "If LLMs Have Human-Like Attributes, Then So Does Age of Empires II"
Original Article
View Cached Full Text

Cached at: 06/20/26, 10:23 PM

New Microsoft + York Univ paper argues that LLMs should not be treated as human-like without clear tests and narrower claims.

Many studies ask whether LLMs have things like understanding, empathy, anxiety, or self-awareness, but they often build those ideas into the test from the start.

The author shows that, in principle, the old strategy game can implement logic gates, train a tiny perceptron, and serve as a substrate for computation.

If the same language model could be rebuilt inside a game, with goats moving around as bits, would we still say it “understands,” “feels anxiety,” or “has empathy” when it produces the same sentence?

The point is not that the game is secretly intelligent, but that the same computation can be represented in a very different form.

If an LLM-like system were rebuilt inside that game, its answers might stay similar, but people would probably find its “feelings” or “understanding” much less convincing.

The authors argue that this shows a big measurement problem: many human-like claims about LLMs may depend on the interface and the observer, not only on the system itself.

The paper is not saying LLMs definitely lack human-like attributes, or that all talk of AI cognition is nonsense.

It is saying that many experiments smuggle the conclusion into the setup: they assume the model has, or cannot have, a human-like property, then interpret behavior through that assumption.


Link – arxiv. org/abs/2605.31514

Title: “If LLMs Have Human-Like Attributes, Then So Does Age of Empires II”

Similar Articles

IF LLMS HAVE HUMAN-LIKE ATTRIBUTES, THEN SO DOES Age of Empires II

Reddit r/ArtificialInteligence

This paper argues that attributing human-like attributes to large language models is problematic because similar claims could be made about simpler systems, such as an AI trained on Age of Empires II, and proposes a null assumption of non-uniqueness to avoid circular reasoning.

Evaluating LLMs as Human Surrogates in Controlled Experiments

arXiv cs.CL

This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.