Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?
Summary
This study examines whether active exploration helps adults overcome the 'conjunctive handicap' in causal reasoning, comparing human performance to LLMs in a blicket detector task. Results show that active exploration improves conjunctive reasoning in adults, though some gaps remain, and LLMs approach human accuracy but explore less efficiently.
View Cached Full Text
Cached at: 06/08/26, 09:15 AM
# Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?
Source: [https://arxiv.org/html/2606.06464](https://arxiv.org/html/2606.06464)
Mandana Samiei1,2∗Eunice Yiu3∗Anthony GX\-Chen4Dongyan Lin5Jocelyn Shen6 Blake A\. Richards1,2,7Alison Gopnik3Doina Precup1,21Mila \- Quebec AI Institute2McGill University3University of California Berkeley4New York University 5Meta FAIR6MIT Media Lab7Montreal Neurological Institute ∗Equal contribution\. Correspondence:mandana\.samiei@mail\.mcgill\.ca, ey242@berkeley\.edu
###### Abstract
A long\-standing finding in the causal learning literature is that adults struggle to identify conjunctive causal rules, where an effect requires the simultaneous presence of multiple causes, while performing better in disjunctive settings\. However, most demonstrations of this “conjunctive handicap” rely on passive observation paradigms with limited evidence, where learners have no control over evidence generation\. This paper asks whether this bias persists when adults are granted agency through active exploration\. Using a modified “blicket detector” task, adult participants freely intervened to identify causal objects under conjunctive or disjunctive rule structures\. We show that active exploration substantially improves adults’ conjunctive causal reasoning, although conjunctive rules still require more tests to infer than disjunctive rules\. We further compare human performance to a range of large language models in the same setting\. While some state\-of\-the\-art models approach human\-level performance on hypothesis inference accuracy, they often exhibit less efficient exploration strategies and similar conjunctive\-disjunctive performance gaps\.
Keywords:Causal Learning; Active Learning; Information Gain; Intervention; Cognitive Development; Language Models
††footnotetext:*Accepted at the 48th Annual Conference of the Cognitive Science Society \(CogSci 2026\)*\.## Introduction
Understanding how agents infer causal structure is a central problem in cognitive science\. Causal learning supports prediction, explanation, intervention, and scientific discovery\[[14](https://arxiv.org/html/2606.06464#bib.bib4)\]\. Developmental work shows that even young children can infer latent causal variables, reason about unobserved mechanisms, and distinguish correlation from causation from limited evidence\[[4](https://arxiv.org/html/2606.06464#bib.bib1),[12](https://arxiv.org/html/2606.06464#bib.bib18),[9](https://arxiv.org/html/2606.06464#bib.bib3)\]\.
Much of this work has relied on the “blicket detector” paradigm\[[4](https://arxiv.org/html/2606.06464#bib.bib1)\], in which learners observe a machine that activates when certain objects, or combinations of objects, are placed on it\. A striking and perhaps counterintuitive finding from this literature is that children sometimes outperform adults in learning abstract causal structure\. Adults often default to “disjunctive” \(OR\) rules, whereas preschool\-aged children readily infer “conjunctive” \(AND\) rules when the training evidence supports them\[[7](https://arxiv.org/html/2606.06464#bib.bib2)\]\. This adult “conjunctive handicap” has been observed across cultures and socioeconomic contexts\[[13](https://arxiv.org/html/2606.06464#bib.bib6)\]\.
However, these demonstrations of adult failure in conjunctive causal reasoning have exclusively relied on*passive*learning paradigms, where learners observe a fixed sequence of evidence without control over which interventions are performed\. This is important because causal learning is shaped by intervention and exploration\. Children learn more effectively when they can design their own tests, and adults also learn worse when interventions are imposed rather than self\-directed\[[9](https://arxiv.org/html/2606.06464#bib.bib3),[10](https://arxiv.org/html/2606.06464#bib.bib8),[11](https://arxiv.org/html/2606.06464#bib.bib9)\]\.
These findings raise a key question: Are adults actually poor conjunctive causal learners, or do they struggle because passive evidence presentation prevents them from generating and evaluating the right evidence to update their priors?
Figure 1:Test structure in the Active Exploration vs\. Passive Observation conditions\.In the active exploration condition, participants can click to add or remove four individual objects from the Nexiom machine, and explicitly test the current combination to observe whether the machine switches ON or OFF\. In the passive observation condition, participants are each paired to an active exploration participant \(yoked control\)\. They do not perform any actions on their own, but rather observe the actions and test outcomes of the active exploration participants\. Participants in both conditions are then asked to classify which object\(s\) are “Nexioms” and the rule under which the objects operate \(conjunctive or disjunctive\) to activate the machine\.We address this question using a novel “nexiom detector” task, a blicket\-style causal learning paradigm designed to minimize familiar task\-specific priors\. Adult participants could test any object or combination of objects, and decide when they had enough evidence to infer both the causal objects and the underlying rule, allowing us to compare active and passive causal learning\.
We find that active exploration substantially changes adult performance\. In contrast to prior passive studies, active adults show strong performance on conjunctive causal rules and generated relatively small but highly informative sets of tests\. This suggests that adult failures in conjunctive causal reasoning\[[7](https://arxiv.org/html/2606.06464#bib.bib2)\]reflect constraints of passive evidence presentation, not a fixed limitation in causal competence\.
We further showed that just generating informative tests was not enough\. Participants who proposed interventions but observed outcomes from someone else’s tests performed as poorly as the passive learners\. This suggests that active exploration helps most when intervention choices are tightly coupled to their own contingent outcomes\.
Finally, we situate these findings in a broader computational context by comparing adults to large language models \(LLMs\) on the same causal discovery tasks\. LLMs are increasingly evaluated as general\-purpose reasoning agents capable of hypothesis generation and experimentation\[[8](https://arxiv.org/html/2606.06464#bib.bib10),[16](https://arxiv.org/html/2606.06464#bib.bib11)\], and recent work has placed them in blicket\-style environments where they select interventions to observe outcomes\[[5](https://arxiv.org/html/2606.06464#bib.bib12)\]\. This makes them a natural comparison case for examining whether active intervention is sufficient for successful causal discovery\. We find that LLMs do not consistently gain from choosing their own interventions and still lag behind the top\-performing human explorers on conjunctive rules\.
Together, these results suggest that successful causal discovery depends critically on maintaining a tight coupling between self\-generated interventions and their contingent outcomes\. Active intervention alone is insufficient: effective causal learning requires adaptive search strategies and progressive hypothesis pruning, particularly in conjunctive environments\.
## Methods
### The Nexiom Detector Software\.
We developed a custom web\-based platform, Nexiom Text Adventure111[https://nexiom\-text\-game\.streamlit\.app/](https://nexiom-text-game.streamlit.app/), to study active causal reasoning in human adults\. The task is functionally equivalent to the classic blicket detector paradigm, but uses the novel term “nexiom” to minimize prior knowledge from the established blicket literature\. The platform was implemented in Streamlit and supports structured experimental sessions involving text\-based scenarios with interactive object selection and explicit causal testing, and records detailed behavioral data, including object selections, test sequences, test outcomes, response times, object identification accuracy, and rule inference judgments\. Each session consists of a comprehension phase followed by a main test phase\.
Experiments\.Participants were assigned to either the "Active Exploration" or the "Passive Observation" condition\. Active Exploration participants engaged in unrestricted causal exploration to identify both the causal objects and the underlying causal rule\. They could freely select objects, test arbitrary combinations, and decide when to stop exploring, whereas Passive Observation participants could only observe the actions and test outcomes of a paired Active Exploration participant \(yoked control\) to learn about the underlying causal hypothesis of the machine\. After exploration, participants were first asked to identify which objects they believed were "nexioms" and could switch on the machine\. Next, they were explicitly informed of two possible rule types \(conjunctive vs\. disjunctive\), and were asked to select which rule they believed governed the machine\. Figure[1](https://arxiv.org/html/2606.06464#Sx1.F1)illustrates the experimental flow\. We used the Passive Observation condition to replicate the ambiguous evidence paradigm used in prior work on conjunctive causal learning in adults\[[7](https://arxiv.org/html/2606.06464#bib.bib2),[3](https://arxiv.org/html/2606.06464#bib.bib14)\]\(see Figure[2](https://arxiv.org/html/2606.06464#Sx2.F2)\)\. We additionally conducted a supplementary "Passive Proposer" experiment to isolate whether the benefits of active exploration arise from intervention planning itself or from receiving contingent feedback from one’s own interventions\. Passive Proposers were allowed to actively generate hypotheses and propose interventions, but did not directly observe the outcomes of their own proposed tests\. Instead, they received the outcomes generated by a matched Active Exploration participant\. Finally, we extended the same active exploration framework to several large language models\. Models were allowed to sequentially select interventions by adding or removing objects, testing the resulting combination, and observe machine outcomes\. In principle, we provide LLMs with access to the same action space as human learners\. Model evaluations used the same conjunctive and disjunctive causal structures as the human experiments, enabling direct comparison of exploration behavior, object identification, and rule inference across agents\.
Figure 2:Rule inference accuracy across age groups\.Performance of children and adults when given ambiguous passive\-observation data in a prior study \(\[[3](https://arxiv.org/html/2606.06464#bib.bib14)\]\(left of dashed line\) is compared with adults’ performance in the current study \(right of dashed line\)\.
### Data\.
A total of 306 adult participants \(age: 22–35 years,Mage=30\.41M\_\{\\mathrm\{age\}\}=30\.41,SDage=3\.94SD\_\{\\mathrm\{age\}\}=3\.94; 153 female, 153 male\) were recruited on Prolific222https://www\.prolific\.comto complete an online experiment in accordance with IRB\-approved protocols\. 102 participants were assigned to the Active Exploration condition, 102 were assigned to the Passive Observation condition, and another 102 were assigned to the Passive Proposer condition\. Participants in both passive conditions were randomly paired with a unique Active Exploration participant to create a passive yoked\-control setup\. Only participants who successfully completed the experiment were included in the analyses\. In each condition, participants were randomly assigned to one of two between\-subjects causal rules in the test phase:N=51N=51were assigned to the conjunctive condition and anotherN=51N=51333Throughout the paper, ”top human” refers to the highest\-performing human participants within each condition, defined as participants achieving perfect full\-hypothesis inference accuracy\.were assigned to the disjunctive condition\. To compare human causal exploration with artificial agents under matched interaction settings, we conducted experiments with language models using 24 trials per rule type \(conjunctive / disjunctive\), resulting in 48 total trials per model\. The 24 trials consisted of 6 unique causal configurations, corresponding to all possible combinations of 2 blickets selected from 4 objects, with each configuration evaluated four times\. We tested six language models spanning both reasoning\-oriented and non\-reasoning agents:gpt\-5,gpt\-5\-mini,gemini\-2\.5\-flash,deepseek\-reasoner,o4\-mini, anddeepseek\-chat\. Across all models and rule conditions, this resulted in a total of 288 LLM evaluation trials \(using temperature =0\.00\.0\)\.
### Evaluation Metrics\.
We used three binary accuracy measures \(scored 0 or 1\) to evaluate participants’ causal understanding and excluded 4 additional participants with prior experience from all analyses\.Object identification accuracyis defined as the proportion of participants who selected the exact set of causal objects \(“nexioms”\)\. A trial was scored as correct only if the participant’s selected set matched the ground\-truth nexioms exactly \(e\.g\., if the correct nexioms were\[2,4\], selecting any other combination was scored as incorrect\)\.Rule inference accuracyis defined as the proportion of participants who correctly identified the underlying causal rule \(conjunctive vs\. disjunctive\)\.Full hypothesis accuracyrequired both correct object identification and correct rule inference\. For each metric and rule condition, we report mean accuracy and standard error \(see Figure[3](https://arxiv.org/html/2606.06464#Sx2.F3)\)\. We additionally report exploration\-process measures derived from participants’ test sequences\.Cumulative information gainquantities the reduction in uncertainty \(in bits\) accumulated across tests\. Before each test, we compute the number of hypotheses consistent with prior evidence; after observing the machine’s response \(on/off\), we recompute the number of remaining consistent hypotheses:InfoGaint=log2\(Nt−1\)−log2\(Nt\)=log2\(Nt−1Nt\)\\text\{InfoGain\}\_\{t\}=\\log\_\{2\}\(N\_\{t\-1\}\)\-\\log\_\{2\}\(N\_\{t\}\)=\\log\_\{2\}\\left\(\\frac\{N\_\{t\-1\}\}\{N\_\{t\}\}\\right\), whereNt−1N\_\{t\-1\}is the number of hypotheses consistent before testtt, andNtN\_\{t\}is the number of hypotheses consistent after testtt\. Cumulative information gain up to testkkis defined asCumInfoGaink=∑t=1kInfoGaint\\text\{CumInfoGain\}\_\{k\}=\\sum\_\{t=1\}^\{k\}\\text\{InfoGain\}\_\{t\}\.Number of hypotheses remainingtracks how many full hypotheses are still consistent with the test outcomes after each test\. The full hypothesis space comprises all pairs of \(Nexiom set, rule\)\. We further compute the average number of tests conducted and the time spent per test across trials\. These measures are reported separately for conjunctive and disjunctive conditions\. The statistics presented in Tables[1](https://arxiv.org/html/2606.06464#Sx3.T1)and[2](https://arxiv.org/html/2606.06464#Sx3.T2)do not consider the time spent in the comprehension phase\.
### Raw Data Availability\.
Behavioral data and analysis\-ready summary files are available on[the Open Science Framework \(OSF\)](https://osf.io/fsvqb/overview?view_only=55a429715ee24f1bbccfb4d30bb7fe05)\. The repository includes raw interaction logs and processed summary data for the analyses here\.
Figure 3:Performance accuracy in conjunctive vs\. disjunctive causal inference with active exploration evidence\.Error bars are±\\pmstandard error of the mean across participants\.
## Results
Beyond overall accuracy, our results show that both human participants and language models do not interact with the Nexiom machine in a random or unguided manner\. Instead, they engaged in systematic exploration designed to disambiguate competing causal solutions\. The hypothesis space initially consists of 32 possible explanations\.444ForNNitems and two rule types \(conjunctive vs\. disjunctive\), the total hypothesis space is2N\+12^\{N\+1\}\.Across both conjunctive and disjunctive conditions, participants and stronger models systematically reduced this space through targeted interventions, although conjunctive conditions required substantially longer exploration trajectories to achieve comparable reductions in uncertainty\. As shown in Figures[4](https://arxiv.org/html/2606.06464#Sx3.F4),[6](https://arxiv.org/html/2606.06464#Sx3.F6), and[7](https://arxiv.org/html/2606.06464#Sx3.F7), successful causal discovery was associated not only with higher final accuracy, but also with more efficient hypothesis pruning, greater cumulative information gain, and more adaptive sequential search strategies over the course of exploration\.
### Exploration Strategies, Passive vs\. Active Learners\.
Several recent developmental works\[[6](https://arxiv.org/html/2606.06464#bib.bib19),[3](https://arxiv.org/html/2606.06464#bib.bib14)\]asked whether learners can actively generate the data necessary to form correct hypotheses, rather than simply receiving that data from an experimenter\. Our study addresses this by placing adults in an environment where they must determine the governing logic \(disjunctive vs\. conjunctive\) through self\-directed interventions\. Figure[4](https://arxiv.org/html/2606.06464#Sx3.F4)shows that disjunctive environments permit rapid pruning of the hypothesis space, with both humans and stronger models eliminating most competing hypotheses within the first few tests\. In contrast, conjunctive environments require substantially longer exploration trajectories to achieve comparable reductions in uncertainty\. This asymmetry is also reflected in cumulative information gain: information accumulates rapidly in disjunctive settings, whereas conjunctive settings require more gradual evidence accumulation across a larger number of tests\. The top\-performing human adults nevertheless show the most efficient search behavior across all models, progressively narrowing the hypothesis space through targeted interventions\.
Table 1:Comparison of exploration effort between humans and modelsacross conjunctive and disjunctive rules \(M±SEM\\pm SE\)\.Table 2:Number of tests and time spent per testacross successful \(both rule and objects\) trials \(M±SEM\\pm SE\)\.
### Conjunctive vs\. Disjunctive Reasoning\.
Our finding shows that adults perform a greater number of tests and actions in conjunctive case compared to disjunctive case, and this pattern is similarly observed across language models \(see Table[1](https://arxiv.org/html/2606.06464#Sx3.T1)\)\. Table[2](https://arxiv.org/html/2606.06464#Sx3.T2)further shows that even successful participants and models required more tests to converge on the correct causal hypothesis in conjunctive settings\. These findings suggest that the challenge of conjunctive reasoning lies not simply in representing conjunctive relations, but in resolving ambiguity among competing hypotheses\. In a disjunctive world \(A∨BA\\lor B\), any single positive test, \(e\.g\. testingAAand seeing it work\) provides immediate, high certainty evidence\. In contrast, in a conjunctive world \(A∧BA\\land B\), a single successful test \(testingAA,BBtogether\) is insufficient to rule outAAalone,BBalone, orA∧BA\\land B\. This suggests that conjunctive conditions require more data points to reach the same level of hypothesis pruning, see Figure[4](https://arxiv.org/html/2606.06464#Sx3.F4)\. The steep drop in the number of hypothesis remaining in Figure[4](https://arxiv.org/html/2606.06464#Sx3.F4)suggests that this exploration is not random\. Like the children in\[[9](https://arxiv.org/html/2606.06464#bib.bib3),[6](https://arxiv.org/html/2606.06464#bib.bib19)\], adults are performing rather efficient experiments to prune the hypothesis space\.
### Contingent Feedback vs\. Intervention Planning\.
To isolate whether the benefits of active exploration arise from intervention planning itself or from receiving contingent feedback from one’s own interventions, we introduced a third participant group: Passive Proposers\. Passive Proposers generated interventions in the same manner as active participants, but did not directly observe the outcomes of their proposed tests\. Instead, each Passive Proposer received the outcome produced by a matched Active Explorer’s intervention sequence\. Thus, Passive Proposers engaged in hypothesis generation and intervention planning without maintaining direct control over evidence generation\. In conjunctive environments, Passive Proposers performed substantially worse on object identification accuracy \(0\.120\.12\) than both Passive Observers \(0\.450\.45\) and Active Explorers \(0\.690\.69\), despite actively reasoning about the hypothesis space\. Rule selection accuracy showed a similar but weaker pattern \(Passive Proposer:0\.570\.57, Passive Observer:0\.690\.69, Active Explorer:0\.820\.82\)\. In disjunctive environments, performance was generally higher across all conditions, although Active Explorers still achieved the strongest performance overall\. All reported numbers here are mean performance across participants\. These results suggest that the benefits of active exploration do not arise solely from internally generating hypotheses or planning interventions\. Rather, successful causal discovery appears to depend critically on the tight coupling between intervention selection and contingent outcome observation\.
### Error Structure Across Rule and Object Inference\.
Figure[5](https://arxiv.org/html/2606.06464#Sx3.F5)decomposes performance into four jointly defined outcome categories, reflecting whether participants correctly inferred the causal rule and identified the causal object\(s\)\. Across both conjunctive and disjunctive conditions, the majority of human participants successfully inferred both the underlying rule and the correct causal objects\. However, the structure of errors differs substantially between the two causal environments\. In the disjunctive condition, errors were relatively rare and typically partial: human participants usually identified either the correct rule or the correct objects, with few simultaneous failures on both dimensions\. Most state\-of\-the\-art models similarly showed strong disjunctive performance, withgpt\-5anddeepseek\-reasonerachieving near\-perfect or perfect full\-hypothesis accuracy\. The conjunctive condition revealed a qualitatively different error profile\. Although many participants still correctly inferred both the rule and the causal objects, failures were more common and heterogeneous\. Among average human adults, 16% incorrectly inferred both the rule and the objects, while another 16% correctly inferred the conjunctive rule but misidentified the causal objects\. This pattern suggests that participants often recognized the need for a conjunctive explanation while still struggling to uniquely determine the correct causal set\. Interestingly, when language models made object\-identification errors, these failures were most often driven by misclassifying only a single causal object, a near\-miss pattern that was comparatively less common in human participants\.
Figure 4:Search efficiency\.Comparing hypothesis pruning & information gain in conjunctive vs\. disjunctive exploration\.Figure 5:Distribution of rule and object identification outcomesfor conjunctive and disjunctive conditions in adults and LLMs with active exploration\.
### Complexity of Hypothesis Space and Search Strategy\.
The analysis of information gain patterns reveals systematic differences in how humans and language models resolve uncertainty across conjunctive and disjunctive environments\. In the disjunctive condition, participants accumulate most available information within only a few tests, and the marginal utility of additional exploration rapidly approaches zero once the rule is identified \(Figure[4](https://arxiv.org/html/2606.06464#Sx3.F4)\)\. In contrast, conjunctive environments require substantially longer exploration trajectories, with informative interventions continuing to emerge later in the sequence\. This pattern is consistent with the behavioral observation that adults perform more tests in conjunctive settings \(MM=9\.6\) than in disjunctive settings \(MM=6\.4; see Table[1](https://arxiv.org/html/2606.06464#Sx3.T1)\), reflecting the greater ambiguity of conjunctive hypothesis spaces\. Figure[6](https://arxiv.org/html/2606.06464#Sx3.F6)further shows that humans and language models adapt their intervention strategies to the underlying causal structure\. In disjunctive environments, both humans and stronger models rely primarily on sparse interventions involving one or two objects, which are often sufficient to identify both the rule and causal objects\. In conjunctive environments, however, agents progressively transition toward multi\-object interventions because single\-object tests are insufficient to disambiguate competing explanations\. Average human adults gradually shift from one\-object to two\-object interventions, while top\-performing humans rapidly converge on targeted multi\-object combinations\. Stronger reasoning\-oriented models such asgpt\-5exhibit partially similar exploration trajectories, dynamically adjusting intervention complexity in response to remaining uncertainty\.
Figure 6:Sequential search strategy across human and language agents during active exploration\.Each color represents the dominant number of objects per test index, with opacity indicating the percentage of trials that have that dominant number of objects within that index\. Grey area shows the terminating point of search\.
### Human and LLM Performance in Causal Discovery\.
Figure[7](https://arxiv.org/html/2606.06464#Sx3.F7)compares full hypothesis accuracy across humans and language models using our active exploration setup\. Performance varies considerably across both models and causal structures\. In the disjunctive condition, several state\-of\-the\-art models achieve near\-human performance\.gpt\-5anddeepseek\-reasonerperform as well as top\-performing human participants\. These results suggest that disjunctive causal environments are comparatively easy to resolve once agents are able to actively generate informative interventions\. The conjunctive condition reveals substantially larger performance differences across agents\. Although top\-performing humans continue to achieve perfect accuracy and average human adults substantially outperform most models, many language models exhibit pronounced declines in conjunctive reasoning performance\. Smaller or less structured models such asdeepseek\-chatando4\-miniperform only modestly above baseline, whereas stronger reasoning\-oriented models such asgpt\-5andgpt5\-minishow substantially greater robustness\. Nevertheless, even strong models generally remain below top human performance in conjunctive settings\.
Figure 7:Comparative analysis of causal reasoning across LLMs and Humans\. Error bars indicate ±SEM, and numeric labels above bars report mean accuracy\.
## Discussion
The present findings suggest that adults’ apparent failures in conjunctive causal learning arise less from a fixed limitation in causal reasoning and more from the structure of the learning environment\. Under passive observation, adults may over\-rely on disjunctive priors, consistent with prior demonstrations of a conjunctive handicap\[[7](https://arxiv.org/html/2606.06464#bib.bib2)\]\. Yet when allowed to actively explore, adults generated informative interventions and successfully inferred conjunctive rules and causal objects, demonstrating a flexibility more commonly associated with young children\[[2](https://arxiv.org/html/2606.06464#bib.bib16)\]\.
Our reinterpretation aligns with interventionist accounts of causal learning, which treat causal knowledge as understanding how actions change the world\[[1](https://arxiv.org/html/2606.06464#bib.bib15),[14](https://arxiv.org/html/2606.06464#bib.bib4),[15](https://arxiv.org/html/2606.06464#bib.bib17)\]\. Passive observation paradigms constrain learners to reason over experimenter\-selected evidence, which limits their ability to design tests that directly adjudicate between competing hypotheses they may hold in their minds, leaving familiar disjunctive priors intact rather than prompting the learners to revise them\. By contrast, active exploration lets learners test the hypotheses that they themselves are considering\. In our task, adults used this to propose tests that narrow their own hypothesis space, generating informative interventions that supported successful inference of conjunctive rules where passive observations failed\.
Importantly, the reported improvement in conjunctive causal learning was not cost\-free\. Although time per test was comparable across rule types, participants conductedmoretests in the conjunctive condition than in the disjunctive condition\. This suggests that conjunctive discovery requires a more extended and strategic search process because conjunctive structures require ruling out multiple simpler alternative explanations\[[12](https://arxiv.org/html/2606.06464#bib.bib18)\]\. Thus, active exploration does not eliminate the informational demands of conjunctive learning, but instead allows adults to reach those demands through targeted intervention rather than heuristic shortcuts or priors\. Active exploration reveals adult competence not by making the task easier, but by giving learners the means to gather the evidence they need\.
The follow\-up Passive Proposer experiment further clarified the source of the active exploration advantage\. Although Passive Proposers actively generated hypotheses and proposed interventions, they did not receive contingent feedback from their own actions\. Instead, they observed outcomes generated by another participant’s intervention sequence, and consequently performed substantially worse than Active Explorers\. This suggests that active learning benefits are not reducible to explicit hypothesis generation or test selection alone\. They depend critically on learners being able to observe the outcome of the intervention they choose\.
These results also help reconcile prior developmental results\. Children’s apparent advantage in conjunctive causal learning has been attributed to weaker prior commitments or greater openness to unlikely hypotheses\[[2](https://arxiv.org/html/2606.06464#bib.bib16)\]\. Our results suggest that adultsdopossess the representational capacity to infer unlikely structures like conjunctive rules too, but they often rely on efficient default heuristics that perform well under sparse or passively presented evidence\. When adults are placed in an environment that supports hypothesis\-driven exploration, their causal learning resembles that of younger children in its flexibility and sensitivity to evidence\. This pattern suggests that previously reported developmental differences reflect differences in the strength of default heuristics and willingness to revise beliefs, and not fundamental differences in causal reasoning mechanisms\.
The present results also clarify how adult performance compares to large language models under active exploration\. Our result shows that model performance is heterogeneous: some actively exploring models perform poorly in both disjunctive and conjunctive settings, while others likegpt\-5achieve accuracy that is close to that of actively exploring adults across both rule types\. Together, these differences suggest that while some models can match humans on full hypothesis accuracy, key aspects of strategic testing, evidence selection, and rule\-level inference may still distinguish human causal learning\.
Collectively, these findings demonstrate that adult causal learning is more adaptive and flexible than prior passive paradigms might have suggested\. Apparent failures in conjunctive reasoning may arise not from an inability to represent complex causal structure, but from the interaction between strong default heuristics and constrained learning environments\. Allowing adults to act like "scientists" by exploring and generating evidence reveals forms of causal competence that passive observation can obscure\. By contrast, the LLM comparison suggests that active intervention alone is insufficient for human\-like causal discovery: successful performance depends on the quality, efficiency, and informativeness of the tests, particularly in conjunctive environments\.
## Acknowledgements
This research was enabled in part by support provided by \(Calcul Québec\) \(https://www\.calculquebec\.ca/en/\) and the Digital Research Alliance of Canada \(https://alliancecan\.ca/en\)\. We acknowledge support by NSERC \(Discovery Grant: RGPIN\-2020\-05105; Discovery Accelerator Supplement: RGPAS\-2020\-00031\) and CIFAR \(Canada AI Chair; Learning in Machine and Brains Fellowship\)\.
## References
- \[1\]\(2011\)Where science starts: spontaneous experiments in preschoolers’ exploratory play\.Cognition120\(3\),pp\. 341–349\.Cited by:[Discussion](https://arxiv.org/html/2606.06464#Sx4.p2.1)\.
- \[2\]A\. Gopnik, T\. L\. Griffiths, and C\. G\. Lucas\(2015\)When younger learners can be better \(or at least more open\-minded\) than older ones\.Current directions in psychological science24\(2\),pp\. 87–92\.Cited by:[Discussion](https://arxiv.org/html/2606.06464#Sx4.p1.1),[Discussion](https://arxiv.org/html/2606.06464#Sx4.p5.1)\.
- \[3\]A\. Gopnik, S\. O’Grady, C\. G\. Lucas, T\. L\. Griffiths, A\. Wente, S\. Bridgers, R\. Aboody, H\. Fung, and R\. E\. Dahl\(2017\)Changes in cognitive flexibility and hypothesis search across human life history from childhood to adolescence to adulthood\.Proceedings of the National Academy of Sciences114\(30\),pp\. 7892–7899\.External Links:[Document](https://dx.doi.org/10.1073/pnas.1700811114)Cited by:[Figure 2](https://arxiv.org/html/2606.06464#Sx2.F2),[The Nexiom Detector Software\.](https://arxiv.org/html/2606.06464#Sx2.SS0.SSS0.Px1.p2.1),[Exploration Strategies, Passive vs\. Active Learners\.](https://arxiv.org/html/2606.06464#Sx3.SS0.SSS0.Px1.p1.1)\.
- \[4\]A\. Gopnik and D\. M\. Sobel\(2000\)Detecting blickets: how young children use information about novel causal powers in categorization and induction\.Child Development71\(5\),pp\. 1205–1222\.Cited by:[Introduction](https://arxiv.org/html/2606.06464#Sx1.p1.1),[Introduction](https://arxiv.org/html/2606.06464#Sx1.p2.1)\.
- \[5\]A\. GX\-Chen, D\. Lin, M\. Samiei, D\. Precup, B\. A\. Richards, R\. Fergus, and K\. Marino\(2025\)Language agents mirror human causal reasoning biases\. how can we help them think like scientists?\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=LKINTp7Gdo)Cited by:[Introduction](https://arxiv.org/html/2606.06464#Sx1.p8.1)\.
- \[6\]E\. Kosoy, A\. Liu, J\. L\. Collins, D\. Chan, J\. B\. Hamrick, N\. R\. Ke, S\. Huang, B\. Kaufmann, J\. Canny, and A\. Gopnik\(2022\)Learning causal overhypotheses through exploration in children and computational models\.InProceedings of the First Conference on Causal Learning and Reasoning,B\. Schölkopf, C\. Uhler, and K\. Zhang \(Eds\.\),Proceedings of Machine Learning Research, Vol\.177,pp\. 390–406\.External Links:[Link](https://proceedings.mlr.press/v177/kosoy22a.html)Cited by:[Exploration Strategies, Passive vs\. Active Learners\.](https://arxiv.org/html/2606.06464#Sx3.SS0.SSS0.Px1.p1.1),[Conjunctive vs\. Disjunctive Reasoning\.](https://arxiv.org/html/2606.06464#Sx3.SS0.SSS0.Px2.p1.8)\.
- \[7\]C\. G\. Lucas, S\. Bridgers, T\. L\. Griffiths, and A\. Gopnik\(2014\)When children are better \(or at least more open\-minded\) learners than adults: developmental differences in learning the forms of causal relationships\.Psychological Science25\(6\),pp\. 1175–1182\.Cited by:[Introduction](https://arxiv.org/html/2606.06464#Sx1.p2.1),[Introduction](https://arxiv.org/html/2606.06464#Sx1.p6.1),[The Nexiom Detector Software\.](https://arxiv.org/html/2606.06464#Sx2.SS0.SSS0.Px1.p2.1),[Discussion](https://arxiv.org/html/2606.06464#Sx4.p1.1)\.
- \[8\]T\. Piriyakulkij, C\. Langenfeld, T\. A\. Le, and K\. Ellis\(2024\)Doing experiments and revising rules with natural language and probabilistic reasoning\.Advances in Neural Information Processing Systems37,pp\. 53102–53137\.Cited by:[Introduction](https://arxiv.org/html/2606.06464#Sx1.p8.1)\.
- \[9\]L\. E\. Schulz, A\. Gopnik, and C\. Glymour\(2007\)Preschool children learn about causal structure from conditional interventions\.Developmental Psychology43\(5\),pp\. 1150\.Cited by:[Introduction](https://arxiv.org/html/2606.06464#Sx1.p1.1),[Introduction](https://arxiv.org/html/2606.06464#Sx1.p3.1),[Conjunctive vs\. Disjunctive Reasoning\.](https://arxiv.org/html/2606.06464#Sx3.SS0.SSS0.Px2.p1.8)\.
- \[10\]D\. M\. Sobel and T\. Kushnir\(2006\)The importance of decision making in causal learning from interventions\.Memory & Cognition34\(2\),pp\. 411–419\.Cited by:[Introduction](https://arxiv.org/html/2606.06464#Sx1.p3.1)\.
- \[11\]D\. M\. Sobel and T\. Kushnir\(2013\)Interventions do not solely benefit causal learning: being told what to do results in worse learning than doing it yourself\.InProceedings of the 25th Annual Cognitive Science Society,pp\. 1100–1105\.Cited by:[Introduction](https://arxiv.org/html/2606.06464#Sx1.p3.1)\.
- \[12\]D\. M\. Sobel, J\. B\. Tenenbaum, and A\. Gopnik\(2004\)Children’s causal inferences from indirect evidence: backwards blocking and bayesian reasoning in preschoolers\.Cognitive science28\(3\),pp\. 303–333\.Cited by:[Introduction](https://arxiv.org/html/2606.06464#Sx1.p1.1),[Discussion](https://arxiv.org/html/2606.06464#Sx4.p3.1)\.
- \[13\]A\. O\. Wente, K\. Kimura, C\. M\. Walker, N\. Banerjee, M\. Fernández Flecha, B\. MacDonald, C\. Lucas, and A\. Gopnik\(2019\)Causal learning across culture and socioeconomic status\.Child development90\(3\),pp\. 859–875\.Cited by:[Introduction](https://arxiv.org/html/2606.06464#Sx1.p2.1)\.
- \[14\]J\. Woodward\(2003\)Making things happen: a theory of causal explanation\.Oxford University Press\.Cited by:[Introduction](https://arxiv.org/html/2606.06464#Sx1.p1.1),[Discussion](https://arxiv.org/html/2606.06464#Sx4.p2.1)\.
- \[15\]E\. Yiu, K\. Allen, S\. Ginosar, and A\. Gopnik\(2025\)Empowerment gain and causal model construction: children and adults are sensitive to controllability and variability in their causal interventions\.arXiv preprint arXiv:2512\.08230\.Cited by:[Discussion](https://arxiv.org/html/2606.06464#Sx4.p2.1)\.
- \[16\]C\. Zhang, B\. Jia, M\. Edmonds, S\. Zhu, and Y\. Zhu\(2021\)Acre: abstract causal reasoning beyond covariation\.InProceedings of the ieee/cvf conference on computer vision and pattern recognition,pp\. 10643–10653\.Cited by:[Introduction](https://arxiv.org/html/2606.06464#Sx1.p8.1)\.Similar Articles
Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity
Academic study shows LLM agents frequently discover complete solutions in their environments but almost never use them, revealing a missing "environmental curiosity" capability critical for open-ended tasks.
Look Before You Leap: Autonomous Exploration for LLM Agents
This paper identifies autonomous exploration as a critical capability for LLM agents and proposes the Explore-then-Act paradigm, which decouples information gathering from task execution to improve adaptability and real-world performance. It also introduces Exploration Checkpoint Coverage as a verifiable metric for evaluating exploration breadth.
Evaluating LLMs as Human Surrogates in Controlled Experiments
This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
This empirical study evaluates LLMs on the Equivalence Class Problem to assess long-chain reasoning capabilities, finding that non-reasoning models fail while reasoning models struggle with specific structural difficulties.
When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions
This paper investigates when chain-of-thought reasoning is beneficial for LLMs, showing that early-stage entropy dynamics reliably indicate reasoning utility, and introduces EDRM, a lightweight, training-free framework that adaptively selects inference strategies to achieve significant token savings while maintaining or improving accuracy.