Designing Synthetic Discussion Generation Systems: A Case Study for Online Facilitation
Summary
This paper introduces Synthetic Discussion Generation (SDG), a novel NLP framework for creating simulated discussions to enable cost-effective pilot experiments in social science research. The authors demonstrate that smaller quantized models (7B-8B parameters) can produce effective simulations at 44x lower cost than proprietary models like GPT, and apply this framework to evaluate LLM facilitators in online discussions.
View Cached Full Text
Cached at: 04/20/26, 08:32 AM
## Designing Synthetic Discussion Generation Systems: A Case Study for Online Facilitation Source: https://arxiv.org/html/2503.16505 ###### Abstract\. A critical challenge in social science research is the high cost associated with experiments involving human participants\. While some studies have explored the use of Large Language Models \(LLMs\) as imperfect substitutes, there has been little work on how to define, design, and evaluate such experiments\. In this paper, we identify Synthetic Discussion Generation \(SDG\), a novel Natural Language Processing \(NLP\) direction aimed at creating simulated discussions that enable cost\-effective pilot experiments\. Drawing on existing SDG systems and interdisciplinary literature, we develop a theoretical, task\-agnostic framework for designing, evaluating, and implementing these simulations\. We argue that the use of proprietary models such as the OpenAI GPT family for such experiments is often unjustified in terms of both cost and capability, despite its prevalence in current research\. Our experiments demonstrate that smaller quantized models \(7B–8B\) can produce effective simulations at a cost more than 44 times lower compared to their proprietary counterparts\. We use our framework in the context of online facilitation, where humans actively engage in discussions to improve them, unlike more conventional content moderation, which only removes inappropriate content from discussions\. The extremely large scale of modern social networks has led researchers to develop LLM facilitators, whose capabilities remain largely unassessed due to the need for costly experiments with human discussants\. By treating this problem as a downstream task for our framework, we show that synthetic simulations can yield generalizable results at least by revealing limitations before engaging human discussants\. In LLM facilitators, a critical limitation is that they are unable to determine when to intervene in a discussion, leading to undesirable frequent interventions and, consequently, derailment patterns similar to those observed in human interactions\. Additionally, we find that different facilitation strategies influence conversational dynamics to some extent\. Beyond our theoretical SDG framework, we also present a cost\-comparison methodology for experimental design, an exploration of available models and algorithms, an open\-source Python framework, and a large, publicly available dataset of LLM\-generated discussions across multiple models\. LLMs, discussions, facilitation, synthetic, experiments ††journal:TSC††ccs:Computing methodologies Natural language generation††ccs:Computing methodologies Discourse, dialogue and pragmatics††ccs:Computing methodologies Language resources††ccs:Human\-centered computing Open source software††ccs:Human\-centered computing Social networking sites††ccs:Human\-centered computing Social media††ccs:Human\-centered computing Collaborative and social computing design and evaluation methods## 1\.Introduction Refer to caption Refer to captionVenn diagram showing that \\acl\{sdg\} \(\\acs\{sdg\}\) is a combination of the three disciplines\. Right: Acyclic Directed graph linking base concepts \(‘Online discussions’, ’Facilitation’, ‘LLMs’, ‘SDG’\) to the use of SDG for experiments in online facilitation\. Figure 1\.Left:Synthetic Discussion Generation (https://arxiv.org/html/2503.16505#id6.6.id6)\(SDG (https://arxiv.org/html/2503.16505#id6.6.id6)\) is a subset ofGenerative Agent\-Based Modelling (https://arxiv.org/html/2503.16505#id11.11.id11)\(GABM (https://arxiv.org/html/2503.16505#id11.11.id11)\) focused exclusively on interactions through text\. DesigningSDG (https://arxiv.org/html/2503.16505#id6.6.id6)systems requires knowledge from three disciplines: social science helps us understand how the system should function,Natural Language Processing (https://arxiv.org/html/2503.16505#id3.3.id3)\(NLP (https://arxiv.org/html/2503.16505#id3.3.id3)\) provides the tools to automate generation and evaluation, andSoftware Engineering (https://arxiv.org/html/2503.16505#id2.2.id2)\(SWE (https://arxiv.org/html/2503.16505#id2.2.id2)\) enables us to build a scalable and generalizable system\.Right: We can useSDG (https://arxiv.org/html/2503.16505#id6.6.id6)to solve to discover and debug issues in the emerging field of LLM facilitation, without the need for costly human experiments\.Advances in generativeArtificial Intelligence (https://arxiv.org/html/2503.16505#id1.1.id1)\(AI (https://arxiv.org/html/2503.16505#id1.1.id1)\) technologies have enabled the study of human behaviors through Large Language Models \(LLMs\)\. Since these models are pretrained on extensive human discussion data, researchers have hypothesized that they could replicate certain human behaviors in social science experiments concerning behavioral social science studies\(Grossmannet al\.,2023 (https://arxiv.org/html/2503.16505#bib.bib122); Törnberget al\.,2023 (https://arxiv.org/html/2503.16505#bib.bib129); Argyleet al\.,2023 (https://arxiv.org/html/2503.16505#bib.bib112)\)or human\-algorithm interaction\(Korreet al\.,2025 (https://arxiv.org/html/2503.16505#bib.bib108); Choet al\.,2024 (https://arxiv.org/html/2503.16505#bib.bib102); Parket al\.,2022 (https://arxiv.org/html/2503.16505#bib.bib33)\), usually by creating synthetic simulation platforms\(Parket al\.,2022 (https://arxiv.org/html/2503.16505#bib.bib33); Törnberget al\.,2023 (https://arxiv.org/html/2503.16505#bib.bib129); Rossettiet al\.,2024 (https://arxiv.org/html/2503.16505#bib.bib128); Zhouet al\.,2024 (https://arxiv.org/html/2503.16505#bib.bib137); Chuanget al\.,2024 (https://arxiv.org/html/2503.16505#bib.bib244); Liet al\.,2025 (https://arxiv.org/html/2503.16505#bib.bib149); Baloget al\.,2024 (https://arxiv.org/html/2503.16505#bib.bib134)\)\. However, since the emergent properties of LLMs have only been discovered relatively recently, studies using synthetic agents so far have employed various ad\-hoc methodologies to define, run, and evaluate simulated discussion experiments\(Mouet al\.,2024 (https://arxiv.org/html/2503.16505#bib.bib246); Parket al\.,2022 (https://arxiv.org/html/2503.16505#bib.bib33),2023 (https://arxiv.org/html/2503.16505#bib.bib247); Törnberget al\.,2023 (https://arxiv.org/html/2503.16505#bib.bib129); Abdelnabiet al\.,2024 (https://arxiv.org/html/2503.16505#bib.bib34); Ulmeret al\.,2024 (https://arxiv.org/html/2503.16505#bib.bib18); Parket al\.,2024 (https://arxiv.org/html/2503.16505#bib.bib82)\)\. In an attempt to consolidate the currently fragmented research landscape of synthetic simulations, we are the first to identify and define this, so far implicit, research direction: ###### Definition 0\. Synthetic Discussion Generation (https://arxiv.org/html/2503.16505#id6.6.id6)\(SDG (https://arxiv.org/html/2503.16505#id6.6.id6)\): Simulating textual human discussions, by using synthetic participants, ultimately aiming to understand and extract representative insights for human discussions\. SDG (https://arxiv.org/html/2503.16505#id6.6.id6)is a specific case ofGenerative Agent\-Based Modelling (https://arxiv.org/html/2503.16505#id11.11.id11)\(GABM (https://arxiv.org/html/2503.16505#id11.11.id11)\), which is a computational approach that simulates the interactions of agents to study complex behaviors and emergent phenomena\. Unlike generalizedGABM (https://arxiv.org/html/2503.16505#id11.11.id11)systems,SDG (https://arxiv.org/html/2503.16505#id6.6.id6)actors do not have a predetermined set of actions and behaviors; rather, both are expressed through free text\.SDG (https://arxiv.org/html/2503.16505#id6.6.id6)is typically used as a means to solve research questions in other tasks \(called*‘downstream tasks’*in this study\) where both human participation and a sustained conversational context are needed, as opposed to single exchanges \(in which case, simple prompting would be sufficient\)\. Examples include negotiations\(Abdelnabiet al\.,2024 (https://arxiv.org/html/2503.16505#bib.bib34)\), social experiments\(Parket al\.,2022 (https://arxiv.org/html/2503.16505#bib.bib33)\)and human persona replication\(Parket al\.,2024 (https://arxiv.org/html/2503.16505#bib.bib82)\)\. Despite the breadth of downstream tasks, we argue that a well\-designedSDG (https://arxiv.org/html/2503.16505#id6.6.id6)methodology can generalize over most such tasks\. The lack of a unified design methodology in studies usingSDG (https://arxiv.org/html/2503.16505#id6.6.id6)results in greater difficulty in creating, running and assessing the validity of these experiments\. Furthermore,*there has been no study that addresses how to systematically create, evaluate, and use generalizable LLM simulations of discussions to solve research questions*without the need for human participants and evaluators\. In this study, we answer the following research question \(RQ\): ###### RQ 0\. How do we design and evaluate scalable111Scalable in this context implies that the system can maintain consistent performance and quality as the scale of the experiments increases, without requiring disproportionate increases in computational resources or financial cost\.SDG simulations that \(a\) replicate known representative behaviors found in real\-life human discussions, and \(b\) can be used to obtain further insights on real\-world phenomena? We begin by creating a theoretical, downstream\-task\-agnostic framework forSDG (https://arxiv.org/html/2503.16505#id6.6.id6)\. Since there are no standards or guidelines on how to createSDG (https://arxiv.org/html/2503.16505#id6.6.id6)systems, we examine task\-specific systems in the literature and use lessons fromSoftware Engineering (https://arxiv.org/html/2503.16505#id2.2.id2)\(SWE (https://arxiv.org/html/2503.16505#id2.2.id2)\) to derive design rules \(§3\.1 (https://arxiv.org/html/2503.16505#S3.SS1)\)\. Through this analysis, we formalizeSynthetic Discussion Generation (https://arxiv.org/html/2503.16505#id6.6.id6)\(SDG (https://arxiv.org/html/2503.16505#id6.6.id6)\) as a distinct research direction and identify four core design principles for effective synthetic simulations\. We break down the requirements forSDG (https://arxiv.org/html/2503.16505#id6.6.id6)and derive a set of isolated components that must be implemented and separately evaluated \(§3\.3 (https://arxiv.org/html/2503.16505#S3.SS3), §3\.4 (https://arxiv.org/html/2503.16505#S3.SS4)\), including the construction of LLM personas and the execution of dynamic discussions\. To assess whether observed behaviors are representative of real\-world patterns, we build an evaluation suite grounded inGABM (https://arxiv.org/html/2503.16505#id11.11.id11)literature and adapted forSDG (https://arxiv.org/html/2503.16505#id6.6.id6), which we use to verify the necessity and integrity of these components \(§3\.2 (https://arxiv.org/html/2503.16505#S3.SS2)\)\. Our findings show that persona\-driven simulations improve the diversity of generated content, aligning it more closely with human discussions, while careful instruction prompting can induce persistently toxic behavior in safety\-aligned LLMs using only conversational context\. Additionally, we observe that LLM agents tend not to abstain from participation, showcasing an inherent limitation in modeling realistic conversational dynamics\. Having created a theoreticalSDG (https://arxiv.org/html/2503.16505#id6.6.id6)framework, we tackle the issue of scalability: how can we minimize the required cost for running simulated experiments? Most recent studies\(Abdelnabiet al\.,2024 (https://arxiv.org/html/2503.16505#bib.bib34); Parket al\.,2022 (https://arxiv.org/html/2503.16505#bib.bib33),2023 (https://arxiv.org/html/2503.16505#bib.bib247); Baloget al\.,2024 (https://arxiv.org/html/2503.16505#bib.bib134); Törnberget al\.,2023 (https://arxiv.org/html/2503.16505#bib.bib129)\)use proprietary models such as the OpenAI GPT family\. We argue that the use of such models increases inference costs, which inherently limits the amount, efficiency, and scalability of experiments conducted using these models in real\-world problems \(§4\.1 (https://arxiv.org/html/2503.16505#S4.SS1)\)\. Indeed, we show that small \(7B\-8B parameter\) quantized models are capable enough for our purposes \(Finding1 (https://arxiv.org/html/2503.16505#Thmfinding1)\) and able to replicate several social dynamics observed in human discussions \(AppendixB (https://arxiv.org/html/2503.16505#A2)\) at a fraction of the cost\. By developing a methodology for cost comparison between experiments with humans, proprietary models, and locally hosted LLMs \(§4\.1 (https://arxiv.org/html/2503.16505#S4.SS1)\), we estimate that our approach using smaller models achieves a*cost reduction of 1,600 times*compared to experiments with humans and*44 times*compared to a recent, mid\-capability proprietary model \(GPT\-5\.1–see Table8 (https://arxiv.org/html/2503.16505#S4.T8)\)\. Next, we investigate whether such simulations can produce insights transferrable to human discussions, by using LLM facilitation as a downstream task \(§5 (https://arxiv.org/html/2503.16505#S5)\)\. Platform designers and researchers have traditionally focused on identifying, flagging, and removing problematic content, an approach often called “content moderation” in literature\(Seering,2020 (https://arxiv.org/html/2503.16505#bib.bib92); Cresciet al\.,2022 (https://arxiv.org/html/2503.16505#bib.bib116)\)\. However, fully automatic content moderation is insufficient in practice\(Horta Ribeiroet al\.,2023 (https://arxiv.org/html/2503.16505#bib.bib93); Schaffneret al\.,2024 (https://arxiv.org/html/2503.16505#bib.bib86); Smallet al\.,2023 (https://arxiv.org/html/2503.16505#bib.bib1); Korreet al\.,2025 (https://arxiv.org/html/2503.16505#bib.bib108)\)\. A more effective approach is using human facilitators,222Some publications use the term “moderator” with the meaning we have assigned to “facilitator”\(Korreet al\.,2025 (https://arxiv.org/html/2503.16505#bib.bib108)\)\.who actively participate in a discussion\. Still, this approach cannot be used scalably in modern social media sites where hundreds of thousands of discussions are being conducted each day\. LLMs have been proposed as scalable facilitators\(Korreet al\.,2025 (https://arxiv.org/html/2503.16505#bib.bib108); Smallet al\.,2023 (https://arxiv.org/html/2503.16505#bib.bib1)\), yet experimentation and development remain costly and constrained by the need for human discussants\(Rossiet al\.,2024 (https://arxiv.org/html/2503.16505#bib.bib121); Smallet al\.,2023 (https://arxiv.org/html/2503.16505#bib.bib1)\)\. The main issue is that facilitation cannot be meaningfully evaluated in isolated, single\-turn listen\-and\-respond settings; it requires sustained conversational context\(Choet al\.,2024 (https://arxiv.org/html/2503.16505#bib.bib102)\), with multiple human participants actively conversing with the LLM facilitator\. Our study is the first to use multiple participants to evaluate a LLM facilitator, and the first to use synthetic participants in lieu of humans\. While the use of LLM participants and the relatively limited recent work on facilitation constrain the scope of our findings, our experiments uncover generalizable results\. Specifically, we identify a severe and previously unreported limitation: LLM facilitators cannot choose not to intervene, even when intervention is unnecessary \(Finding2 (https://arxiv.org/html/2503.16505#Thmfinding2)\)\. This inability to remain silent has not, to our knowledge, been documented in prior literature\. This behavior has been observed to induce frustration and toxicity among human participants in online communities\(Korreet al\.,2025 (https://arxiv.org/html/2503.16505#bib.bib108)\)\.333A pattern we replicate in our experiments \(App\.B (https://arxiv.org/html/2503.16505#A2); Table14 (https://arxiv.org/html/2503.16505#A2.T14)\)\.Most importantly, failing to uncover this issue could lead researchers to invest signSimilar Articles
Multi-Persona Debate System for Automated Scientific Hypothesis Generation
The paper introduces the Multi-Persona Debate System (MPDS), a literature-grounded framework that uses LLMs, persona induction, and structured multi-agent debate to automate the generation of scientific hypotheses, with evaluations in battery materials research showing improved hypothesis quality and cross-perspective integration.
Beyond Static Benchmarks: Synthesizing Harmful Content via Persona-based Simulation for Robust Evaluation
Researchers from KAIST propose a framework that uses persona-guided LLM agents to synthesize diverse harmful content for stress-testing detection systems, addressing limitations of static benchmarks such as scalability, diversity, and data contamination. Both human and LLM evaluations confirm the synthetic scenarios are harder to detect than existing benchmarks while maintaining linguistic and topical diversity.
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
FrontierSmith automatically generates diverse open-ended coding problems from closed-ended tasks, improving LLM coding performance on benchmarks through enhanced agent interactions and training data synthesis.
DataArc-SynData-Toolkit: A Unified Closed-Loop Framework for Multi-Path, Multimodal, and Multilingual Data Synthesis
The article introduces DataArc-SynData-Toolkit, an open-source framework designed to simplify multi-path, multimodal, and multilingual synthetic data generation. It aims to lower technical barriers and improve usability for training large language models through a unified, configuration-driven pipeline.
Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction
This paper proposes a large language model-driven data augmentation framework using GPT-5 to generate synthetic oral monologues from written anchors for cognitive score prediction from speech. A similarity-guided selection strategy consistently reduces prediction error, particularly for minority low-score participants.