Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

arXiv cs.CL Papers

Summary

This paper evaluates the abilities of large language models (LLMs) and multimodal LLMs for addressee detection, turn-change prediction, and next speaker prediction in multi-party meeting conversations. Results show text-based LLMs outperform supervised models and humans in next speaker prediction, while multimodal LLMs improve over text-only models in other tasks but remain below human performance.

arXiv:2606.17542v1 Announce Type: new Abstract: We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised models trained for these tasks, text-based LLMs, multimodal LLMs (MM-LLMs), and human subjects. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance, indicating difficulty leveraging raw audio-visual signals. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both.
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:41 AM

# Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings
Source: [https://arxiv.org/html/2606.17542](https://arxiv.org/html/2606.17542)
Fukuda Kano Arora Delcroix Tawara Ogawa Chiba Ando Chen Watanabe\\CJKencfamilyUTF8mc\\CJK@envStartUTF8

TakatomoSiddhantMarcNaohiroAtsunoriYuyaAtsushiWilliamShinji1NTT, Inc\., Japan,2Language Technologies Institute, Carnegie Mellon University, USA[ryo\.fukuda@ntt\.com](https://arxiv.org/html/2606.17542v1/mailto:[email protected])

###### Abstract

We investigate turn\-taking in multimodal multi\-party conversations using large language models \(LLMs\)\. We construct an evaluation framework for three tasks: addressee detection, turn\-change prediction, and next speaker prediction\. We compare supervised models trained for these tasks, text\-based LLMs, multimodal LLMs \(MM\-LLMs\), and human subjects\. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information\. An MM\-LLM performed better than text\-based LLMs on addressee detection and turn\-change prediction but remained below human performance, indicating difficulty leveraging raw audio\-visual signals\. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction\. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both\.

###### keywords:

turn\-taking modeling, multi\-party conversation, multimodal speech processing, conversational context

## 1Introduction

Advances in large language models \(LLMs\) have substantially improved the ability of conversational agents to understand and generate natural language\. With the emergence of multimodal LLMs \(MM\-LLMs\) capable of processing audio and visual inputs in addition to text\[zhang\-etal\-2023\-video,10448504\], it is becoming possible to integrate linguistic and non\-linguistic information to understand non\-verbal communication\. These developments have increased interest in systems that can engage in human multi\-party conversations \(MPCs\), such as meeting assistants and collaborative agents\[matsuyama2013four,skantze2015exploring\], as studied in several MPC corpora\[1198793,Koutsombogera2018\-ne,Watanabe2020\-bm,Vinnikov2024\-an,Nguyen2026\-jb\]\.

Conversational agents participating in MPCs are required to understand turn\-taking behaviors, which are more complex than those in dyadic interactions\[traum2003issues\]\. In dyadic conversations, understanding turn\-taking only requires detecting speaker changes, as the addressee of the current utterance and the next speaker are self\-evident \(i\.e\., the interlocutor\)\. In contrast, MPCs involve multiple potential addressees and multiple candidates for the next speaker\. An utterance may be directed to a specific individual, several participants, or the group as a whole\. Successful participation therefore requires inferring who is being addressed and whether and to whom the conversational floor will shift\. Failure to appropriately manage turn\-taking can result in long pauses or inappropriate interruptions that significantly disrupt the fluidity of the interaction\[johansson2015opportunities\]\.

Previous studies have extensively investigated turn\-taking in dyadic conversations\[Skantze2017\-ab,roddy2018multimodal,Ekstedt2022\-up\], recently examining the turn\-taking capabilities of LLMs and audio foundation models\[Ekstedt2020\-ds,aroratalking,lin2025full\]\. In comparison, turn\-taking in MPCs has received less attention, although interest has grown in recent years\[castillo\-lopez\-etal\-2025\-survey\]\. In MPCs, turn\-taking has been studied through individual tasks such as addressee detection\[Jovanovic2004\-tw\], turn\-change prediction\[laskowski\-2010\-modeling\], and next speaker prediction\[Ishii2013\-me\]\. These tasks have been shown to benefit from multimodal information\. For example, visual cues such as gaze contribute to addressee detection and next speaker prediction\[Jovanovic2004\-tw,Ishii2013\-me\], while prosodic features provide cues for turn\-change prediction\[Ohsuga2005\-jc\]\. Earlier work primarily employed supervised models such as conditional random fields\[10\.1145/1647314\.1647332\], support vector machines \(SVM\)\[Malik2019\-pb\], and deep neural networks\[Heo2025\-oo,Elmers2025\-yy\], typically combining textual features with audio\-visual signals\. More recently, LLM\-based approaches have been explored\. Several studies have evaluated the ability of text\-based LLMs to perform addressee detection and next speaker prediction\[Koji2025\-zu,Hilgert2025\-rl\]\. The effectiveness of incorporating visual information, such as gaze or images, to text\-based and MM\-LLMs has also been investigated\[Koji2025\-zu,Mori2026\-zq\]\.

Despite these efforts, three limitations remain\. First, it is unclear to what extent MM\-LLMs can perform these tasks directly from raw audio and video signals\. Although such models accept multimodal inputs, their ability to exploit these signals effectively for turn\-taking prediction in MPCs has not been systematically examined\. Second, comprehensive evaluations across modeling paradigms, such as supervised models trained to predict these tasks, text\-based LLMs, and MM\-LLMs, under a unified experimental protocol are limited\. Third, and most importantly, human performance on these tasks has not been sufficiently quantified under comparable multimodal conditions\. Understanding the inherent difficulty of these tasks for humans and the factors that contribute to it is essential for developing conversational agents that approximate human behavior\.

Table 1:LLM\-based approaches for turn\-taking in multi\-party conversation\.In this study, we address these limitations by evaluating addressee detection, turn\-change prediction, and next speaker prediction in a unified framework\. Both models and humans address each task using only past and current conversation information, which reflects real\-time conversational constraints\. We use the AMI Meeting Corpus\[kraaij2005ami\], which consists of natural MPCs in meetings involving four English speakers\. The corpus provides synchronized audio, video, and manual transcriptions, as well as annotations such as addressee and focus\-of\-attention \(FOA\) labels, making it suitable for our experiments\. We compare three classes of models: supervised models, text\-based LLMs, and MM\-LLMs\. As supervised baselines, we implement four conventional models\. For text\-based LLMs, we evaluate three variants of Qwen3\[Yang2025\-vg\]\. For MM\-LLMs, we examine Qwen\-Omni models\[Xu2025\-mb,Xu2025\-vy\]and Gemini 2\.5 Pro\. In addition, we conduct human evaluations under comparable multimodal settings to directly compare humans and models\.

The evaluation revealed several key findings\.

- •First, human performance on these tasks was not particularly high\.111We should note that a limitation of the study is that, as explained later, our human evaluation was carried out by non\-native speakers, who may perform worse than native speakers on these tasks\.For example, in next speaker prediction, where four candidates were possible, the F1 score was approximately 60%\. This indicates that turn\-taking in MPCs is difficult to predict, even for humans\.
- •Second, text\-based LLMs outperformed human subjects and supervised models in next speaker prediction\. We confirmed that conversational context was critical for this task, consistent with prior findings\[Hilgert2025\-rl\]\.
- •Third, among MM\-LLMs, Gemini 2\.5 Pro achieved higher performance than text\-based LLMs in addressee detection and turn\-change prediction, but still underperformed human performance\. Our analysis indicates that current MM\-LLMs still face difficulties in effectively leveraging raw audio\-visual signals for turn\-taking prediction\.
- •Finally, we observed that the prediction tendencies of humans and LLMs were broadly similar\. Intervals that were difficult for humans also tended to be difficult for LLMs\. In particular, segments with frequent speaker changes and more balanced participation among speakers were harder to predict\.

## 2Related Work

### 2\.1Evaluation of LLMs

Several recent studies have examined the ability of LLMs to understand turn\-taking in MPCs \([Table1](https://arxiv.org/html/2606.17542#S1.T1)\)\. Inoue et al\.\[Koji2025\-zu\]constructed a benchmark for addressee detection and next speaker prediction using three\-party conversations\. They reported that LLM performance with ground\-truth transcriptions was close to chance level and that incorporating gaze information did not substantially improve performance\. Hilgert and Niehues\[Hilgert2025\-rl\]evaluated text\-based LLMs on next speaker prediction using several MPC datasets\. Their results showed that some models outperformed human subjects and highlighted the importance of conversational context for predicting the next speaker\. Mori et al\.\[Mori2026\-zq\]investigated next speaker prediction using an MM\-LLM with text and image inputs on the TEIDAN dataset\. They found no clear performance improvement from incorporating visual information, suggesting that current models may have a limited ability to exploit visual cues for this task\. Chang et al\.\[chang2025multimodal\]explored whether MM\-LLMs can understand conversational structure from audio and video signals\. Using a television dialogue dataset, they evaluated models on discourse\-related attributes such as addressee prediction\. However, their evaluation assumes an offline setting in which the entire conversation is available, which differs from online prediction scenarios\.

While these studies demonstrate increasing interest in LLM\-based approaches to MPC understanding, most focus on a single turn\-taking task or rely on limited modalities\. Moreover, systematic comparisons across different model types remain limited\. In contrast, our study focuses on an online scenario simulating an agent participating in a meeting, predicting conversational dynamics without access to future information\. Moreover, we compare text\-based and MM\-LLMs across multiple turn\-taking tasks\.

### 2\.2Human Performance

Several studies have explored human performance on related turn\-taking tasks\. The AMI corpus includes addressee annotations that were carefully produced by trained annotators with access to the entire conversation and guided by dialogue act information\. Inter\-annotator agreement for these annotations has been reported to be moderate \(Krippendorff'sα\\alpha≈\\approx0\.45–0\.56\)\[Akker2009\-zf\], indicating that addressee identification is inherently ambiguous\. However, in natural conversation, participants infer addressees intuitively, without explicitly referring to dialogue act definitions\. The extent to which such online human judgments align with annotations remains unclear\.

Human ability to predict turn changes has also been investigated\. De Ruiter et al\.\[de2006projecting\]reported that listeners predict turn ends approximately 200 milliseconds before the current speaker finishes\. Casillas and Frank\[casillas2013development\]found that even 1–2\-year\-old children can predict turn changes by integrating lexical and prosodic cues\. However, their experiments were limited to dyadic conversations, and did not quantify utterance\-level prediction accuracy\. Hilgert and Niehues\[Hilgert2025\-rl\]evaluated human performance of next speaker prediction in MPCs, but their evaluation relied solely on textual information\. This setup may underestimate human performance in natural conversations where multimodal cues such as gaze and prosody are available\.

To the best of our knowledge, no prior study has quantitatively evaluated human performance on addressee detection, turn\-change prediction, and next speaker prediction under a multimodal and online setting in MPCs\. Our study provides the first unified comparison of humans and models on these tasks, enabling direct assessment of their relative performance\.

## 3Task Definition

In this study, we evaluate turn\-taking prediction in MPCs through three tasks: \(1\)addressee detection, \(2\)turn\-change prediction, and \(3\)next speaker prediction\. In addition to evaluating models, we also measure human performance on the same tasks to clarify the gap between humans and current models for these tasks\.

In our experiments, systems perform only two tasks: \(1\) addressee detection and \(2\) next speaker prediction at the utterance level, from which turn\-change labels can be directly derived\. This formulation simplifies the prediction process while enabling evaluation of different aspects of turn\-taking\.

### 3\.1Addressee detection

Let a conversation be represented as anNN\-length sequence of utterances\(u1,u2,…,uN\)\(u\_\{1\},u\_\{2\},\\dots,u\_\{N\}\)\. Each utteranceuiu\_\{i\}is associated with a speaker IDsi∈𝒫s\_\{i\}\\in\\mathcal\{P\}, a transcriptiontit\_\{i\}, an utterance\-level audio segmentaia\_\{i\}, and an utterance\-level video segmentviv\_\{i\}, where𝒫\\mathcal\{P\}denotes the set of participants and\|𝒫\|=K\|\\mathcal\{P\}\|=K\. In our experiments using the AMI corpus,K=4K=4\. At time stepii, the goal is to predict the addressee labelyi∈𝒫∪\{Group,None\}y\_\{i\}\\in\\mathcal\{P\}\\cup\\\{\\texttt\{Group\},\\texttt\{None\}\\\}for the current utteranceuiu\_\{i\}\.Groupindicates multiple participants, andNoneindicates that the utterance is not directed to any specific participant\. Addressee detection predictsyiy\_\{i\}from\(si,ti,ai,vi,ci\)\(s\_\{i\},t\_\{i\},a\_\{i\},v\_\{i\},c\_\{i\}\), wherecic\_\{i\}is conversational context derived from preceding utterances\. The form of the conversational contextcic\_\{i\}depends on the systems \(see[Section4\.2](https://arxiv.org/html/2606.17542#S4.SS2)\)\.

For evaluation, we report classification accuracy \(Acc\\mathrm\{Acc\}\)\. To account for class imbalance, we additionally report the macro\-averaged F1 score \(F1ma\\mathrm\{F1\}\_\{\\mathrm\{ma\}\}\) over all classes\.

### 3\.2Turn\-change and next speaker prediction

We consider two closely related tasks: turn\-change prediction and next speaker prediction\. Both tasks are defined based on predicting who will speak immediately after each utteranceuiu\_\{i\}\. Models and humans predict a possibly multi\-valued set of next speaker candidatesS^i\+1⊆𝒫\\hat\{S\}\_\{i\+1\}\\subseteq\\mathcal\{P\}from the current inputs\(si,ti,ai,vi,ci\)\(s\_\{i\},t\_\{i\},a\_\{i\},v\_\{i\},c\_\{i\}\)\.222In a pilot study, subjects often found it difficult to commit to a single next speaker candidate\. Therefore, both models and humans are allowed to output one or more candidates\. Turn\-change prediction:Turn\-change prediction is the task of predicting whether the speaker will change in the next utterance\. We define the ground\-truth turn\-change labelrir\_\{i\}as

ri=\{Holdif​si\+1=si,Shiftotherwise\.r\_\{i\}=\\begin\{cases\}\\texttt\{Hold\}&\\text\{if \}s\_\{i\+1\}=s\_\{i\},\\\\ \\texttt\{Shift\}&\\text\{otherwise\}\.\\end\{cases\}The predicted label is derived fromS^i\+1\\hat\{S\}\_\{i\+1\}as

r^i=\{Holdif​S^i\+1=\{si\},Shiftotherwise\.\\hat\{r\}\_\{i\}=\\begin\{cases\}\\texttt\{Hold\}&\\text\{if \}\\hat\{S\}\_\{i\+1\}=\\\{s\_\{i\}\\\},\\\\ \\texttt\{Shift\}&\\text\{otherwise\}\.\\end\{cases\}AHoldis predicted only when the predicted set contains exclusively the current speaker\. For turn\-change prediction, we reportAcc\\mathrm\{Acc\}andF1ma\\mathrm\{F1\}\_\{\\mathrm\{ma\}\}over the two classes\. Next speaker prediction:Next speaker prediction is evaluated only on turns with a speaker change \(ri=Shiftr\_\{i\}\\;=\\texttt\{Shift\}\)\. We report a strict accuracy that counts a prediction as correct only when the predicted set contains exactly one speaker and it matches the ground\-truth next speaker, i\.e\.,S^i\+1=\{si\+1\}\\hat\{S\}\_\{i\+1\}\\;=\\\{s\_\{i\+1\}\\\}\. Because this strict criterion does not account for ambiguous situations, we also reportPrecision\\mathrm\{Precision\},Recall\\mathrm\{Recall\}, andF1\\mathrm\{F1\}, computed based on whether the ground\-truth next speakersi\+1s\_\{i\+1\}is included inS^i\+1\\hat\{S\}\_\{i\+1\}\.

## 4Dataset

We constructed an evaluation set for the above tasks\. We used the AMI corpus, which consists of 100 hours of meeting recordings, as in previous studies\[Malik2019\-pb,Hilgert2025\-rl\]\. The AMI corpus provides synchronized audio recordings, video streams, and manual transcriptions\. This corpus includes scenario\-based meetings where four participants, each playing different roles in a design team, complete a design project over the course of a day\. A subset of the corpus is annotated with dialogue acts, and within a further subset of these sessions, the dialogue act annotations include manually marked addressee information\. Separately, another subset of the corpus is annotated with FOA, which represents the speaker's gaze target\. FOA labels take one of eight values: one of the four participants,table,whiteboard,slidescreen, orUnspecified\. Further details on the annotations are provided on the AMI corpus website333[https://groups\.inf\.ed\.ac\.uk/ami/corpus/annotation\.shtml](https://groups.inf.ed.ac.uk/ami/corpus/annotation.shtml)\. In the original annotation, addressee labels are assigned to spans of consecutive words\. To obtain utterance\-level labels suitable for our tasks, we construct addressee labelsyi∈𝒫∪\{Group,None\}y\_\{i\}\\in\\mathcal\{P\}\\cup\\\{\\texttt\{Group\},\\texttt\{None\}\\\}for each utterance as follows\. For each word in a given utterance, collect all of the original addressee annotations associated with that word\. If no annotation is found, the utterance is labeled asNone\. If one annotation is found, the corresponding addressee label is assigned to that utterance\. If multiple annotations are found, the labelGroupis assigned\. For turn\-change and next speaker prediction in[Section3\.2](https://arxiv.org/html/2606.17542#S3.SS2), we simply use the speaker labelsi\+1s\_\{i\+1\}as the ground truth target for given the utteranceuiu\_\{i\}, in line with previous work\[Hilgert2025\-rl\]\. Utterances at the end of a session, for which no subsequent speaker exists, are excluded from these evaluations\.

For audio, we use close\-talking headset microphone recordings for each participant\. Due to speech overlap, the microphone of one speaker occasionally captures the voice of another participant\. In some cases, this leakage makes the identity of the next speaker partially observable\. Manual inspection confirmed such leakage in approximately 8\.5% of utterances\. Because our goal is to evaluate continuous turn\-taking prediction in realistic multi\-party interactions, we did not exclude these instances\. Only short utterances completely overlapping other utterances \(typically fillers or backchannels\) were excluded\.

### 4\.1Data Selection

Table 2:Statistics of the dataset\. †Single refers to labelsA,B,CorD\.We manually inspected the corpus and selected sessions according to the following criteria: \(i\) availability of addresse and FOA labels, \(ii\) the camera view clearly captures the faces of all participants, and \(iii\) availability of headset microphone recordings for all speakers\. As a result, the following 11 sessions remained: IS1000a, IS1001a, IS1001b, IS1001c, IS1003d, IS1006b, IS1006d, IS1008a, IS1008b, IS1008c, IS1008d\. Of these, ten sessions, excluding IS1000a, which was used for debugging, were used for evaluation\.

In addition, we constructed a subset for human evaluation using IS1001a and IS1008a, due to time and resource constraints\. These sessions involve different groups of participants, ensuring diversity in speaker characteristics\. In the AMI corpus, the session suffixes \(a, b, c, d\) correspond to four phases of the same meeting conducted by a single group \(e\.g\., IS1001 or IS1008\)\. We therefore selected phase\-a sessions, which are the least influenced by preceding conversational context\.[Table2](https://arxiv.org/html/2606.17542#S4.T2)summarizes the statistics of the selected data\. Except for a higher proportion of individually addressed interactions, the turn\-change ratio and addressee distribution are comparable to those of the full sessions\.

Table 3:Summary of input features\.
### 4\.2Features

[Table3](https://arxiv.org/html/2606.17542#S4.T3)summarizes the input features\. In the primary setting, ground\-truthsis\_\{i\}andtit\_\{i\}, and the utterance\-aligned raw signalsaia\_\{i\}andviv\_\{i\}, are provided\. Speaker IDs are overlaid on the video to associate visual streams with𝒫\\mathcal\{P\}\(see[Figure1](https://arxiv.org/html/2606.17542#S5.F1)\)\.

The form of the conversation contextcic\_\{i\}varies depending on the systems\. For LLMs \([Section5\.2](https://arxiv.org/html/2606.17542#S5.SS2)and[5\.3](https://arxiv.org/html/2606.17542#S5.SS3)\), it consists of speaker\-transcription pairs from preceding utterances,\(s<i,t<i\)\(s\_\{<i\},t\_\{<i\}\)\. For supervised models \([Section5\.1](https://arxiv.org/html/2606.17542#S5.SS1)\), following prior work\[Malik2019\-pb\], we provide a speaker ID of an preceding utterance,ci=si−1c\_\{i\}=s\_\{i\-1\}\. Human subjects \([Section6](https://arxiv.org/html/2606.17542#S6)\) are not explicitly given contextual information\. However, because they solve the tasks sequentially, they may implicitly rely on contextual information retained in memory from previous utterances\.

We additionally consider optional inputs\. For supervised models, the transcription is represented via a text\-derived feature vectorϕi\\phi\_\{i\}computed fromtit\_\{i\}, rather than raw text\. Specifically,ϕi\\phi\_\{i\}consists of second\-person pronoun usage \(you\), presence of conjunctions, and sentence length \(\[Malik2019\-pb\]for details\)\. To assess robustness under practical conditions, we also consider an ASR setting in whichtit\_\{i\}is replaced with the ASR transcriptt~i\\tilde\{t\}\_\{i\}\(see[Section7\.3](https://arxiv.org/html/2606.17542#S7.SS3)\)\. We also evaluated a modular setting in which FOA labelsgig\_\{i\}are provided as visual cues \([Section7\.4](https://arxiv.org/html/2606.17542#S7.SS4)\)\.

Table 4:Prompt used for the MM\-LLM \(Next Speaker Prediction\)\. The prompt was slightly edited for brevity\.Description: You are an expert model of multi\-party conversation analysis\. Your task is to predict who will speak next in a meeting with four participants: A, B, C, and D\. You are given the current speaker ID, the current utterance \(transcription\), and an audio\-visual video clip, optionally with conversation context \(previous speakers and utterances\) and focus of attention \(where the current speaker is visually attending\)\. The video shows four participants in a meeting room\. Seating positions: A \(back right\), B \(back left\), C \(front right\), D \(front left\)\. Video frames are overlaid with participant labels \(A–D\)\. Participants may stand or move; use overlaid labels rather than absolute positions\. You may use visual cues \(gaze direction, head orientation, posture, gestures\) and prosodic cues \(intonation, pauses, turn\-final cues\) to infer the next speaker\. The next speaker may be the same as the current speaker\. You may predict one or multiple candidates if ambiguous\.Output: One or more labels from \{A, B, C, D\}\.Context \(Speaker: Transcription\):C: Oops\.C: That's as far as it goes\.A: Hi guys uh good morning everybody here\.Current speaker: A\.Transcription: I am a project manager for this new project which we are going to discuss now\.Focus of attention: Table\.Video: \(video clip path\)

## 5Model Evaluation

We evaluate three classes of models: conventional supervised learning models, and off\-the\-shelf text\-based and MM\-LLMs\. As a naive baseline, we report majority or chance\-level strategies for each task\. For addressee detection, the naive baseline always predictsGrouplabel\. For turn\-change prediction, it always predictsShiftlabel\. For next speaker prediction, we report the expected accuracy of random selection among the four participants, which is 25\.0\.

### 5\.1Supervised Learning Models

We implemented four supervised models used in prior work of addressee detection\[Malik2019\-pb\]: Naive Bayes\[rish2001empirical\], Random Forest\[liaw2002classification\], Multi\-layer Perceptron \(MLP\)\[kruse2022multi\], and Support Vector Machine \(SVM\)\[steinwart2008support\]\. Hyperparameters were set following\[Malik2019\-pb\]\. The input consists of the current speakersis\_\{i\}, the text\-derived featuresϕi\\phi\_\{i\}, and the contextci=si−1c\_\{i\}=s\_\{i\-1\}\.

All models were implemented using scikit\-learn\[kramer2016scikit\]\. We conducted five\-fold cross\-validation over the ten sessions usingGroupKFold, ensuring that utterances from the same session did not appear in both training and test sets\. Predictions on the test split of each fold were aggregated, and performance metrics were computed over the combined test data\.

### 5\.2LLMs

As LLMs, we evaluated three variants of Qwen3\[Yang2025\-vg\], a recently proposed model with the Transformer\-based decoder architecture\[vaswani2017attention\]: Qwen3\-8B, Qwen3\-14B, and Qwen3\-32B\. For these models, the thinking mode was enabled by default, and its contribution was examined through an ablation study\. Inference was performed in bfloat16 precision using greedy decoding\. For each utteranceuiu\_\{i\}, text\-based LLMs were given the task instruction and the text\-based inputs\(si,ti,ci\)\(s\_\{i\},t\_\{i\},c\_\{i\}\)\. The prompt for text\-based LLMs is the same as that for MM\-LLMs described below, except for the absence of audio\-visual inputs\(ai,vi\)\(a\_\{i\},v\_\{i\}\)\.

### 5\.3MM\-LLMs

We evaluate MM\-LLMs that jointly process text, audio, and video inputs\. For each utteranceuiu\_\{i\}, the models were given the task instruction and the multimodal inputs\(si,ti,ai,vi,ci\)\(s\_\{i\},t\_\{i\},a\_\{i\},v\_\{i\},c\_\{i\}\)\.[Table4](https://arxiv.org/html/2606.17542#S4.T4)presents an example prompt for next speaker prediction\.

We evaluate recent open\-weight end\-to\-end multimodal foundation models, Qwen2\.5\-Omni\[Xu2025\-mb\]and Qwen3\-Omni\[Xu2025\-vy\], which have demonstrated strong performance in both unimodal and multimodal understanding tasks\. These models adopt a Thinker–Talker architecture, where the Thinker module is responsible for text generation and reasoning, and the Talker module generates streaming speech tokens\. In this study, only the Thinker module was used\. From the Qwen2\.5\-Omni family, we selected Qwen2\.5\-Omni\-7B, a 7B\-parameter model\. For Qwen3\-Omni, we evaluated Qwen3\-Omni\-30B\-A3B\-Thinking \(Qwen3\-Omni\-30B\), a 30B\-parameter mixture\-of\-experts model optimized for reasoning\. The instruction\-tuned variant \(Qwen3\-Omni\-30B\-A3B\-Instruct\) was excluded because preliminary experiments showed inferior performance compared to the Thinking model\. Inference was performed in bfloat16 precision using greedy decoding\.

We also evaluated the closed\-source multimodal model Gemini 2\.5 Pro via the official API\. The same task instructions and input features were provided to the local multimodal models\. Inference was conducted with a temperature set to 1\.0\.

![Refer to caption](https://arxiv.org/html/2606.17542v1/figures/mpceval_screenshot.png)Figure 1:Screenshot of the experiment tool\.

## 6Human Evaluation

To compare human and model performance, we conducted a human evaluation under the same task formulation described in[Section3](https://arxiv.org/html/2606.17542#S3)\. Participants simultaneously performed addressee detection, turn\-change prediction, and next speaker prediction in an online setting, without access to future utterances\.

We developed a web\-based interface \([Figure1](https://arxiv.org/html/2606.17542#S5.F1)\) that plays videos and audio from the beginning and automatically pauses them at the end of each utterance\. At each pause for utteranceuiu\_\{i\}, subjects were given the current speaker ID and the transcription, together with the corresponding audio and video clips, i\.e\.,\(si,ti,ai,vi\)\(s\_\{i\},t\_\{i\},a\_\{i\},v\_\{i\}\)\. They then answered two questions:Q1\. Who is being addressed?andQ2\. Who is likely to speak next?Formally, they predicted the addressee labelyiy\_\{i\}\([Section3\.1](https://arxiv.org/html/2606.17542#S3.SS1)\) and a set of next speaker candidatesS^i\+1⊆𝒫\\hat\{S\}\_\{i\+1\}\\subseteq\\mathcal\{P\}\([Section3\.2](https://arxiv.org/html/2606.17542#S3.SS2)\)\. After answering, playback resumed until the next utterance boundary\.

Subjects were instructed to respond based on their intuition\. To ensure consistency in the annotation criteria, all participants engaged in a practice session before the main experiment and were provided with example responses\. To reduce participant burden, each session was divided into segments of approximately five minutes\. Thus, participants could rely on at most about five minutes of prior conversational context when making predictions\. For a fair comparison, when evaluating models on the human\-evaluation subset, we used the same segmented session chunks, rather than the full sessions\. Twelve non\-native English speakers who regularly use English in professional settings participated in the experiment\. Each utterance received three independent annotations\. It should be noted that our experimental design differs from natural conversational settings in several aspects and may underestimate human performance\. We discuss these limitations in[Section9](https://arxiv.org/html/2606.17542#S9)\.

Inter\-participant agreement, measured by Krippendorff'sα\\alpha, was 0\.67 for addressee detection and 0\.48 for next speaker prediction\. The agreement for addressee detection indicates a moderate level of consistency among participants, whereas the lower agreement for next speaker prediction reflects the inherent ambiguity of anticipating future turns in MPCs\. The agreement for addressee detection is slightly higher than the previously reported inter\-annotator agreement for AMI addressee labels \(approximately 0\.45–0\.56\)\[Akker2009\-zf\]\. This may seem counterintuitive, given that the original annotations were produced by trained annotators under stricter guidelines\. One possible explanation is the difference in labeling granularity: AMI annotations were assigned at the word level, whereas our evaluation was conducted at the coarser utterance level, which may have resulted in higher agreement\. For ext speaker prediction, our agreement is substantially higher than the Fleiss' kappa of 0\.17 reported in prior work\[Hilgert2025\-rl\]\. This difference may stem from experimental settings, as the previous study relied only on textual information, whereas our participants had access to multimodal cues\.

Table 5:Overall results on the full set\. Input features:aia\_\{i\}\(audio segment\),viv\_\{i\}\(video clip\),sis\_\{i\}\(speaker ID\),tit\_\{i\}\(transcription\), andci=\(s<i,t<i\)c\_\{i\}=\(s\_\{<i\},t\_\{<i\}\)\(conversational context\)\.Acc\\mathrm\{Acc\}was reported for Addressee detection \(Addr\.\) and Turn\-change prediction \(Turn\), andF1\\mathrm\{F1\}for Next speaker prediction \(Next\)\. The naive baseline corresponds to predicting the majority class for addressee detection \(Group\) and turn\-change prediction \(Shift\), and random selection for next speaker prediction\.Table 6:Comparison of models and humans in the subset\. †Significantly better thanHuman; ‡significantly worse thanHuman\(paired bootstrap, 95% CI\)\.AddresseeTurn\-changeNext speakerdetectionpredictionpredictionAcc\\mathrm\{Acc\}F1ma\\mathrm\{F1\}\_\{\\mathrm\{ma\}\}Acc\\mathrm\{Acc\}F1ma\\mathrm\{F1\}\_\{\\mathrm\{ma\}\}Acc\\mathrm\{Acc\}P\\mathrm\{P\}R\\mathrm\{R\}F1\\mathrm\{F1\}Naive baseline28\.47\.464\.639\.325\.025\.025\.025\.0Human66\.667\.575\.071\.547\.252\.270\.760\.1Qwen3\-14B51\.5‡52\.0‡67\.453\.8‡53\.760\.381\.869\.4†Gemini 2\.5 Pro61\.457\.469\.860\.6‡45\.350\.766\.857\.7

## 7Results

### 7\.1Model comparison

Supervised models vs\. LLMs:[Table5](https://arxiv.org/html/2606.17542#S6.T5)shows the overall performance of models\. SVM achieved the highest accuracy in addressee detection\. In turn\-change prediction, it also outperformed all LLMs except Gemini 2\.5 Pro\. These results indicate that task\-specific supervised models can surpass general LLMs in these tasks, even without multimodal information\. In contrast, all text\-based LLMs and Gemini 2\.5 Pro outperformed supervised models in next speaker prediction\. This task depends on the conversational context\[Hilgert2025\-rl\]\. Unlike supervised models, which rely only on features from the immediately preceding utterance as context, LLMs can leverage longer conversational histories, likely contributing to their advantage in this task\. LLM model size:Among text\-based LLMs, we found no monotonic correlation between model size and performance\. Qwen3\-14B outperformed both the smaller 8B and larger 32B variants across tasks\. This suggests that the 14B model has a certain level of reasoning ability to solve these tasks, and that the additional parameters of the 32B model do not necessarily yield further gains\. For example, in addressee detection, the 32B model more frequently predicted individual classes \(A–D\) when the correct label was the majority classGroup, leading to reduced accuracy\. This may reflect the tendency of the 32B model to prefer detailed predictions over simple answers\. Comparison with prior work:Text\-based LLMs outperformed the naive baseline in both addressee detection and next speaker prediction\. This contrasts with the findings of Inoue et al\.\[Koji2025\-zu\], who reported near\-chance performance of GPT\-4o on similar tasks in Japanese three\-party conversations\. The discrepancy may stem from differences in models, language, and domain \(Japanese three\-party goal\-free discussions vs\. English four\-party meetings\), and the amount of contextual input provided \(e\.g\., five utterances vs\. entire conversation history\)\. Investigating how task difficulty varies across languages and domains remains an important direction for future research\. MM\-LLM:The local MM\-LLMs, Qwen2\.5\-Omni\-7B and Qwen3\-Omni\-30B, did not outperform the SVM or the text\-based LLMs\. This suggests that the audio\-visual features were not effectively leveraged in these models\. In contrast, Gemini 2\.5 Pro obtained the highest scores among LLMs in addressee detection and turn\-change prediction\. We discuss the factors underlying this strong performance of Gemini 2\.5 Pro in[Section8](https://arxiv.org/html/2606.17542#S8)\. In the next speaker prediction, which depends more strongly on conversational context, Gemini 2\.5 Pro performed worse than the text\-based LLMs\. Possibly, incorporating audio\-visual inputs increased the complexity of input, making effective contextual reasoning more difficult, as discussed in[Section8](https://arxiv.org/html/2606.17542#S8)\.

Table 7:Ablation results for Qwen3\-14B\. Input features:aia\_\{i\}\(audio segment\),viv\_\{i\}\(video clip\),sis\_\{i\}\(speaker ID\),tit\_\{i\}\(transcription\),t~i\\tilde\{t\}\_\{i\}\(ASR transcription\),cic\_\{i\}\(context\), andgig\_\{i\}\(FOA\)\.⋄\\diamond: Setting without thinking mode\. † Significantly better than \(a\); ‡ significantly worse than \(a\)\.SettingRawAnnotatedAddr\.\(Acc\\mathrm\{Acc\}\)Turn\(Acc\\mathrm\{Acc\}\)Next\(F1\\mathrm\{F1\}\)aia\_\{i\}viv\_\{i\}sis\_\{i\}tit\_\{i\}t~i\\tilde\{t\}\_\{i\}cic\_\{i\}gig\_\{i\}\(a\)Primary✓✓✓52\.366\.451\.1\(b\)\+ ASR✓✓^\\hat\{\\checkmark\}✓51\.265\.651\.1\(c\)\+ FOA✓✓✓✓55\.065\.867\.7\(d\)– Context✓✓×\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\times\}42\.5‡56\.2‡36\.9‡\(e\)– Think⋄\\diamond✓✓✓47\.3‡64\.948\.1‡Table 8:Ablation results for Gemini 2\.5 Pro in the subset\.
### 7\.2Humans vs\. Models

[Table6](https://arxiv.org/html/2606.17542#S6.T6)compares human performance with the best\-performing LLMs in the subset\. Human performance was computed by aggregating predictions across all participants\. For addressee detection and turn\-change prediction, humans significantly outperformed both Qwen3\-14B and Gemini 2\.5 Pro\. Multimodal cues such as gaze direction and prosody are known to play an important role in these tasks\[Jovanovic2004\-tw,Ishii2013\-me,Ohsuga2005\-jc\], and human subjects may have been able to more effectively utilize these features, resulting in higher accuracy\. In contrast, for next speaker prediction, the text\-based LLM achieved higher performance, and the MM\-LLM achieved performance comparable to humans\. Hilgert and Niehues\[Hilgert2025\-rl\]similarly reported that text\-based LLMs outperformed humans in next speaker prediction under text\-only conditions\. Our results are consistent with their findings and further show that this advantage persists even when humans have access to multimodal information\. This may be attributed to their stronger ability to model conversational contextci=\(s<i,t<i\)c\_\{i\}=\(s\_\{<i\},t\_\{<i\}\), which appears to be critical for identifying the next speaker, as discussed in[Section8](https://arxiv.org/html/2606.17542#S8)\. Note that both humans and LLMs were evaluated under the same constraint described in[Section6](https://arxiv.org/html/2606.17542#S6), i\.e\., the available preceding context was limited to that corresponding to at most about five minutes of video\.

Importantly, these results and moderate inter\-participant agreement \(reported in[Section4](https://arxiv.org/html/2606.17542#S4)\) indicate that all three tasks, particularly next speaker prediction, are challenging even for humans\. To better understand what makes these tasks difficult, we further analyze how prediction performance varies across conversational segments in[Section8\.3](https://arxiv.org/html/2606.17542#S8.SS3)\.

### 7\.3Effect of estimated transcriptions

The primary setting assumed access to ground\-truth speaker ID \(sis\_\{i\}\) and manual transcriptions \(tit\_\{i\}\)\. To evaluate model robustness under more practical conditions, we replaced manual transcriptions with ASR outputs \(t~i\\tilde\{t\}\_\{i\}\)\. We used Whisper large\-v3444[https://huggingface\.co/openai/whisper\-large\-v3](https://huggingface.co/openai/whisper-large-v3)\[radford2022robustspeechrecognitionlargescale\]to transcribe the individual headset microphones, simulating a meeting scenario in which each participant uses an individual microphone\. The word error rate was 24\.12 %\. For Qwen3\-14B \([Table7](https://arxiv.org/html/2606.17542#S7.T7)\(a\) vs\. \(b\)\), replacing manual transcripts with ASR slightly decreased performance in addressee detection and turn\-change prediction, while next speaker prediction performance remained largely unchanged\. For Gemini 2\.5 Pro \([Table8](https://arxiv.org/html/2606.17542#S7.T8)\(a\) vs\. \(b\)\), addressee detection performance slightly decreased, while turn\-change prediction and next speaker prediction improved555Due to cost constraints, the ablation study for Gemini 2\.5 Pro was conducted on the subset only\.\. Despite the relatively high WER, the overall impact was limited, and in some cases performance improved\. Many ASR errors resulted from the omission of fillers and minor disfluencies that were included in the ground\-truth transcripts\. While these omissions contributed to the WER, they often had little impact on the underlying meaning of the utterances\. These results demonstrate to some extent the robustness of the LLM\-based approach in practical environments\. However, in more general scenarios without individual headset microphones, i\.e\., where speaker diarization and multi\-talker ASR are required, these tasks may be more challenging\. Evaluating performance under such conditions is part of our future work\.

### 7\.4Effect of gaze information

We also evaluated a modular setting in which FOA labels \(gig\_\{i\}\) were provided as explicit visual cues\. For both Qwen3\-14B \([Table7](https://arxiv.org/html/2606.17542#S7.T7)\(a\) vs\. \(b\)\) and Gemini 2\.5 Pro \([Table8](https://arxiv.org/html/2606.17542#S7.T8)\(a\) vs\. \(b\)\), incorporating FOA consistently improved performance in addressee detection and next speaker prediction\. This finding aligns with prior research showing that gaze information is a useful signal for identifying the addressee and anticipating the next speaker\[Jovanovic2006\-xa,Ishii2013\-me\]\. In contrast, the effect of FOA on turn\-change prediction was mixed\. Performance slightly decreased for Qwen3\-14B but slightly improved for Gemini 2\.5 Pro\. Prior work has shown that turn\-change prediction is more strongly associated with gaze transition patterns rather than the speaker\\CJK@punctchar\\CJK@uniPunct0"80"99s gaze direction at a given moment\. Since FOA represents only the current gaze target at each utterance, it may not capture these dynamic transition patterns\.

Gemini 2\.5 Pro showed improved performance when FOA was added, despite already receiving raw audio and video inputs\. These gains suggest that the model does not fully extract or utilize gaze cues from the raw video\. Thus, explicitly annotated visual attention appears to provide complementary information even for an MM\-LLM\. We analyze the contribution of multimodal cues in this model in the next sections\.

![Refer to caption](https://arxiv.org/html/2606.17542v1/x1.png)Figure 2:Effect of context size on Qwen3\-14B performance\.

## 8Analysis

### 8\.1Important features

[Table7](https://arxiv.org/html/2606.17542#S7.T7)and[8](https://arxiv.org/html/2606.17542#S7.T8)include ablation studies examining the contribution of input features\. Firstly, removing conversational context \(\(a\) vs\. \(d\) in[Table7](https://arxiv.org/html/2606.17542#S7.T7)and[8](https://arxiv.org/html/2606.17542#S7.T8)\) led to a substantial performance degradation for Qwen3\-14B and Gemini 2\.5 Pro, particularly in addressee detection and next speaker prediction\. These results indicate that conversational context is important in all three tasks\.[Figure2](https://arxiv.org/html/2606.17542#S7.F2)shows the performance of Qwen3\-14B with different context sizes,C=\{0,1,2,5,10,20,40,80,160\}C=\\\{0,1,2,5,10,20,40,80,160\\\}utterances\. Introducing even a very short context \(C=1C=1orC=2C=2\) led to large gains over the no\-context setting across all tasks\. For addressee detection and next speaker prediction, performance continued to increase as context expanded up toC=40C=40, after which it was largely maintained or slightly declined\. In contrast, turn\-change prediction peaked atC=5C=5and did not benefit from longer context\.These results suggest that addressee detection and next speaker prediction rely more heavily on longer conversational context, whereas turn\-change prediction depends primarily on local cues\.Moreover, the improvement with increasing context was most pronounced for next speaker prediction, indicating a particularly strong dependence on contextual information\.

Secondly, we examined the ablation of transcriptions in the MM\-LLM \([Table8](https://arxiv.org/html/2606.17542#S7.T8)\(e\)\), where predictions relied only on raw audio\-visual signals and the current speaker ID\. Compared to removing context alone \([Table8](https://arxiv.org/html/2606.17542#S7.T8)\(d\)\), performance for addressee detection and turn\-change prediction was comparable, and next speaker prediction was slightly improved\. This indicates that textual transcription is not the sole source of information for these tasks\. However, it is difficult to determine how much the model relies on raw audio\-visual cues, as the speaker ID may already provide a strong prior\. When audio input was removed \([Table8](https://arxiv.org/html/2606.17542#S7.T8)\(f\)\), performance changed only marginally across tasks\. Similarly, removing both audio and video \([Table8](https://arxiv.org/html/2606.17542#S7.T8)\(g\)\) resulted in comparable or even higher performance for next speaker prediction\.These results suggest that, in the current setting, raw audio\-visual signals do not consistently provide complementary information beyond text and speaker identity\.Overall, while multimodal inputs are available, their effective integration with textual and speaker information remains limited in the present model\.

Table 9:Example of thought trace of Qwen3\-14B with focus\-of\-attention in the next speaker prediction task\.Okay, let's see\. The current speaker is D, and their transcription is "Yes yes is it like that\."The focus of attention is the whiteboard\.I need to figure out who is likely to speak next\.Looking at the conversation context, there's been a discussion about drawing animals on the whiteboard\. B drew an orangutan, and now D is responding to something, possibly about the drawing\. The last few lines involve D saying it's a cat and others reacting\.Since D just confirmed something about the drawing \(maybe agreeing it's a cat\), the next speaker might be someone reacting to that\.A has been leading the meeting, so maybe A would continue\. B was discussing the orangutan earlier, so might comment again\. C hasn't spoken much lately\. D might not speak again immediately if they just confirmed\.Also, the focus is on the whiteboard, so someone might be pointing out something else on it\.A might want to move the meeting forward, or B could have another comment\. C might have an opinion but hasn't been active\.So possible candidates are A or B\. Maybe both\. I'll go with A and B\.

### 8\.2Reasoning ability of LLMs

[Table9](https://arxiv.org/html/2606.17542#S8.T9)shows an example of the thought trace of Qwen3\-14B with FOA labels for the next speaker prediction task\. This example provides some insights about the predictive capabilities of LLMs\. The model leverages the broader conversational context to infer the flow of the discussion \(text in italics\), the roles of participants, and their speaking tendencies \(text in bold italics\), rather than relying only on the most recent utterance\. It also explicitly incorporates the FOA labels to ground its situational reasoning \(text in bold\), suggesting that visual information is actively used in the prediction\. These behaviors may partly explain why LLMs sometimes outperform human subjects in our experiment\. Humans were not given the text\-based context, and they took up to twice the duration of the original video to complete the task\.In contrast, the LLM had access to the long context for each utterance and could perform step\-by\-step reasoning, which likely contributed to its higher accuracy\.As shown in[Table7](https://arxiv.org/html/2606.17542#S7.T7)\(e\), disabling the thinking mode significantly degraded the performance, supporting this hypothesis\.

### 8\.3Task difficulty

In this section, we examine how task difficulty varies over time and how it relates to prediction accuracy\.[Figure3](https://arxiv.org/html/2606.17542#S8.F3)visualizes the temporal changes in addressee detection and next speaker prediction accuracy for humans and Gemini 2\.5 Pro using one\-minute windows\.There are similar trends in the temporal changes of human and Gemini performance\.In both sessions, there appear to be \`\`easy" intervals where accuracy is high for both, and \`\`difficult" intervals where performance drops\. This indicates that task difficulty fluctuates considerably across conversational intervals within the dataset, and suggests promising directions for future research\.Instead of aiming for uniform accuracy across an entire dialogue, we should design models that account for the inherent uncertainty in human conversational dynamics, improving reliability in predictable moments, while allowing for more fluid, stochastic behavior for ambiguous cases\.

![Refer to caption](https://arxiv.org/html/2606.17542v1/x2.png)Figure 3:Temporal variation of addressee detection and next speaker prediction accuracies across 1\-minute windows\. Shaded region indicates the min–max range across participants\.![Refer to caption](https://arxiv.org/html/2606.17542v1/x3.png)Figure 4:Temporal variation on the Multisimo subset using 15\-second windows\.To investigate what determines task difficulty across segments, we computed correlations between turn\-taking characteristics and human performance in[Table10](https://arxiv.org/html/2606.17542#S8.T10)\. We considered four turn\-taking\-related features:\(1\) Turn\-change probability, i\.e\., the proportion of adjacent utterance pairs in which the speaker changes within a one\-minute window; \(2\) Speaker dispersion, i\.e\., the entropy of utterance counts across speakers; \(3\) Mean gap, i\.e\., the average temporal gap between consecutive utterances; and \(4\) Silence ratio, i\.e\., the total duration of positive gaps \(silences\) normalized by the window duration\. Turn\-change probability had a weak negative correlation with both addressee detection and next speaker prediction, indicating that conversations with more frequent speaker transitions tend to be more difficult\. Speaker dispersion also exhibited a weak negative correlation with performance, suggesting that conversations with evenly distributed participation are harder, whereas those dominated by a single speaker are relatively easier\. This tendency was more pronounced for next speaker prediction\. In contrast, mean gap and silence ratio showed no clear correlation with any task\. Turn\-change prediction did not exhibit meaningful correlations with any of these features\.Although these observations show that the difficulty of turn\-taking in MPC is related to structural factors such as speaker distribution and the frequency of speaker transitions, these factors alone are not sufficient to fully explain performance variability\.A variety of factors, such as discourse function, topic, personality, and social context, are considered to influence turn\-taking\[Sacks1974\-ff,cox2025social,onishi2025modeling\], and further investigation is required to clarify how these factors relate to task difficulties\.

### 8\.4Pilot study on three\-party conversation

Our experiments used only the AMI corpus, and the findings may therefore be domain\-dependent\. As a pilot study, we applied our evaluation to the Multisimo\[Koutsombogera2018\-ne\], a multimodal three\-party conversation corpus\. We analyzed the first∼\\sim60utterances of the S02 session for which addressee annotations were available\.[Figure4](https://arxiv.org/html/2606.17542#S8.F4)visualizes temporal changes for humans and Gemini 2\.5 Pro performance using 15\-second windows\. Although this analysis is small\-scale and limited to one session, the trends of model and human performance appear similar to those observed on AMI\. Extending the evaluation to broader domains and languages remains future work, but this pilot result suggests that some of our findings may generalize beyond AMI\.

Table 10:Pearson Correlation r between turn\-taking\-related features \(per one\-minute window\) and human performance\.

## 9Limitation

Our evaluation differs from natural conversational participation in several respects\. First, humans and MM\-LLMs performed the tasks by watching fixed\-angle recorded videos, which do not reflect the first\-person perspective of a meeting participant\. Second, textual transcripts and explicit current speaker information were provided\. Such information is rarely available in natural interactions and may have led humans to rely more on linguistic cues than on visual and audio signals\. Third, some experimental parts began from the start of the meeting, while others began mid\-session\. As a result, the amount of accessible conversational context varied across segments, potentially affecting difficulty\. Finally, although all subjects use English in professional settings, they were non\-native speakers\. This gap between the experimental setting and real\-world interaction may have led to an underestimation of human performance\.

In addition, our experiments were conducted solely on the AMI corpus, which consists of role\-based scenario meetings with four participants\. Different languages, domains, or interaction settings may yield different results\. Evaluating turn\-taking abilities under more natural participation conditions and across diverse conversational settings remains for future research\.

## 10Conclusion

We conducted a unified evaluation of turn\-taking in multimodal MPCs\. We compared supervised models, text\-based LLMs, multimodal LLMs, and human participants on addressee detection, turn\-change prediction, and next speaker prediction under online constraints\. Our results showed that multimodal LLMs underperformed humans in addressee detection and turn\-change prediction, suggesting current limitations in effectively leveraging raw audio and visual signals without task\- or domain\-specific adaptation\. In contrast, text\-based LLMs outperformed humans in next speaker prediction despite lacking access to multimodal information, suggesting strong contextual reasoning over conversational history\. These findings highlight both the progress and the limitations of current models in handling MPCs\. Future work should investigate how multimodal LLMs can more effectively integrate audio, video, and text information to better understand MPCs\. For practical deployment, it will also be necessary to develop systems that operate jointly with multi\-talker automatic speech recognition and speaker diarization in realistic environments\. Finally, further investigation of human turn\-taking behavior under comparable conditions will help inform the design and evaluation of MPC agents\.

## 11Generative AI Use Disclosure

This manuscript was edited and polished with the assistance of generative AI\. Generative AI models were also used as comparison systems in the experimental evaluation\. All experimental design, implementation, and analysis were conducted by the authors who take full responsibility for the content\.

## References

\\CJK@envEnd

Similar Articles

How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework

arXiv cs.CL

This paper introduces a register-aware linguistic evaluation framework to assess how human-like large language models (LLMs) are by comparing the distribution of 67 lexico-grammatical features between human and LLM-generated texts using Maximum Mean Discrepancy. Experiments across seven instruction-tuned open-source models and five registers show that no model perfectly matches human baselines, and closeness to human language varies by register rather than model size.

Are super tiny LLMs any good?

Reddit r/singularity

Explores whether very small language models can handle casual conversations adequately, and what training factors differentiate the better ones.

Are you speaking my languages? On spoken language adherence in multimodal LLMs

arXiv cs.CL

This paper addresses the problem of spoken language adherence in multimodal LLMs for ASR, proposing a soft prompting approach and novel metric to quantify language violations. It evaluates three mitigation strategies—zero-shot prompting, supervised fine-tuning, and chain-of-thought reasoning—across multiple languages to improve transcription fidelity.