Tag
This paper investigates seemingly contradictory findings on whether large vision-language models (LVLMs) can coordinate efficient referring expressions. The authors show that models can achieve efficiency when explicitly prompted, but fail to infer the need for efficiency from implicit prompts, revealing key differences between human and AI communication.
This paper analyzes synchronization and turn-taking dynamics in full-duplex speech dialogue models by simulating conversations between two instances of the Moshi model, measuring representational alignment via CKA and predicting turn boundaries with LSTM probes.
This paper analyzes spontaneous dyadic Zoom conversations using multimodal features (acoustic, facial, turn-taking) to identify markers of perceived conversational success, finding that entrainment in speech and facial movements correlates with higher interaction quality.