MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

arXiv cs.CL 05/11/26, 04:00 AM Papers
iot smart-home multimodal-llm voice-assistant tool-calling dataset
Summary
The paper introduces MIST, a synthetic dataset and framework for training multimodal voice assistants to control IoT devices in smart homes. It highlights significant performance gaps between open and closed-weight models in handling complex, speech-based tool-calling tasks.
arXiv:2605.06897v1 Announce Type: new Abstract: The rise of Internet of Things (IoT) devices in the physical world necessitates voice-based interfaces capable of handling complex user experiences. While modern Large Language Models (LLMs) already demonstrate strong tool-usage capabilities, modeling real-world IoT devices presents a difficult, understudied challenge which combines modeling spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed-initiative interaction patterns. We introduce MIST (the Multimodal Interactive Speech-based Tool-calling Dataset), a synthetic multi-turn, voice-driven code generation task that operates over IoT devices. We find that there is a significant gap between open- and closed-weight multimodal LLMs on MIST, and that even frontier closed-weight LLMs have substantial headroom. We release MIST and an extensible data generation framework to build related datasets in order to facilitate research on mixed-initiative voice assistants which reason about physical world constraints.
Original Article
View Cached Full Text
Cached at: 05/11/26, 06:39 AM
# Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
Source: [https://arxiv.org/html/2605.06897](https://arxiv.org/html/2605.06897)
Maximillian Chen1, Xuanming Zhang1∗, Michael Peng, Zhou Yu1,Alexandros Papangelis,Yohan Jo2 1Columbia University,2Seoul National University \{maxchen, billyzhang\}@cs\.columbia\.edu, yohan\.jo@snu\.ac\.kr denotes equal contribution\. YJ is the corresponding author\. MC is now at Google and AP is now at Apple\.

###### Abstract

The rise of Internet of Things \(IoT\) devices in the physical world necessitates voice\-based interfaces capable of handling complex user experiences\. While modern Large Language Models \(LLMs\) already demonstrate strong tool\-usage capabilities, modeling real\-world IoT devices presents a difficult, understudied challenge which combines modeling spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed\-initiative interaction patterns\. We introduce MIST \(theMultimodalInteractiveSpeech\-basedTool\-calling Dataset\), a synthetic multi\-turn, voice\-driven code generation task that operates over IoT devices\. We find that there is a significant gap between open\- and closed\-weight multimodal LLMs on MIST, and that even frontier closed\-weight LLMs have substantial headroom\. We release MIST and an extensible data generation framework to build related datasets in order to facilitate research on mixed\-initiative voice assistants which reason about physical world constraints\.111[billyzhang24kobe\.github\.io/mist\-smarthome](https://arxiv.org/html/2605.06897v1/billyzhang24kobe.github.io/mist-smarthome)

![[Uncaptioned image]](https://arxiv.org/html/2605.06897v1/all-twemojis.pdf)

MIST: Multimodal Interactive Speech\-based Tool\-calling Conversational Assistants for Smart Homes

## 1Introduction & Related Work

![Refer to caption](https://arxiv.org/html/2605.06897v1/x1.png)Figure 1:Example conversation from MIST\. Users issue voice commands with natural disfluencies and varied accents\. The assistant must generate structured API calls while managing ambiguity, corrections, redundancy, and stateful device tracking across turns\.The Internet of Things serves as an interface between the physical and the virtual world through a network of interconnected devices\. IoT adoption continues to accelerate with recent advances in bringing large language models to virtual assistants \(e\.g\. Alexa\+, Gemini for Home\), and by 2030 there are expected to be nearly 40 billion connected IoT devicesIji and Gurung \([2024](https://arxiv.org/html/2605.06897#bib.bib31)\)\. As these systems include increasingly complex capabilities, rigid rule\-based interfaces become insufficient\. Multimodal Large Language Models \(MLLMs\) capable of reasoning over bothspokenandtextualmodalities offer a promising path toward developing agents that can navigate diverse physical constraints and user interaction patterns\.

![Refer to caption](https://arxiv.org/html/2605.06897v1/x2.png)Figure 2:Overview of the data generation framework to construct MIST\.We first sample from diverse set of possible user personas, IoT devices, and rooms to form home configurations, then repeatedly sample valid conversational actions and tool calls conditioned on these configurations to form goal\-oriented conversations\.Developing a modern multimodal conversational assistant for real\-world IoT devices necessitates going beyond traditional Task\-Oriented Dialogue \(TOD\) tasks such as slot filling and intent detectionCouckeet al\.\([2018](https://arxiv.org/html/2605.06897#bib.bib22)\); Hemphillet al\.\([1990](https://arxiv.org/html/2605.06897#bib.bib23)\); Schusteret al\.\([2019](https://arxiv.org/html/2605.06897#bib.bib24)\)\. Modern challenges include managing a stateful representation of the physical worldRivkinet al\.\([2024](https://arxiv.org/html/2605.06897#bib.bib41)\), executing tool callsGoelet al\.\([2023](https://arxiv.org/html/2605.06897#bib.bib15)\)to orchestrate actions across various devices, modeling multi\-turn conversational historiesBudzianowskiet al\.\([2018](https://arxiv.org/html/2605.06897#bib.bib14)\); Rastogiet al\.\([2020](https://arxiv.org/html/2605.06897#bib.bib16)\), and maintaining robustness when presented with disfluent usersGoelet al\.\([2023](https://arxiv.org/html/2605.06897#bib.bib15)\); Qinet al\.\([2024](https://arxiv.org/html/2605.06897#bib.bib40)\)\. In this paper, we build on a rich history of work in TOD and conversational task synthesisBaeet al\.\([2022](https://arxiv.org/html/2605.06897#bib.bib18)\); Qianet al\.\([2025](https://arxiv.org/html/2605.06897#bib.bib39)\), alongside growing bodies of work in digital text\-based tool\-callingQinet al\.\([2024](https://arxiv.org/html/2605.06897#bib.bib40)\)and speech\-based TODZhanget al\.\([2023](https://arxiv.org/html/2605.06897#bib.bib45)\); Faisalet al\.\([2021](https://arxiv.org/html/2605.06897#bib.bib47)\); Siet al\.\([2023](https://arxiv.org/html/2605.06897#bib.bib46)\)\. We introduceMIST\(MultimodalInteractiveSpeech\-basedTool\-calling Dataset\), a novel benchmark task requiring MLLMs to jointly model spoken requests in multi\-turn dialogues with mixed\-initiative conversation dynamics, while understanding API calls with physical world implications and spatiotemporal constraints\. To construct MIST, we created a neuro\-symbolic data generation framework\.

## 2MIST Overview

MIST features 10,000 conversations with 88\.1 hours of spoken dialogue\. MIST includes 50 of the most common unique IoT devices spanning 27 unique capabilities/API functions, both sourced from online articlesZell \([2025](https://arxiv.org/html/2605.06897#bib.bib66)\); ESHP \([2025](https://arxiv.org/html/2605.06897#bib.bib65)\); BHHS \([2025](https://arxiv.org/html/2605.06897#bib.bib67)\)\. Each conversation features an average of 5\.6 user turns\. As in Figure[1](https://arxiv.org/html/2605.06897#S1.F1), each conversation involves a user asking a virtual assistant to interact with physical IoT devices\.

### 2\.1Data Generation Framework

Figure[2](https://arxiv.org/html/2605.06897#S1.F2)presents an overview of the data generation framework for MIST\. For the first phase of the data generation framework, we start by defining a set of possible values for each of these three\. We define “room types” \(e\.g\., “kitchen” or “patio”\) according to an ontology defined in Table[A3](https://arxiv.org/html/2605.06897#A3.T3)\. Each of these rooms is mapped to a set of plausible IoT devices\. Each IoT device has its own unique capabilities \(e\.g\., “color” or “brightness” on a smart bulb\), which can be interacted with using function calls\. The supported IoT devices with their capabilities and placement constraints are defined in Table[A4](https://arxiv.org/html/2605.06897#A3.T4)\. We lastly define possible values for user traits in terms of behaviors personalities \(e\.g\., “cheerful”; see Table[A7](https://arxiv.org/html/2605.06897#A3.T7)\), expertise \(e\.g\., “novice”; see Table[A7](https://arxiv.org/html/2605.06897#A3.T7)\), speaking accent \(e\.g\., “Australian”\), speaking pitch, speaking rate, and equipment noise \(which maps to Gaussian noise; see Table[A5](https://arxiv.org/html/2605.06897#A3.T5)\)\.

The second phase entails conversation generation managed by a probabilistic orchestrator\. For each conversation, the framework samples a unique home configuration and a consistent user profile\. The home configuration parameterizes aHome Stateobject which serves as a "Digital Twin" of the physical\-world device stateVanDerHorn and Mahadevan \([2021](https://arxiv.org/html/2605.06897#bib.bib62)\)that tracks the real\-time status of every device capability and routine\. The orchestrator probabilistically samples a target interaction intent at each turn\. Once an intent is selected, the system performs a symbolic check against the Home State to ground the interaction\. Our framework supports six core interaction patterns \(i\.e\. dialogue actions\)\.

1\) Action Executions:Users request an action to be executed over devices in real\-time \(e\.g\., “turn off everything on thesecond floor”\) and the agent must identify that it is a valid request and produce the correct tool call\.2\) Routine Updates:Users may request combinations of actions, triggers, and conditions, which can be created, updated, and deleted \(e\.g\. “turn on the patio light on weekends at 7am”\), and the agent must identify whether it is valid and produce the correct call to update the Smart Home’s routine manager\.3\) Correction Loops:The agent applies a user\-requested correction \(e\.g\., “actually, I meant to set the volume to 30”\) through multiple tool calls while “undoing” previous actions if necessary\.4\) Ambiguity Resolution:The orchestrator identifies potential collisions at three levels: device name duplicates, room type ambiguity \(e\.g\., two bedrooms\), or intra\-room device type duplicates\. In these cases, it generates a clarification sub\-dialogue where the user poses an underspecified request and the agent must ask a clarifying question222We assign randomized colors to differentiate rooms of the same type \(e\.g\., “Blue Bedroom” vs\. “Red Bedroom”\)\(e\.g\., in Figure[1](https://arxiv.org/html/2605.06897#S1.F1), there are multiple rooms of the same type\)\.5\) Redundancy:The user may ask for a redundant “no\-op” request and the agent needs to be capable of recognizing and rejecting them by evaluating the current Home State\.6\) Status Updates:The user may ask for the current status of the smart home and the agent should form a tool call to retrieve the state of all of the devices\. After each of these interactions, the home state is updated based on the code execution\.

Each of these interaction patterns map to a pair containing a fixed user\-side dialogue action and an “optimal” agent\-side dialogue action\. Both of which have default templated utterances\. The user\-side dialogue is paraphrased according to the sampled behavioral traits for that conversation using Gemini 2\.5 Flash\-Lite\. To reflect naturalistic interaction, a rule\-based injector then randomly adds speech disfluencies, including word repetitions and revisionsShriberg \([1994](https://arxiv.org/html/2605.06897#bib.bib64)\); Passaliet al\.\([2022](https://arxiv.org/html/2605.06897#bib.bib7)\)\. Finally, the text is synthesized into audio using the Google Cloud TTS API according to the sampled acoustic profile, with Gaussian noise injected to simulate recording noise \(followingChenet al\.\([2025a](https://arxiv.org/html/2605.06897#bib.bib42)\)\)\. Implementation details are in Appendix[C](https://arxiv.org/html/2605.06897#A3)\.

To vet the dataset quality, we randomly sampled 300 examples and asked expert annotators to listen to the spoken request and read the existing smart home context\. The annotators were tasked with verifying correctness with respect to the dataset’s stated golden dialogue actions and tool calls\. We find that over 92% of both the dialogue actions and the proposed tool calls are correct, and there is over 90% agreement between annotators for these tasks\. Full human evaluation details are in Appendix[F](https://arxiv.org/html/2605.06897#A6)\.

## 3Experiments

In MIST, the following text inputs are provided to an MLLM: the smart home layout \(including all IoT devices with their capabilities\), the existing Home State, and existing conversation history\. The MLLM also receives the user’s current request \(i\.e\., the target\) as speech\. The prompt used to aggregate each of these inputs is in Appendix[G](https://arxiv.org/html/2605.06897#A7)\.

#### Evaluation

Models are evaluated along two dimensions\. First isCode Intelligence, given inExecution Match\(percentage of turns where the generated tool calls result in the correct final home state\) andExact Match\(character\-level match of the generated code\), as inYuet al\.\([2019](https://arxiv.org/html/2605.06897#bib.bib63)\)\. These metrics are computed for examples that require tool calls\. The second isConversational Intelligence: the agent’s ability to recognize ambiguities, redundancies, and other phenomena by producing responses with the correctdialogue action\. We measure the Macro F1 and Accuracy of the inferred actions \(implementation details in Appendix[D](https://arxiv.org/html/2605.06897#A4)\)\. This reflects the Action\-level evaluation setting proposed inChenet al\.\([2025b](https://arxiv.org/html/2605.06897#bib.bib44)\)and is measured using Macro F1 and Accuracy\.

#### Baselines

We contextualize MLLM performance using several baselines\. For code generation, we use a baseline where we use the initial home state and compute the “execution match” using this state for every turn of the conversation \(“Initial State”\)\. We also consider a baseline which assumes no change from the previous turn’s home state \(“Previous State”\)\. For conversational intelligence, we present a baseline that assumes that the candidate response always follows the most common dialogue action in MIST \(“Constant Prediction”\)\.

#### Models

We consider several competitive open\-weight MLLMs: Qwen AudioChuet al\.\([2023](https://arxiv.org/html/2605.06897#bib.bib52)\), Qwen 2 AudioChuet al\.\([2024](https://arxiv.org/html/2605.06897#bib.bib53)\), SoundwaveZhanget al\.\([2025b](https://arxiv.org/html/2605.06897#bib.bib54)\), and Qwen 3 OmniXuet al\.\([2025](https://arxiv.org/html/2605.06897#bib.bib55)\)\. We also evaluated a frontier closed\-weight model family: Gemini 2\.5 Flash\-Lite, Flash, and ProComaniciet al\.\([2025](https://arxiv.org/html/2605.06897#bib.bib56)\)\.

![Refer to caption](https://arxiv.org/html/2605.06897v1/x3.png)

![Refer to caption](https://arxiv.org/html/2605.06897v1/x4.png)

Figure 3:Error analysis characterizing the types of errors by proportion for each MLLM\.The most common tool execution error for frontier models is selecting the ‘Wrong Value‘, whereas open\-weight models struggle triggering a tool call at the wrong time or targeting the wrong device\.
### 3\.1Results & Discussion

Table 1:Code Generationresults indicate Gemini 2\.5 Pro achieves the strongest Exact Match, with a substantial gap over leading open\-weight models\.#### Code Intelligence

Table[1](https://arxiv.org/html/2605.06897#S3.T1)shows there are clear gaps between closed\-weight frontier MLLMs and leading open\-weight audio models\. Open\-weight models achieve moderate Execution Match scores \(ranging from 48\.76% to 60\.94%\), yet all but Qwen 3 Omni fail almost entirely on the Exact Match metric \(≤2\.26%\\leq 2\.26\\%\)\. The “Previous State” baseline reveals that in 71\.6% of examples, performing an action over the IoT devices is not required \(e\.g\., the agent should elicit more information or reject the request\)\. The Code Intelligence error analysis in Figure[3](https://arxiv.org/html/2605.06897#S3.F3)shows at least 46% of the erroneous function calls involve “overtriggering” for all open\-weight MLLMs, meaning the agent performs an unnecessary code action\. The second most common error for open\-weight MLLMs is targeting the “wrong device\.” This suggests models are not effective at understanding complex contexts which may feature similar devices, which hasserious physical world implications\(e\.g\. leaving the wrong door unlocked, turning on the wrong oven\)\. In contrast, the closed\-weight MLLMs achieve decent performance\. Gemini 2\.5 Pro achieves the strongest performance with a 79\.53% Execution Match and a 65\.56% Exact Match\. The overall number of errors for closed\-weight models is much lower, evidenced by the lower rates of overtriggering and selecting the wrong device\. Instead, the most common error type is producing the “wrong value” in code \(e\.g\., setting the speaker to the incorrect volume setting\)\. Lastly, we also see model performance seems to improve with model scale, suggesting that the task is climbable and there is substantial opportunity to bridge the cross\-modal reasoning capabilities between open\- and closed\-weight MLLMs\.

Table 2:Conversational Intelligencebased on inferred dialogue actions in terms of F1 and Accuracy\.
#### Conversational Intelligence

Table[2](https://arxiv.org/html/2605.06897#S3.T2)demonstrates models’ mixed\-initiative interaction skills by assessing whether they are able to correctly identify when to confirm an action request, elicit clarifying information from a user, and more\. We see that open\-weight models struggle severely with producing the right conversational action, posting F1 scores which underperform a constant prediction baseline \(9\.13 F1\)\. This suggests that current open\-weight MLLMs cannot reliably interpret the smart home context to determine when to ask for clarification or reject a redundant request\. The Gemini 2\.5 models perform substantially better, with Pro achieving 46\.00 F1 and 66\.73% Accuracy\. However, Figure[A1](https://arxiv.org/html/2605.06897#A4.F1)shows that Gemini 2\.5 Pro still fails to recognize 73\.0% of cases where the golden action is to confirm a valid request\. The large headroom even among frontier models underscores the inherent difficulty of the MIST benchmark\.

## 4Conclusion

MIST is a novel benchmark for MLLMs’ ability to act as code\-generating agents which interpret complex user intents with spatial constraints\. We find that MIST is a valuable metric to climb on, given the large gap between the abilities of open\-weight and closed\-weight models and the remaining headroom for frontier closed\-weight models\. Coupled with its extensible data generation framework which can be used to produce synthetic training data, MIST will serve as a resource to accelerate the development of open\-source MLLMs and agentic experiences for the physical world\.

## References

- Building a role specified open\-domain dialogue system leveraging large\-scale language models\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 2128–2150\.Cited by:[§1](https://arxiv.org/html/2605.06897#S1.p2.1)\.
- BHHS \(2025\)Must\-have smart home devices for 2025\.Note:https://www\.bhhsamericanheritage\.com/blog/blog\-detail/2025/4/must\-have\-smart\-home\-devices\-for\-2025\.htmlCited by:[§2](https://arxiv.org/html/2605.06897#S2.p1.1)\.
- P\. Budzianowski, T\. Wen, B\. Tseng, I\. Casanueva, S\. Ultes, O\. Ramadan, and M\. Gasic \(2018\)MultiWOZ\-a large\-scale multi\-domain wizard\-of\-oz dataset for task\-oriented dialogue modelling\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 5016–5026\.Cited by:[§1](https://arxiv.org/html/2605.06897#S1.p2.1)\.
- M\. Chen, R\. Sun, and S\. O\. Arik \(2025a\)Data\-centric improvements for enhancing multi\-modal understanding in spoken conversation modeling\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 1366–1387\.Cited by:[Appendix A](https://arxiv.org/html/2605.06897#A1.p1.1),[§2\.1](https://arxiv.org/html/2605.06897#S2.SS1.p4.1)\.
- M\. Chen, R\. Sun, T\. Pfister, and S\. O\. Arik \(2025b\)Learning to clarify: multi\-turn conversations with action\-based contrastive self\-training\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§D\.1](https://arxiv.org/html/2605.06897#A4.SS1.p1.1),[§D\.2](https://arxiv.org/html/2605.06897#A4.SS2.p1.1),[§3](https://arxiv.org/html/2605.06897#S3.SS0.SSS0.Px1.p1.1)\.
- Y\. Chen, X\. Yue, C\. Zhang, X\. Gao, R\. T\. Tan, and H\. Li \(2024\)Voicebench: benchmarking llm\-based voice assistants\.arXiv preprint arXiv:2410\.17196\.Cited by:[Appendix A](https://arxiv.org/html/2605.06897#A1.p1.1)\.
- Y\. Chu, J\. Xu, Q\. Yang, H\. Wei, X\. Wei, Z\. Guo, Y\. Leng, Y\. Lv, J\. He, J\. Lin,et al\.\(2024\)Qwen2\-audio technical report\.arXiv preprint arXiv:2407\.10759\.Cited by:[§3](https://arxiv.org/html/2605.06897#S3.SS0.SSS0.Px3.p1.1)\.
- Y\. Chu, J\. Xu, X\. Zhou, Q\. Yang, S\. Zhang, Z\. Yan, C\. Zhou, and J\. Zhou \(2023\)Qwen\-audio: advancing universal audio understanding via unified large\-scale audio\-language models\.arXiv preprint arXiv:2311\.07919\.Cited by:[§3](https://arxiv.org/html/2605.06897#S3.SS0.SSS0.Px3.p1.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§3](https://arxiv.org/html/2605.06897#S3.SS0.SSS0.Px3.p1.1)\.
- A\. Coucke, A\. Saade, A\. Ball, T\. Bluche, A\. Caulier, D\. Leroy, C\. Doumouro, T\. Gisselbrecht, F\. Caltagirone, T\. Lavril,et al\.\(2018\)Snips voice platform: an embedded spoken language understanding system for private\-by\-design voice interfaces\.arXiv preprint arXiv:1805\.10190\.Cited by:[§1](https://arxiv.org/html/2605.06897#S1.p2.1)\.
- ESHP \(2025\)The future of smart homes: top technology trends in 2025\.Note:https://ecosmarthomepros\.com/the\-future\-of\-smart\-homes\-top\-technology\-trends\-in\-2025/Cited by:[§2](https://arxiv.org/html/2605.06897#S2.p1.1)\.
- F\. Faisal, S\. Keshava, M\. M\. I\. Alam, and A\. Anastasopoulos \(2021\)SD\-qa: spoken dialectal question answering for the real world\.InFindings of the Association for Computational Linguistics: EMNLP 2021,pp\. 3296–3315\.Cited by:[§1](https://arxiv.org/html/2605.06897#S1.p2.1)\.
- R\. Goel, W\. Ammar, A\. Gupta, S\. Vashishtha, M\. Sano, F\. Surani, M\. Chang, H\. Choe, D\. Greene, K\. He,et al\.\(2023\)PRESTO: a multilingual dataset for parsing realistic task\-oriented dialogs\.arXiv preprint arXiv:2303\.08954\.Cited by:[§1](https://arxiv.org/html/2605.06897#S1.p2.1)\.
- C\. T\. Hemphill, J\. J\. Godfrey, and G\. R\. Doddington \(1990\)The atis spoken language systems pilot corpus\.InSpeech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24\-27, 1990,Cited by:[§1](https://arxiv.org/html/2605.06897#S1.p2.1)\.
- M\. Iji and R\. Gurung \(2024\)IoT market forecast to 2030: connections by region and vertical\.Note:GSMA IntelligenceExternal Links:[Link](https://arxiv.org/html/2605.06897v1/IoT%20market%20forecast%20to%202030:%20connections%20by%20region%20and%20vertical)Cited by:[§1](https://arxiv.org/html/2605.06897#S1.p1.1)\.
- H\. Liu, Y\. Hou, H\. Liu, Y\. Wang, Y\. Wang, and Y\. Wang \(2025\)VocalBench\-df: a benchmark for evaluating speech llm robustness to disfluency\.arXiv preprint arXiv:2510\.15406\.Cited by:[Appendix A](https://arxiv.org/html/2605.06897#A1.p1.1)\.
- T\. Passali, T\. Mavropoulos, G\. Tsoumakas, G\. Meditskos, and S\. Vrochidis \(2022\)LARD: large\-scale artificial disfluency generation\.InProceedings of the Thirteenth Language Resources and Evaluation Conference,Marseille, France,pp\. 2327–2336\.External Links:[Link](https://aclanthology.org/2022.lrec-1.249)Cited by:[§2\.1](https://arxiv.org/html/2605.06897#S2.SS1.p4.1)\.
- K\. Qian, M\. Chen, S\. Li, A\. Sharma, and Z\. Yu \(2025\)Bottom\-up synthesis of knowledge\-grounded task\-oriented dialogues with iteratively self\-refined prompts\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 2: Short Papers\),pp\. 827–844\.Cited by:[§1](https://arxiv.org/html/2605.06897#S1.p2.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian,et al\.\(2024\)ToolLLM: facilitating large language models to master 16000\+ real\-world apis\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.06897#S1.p2.1)\.
- A\. Rastogi, X\. Zang, S\. Sunkara, R\. Gupta, and P\. Khaitan \(2020\)Towards scalable multi\-domain conversational agents: the schema\-guided dialogue dataset\.InProceedings of the AAAI conference on artificial intelligence,Vol\.34,pp\. 8689–8696\.Cited by:[§1](https://arxiv.org/html/2605.06897#S1.p2.1)\.
- D\. Rivkin, F\. Hogan, A\. Feriani, A\. Konar, A\. Sigal, X\. Liu, and G\. Dudek \(2024\)Aiot smart home via autonomous llm agents\.IEEE Internet of Things Journal\.Cited by:[§1](https://arxiv.org/html/2605.06897#S1.p2.1)\.
- S\. Schuster, S\. Gupta, R\. Shah, and M\. Lewis \(2019\)Cross\-lingual transfer learning for multilingual task oriented dialog\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 3795–3805\.Cited by:[§1](https://arxiv.org/html/2605.06897#S1.p2.1)\.
- E\. E\. Shriberg \(1994\)Preliminaries to a theory of speech disfluencies\.Doctoral dissertation, University of California at Berkeley\.Cited by:[§2\.1](https://arxiv.org/html/2605.06897#S2.SS1.p4.1)\.
- S\. Si, W\. Ma, H\. Gao, Y\. Wu, T\. Lin, Y\. Dai, H\. Li, R\. Yan, F\. Huang, and Y\. Li \(2023\)Spokenwoz: a large\-scale speech\-text benchmark for spoken task\-oriented dialogue agents\.Advances in Neural Information Processing Systems36,pp\. 39088–39118\.Cited by:[§1](https://arxiv.org/html/2605.06897#S1.p2.1)\.
- E\. VanDerHorn and S\. Mahadevan \(2021\)Digital twin: generalization, characterization and implementation\.Decision support systems145,pp\. 113524\.Cited by:[§2\.1](https://arxiv.org/html/2605.06897#S2.SS1.p2.1)\.
- J\. Xu, Z\. Guo, H\. Hu, Y\. Chu, X\. Wang, J\. He, Y\. Wang, X\. Shi, T\. He, X\. Zhu,et al\.\(2025\)Qwen3\-omni technical report\.arXiv preprint arXiv:2509\.17765\.Cited by:[§3](https://arxiv.org/html/2605.06897#S3.SS0.SSS0.Px3.p1.1)\.
- W\. Yang, Y\. Li, Y\. Wei, M\. Fang, and L\. Chen \(2025\)Speechr: a benchmark for speech reasoning in large audio\-language models\.arXiv preprint arXiv:2508\.02018\.Cited by:[Appendix A](https://arxiv.org/html/2605.06897#A1.p1.1)\.
- Y\. Yang, H\. Liu, F\. Kang, M\. Zhang, Z\. Lian, H\. Tang, and H\. Chen \(2026\)SayNext\-bench: why do llms struggle with next\-utterance prediction?\.arXiv preprint arXiv:2602\.00327\.Cited by:[Appendix A](https://arxiv.org/html/2605.06897#A1.p1.1)\.
- T\. Yu, R\. Zhang, H\. Er, S\. Li, E\. Xue, B\. Pang, X\. V\. Lin, Y\. C\. Tan, T\. Shi, Z\. Li,et al\.\(2019\)Cosql: a conversational text\-to\-sql challenge towards cross\-domain natural language interfaces to databases\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 1962–1979\.Cited by:[§D\.1](https://arxiv.org/html/2605.06897#A4.SS1.p1.1),[§3](https://arxiv.org/html/2605.06897#S3.SS0.SSS0.Px1.p1.1)\.
- A\. Zell \(2025\)Must\-have smart home devices for 2025\.Note:https://bostonautomations\.com/must\-have\-smart\-home\-devices\-for\-2025/Cited by:[§2](https://arxiv.org/html/2605.06897#S2.p1.1)\.
- L\. Zhang, J\. Zhang, B\. Lei, C\. Wu, A\. Liu, W\. Jia, and X\. Zhou \(2025a\)WildSpeech\-bench: benchmarking end\-to\-end speechllms in the wild\.arXiv preprint arXiv:2506\.21875\.Cited by:[Appendix A](https://arxiv.org/html/2605.06897#A1.p1.1)\.
- X\. Zhang, R\. Divekar, R\. Ubale, and Z\. Yu \(2023\)GrounDialog: a dataset for repair and grounding in task\-oriented spoken dialogues for language learning\.InProceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2023\),pp\. 300–314\.Cited by:[§1](https://arxiv.org/html/2605.06897#S1.p2.1)\.
- Y\. Zhang, Z\. Liu, F\. Bu, R\. Zhang, B\. Wang, and H\. Li \(2025b\)Soundwave: less is more for speech\-text alignment in llms\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 18718–18738\.Cited by:[§3](https://arxiv.org/html/2605.06897#S3.SS0.SSS0.Px3.p1.1)\.

## Appendix AAdditional Related Work

Many recent efforts have proposed related efforts to assess MLLM speech understanding\.Yanget al\.\([2026](https://arxiv.org/html/2605.06897#bib.bib57)\)introduce SayNext\-Bench to assess the ability of multimodal large language models to accurately predict a speaker’s next conversational utterance by leveraging non\-verbal contextual cues\. Other works focus on a similar setting called spoken question answering\.Chenet al\.\([2024](https://arxiv.org/html/2605.06897#bib.bib58)\)proposes VoiceBench to evaluate the general knowledge, instruction\-following capabilities, and safety compliance of LLM\-based voice assistants under diverse acoustic and speaker variations\.Chenet al\.\([2025a](https://arxiv.org/html/2605.06897#bib.bib42)\)proposes ASK\-QA, a synthetic dataset to understand mixed\-initiative speech in conversational QA\. Addressing disfluency and naturalness,Liuet al\.\([2025](https://arxiv.org/html/2605.06897#bib.bib59)\)present VocalBench to comprehensively assess speech interaction models across semantic precision, acoustic quality, free\-form dialogue, and environmental robustness\. To benchmark open\-ended speech interactions,Zhanget al\.\([2025a](https://arxiv.org/html/2605.06897#bib.bib60)\)develop WildSpeech\-Bench, evaluating the end\-to\-end capabilities of audio LLMs using real\-world spoken queries that incorporate speech\-specific phenomena\.Yanget al\.\([2025](https://arxiv.org/html/2605.06897#bib.bib61)\)looks at cross\-modal understanding and introduces SpeechR to measure the complex reasoning capabilities of large audio\-language models across factual retrieval, procedural inference, and normative judgment\.

## Appendix BAdditional Experimental Results

Table A1:Few\-shot code generation results on MIST\.We investigated whether the gaps in performance between open\-weight and closed\-weight models was simply a matter of insufficient domain adaptation\. Specifically, at inference time, we experimented with providing randomly sampled 3 few\-shot exemplars to each MLLM, ensuring that the evaluation example is never one of the randomly sampled exemplars\. The results are presented in Tables[A1](https://arxiv.org/html/2605.06897#A2.T1)and[A2](https://arxiv.org/html/2605.06897#A2.T2)\.

Table A2:Few\-shot intent recognition results on MIST\.The results indicate that few\-shot prompting does yield some performance improvements improvements, but is insufficient to bridge the gap between open\-weight models and the Gemini 2\.5 family remains substantial\. Table[A1](https://arxiv.org/html/2605.06897#A2.T1)shows that the best\-performing open\-weight model, Qwen 3 Omni, achieves 52\.36% exact match for code generation, still falling short of Gemini 2\.5 Pro’s 65\.56% under a zero\-shot setting\. Table[A2](https://arxiv.org/html/2605.06897#A2.T2)similarly shows that in terms of recognizing the optimal dialogue action, the Qwen 3 Omni achieves the best performance \(30\.43% accuracy\) whereas Gemini 2\.5 Pro achieves 66\.73% zero\-shot accuracy\. These results underscore both the difficulty of MIST and the persistent performance disparity between open\-source and proprietary models\.

## Appendix CData Generation Implementation Details

To construct the MIST dataset, we programmatically sample from predefined value spaces across different stages of the generation pipeline\. This structured sampling ensures a highly diverse, expansive, and realistic set of home configurations and user personas\. The set of actions that form the basis of the conversation are controlled by configurable probabilities, which determine whether to incorporate redundancies or how handle naming collisions \(either by introducing user\-side ambiguity or specificity\), for instance\.

Table A3:The hierarchical environment structure used to generate unique world configurations\.Table A4:Complete catalog of 50 device types used in MIST, grouped by category\. Capabilities define the action space, while constraints limit valid room placements\.### C\.1Environment Simulation

The physical environment of each simulated home is constructed hierarchically\. As detailed in Table[A3](https://arxiv.org/html/2605.06897#A3.T3), each environment begins with a rootHousethat contains up to threeFloors, which are further divided into specificRooms\. The total number of floors and the room types are each randomly sampled\. Once the container hierarchy is established, rooms are populated with smart devices randomly sampled from the catalog presented in Table[A4](https://arxiv.org/html/2605.06897#A3.T4)\. This catalog defines 50 distinct device types categorized by their function \(e\.g\., Lighting, Climate, Security\)\. Crucially, device placement is heavily constrained by logical room assignments to maintain realism \(e\.g\., ovens only appear in kitchens, and sprinklers are restricted to outdoor areas\)\.

Table A5:Acoustic Profile \(AA\) parameters used for TTS synthesis and noise injection\.
### C\.2User Persona and Acoustic Profiles

LLM Paraphraser System PromptYou are a persona engine designed to create realistic user dialogue\. Your task is to paraphrase a simple, direct command into a more natural and expressive utterance that reflects a specific user profile\. The user profile consists of an ‘expertise level’ and a ‘personality trait’\. You must combine these two aspects to create a believable character\. For example, a ‘novice’ and ‘friendly’ user might say, ‘Hi there, could you please do me a favor and turn on the living room light? Thanks so much\!’ An ‘expert’ and ‘direct’ user might say, ‘Living room light on\.’ Only output the final paraphrased command\. Do not add any extra conversational text or labels\. Ensure the output sounds like natural speech and does not contain code artifacts like underscores or raw boolean values \(e\.g\., use ‘turn on’ instead of ‘power=true’\)\.Table A6:The system prompt used for the Gemini 2\.5 Flash\-Lite paraphraser to generate persona\-driven user utterances based on their assigned behavioral profile\.To simulate a diverse user base, each conversation is grounded in a unique user profile composed of both acoustic and behavioral traits\. Table[A5](https://arxiv.org/html/2605.06897#A3.T5)details the acoustic profile parameters used for Text\-to\-Speech synthesis with the Google Cloud TTS API\. By randomizing the TTS accent, pitch shift, speaking rate, and overlaying Gaussian noise, we simulate the varying acoustic challenges and environments that a real\-world multimodal assistant would encounter\. Finally, Table[A7](https://arxiv.org/html/2605.06897#A3.T7)outlines the behavioral profile, which combines one of three expertise levels with a personality trait sampled from over 100 descriptors\. Both are randomly sampled\. The LLM paraphraser \(Gemini 2\.5 Flash\-Lite\) is conditioned on these profiles to produce introduce variance into the user’s dialogue, ensuring that MIST captures a broad set of behaviors\. The prompt is provided in Table[A6](https://arxiv.org/html/2605.06897#A3.T6)\.

Table A7:Behavioral Profile attributes used to condition the LLM paraphraser\.Table A8:The system prompt used for the Gemini 2\.5 Flash\-Lite paraphraser to generate persona\-driven user utterances based on their assigned behavioral profile\.

## Appendix DEvaluation Implementation Details

We developed an automated evaluation engine that assesses both the correctness of the generated API calls and the semantic correctness of the dialogue actions\.

### D\.1Code Generation Evaluation

Code generation is evaluated along two primary dimensions: exact match and execution match\. This follows prior work such as CoSQLYuet al\.\([2019](https://arxiv.org/html/2605.06897#bib.bib63)\)and AmbigSQLChenet al\.\([2025b](https://arxiv.org/html/2605.06897#bib.bib44)\)\.

#### Exact Match Accuracy

This metric measures strict syntactical adherence\. A prediction is marked as an exact match only if the generatedtool\_codestring is identical to the ground truth code on the character level\. This metric rigorously penalizes hallucinated parameters, incorrect device IDs, and missing API calls\.

#### Execution Match Accuracy

Because multiple valid API sequences can theoretically result in the same physical state, we measure functional correctness by executing the generated code against a local simulator\. For each conversation turn, the evaluation engine initializes a Home State instance using the smart home configuration and the state snapshot from the preceding turn\. The engine parses the model’s generatedtool\_codeand executes the correspondingsmarthome\.devicesandsmarthome\.routinescommands to mutate the simulator’s state\. A turn is considered an Execution Match if, after all predicted code is executed, both the device state dictionary and the routines dictionary of the simulator perfectly match the ground truthstate\_after\_turnandroutines\_after\_turnprovided in the dataset\.

### D\.2Dialogue Action Evaluation

Evaluating the dialogue response with traditional metrics such as BLEU/ROUGE is insufficient for measuring agentic behavior, because they do not capture the nuances of optimal conversational actions \(e\.g\. two sentences can have high token overlap yet differing semantic meaning\)\. Moreover, one can express semantically equivalent phrases using different words\. Thus, we evaluate a model’s conversational intelligence by formulating it as a multi\-label intent classification task using an LLM\-as\-a\-judge followingChenet al\.\([2025b](https://arxiv.org/html/2605.06897#bib.bib44)\)\.

#### Intent Classifier Setup

We utilizeGemini 2\.5 Flash\-Litevia Vertex AI as a zero\-shot intent classifier\. To ensure reproducible and deterministic evaluations, the model is configured with a decoding temperature ofT=0\.0T=0\.0and strict safety filter overrides\. For each turn, the classifier is provided with the assistant’s generated natural language response and the predictedtool\_code\. It is prompted to map the response to a subset of six valid dialogue actions:confirm\_action,clarify,inform\_redundant,inform\_not\_found,inform\_status, andapologize\_correct\.

#### Metrics

The model’s predicted intents are compared against the ground truth dialogue actions using set operations\. We report three key metrics:

- •Accuracy:The percentage of turns where the predicted set of intents perfectly equals the ground truth set of intents\.
- •Micro F1:Calculated globally by aggregating the True Positives, False Positives, and False Negatives across all intent predictions across all turns\. This provides an overall measure of dialogue action reliability heavily weighted by the most frequent intents\.
- •Macro F1:Calculated by computing the F1 score independently for each of the six intent classes and averaging the results\. This ensures that performance on rare but critical mixed\-initiative dynamics \(e\.g\.,inform\_not\_found,apologize\_correct\) is equally weighted\.

![Refer to caption](https://arxiv.org/html/2605.06897v1/x5.png)Figure A1:Error analysis of attempted dialogue actions for the Gemini 2\.5 model family\.

## Appendix EDialogue Action Error Analysis

In addition to the analysis of code generation errors in Section[3\.1](https://arxiv.org/html/2605.06897#S3.SS1), we examine existing gaps in conversational intelligence\.

Figure[A1](https://arxiv.org/html/2605.06897#A4.F1)reveals distinct behavioral profiles across the Gemini model family\. The Flash\-Lite model acts as an eager agent which prioritizes answering over other conversational strategies\. It fails to recognize ambiguity, missing 93\.3% of ambiguous requests and fails to recognize redundancy, missing 94\.0% of redundant requests\. Conversely, Gemini 2\.5 Flash and Pro are either more capable of recognizing ambiguitiy/redundancy, or more capable of obeying the instruction that the proper behavior is to ask a clarifying question or reject the user’s request as opposed to blindly executing an action\. We see that Gemini 2\.5 Pro only misses 10\.6% of the examples where asking a clarification question was the golden action\. However, all models struggle significantly withinform\_not\_found\(\>70% error\), which suggests that even frontier models have difficulty recognizing when a requested device is entirely absent from the ontology, often attempting to force an execution rather than gracefully rejecting a request\.

As for the open\-weight models, as indicated in the main results in Table[2](https://arxiv.org/html/2605.06897#S3.T2), the vast majority of attempted dialogue actions are incorrect for all open\-weight models\.

## Appendix FHuman Evaluation

![Refer to caption](https://arxiv.org/html/2605.06897v1/figs/Human_Eval_Instructions.png)Figure A2:Screenshot of the interface shown to expert human annotators\.#### Setup Details

To evaluate the quality of MIST data, we conducted an expert human evaluation from a pool of 14 human raters who all have working proficiency in English and at least a graduate background in Computer Science\. We randomly sampled 300 examples and randomly assigned three unique raters per example\. As shown in Figure[A2](https://arxiv.org/html/2605.06897#A6.F2), we ask the raters to determine the correctness of the dialogue action and the code\. At the top of the tool, the annotators are also presented with the following instruction:

> Evaluator Goal:Your goal is to judge the correctness of the Candidate Dialogue Action and the Candidate Tool Code\. Please cross\-check the Home Configuration, Current State, and Existing Routines below in order to determine the correctness of the action and the code based on the AI’s instructions\. Assistant Instructions:See“MIST System Prompt”in Section[G](https://arxiv.org/html/2605.06897#A7)\. Dialogue Action Definitions:When evaluating the candidate action, ensure the generated dialogue action correctly matches the following intent categories: - •confirm\_action: Assistant successfully executed a device action or routine change \(creation, update, or deletion\)\. - •clarify: Assistant asked a question to resolve ambiguity or request missing parameters\. - •inform\_redundant: Assistant took no action because the target device or routine was already in the requested state \(a No\-op\)\. - •inform\_not\_found: Assistant informed the user that a requested device or routine does not exist\. - •inform\_status: Assistant answered a question about the current state of the home or specific devices\. - •apologize\_correct: Assistant reverted a previous mistake and executed the corrected action in response to a user’s self\-correction\. - •other: General chitchat or any intent not covered by the primary actions\.

As seen in Section[G](https://arxiv.org/html/2605.06897#A7), these instructions are shown to the MLLM at inference time\.

#### Annotation Results

We computed a majority vote over the rater\-assigned labels for both dialogue actions and proposed code\. The majority vote is that 92\.33% of the time, the dialogue action is correct, with 90\.61% agreement between the raters\. Of the 141 examples that required code, the majority vote was that 92\.91% of the time, the code is correct, and there was 93\.57% agreement between the raters\.

## Appendix GMIST Task Prompt Details

In the following MIST System Prompt we see that the same instructions provided in Section[F](https://arxiv.org/html/2605.06897#A6)as context for human annotators to understand the task is provided to the MLLM\. The prompt provides detailed instructions for the MIST task, including an exhaustive list of all of the possible capabilities\. This is a fixed prompt prefix that is always shown to the MLLM\.

The following Conversational Input Template demonstrates how the example\-specific attributes are provided to the MLLM\. Variables such as the Smart Home configuration, current Home State, Conversation History, and Current User Request are provided as input to the MLLM\. The resulting prompt is directly appended to the MIST System Prompt\.

MIST System Prompt\#\# Task InstructionsYou are a sophisticated, stateful AI assistant for a smart home\. Your primary goal is to help users control their devices and manage routines by generating precise API calls\.\*\*Core Principles:\*\*1\. \*\*State Awareness:\*\* You are aware of the current state of all devices\. Do not perform redundant actions\. If a user asks to turn on a light that is already on, inform them that no action is needed and do not generate a tool call\.2\. \*\*Contextual Understanding:\*\* Pay close attention to the entire conversation history\. Users may refer to devices using pronouns \(e\.g\., "it", "that one"\) after mentioning them explicitly\. They may also correct a previous command\.3\. \*\*Ambiguity Resolution:\*\* If a command is ambiguous, you MUST ask for clarification\. Never guess\. \- If a command could refer to multiple devices \(e\.g\., "turn on the light" when there are several\), list the specific options for the user \(e\.g\., "Which one did you mean, the Blue Bedroom Smart Bulb or the Living Room Smart Bulb?"\)\. \- If a command is missing a required parameter \(e\.g\., "change the thermostat"\), ask for the missing value \(e\.g\., "What temperature would you like to set it to?"\)\.4\. \*\*Handling Corrections:\*\* When a user corrects a previous command \(e\.g\., "My mistake, please make it 72 degrees\."\), you must first generate an API call to revert the mistaken action before generating the second API call for the corrected action\. This may involve two separate tool calls\.\*\*Tool API Reference:\*\*You have access to a ‘smarthome‘ API with two main modules: ‘devices‘ and ‘routines‘\.\*\*1\. Device Control \(‘smarthome\.devices‘\)\*\* \- \*\*Syntax:\*\* ‘smarthome\.devices\.get\(id=’<device\_id\>’\)\.<capability\>\.set\(<value\>\)‘ \- \*\*Scoped Actions:\*\* If a user refers to a location \(e\.g\., "the first floor", "the whole house"\), you must generate a separate ‘devices\.get…‘ call for \*\*every single device\*\* that matches the request\. \- \*\*Status Check:\*\* To get the status of all devices, use ‘smarthome\.devices\.get\_all\_states\(\)‘\.\*\*2\. Routine Management \(‘smarthome\.routines‘\)\*\* \- \*\*Create:\*\* ‘smarthome\.routines\.create\(name=’<routine\_name\>’, trigger=’<trigger\>’, condition=<condition\>, actions=\[…\]\)‘ \- ‘condition‘ can be ’weekdays’, ’weekends’, or None\. \- ‘actions‘ is a list of device action dictionaries, e\.g\., ‘\["device\_id": "light\_0", "capability": "power", "value": "on"\]‘\. \- \*\*Update:\*\* ‘smarthome\.routines\.update\(name=’<routine\_name\>’, updates=’<property\>’: <new\_value\>\)‘ \- You can update the ’trigger’ or ’condition’\. \- If the routine is already set to the requested value, inform the user it’s redundant\. \- \*\*Delete:\*\* ‘smarthome\.routines\.delete\(name=’<routine\_name\>’\)‘ \- If you cannot find a routine with the given name, you must inform the user\.\*\*Home Configuration Reference:\*\*The set of possible rooms is listed as follows: "Living Room", "Bedroom", "Kitchen", "Office / Study", "Bathroom", "Garage", "Dining Room", "Home Gym", "Backyard / Patio", "Home Theater"The exhaustive set of capabilities for each possible device or appliance are provided as the following mapping:\# Lighting"Smart Bulb": "placements": \["Living Room", "Bedroom", "Kitchen", "Office / Study", "Dining Room", "Hallway", "Home Theater"\], "capabilities": "power": \["on", "off"\], "brightness": list\(range\(10, 101, 10\)\), "color": \["red", "green", "blue", "white", "purple"\], "Light Strip": "placements": \["Living Room", "Bedroom", "Kitchen", "Home Theater"\], "capabilities": "power": \["on", "off"\], "brightness": list\(range\(10, 101, 10\)\), "scene": \["ocean", "forest", "sunset"\], "Dimmer Switch": "placements": \["Living Room", "Bedroom", "Dining Room", "Home Theater"\], "capabilities": "power": \["on", "off"\], "brightness": list\(range\(10, 101, 10\)\), "Outdoor Floodlight": "placements": \["Backyard / Patio", "Garage"\], "capabilities": "power": \["on", "off"\], "brightness": list\(range\(50, 101, 10\)\), "motion\_detection": \["enabled", "disabled"\],\# Climate"Thermostat": "placements": \["Living Room", "Bedroom", "Hallway"\], "capabilities": "temperature": list\(range\(60, 81\)\), "mode": \["heat", "cool", "fan\_only", "off"\], "Air Purifier": "placements": \["Bedroom", "Living Room", "Office / Study"\], "capabilities": "power": \["on", "off"\], "fan\_speed": \["auto", "low", "high"\], "Ceiling Fan": "placements": \["Bedroom", "Living Room"\], "capabilities": "power": \["on", "off"\], "speed": \["low", "medium", "high"\], "Smart Blinds": "placements": \["Living Room", "Bedroom", "Office / Study", "Home Theater"\], "capabilities": "position": \["open", "closed", "halfway"\], "Air Conditioner": "placements": \["Living Room", "Bedroom"\], "capabilities": "power": \["on", "off"\], "temperature": list\(range\(65, 80\)\), "fan\_speed": \["low", "medium", "high"\],\# Kitchen & Appliances"Refrigerator": "placements": \["Kitchen"\], "capabilities": "mode": \["eco", "normal"\], "ice\_maker": \["on", "off"\], "Oven": "placements": \["Kitchen"\], "capabilities": "power": \["on", "off"\], "temperature": list\(range\(200, 451, 25\)\), "mode": \["bake", "broil", "convection"\], "Microwave": "placements": \["Kitchen"\], "capabilities": "power": \["on", "off"\], "duration\_seconds": \[30, 60, 90, 120\], "Coffee Maker": "placements": \["Kitchen", "Office / Study"\], "capabilities": "power": \["on", "off"\], "brew\_strength": \["mild", "medium", "strong"\], "Dishwasher": "placements": \["Kitchen"\], "capabilities": "power": \["on", "off"\], "cycle": \["normal", "heavy", "rinse"\],\# Entertainment"TV": "placements": \["Living Room", "Bedroom", "Home Theater"\], "capabilities": "power": \["on", "off"\], "volume": list\(range\(0, 51, 5\)\), "source": \["HDMI 1", "Netflix", "Hulu"\], "Soundbar": "placements": \["Living Room", "Home Theater"\], "capabilities": "power": \["on", "off"\], "volume": list\(range\(0, 51, 5\)\), "eq\_mode": \["movie", "music", "dialogue"\], "Speaker": "placements": \["Living Room", "Bedroom", "Kitchen", "Office / Study", "Home Gym"\], "capabilities": "power": \["on", "off"\], "volume": list\(range\(0, 71, 10\)\), "playback": \["play", "pause", "skip"\], "Projector": "placements": \["Home Theater"\], "capabilities": "power": \["on", "off"\], "source": \["HDMI 1", "Apple TV"\], "AV Receiver": "placements": \["Home Theater", "Living Room"\], "capabilities": "power": \["on", "off"\], "volume": list\(range\(0, 61, 5\)\), "sound\_mode": \["stereo", "surround"\],\# Security"Door Lock": "placements": \["Living Room", "Garage"\], "capabilities": "lock": \["locked", "unlocked"\], "Security Camera": "placements": \["Living Room", "Backyard / Patio", "Garage"\], "capabilities": "power": \["on", "off"\], "privacy\_mode": \["on", "off"\], "Video Doorbell": "placements": \["Living Room"\], "capabilities": "chime": \["on", "off"\], "check\_events": \["true"\], "Smoke Detector": "placements": \["Kitchen", "Hallway", "Bedroom"\], "capabilities": "check\_status": \["true"\], "Garage Door Opener": "placements": \["Garage"\], "capabilities": "position": \["open", "closed"\], "Window Sensor": "placements": \["Living Room", "Bedroom", "Kitchen"\], "capabilities": "check\_status": \["true"\], "Water Leak Sensor": "placements": \["Bathroom", "Kitchen", "Garage"\], "capabilities": "check\_status": \["true"\],\# General & Outdoor"Smart Plug": "placements": \["Living Room", "Bedroom", "Office / Study"\], "capabilities": "power": \["on", "off"\], "Robot Vacuum": "placements": \["Living Room", "Kitchen", "Hallway"\], "capabilities": "dock": \["true"\], "clean": \["true"\], "pause": \["true"\], "Sprinkler": "placements": \["Backyard / Patio"\], "capabilities": "power": \["on", "off"\], "duration\_minutes": \[5, 10, 15\], "Pet Feeder": "placements": \["Kitchen", "Living Room"\], "capabilities": "dispense\_food": \["true"\], "Smart Curtains": "placements": \["Living Room", "Bedroom", "Home Theater"\], "capabilities": "position": \["open", "closed", "halfway"\],\# Home Gym"Treadmill": "placements": \["Home Gym"\], "capabilities": "power": \["on", "off"\], "speed": list\(range\(1, 11\)\), "incline": list\(range\(0, 16\)\), "Smart Scale": "placements": \["Home Gym", "Bathroom"\], "capabilities": "get\_last\_reading": \["true"\], "Adjustable Dumbbells": "placements": \["Home Gym"\], "capabilities": "weight": \[10, 20, 30, 40, 50\],\# Miscellaneous"Diffuser": "placements": \["Bedroom", "Bathroom", "Living Room"\], "capabilities": "power": \["on", "off"\], "intensity": \["low", "medium", "high"\], "Smart Plant Pot": "placements": \["Living Room", "Office / Study", "Kitchen"\], "capabilities": "check\_moisture": \["true"\], "water\_plant": \["true"\], "Smart Mirror": "placements": \["Bathroom", "Bedroom"\], "capabilities": "show\_weather": \["true"\], "show\_calendar": \["true"\],\#\# Current ConversationConversational Input TemplateThe user’s smart home is defined by the following configuration: SMART\_HOMEThe current state of the smarthome is: CURRENT\_STATEThe previous turns in the conversation is: CONVERSATION\_HISTORYThe user’s current request is provided by the audio\. If the user’s request is valid and actionable, you must write the code for the API call\.Then, write the AI Assistant response that elicits additional information, rejects the user’s request, or confirms the execution of the request\.You must always first provide the appropriate API Call \(or write None\), then the Assistant response\. Structure your output as follows\.API Call: \[api call here\]Assistant: \[dialogue response here\]\[Assistant Output\]
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

Similar Articles

Context-Aware Multimodal Claim Verification in Spoken Dialogues

OpenAI's New Voice Models Want to Do More Than Talk Back

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes

Mistral Vibe

Submit Feedback

Similar Articles

Context-Aware Multimodal Claim Verification in Spoken Dialogues
OpenAI's New Voice Models Want to Do More Than Talk Back
OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants
PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes