Towards Next-Generation Healthcare: A Survey of Medical Embodied AI for Perception, Decision-Making, and Action

arXiv cs.AI 06/16/26, 04:00 AM Papers
survey medical embodied-ai healthcare perception decision-making foundation-models
Summary
This paper systematically surveys the core components of medical embodied AI, emphasizing the coordinated integration of perception, decision-making, and action in clinical environments, and reviews representative applications, datasets, and future research directions.
arXiv:2606.15647v1 Announce Type: new Abstract: Foundation models have demonstrated impressive performance in enhancing healthcare efficiency across a wide range of medical applications. Nevertheless, their limited ability to perceive, understand, and interact with the physical world significantly constrains their effectiveness in real-world clinical workflows, where safety-critical decision-making and physical execution are tightly coupled. Recently, embodied artificial intelligence (AI) has emerged as a promising physical-interactive paradigm for intelligent healthcare, enabling agents to operate in complex medical environments. As research in this area rapidly expands, understanding how intelligent agents function as integrated, end-to-end systems in clinical environments becomes increasingly critical. However, existing surveys on medical embodied AI largely emphasize individual aspects or functional components, lacking a unified system-level organization of the field. To support and consolidate recent advances, we systematically survey the core components of medical embodied AI, with a particular emphasis on the coordinated integration of perception, decision-making, and action. We further review representative medical applications and relevant datasets, and we analyze the major challenges encountered in real-world clinical practice. Finally, we discuss key directions for future research in this rapidly evolving field. The associated project can be found at https://github.com/VMVLab/Medical_Embodied_AI_Paper_List.
Original Article
View Cached Full Text
Cached at: 06/16/26, 11:47 AM
# Towards Next-Generation Healthcare: A Survey of Medical Embodied AI for Perception, Decision-Making, and Action
Source: [https://arxiv.org/html/2606.15647](https://arxiv.org/html/2606.15647)
Cheng Zhang, Qing Cai, , Xingzheng Wu, Xun Yang, Xiaojun Chang, Bingkun Bao, , Liqiang Nie, , Xinwang Liu, , Yi YangCheng Zhang and Xingzheng Wu are with the School of Information Science and Engineering, Ocean University of China, Qingdao, Shandong 266100, China \(e\-mail: zhangcheng@stu\.ouc\.edu\.cn, wuxingzheng@stu\.ouc\.edu\.cn\)\. Qing Cai is with the Innovation School of Artificial Intelligence, Hefei University of Technology, Hefei 230009, China \(e\-mail: caiqing1617@gmail\.com\)\. Xun Yang and Xiaojun Chang are with the School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, China \(e\-mail: xyang21@ustc\.edu\.cn, xjchang@ustc\.edu\.cn\)\. Bing\-Kun Bao is with the School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China \(e\-mail: bingkunbao@hfut\.edu\.cn\)\. Liqiang Nie is with the School of Computer Science and Technology, Harbin Institute of Technology \(Shenzhen\), Shenzhen 518055, China \(e\-mail: nieliqiang@gmail\.com\)\. Xinwang Liu is with the College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China \(e\-mail: xinwangliu@nudt\.edu\.cn\)\. Yi Yang is with the ReLER Laboratory, CCAI, Zhejiang University, Zhejiang 310027, China \(e\-mail: yangyics@zju\.edu\.cn\)\.

###### Abstract

Foundation models have demonstrated impressive performance in enhancing healthcare efficiency across a wide range of medical applications\. Nevertheless, their limited ability to perceive, understand, and interact with the physical world significantly constrains their effectiveness in real\-world clinical workflows, where safety\-critical decision\-making and physical execution are tightly coupled\. Recently, embodied artificial intelligence \(AI\) has emerged as a promising physical\-interactive paradigm for intelligent healthcare, enabling agents to operate in complex medical environments\. As research in this area rapidly expands, understanding how intelligent agents function as integrated, end\-to\-end systems in clinical environments becomes increasingly critical\. However, existing surveys on medical embodied AI largely emphasize individual aspects or functional components, lacking a unified system\-level organization of the field\. To support and consolidate recent advances, we systematically survey the core components of medical embodied AI, with a particular emphasis on the coordinated integration of perception, decision\-making, and action\. We further review representative medical applications and relevant datasets, and we analyze the major challenges encountered in real\-world clinical practice\. Finally, we discuss key directions for future research in this rapidly evolving field\. The associated project can be found athttps://github\.com/VMVLab/Medical\_Embodied\_AI\_Paper\_List\.

## IIntroduction

![Refer to caption](https://arxiv.org/html/2606.15647v1/x1.png)Figure 1:Overview of a medical embodied artificial intelligence framework\. Medical embodied agents interact with simulated and real clinical environments via a perception–decision–action loop\.The widespread adoption of artificial intelligence \(AI\) in medicine has significantly improved the efficiency and accuracy of clinical diagnosis\[[1](https://arxiv.org/html/2606.15647#bib.bib1)\]\. Convolutional Neural Networks \(CNNs\) have achieved strong performance in disease classification and lesion segmentation\[[2](https://arxiv.org/html/2606.15647#bib.bib2),[3](https://arxiv.org/html/2606.15647#bib.bib3)\], while Large Language Models \(LLMs\) and their multimodal extensions have recently shown promise in medical report generation and clinical decision support\[[4](https://arxiv.org/html/2606.15647#bib.bib4),[5](https://arxiv.org/html/2606.15647#bib.bib5),[6](https://arxiv.org/html/2606.15647#bib.bib6),[7](https://arxiv.org/html/2606.15647#bib.bib7)\]\. However, these approaches are largely confined to a “perception and decision” paradigm based on static data, lacking the capability for physical interaction with real\-world clinical environments, which limits their applicability in realistic medical scenarios\.

In contrast, embodied artificial intelligence \(Embodied AI\) enables perception, decision\-making, and action within physical environments, opening new avenues for medical AI\[[8](https://arxiv.org/html/2606.15647#bib.bib8),[9](https://arxiv.org/html/2606.15647#bib.bib9)\]\. As illustrated in Fig\.[1](https://arxiv.org/html/2606.15647#S1.F1), medical embodied AI systems typically follow a closed\-loop perception–decision–action framework\. Embodied AI has been applied to a range of medical tasks, including surgical robotics\[[10](https://arxiv.org/html/2606.15647#bib.bib10),[11](https://arxiv.org/html/2606.15647#bib.bib11)\], surgical navigation\[[12](https://arxiv.org/html/2606.15647#bib.bib12),[13](https://arxiv.org/html/2606.15647#bib.bib13),[14](https://arxiv.org/html/2606.15647#bib.bib14),[15](https://arxiv.org/html/2606.15647#bib.bib15),[16](https://arxiv.org/html/2606.15647#bib.bib16)\], rehabilitation assistance\[[17](https://arxiv.org/html/2606.15647#bib.bib17),[18](https://arxiv.org/html/2606.15647#bib.bib18),[19](https://arxiv.org/html/2606.15647#bib.bib19)\], and mobile clinical support\[[20](https://arxiv.org/html/2606.15647#bib.bib20),[21](https://arxiv.org/html/2606.15647#bib.bib21),[22](https://arxiv.org/html/2606.15647#bib.bib22),[23](https://arxiv.org/html/2606.15647#bib.bib23)\], demonstrating clear advantages in complex and dynamic clinical scenarios\. Despite this potential, significant challenges remain across embodied perception, decision\-making, and action, including data scarcity, uncertainty modeling, and high\-precision control sensitivity\.

![Refer to caption](https://arxiv.org/html/2606.15647v1/x2.png)Figure 2:Overview structure of this survey\.Recent surveys have examined medical embodied AI from complementary perspectives\. Some provide broad overviews of embodied AI in healthcare, summarizing functional components, application domains, datasets, and ethical considerations to outline the overall research landscape\[[24](https://arxiv.org/html/2606.15647#bib.bib24)\]\. Others emphasize system\-level design, such as hierarchical or modular architectures, to integrate perception, planning, and execution with a focus on clinical reliability and safety\[[25](https://arxiv.org/html/2606.15647#bib.bib25)\]\. Additional works narrow the scope to specific aspects, including representative applications \(e\.g\., surgical robotics and rehabilitation\)\[[26](https://arxiv.org/html/2606.15647#bib.bib26)\], core perceptual technologies such as 3D medical image segmentation\[[27](https://arxiv.org/html/2606.15647#bib.bib27)\], and specialty\-driven viewpoints exemplified by ophthalmology\[[28](https://arxiv.org/html/2606.15647#bib.bib28)\]\. Building on these efforts, this work unifies prior perspectives within a closed\-loop framework of perception, decision\-making, and action, offering a complementary system\-level view of medical embodied AI\.

![Refer to caption](https://arxiv.org/html/2606.15647v1/x3.png)Figure 3:Conceptual foundations of embodied AI and its relevance to medical embodied intelligence\. a, Publication trends, temporal evolution over the past decade, and representative keywords of embodied AI based on Google Scholar statistics\. b, Four developmental stages of embodied AI: conceptual germination, paradigm shift, learning\-driven, and large\-model–empowered stages\. c, Comparison between disembodied intelligence and embodied AI, highlighting the latter’s ability to interact with the environment\. d, Core components of embodied AI, including agents and environments at the system level, and embodied perception, decision\-making, and action at the technical level\.As illustrated in Fig\.[2](https://arxiv.org/html/2606.15647#S1.F2), the remainder of this survey is organized as follows\. Section 2 provides background on embodied AI as the conceptual foundation for medical embodied AI, introducing its development and core components\. Section 3 examines medical embodied AI applications, while Section 4 introduces relevant datasets\. Section 5 discusses key challenges and future perspectives, and Section 6 concludes the survey with key insights and implications for intelligent healthcare systems\.

## IIBackground: Embodied AI

In this section, we briefly review embodied AI as the conceptual foundation of medical embodied AI, focusing on its core ideas, developmental evolution, and system\-level components\. Rather than providing an exhaustive survey of embodied AI, we aim to establish a concise background that facilitates understanding of subsequent discussions on medical embodied AI\.

### II\-AFoundations and Evolution

Embodied AI has received growing research attention in recent years \(Fig\.[3](https://arxiv.org/html/2606.15647#S1.F3)a\) and has evolved through four major developmental stages \(Fig\.[3](https://arxiv.org/html/2606.15647#S1.F3)b\)\. The early Conceptual Germination Stage established the foundations of artificial intelligence through symbolic reasoning, followed by a Paradigm Shift Stage that emphasized learning mechanisms and neural networks\. The subsequent Learning\-Driven Stage leveraged deep reinforcement and imitation learning to enable autonomous decision\-making\. In the recent Large\-Model\-Empowered Stage, large language and multimodal models have substantially enhanced perception, cognition, and interaction, exposing the limitations of disembodied AI and motivating embodied systems capable of acting in physical environments\.

TABLE I:Overview of the core components, their respective functions, and sub\-directions in embodied AI\.ComponentsFunctionSub\-DirectionEmbodied PerceptionProvides multimodal understanding of the environment\.Object PerceptionScene PerceptionBehavior PerceptionExpression PerceptionEmbodied Decision\-MakingConverts perception into adaptive strategies\.Task PlanningEmbodied NavigationEmbodied Question Answering \(EQA\)Embodied ActionExecutes decisions through physical interaction\.Imitation Learning\-Based ActionReinforcement Learning\-Based ActionLarge Model\-Driven Action
### II\-BCore Components

Specifically, conventional expert systems and language models operate primarily on abstract or symbolic representations and lack direct interaction with the physical environment, i\.e\., they are typically disembodied \(Fig\.[3](https://arxiv.org/html/2606.15647#S1.F3)c\)\. As a result, their adaptability and generalization to complex real\-world scenarios remain inherently constrained\. In contrast, embodied AI enables agents to perceive, decide, and act in a closed\-loop manner with the environment\. As illustrated in Fig\.[3](https://arxiv.org/html/2606.15647#S1.F3)d, embodied AI systems are typically composed of three core components—embodied perception, embodied decision\-making, and embodied action—which jointly support multimodal understanding, planning and reasoning, and autonomous interaction\[[29](https://arxiv.org/html/2606.15647#bib.bib29),[30](https://arxiv.org/html/2606.15647#bib.bib30),[31](https://arxiv.org/html/2606.15647#bib.bib31)\]\. In addition, sim\-to\-real transfer is commonly employed to bridge the gap between simulated training and real\-world deployment\[[32](https://arxiv.org/html/2606.15647#bib.bib32),[33](https://arxiv.org/html/2606.15647#bib.bib33),[34](https://arxiv.org/html/2606.15647#bib.bib34),[35](https://arxiv.org/html/2606.15647#bib.bib35)\]\.

Embodied AI typically operates in a closed\-loop perception–decision–action paradigm\[[30](https://arxiv.org/html/2606.15647#bib.bib30)\]\. As summarized in Table[I](https://arxiv.org/html/2606.15647#S2.T1), embodied perception extracts multimodal representations from heterogeneous sensory inputs \(e\.g\., vision, depth, audio, and touch\), supporting object, scene, behavior, and expression understanding for downstream interaction, planning, navigation, and question answering\[[36](https://arxiv.org/html/2606.15647#bib.bib36),[37](https://arxiv.org/html/2606.15647#bib.bib37)\]\. Building on perceptual representations, embodied decision\-making maps observations to adaptive strategies through task planning, navigation, and embodied question answering, enabling goal\- and language\-aware reasoning\[[38](https://arxiv.org/html/2606.15647#bib.bib38),[39](https://arxiv.org/html/2606.15647#bib.bib39),[40](https://arxiv.org/html/2606.15647#bib.bib40),[41](https://arxiv.org/html/2606.15647#bib.bib41)\]\. Finally, embodied action executes decisions via physical interaction, commonly realized through imitation\-based, reinforcement\-based, and large\-model–driven approaches\[[42](https://arxiv.org/html/2606.15647#bib.bib42)\]\. These properties make embodied AI particularly suitable for safety\-critical and environment\-dependent medical scenarios\.

![Refer to caption](https://arxiv.org/html/2606.15647v1/x4.png)Figure 4:Overview of medical embodied AI and its hierarchical organization with representative methods\.![Refer to caption](https://arxiv.org/html/2606.15647v1/x5.png)Figure 5:Overview of medical embodied perception, including medical instrument and organ recognition, surgical and clinical environment perception and modeling, medical operation behavior detection, and emotional interaction understanding\.

## IIIMedical Embodied AI

Building on embodied AI, medical embodied AI has emerged as a paradigm for interactive, task\-oriented clinical operation\. As illustrated in Fig\.[4](https://arxiv.org/html/2606.15647#S2.F4), it follows a closed\-loop perception–decision–action framework, comprising medical embodied perception, decision\-making, and action\. Integrated application scenarios instantiate this framework at the system level by jointly combining these components to support real\-world medical tasks\. Accordingly, this chapter reviews representative advances across these aspects to provide an overview of medical embodied AI\.

### III\-AMedical Embodied Perception

Medical embodied perception enables semantic understanding of critical elements in complex medical environments with high object complexity and strict operational constraints\. As illustrated in Fig\.[5](https://arxiv.org/html/2606.15647#S2.F5), this section reviews four key aspects: medical instrument and organ recognition, surgical and clinical environment perception, medical operation behavior detection, and emotional interaction understanding\.

#### III\-A1Medical Instrument and Organ Recognition

Medical instrument and organ recognition is a fundamental capability for ensuring operational safety and diagnostic accuracy\[[43](https://arxiv.org/html/2606.15647#bib.bib43),[44](https://arxiv.org/html/2606.15647#bib.bib44)\]\. Agents must reliably identify diverse surgical tools and complex anatomical structures under challenging conditions, including cluttered scenes, occlusion, blood contamination, unstable illumination, and significant organ deformation, which impose high demands on robustness and real\-time performance\.

Existing methods can be broadly categorized into three groups, reflecting different strategies for balancing robustness, data dependency, and computational efficiency\. Convolution\-based image modeling methods are widely used for two\- and three\-dimensional segmentation of instruments and organs via multi\-scale spatial feature modeling\. Representative architectures such as U\-Net\[[45](https://arxiv.org/html/2606.15647#bib.bib45)\]and Transformer\-based variants \(e\.g\., SwinPA\-Net\[[46](https://arxiv.org/html/2606.15647#bib.bib46)\]\) perform well under controlled conditions but, from a robustness perspective, remain sensitive to occlusion, illumination variation, and tissue deformation\. Spatio\-temporal video modeling methods exploit temporal continuity to capture surgical dynamics and improve stability under motion and transient occlusion; however, compared with convolution\-based methods, they typically require large\-scale annotated video data and incur higher computational cost\[[47](https://arxiv.org/html/2606.15647#bib.bib47),[48](https://arxiv.org/html/2606.15647#bib.bib48)\]\. Multimodal fusion–based semantic modeling methods integrate complementary modalities such as vision and language\. For example, SurgVLM\[[49](https://arxiv.org/html/2606.15647#bib.bib49)\]enables prompt\-driven recognition of instruments and anatomical structures but, in contrast to purely visual or spatio\-temporal methods, faces challenges in cross\-modal alignment and inference efficiency\.

Discussion: Overall, existing methods address instrument and organ recognition from complementary perspectives, yet their reliance on isolated modeling assumptions limits robustness, efficiency, and generalization under dynamic, safety\-critical surgical conditions\.

#### III\-A2Surgical and Clinical Environment Perception and Modeling

Surgical and clinical environment perception and modeling aim to provide embodied agents with a structured and global understanding of operating spaces, including room layout, devices, personnel, and dynamic interactions\[[50](https://arxiv.org/html/2606.15647#bib.bib50),[51](https://arxiv.org/html/2606.15647#bib.bib51)\], thereby supporting navigation, collaboration, and task planning\.

Existing approaches can be broadly categorized into three groups, reflecting trade\-offs among geometric fidelity, semantic abstraction, and computational efficiency\. Reconstruction\-based methods recover geometric structures from multi\-view images, depth data, or point clouds\. Approaches such as NeRF\-OR\[[52](https://arxiv.org/html/2606.15647#bib.bib52)\]and Deform3DGS\[[53](https://arxiv.org/html/2606.15647#bib.bib53)\]achieve high\-fidelity reconstruction of surgical environments under static or quasi\-static assumptions, but are limited by occlusion, restricted viewpoints, and high computational cost\. Recent extensions adopt deformation\-aware dynamic 3D Gaussian representations to model non\-rigid scenes\. Methods such as Endo\-HDR\[[54](https://arxiv.org/html/2606.15647#bib.bib54)\]and SurgicalGS\[[55](https://arxiv.org/html/2606.15647#bib.bib55)\]provide temporally consistent geometry under dynamic motion, improving spatial fidelity for navigation and planning, while still facing challenges in real\-time performance, limited supervision, and complex tool–tissue interactions\. Graph\-based relational methods abstract surgical entities and their interactions into semantic scene graphs\. For example, 4D\-OR\[[56](https://arxiv.org/html/2606.15647#bib.bib56)\]models participants, instruments, and spatial relations, while LABRAD\-OR\[[57](https://arxiv.org/html/2606.15647#bib.bib57)\]incorporates temporal memory to capture evolving surgical semantics\. These approaches offer more structured and interpretable representations than reconstruction\-based methods but depend heavily on accurate entity and relation extraction\. Large\-model–based semantic understanding methods leverage vision–language or embodied foundation models to infer spatial layouts and semantic relations\. For instance, Spatial\-ORMLLM\[[58](https://arxiv.org/html/2606.15647#bib.bib58)\]predicts operating room structure directly from RGB inputs, enabling stronger generalization; however, challenges remain in training cost, cross\-modal alignment, and controllability compared with reconstruction\- and graph\-based approaches\.

Discussion: Overall, existing environment perception and modeling methods offer complementary geometric and semantic representations, yet their reliance on isolated reconstruction, relational abstraction, or language\-based inference limits robustness and efficiency in dynamic, cluttered surgical environments\.

#### III\-A3Medical Operation Behavior Detection

Medical operation behavior detection aims to enable embodied agents to recognize and interpret surgical and clinical actions, providing semantic understanding of procedural workflows, operator intent, and task progression\[[59](https://arxiv.org/html/2606.15647#bib.bib59)\]\. This capability supports real\-time feedback, skill assessment, safety monitoring, and decision support in complex medical procedures\.

Existing approaches can be broadly categorized into three groups, reflecting different strategies for balancing action granularity, temporal context, and semantic richness\. Vision\-based action recognition methods identify atomic surgical actions by extracting spatiotemporal features from videos\. Representative approaches such as MGRFormer\[[60](https://arxiv.org/html/2606.15647#bib.bib60)\]and 3D CNN or SlowFast\-based models\[[61](https://arxiv.org/html/2606.15647#bib.bib61)\]effectively capture fine\-grained motion patterns but, from a robustness perspective, remain sensitive to occlusion, illumination variation, and operator diversity\. Spatiotemporal modeling–based surgical phase inference methods emphasize long\-term temporal dependencies to segment procedural workflows\. For example, TransSG\[[62](https://arxiv.org/html/2606.15647#bib.bib62)\]employs spatiotemporal Transformers to recognize gesture sequences, while STANet\[[63](https://arxiv.org/html/2606.15647#bib.bib63)\]integrates multi\-scale temporal features to improve phase recognition; however, compared with vision\-based action recognition methods, cross\-procedure generalization remains challenging\. Multimodal fusion–based semantic behavior understanding methods jointly analyze visual, haptic, auditory, or physiological signals to infer higher\-level surgical intent and operator state\[[64](https://arxiv.org/html/2606.15647#bib.bib64),[65](https://arxiv.org/html/2606.15647#bib.bib65)\]\. In contrast to purely vision\-based or spatiotemporal methods, these approaches enhance semantic interpretation but introduce increased sensing complexity and system integration challenges\.

Discussion: Overall, existing behavior detection approaches capture complementary aspects of surgical actions, yet their reliance on isolated visual, temporal, or multimodal cues limits robustness and generalization across procedures and operators\.

#### III\-A4Emotional Interaction Understanding

Emotional interaction understanding enables embodied agents to perceive affective and intention\-related cues in clinical environments, supporting natural and context\-aware communication among healthcare staff, patients, and intelligent systems\[[66](https://arxiv.org/html/2606.15647#bib.bib66),[67](https://arxiv.org/html/2606.15647#bib.bib67)\]\. Such cues are conveyed through speech, facial expressions, body posture, and physiological signals, and are critical for human\-centered medical interaction\[[68](https://arxiv.org/html/2606.15647#bib.bib68),[69](https://arxiv.org/html/2606.15647#bib.bib69)\]\.

Existing approaches can be broadly categorized into three groups, reflecting different strategies for balancing perceptual sensitivity, robustness, and semantic interpretability\. Audio\-visual–based emotion recognition methods infer emotional states by jointly modeling speech and facial cues\. Representative approaches such as DEP\-former\[[70](https://arxiv.org/html/2606.15647#bib.bib70)\]and MSER\[[71](https://arxiv.org/html/2606.15647#bib.bib71)\]capture dynamic emotional variations in clinical communication but, from a robustness perspective, remain sensitive to noise and occlusion caused by protective equipment\. Physiological–behavioral–based affective state estimation methods integrate signals such as heart rate variability, electrodermal activity, and body motion to reflect underlying emotional and stress responses\. For example, dual\-stream representation learning frameworks\[[72](https://arxiv.org/html/2606.15647#bib.bib72)\]and multimodal physiological models\[[73](https://arxiv.org/html/2606.15647#bib.bib73)\]improve recognition accuracy; however, compared with audio\-visual methods, they are influenced by sensor noise and individual variability\. Language\-semantic–based cognitive emotion understanding methods focus on interpreting emotional intent and contextual sentiment in clinical dialogues\. Methods such as MedVLM\-R1\[[74](https://arxiv.org/html/2606.15647#bib.bib74)\]and DialogueLLM\[[75](https://arxiv.org/html/2606.15647#bib.bib75)\]leverage vision–language or large language models to support emotion\-aware reasoning and interaction, but in contrast to perception\-driven approaches, their effectiveness depends on high\-quality linguistic cues and explicit emotional expression\.

Discussion: Overall, existing emotion understanding approaches capture complementary affective cues from audio\-visual, physiological, and linguistic signals, yet their reliance on isolated modalities and assumptions limits robustness and consistency in real\-world clinical interactions\.

![Refer to caption](https://arxiv.org/html/2606.15647v1/x6.png)Figure 6:Overview of medical embodied decision\-making, including medical workflow modeling and task planning, medical navigation systems, and clinical question\-answering and decision\-support mechanisms\.

### III\-BMedical Embodied Decision\-Making

Medical embodied decision\-making builds on perceptual representations to enable reasoning and planning for clinical tasks\[[76](https://arxiv.org/html/2606.15647#bib.bib76),[77](https://arxiv.org/html/2606.15647#bib.bib77)\], while involving complex temporal dependencies and strong domain constraints\. As illustrated in Fig\.[6](https://arxiv.org/html/2606.15647#S3.F6), this section reviews three directions: medical workflow modeling and task planning, medical navigation, and clinical question answering and decision support\.

#### III\-B1Medical Workflow Modeling and Task Planning

Medical workflow modeling and task planning aim to capture procedural structure and task dependencies in surgical or diagnostic processes, enabling agents to infer workflow states and generate high\-level action plans\[[78](https://arxiv.org/html/2606.15647#bib.bib78),[79](https://arxiv.org/html/2606.15647#bib.bib79)\]\. This capability bridges perception and action by supporting temporal reasoning, task decomposition, and policy generation\.

Existing approaches can be broadly categorized into three groups, reflecting different strategies for balancing structural explicitness, temporal reasoning, and semantic generalization\. Supervised stage\-based modeling methods learn workflow segmentation and stage recognition from annotated data using temporal convolution or Transformer architectures\. Representative approaches such as Trans\-SVNet\[[80](https://arxiv.org/html/2606.15647#bib.bib80)\]and TeCNO\[[81](https://arxiv.org/html/2606.15647#bib.bib81)\]achieve stable stage recognition but, from a structural modeling perspective, encode workflow structure implicitly in model parameters, limiting explicit representation of long\-term task dependencies\. Temporal graph–based task planning methods explicitly model procedural structure by representing stages, tools, or actions as graph nodes with temporal and semantic relations\. For example, PATG\[[82](https://arxiv.org/html/2606.15647#bib.bib82)\]captures cross\-stage dependencies via position\-aware temporal graphs, while graph\-based interaction modeling methods\[[83](https://arxiv.org/html/2606.15647#bib.bib83)\]encode instrument trajectories over time; however, compared with stage\-based methods, they often rely on manually designed graph schemas and task priors\. Multimodal semantic planning methods leverage large language or vision–language models to perform high\-level task reasoning and decomposition across visual and linguistic modalities\. Methods such as SurgVLM\[[49](https://arxiv.org/html/2606.15647#bib.bib49)\]and LLaVA\-Med\[[84](https://arxiv.org/html/2606.15647#bib.bib84)\]enable language\-driven planning with stronger semantic generalization, but in contrast to graph\-based approaches, face challenges in interpretability and integrating explicit medical knowledge constraints\.

Discussion: Overall, existing workflow modeling and task planning approaches capture complementary aspects of procedural structure and semantic reasoning, yet their reliance on isolated implicit, graph\-based, or language\-driven representations limits interpretability and robustness in complex clinical workflows\.

#### III\-B2Medical Navigation Systems

Medical navigation systems connect spatial perception with action execution, enabling embodied agents to perform localization, registration, and path planning in diverse clinical scenarios, including surgical robotics, interventional procedures, and in\-hospital guidance\.

##### Surgical Robotics and Intraoperative Navigation

In surgical settings, navigation systems emphasize precise localization and registration among patients, instruments, and preoperative images, with geometric and image\-registration–based methods remaining dominant\. Systems such as the BrainLab VectorVision Neuronavigation System\[[85](https://arxiv.org/html/2606.15647#bib.bib85)\]and augmented\-reality–based platforms\[[86](https://arxiv.org/html/2606.15647#bib.bib86)\]achieve high accuracy but remain sensitive to tissue deformation, occlusion, and real\-time constraints\.

##### Interventional Navigation

For minimally invasive and image\-guided interventions, navigation systems must adapt to dynamic anatomy and limited sensing conditions\. Learning\- and optimization\-based approaches introduce machine learning and reinforcement learning to improve adaptability\. For example, RL\-USRegi\[[87](https://arxiv.org/html/2606.15647#bib.bib87)\]enables autonomous ultrasound registration, while inverse reinforcement learning–based methods support catheter and guidewire navigation by imitating expert behavior\[[88](https://arxiv.org/html/2606.15647#bib.bib88)\]\. These approaches enhance flexibility but typically incur high training cost and face sim\-to\-real transfer challenges\.

##### In\-hospital Guidance and Semantic Navigation

Beyond procedural navigation, embodied agents are increasingly used for in\-hospital guidance\. Multimodal semantic methods integrate visual, linguistic, and spatial cues to support language\-driven, environment\-aware navigation\. Systems such as SurgVLM\[[49](https://arxiv.org/html/2606.15647#bib.bib49)\], NavGPT\[[89](https://arxiv.org/html/2606.15647#bib.bib89)\], and NavGPT\-2\[[90](https://arxiv.org/html/2606.15647#bib.bib90)\]demonstrate this capability but face challenges in semantic ambiguity, cross\-modal alignment, and real\-time performance\. Recent studies combine bird’s\-eye\-view perception or scene maps with large language models to improve instruction generation and controllability\[[91](https://arxiv.org/html/2606.15647#bib.bib91),[92](https://arxiv.org/html/2606.15647#bib.bib92)\]\.

Discussion: Overall, existing medical navigation systems address complementary aspects of localization, planning, and semantic guidance, yet their reliance on isolated geometric, learning\-based, or language\-driven paradigms limits robustness and real\-time reliability in dynamic clinical environments\.

![Refer to caption](https://arxiv.org/html/2606.15647v1/x7.png)Figure 7:Overview of medical embodied action, including medical imitation\-based action, medical reinforcement\-based action, and medical large\-model\-driven action\.

#### III\-B3Clinical Question Answering and Decision Support

Clinical question answering and decision support enable embodied agents to reason over multimodal clinical information and provide interpretable recommendations or action guidance, bridging task planning and execution in real\-world medical settings\[[93](https://arxiv.org/html/2606.15647#bib.bib93),[94](https://arxiv.org/html/2606.15647#bib.bib94)\]\.

Existing approaches can be broadly categorized into three groups, reflecting different strategies for balancing predictive accuracy, interpretability, and interaction flexibility\. Prediction\-based decision support methods apply machine learning models to clinical records or imaging data for risk assessment, outcome prediction, and treatment planning\. While providing quantifiable decision cues, they often lack transparency and interactive reasoning capability\[[95](https://arxiv.org/html/2606.15647#bib.bib95)\]\. Language\-model–based decision support methods leverage natural language question answering to assist case interpretation and clinical decision\-making, improving human–machine communication; however, compared with prediction\-based methods, they remain constrained by medical specialization, interpretability, and safety concerns\[[96](https://arxiv.org/html/2606.15647#bib.bib96)\]\. Multimodal fusion–based decision support methods integrate imaging, clinical text, behavioral, and physiological signals to enable comprehensive reasoning across patient states and clinical workflows\[[97](https://arxiv.org/html/2606.15647#bib.bib97),[98](https://arxiv.org/html/2606.15647#bib.bib98)\]\. In contrast to unimodal prediction or language\-based approaches, these methods enhance contextual completeness but face challenges in data standardization, real\-time processing, and deployment reliability\.

Discussion: Overall, existing clinical question answering and decision support approaches provide complementary predictive and reasoning capabilities, yet their reliance on isolated predictive, language\-based, or multimodal frameworks limits interpretability, safety, and reliable deployment in real\-world clinical settings\.

### III\-CMedical Embodied Action

Medical embodied action focuses on executing perceptual and decision outputs through physical interaction, enabling embodied agents to autonomously perform medical procedures under strict precision and safety constraints\[[99](https://arxiv.org/html/2606.15647#bib.bib99),[100](https://arxiv.org/html/2606.15647#bib.bib100)\]\. As illustrated in Fig\.[7](https://arxiv.org/html/2606.15647#S3.F7), this section reviews three representative paradigms: medical imitation\-based action, medical reinforcement\-based action, and medical large\-model\-driven action\.

#### III\-C1Medical Imitation\-Based Action

Medical imitation\-based action aims to transfer expert surgical skills to embodied agents through demonstrations, providing a safe and sample\-efficient alternative to trial\-and\-error learning in high\-risk clinical settings\[[101](https://arxiv.org/html/2606.15647#bib.bib101)\]\.

Existing approaches can be broadly categorized into three groups, reflecting trade\-offs among sample efficiency, robustness, and adaptability\. Behavior cloning \(BC\) learns state–action mappings from expert demonstrations and is widely used for basic surgical skills\. Methods such as the Surgical Robot Transformer \(SRT\)\[[102](https://arxiv.org/html/2606.15647#bib.bib102)\]and Intermittent Visual Servoing\[[103](https://arxiv.org/html/2606.15647#bib.bib103)\]achieve effective imitation via relative\-action representations and visual closed\-loop control, while SuFIA\-BC\[[104](https://arxiv.org/html/2606.15647#bib.bib104)\]improves generalization through synthetic demonstrations; however, BC remains sensitive to demonstration quality and distributional shift\. Inverse reinforcement learning \(IRL\) and adversarial imitation learning infer implicit expert objectives to produce more robust policies\. Examples include steerable\-needle path planning with learned reward functions\[[105](https://arxiv.org/html/2606.15647#bib.bib105)\]and adversarial trajectory imitation for catheter insertion\[[106](https://arxiv.org/html/2606.15647#bib.bib106)\]\. Compared with BC, these methods improve robustness but incur higher training complexity and computational cost\. Hybrid imitation–reinforcement learning combines demonstration\-based initialization with reinforcement refinement to enhance adaptability\. Systems such as surgeon\-preference\-aware ophthalmic assistants\[[107](https://arxiv.org/html/2606.15647#bib.bib107)\]and the ILLC framework for laparoscope control\[[108](https://arxiv.org/html/2606.15647#bib.bib108)\]enable personalized adaptation and cross\-scenario generalization, but introduce additional system complexity and tuning overhead\.

Discussion: Overall, imitation\-based action approaches capture complementary strengths in sample efficiency, robustness, and adaptability, yet their reliance on fixed demonstrations or implicit reward assumptions limits generalization and stability under distributional shift in real\-world clinical scenarios\.

#### III\-C2Medical Reinforcement\-Based Action

Medical reinforcement\-based action leverages reinforcement learning to optimize control policies through interaction with medical environments, enabling embodied agents to achieve adaptive and autonomous execution beyond imitation\-based strategies\[[109](https://arxiv.org/html/2606.15647#bib.bib109)\]\. This paradigm is particularly suited for complex surgical tasks requiring continuous control and dynamic adaptation\.

Existing approaches can be broadly categorized into three groups, reflecting different strategies for balancing sample efficiency, control stability, and scalability\. Value\-based methods learn state–action value functions to guide policy optimization and are suitable for relatively low\-dimensional decision problems\. Representative work such as Collaborative Suturing\[[110](https://arxiv.org/html/2606.15647#bib.bib110)\]applies Q\-learning to enable autonomous handover actions but, from a reward design perspective, remains sensitive to handcrafted rewards and limited sample efficiency\. Policy\-based methods directly optimize control policies for continuous action spaces and exhibit stable learning behavior\. For example, LapGym\[[111](https://arxiv.org/html/2606.15647#bib.bib111)\]employs PPO to learn laparoscopic manipulation policies, while A3C\-based approaches support navigation and instrument control in virtual intervention settings\[[112](https://arxiv.org/html/2606.15647#bib.bib112)\]; however, compared with value\-based methods, they typically require substantial training data and computational resources\. Actor–Critic methods combine value estimation with policy learning to improve stability and efficiency in high\-dimensional environments\. Representative approaches such as AC\-SSIL\[[113](https://arxiv.org/html/2606.15647#bib.bib113)\]and CASOG\[[114](https://arxiv.org/html/2606.15647#bib.bib114)\]integrate imitation guidance or conservative value estimation to enhance robustness and sample efficiency, but in contrast to purely policy\-based approaches, demand careful tuning to balance dual learning objectives\.

Discussion: Overall, reinforcement\-based action methods provide strong adaptability and control flexibility, yet their reliance on carefully designed rewards and large\-scale interaction limits sample efficiency, stability, and practical deployment in real\-world clinical environments\.

#### III\-C3Medical Large\-Model\-Driven Action

Medical large\-model\-driven action leverages vision–language–action foundation models to map high\-level semantic understanding to executable medical actions, enabling embodied agents to perform complex tasks with strong generalization under limited supervision\[[115](https://arxiv.org/html/2606.15647#bib.bib115),[116](https://arxiv.org/html/2606.15647#bib.bib116),[117](https://arxiv.org/html/2606.15647#bib.bib117)\]\.

Existing approaches can be broadly categorized into three paradigms, reflecting different strategies for balancing alignment fidelity, planning flexibility, and adaptation efficiency\. Multimodal perception–action alignment methods jointly model visual observations and action trajectories to enable end\-to\-end action generation\. Representative models such as SurgicalGPT\[[118](https://arxiv.org/html/2606.15647#bib.bib118)\]and SurgVLM\[[49](https://arxiv.org/html/2606.15647#bib.bib49)\]integrate surgical videos and language supervision to predict action sequences and procedural steps but, from a controllability perspective, remain sensitive to cross\-modal misalignment and execution uncertainty\. Language\-conditioned task planning methods utilize natural language instructions or structured knowledge to guide action generation under semantic constraints\. For example, LLaVA\-Med\[[84](https://arxiv.org/html/2606.15647#bib.bib84)\]and Med\-Flamingo\[[119](https://arxiv.org/html/2606.15647#bib.bib119)\]enable instruction\-driven execution for tasks such as instrument alignment and path planning; however, compared with perception–action alignment methods, they rely more heavily on accurate language grounding and task specification\. Cross\-modal transfer–based few\-shot execution methods focus on rapid adaptation to unseen tasks through parameter\-efficient tuning or sim\-to\-real transfer\. Systems such as RoboNurse\-VLA\[[120](https://arxiv.org/html/2606.15647#bib.bib120)\]demonstrate effective real\-time instrument manipulation under verbal guidance, but in contrast to planning\-driven approaches, often trade explicit task reasoning for adaptability and scalability\.

Discussion: Overall, large\-model\-driven action methods offer strong semantic generalization and planning flexibility, yet their reliance on cross\-modal alignment and implicit reasoning limits controllability, reliability, and safe execution in real\-world clinical settings\.

TABLE II:Integrated application systems of medical embodied AI\.CategorySystem/RobotSettingCore functionsAutonomySurgicalRobotda Vinci\[[121](https://arxiv.org/html/2606.15647#bib.bib121)\]ORSurgeon\-controlled manipulation; improves precision and consistencyTeleopEMARO\[[122](https://arxiv.org/html/2606.15647#bib.bib122)\]OREndoscope stabilization and positioning; reduces assistant burdenSharedROSA\[[123](https://arxiv.org/html/2606.15647#bib.bib123)\]ORImage\-guided navigation and positioning; improves placement accuracyAssistedPRECEYES\[[124](https://arxiv.org/html/2606.15647#bib.bib124)\]ORMicron\-level manipulation; tremor suppressionTeleopROBODOC\[[125](https://arxiv.org/html/2606.15647#bib.bib125)\]ORHigh\-precision bone preparation for improved implant fitSupervisedSymani\[[126](https://arxiv.org/html/2606.15647#bib.bib126)\]OREnhanced micro\-scale dexterity and consistencyTeleopCyberKnife\[[127](https://arxiv.org/html/2606.15647#bib.bib127)\]RT suiteImage\-guided adaptive radiation deliveryAutomatedMonarch\[[128](https://arxiv.org/html/2606.15647#bib.bib128)\]EndoscopyStable access for peripheral lung samplingAssistedFlex\[[129](https://arxiv.org/html/2606.15647#bib.bib129)\]ORFlexible access in confined anatomyTeleopCorPath GRX\[[130](https://arxiv.org/html/2606.15647#bib.bib130)\]Cath labPrecise device manipulation; reduced radiation exposureTeleopIntelligentCaregiving& CompanionRobotPARO\[[131](https://arxiv.org/html/2606.15647#bib.bib131)\]Ward/HomeEmotional interaction and engagementInteractiveAIBO\[[132](https://arxiv.org/html/2606.15647#bib.bib132)\]HomeSocial companionship and engagementInteractivePepper\[[133](https://arxiv.org/html/2606.15647#bib.bib133)\]HospitalSocial interaction and basic assistanceInteractiveElliQ\[[134](https://arxiv.org/html/2606.15647#bib.bib134)\]HomeDaily support, reminders, and engagementProactiveArash\[[135](https://arxiv.org/html/2606.15647#bib.bib135)\]HospitalAffective interaction; anxiety reductionInteractiveRobear\[[136](https://arxiv.org/html/2606.15647#bib.bib136)\]CarePhysical assistance \(e\.g\., lifting, transfer\)AssistedGiraff\[[137](https://arxiv.org/html/2606.15647#bib.bib137)\]TelecareRemote presence for distributed careTelepresenceTelenoid\[[138](https://arxiv.org/html/2606.15647#bib.bib138)\]TelecareEmbodied communication proxyTelepresenceImmersiveMedicalEducationPlatformTouch Surgery\[[139](https://arxiv.org/html/2606.15647#bib.bib139)\]TrainingProcedural cognition and decision rehearsal–Body Interact\[[140](https://arxiv.org/html/2606.15647#bib.bib140)\]TrainingDecision\-making training with feedback–VirtaMed ArthroS\[[141](https://arxiv.org/html/2606.15647#bib.bib141)\]TrainingOperative skill learning and assessment–Vimedix 3\.2\[[142](https://arxiv.org/html/2606.15647#bib.bib142)\]TrainingScanning skill training and evaluation–OSSO\[[143](https://arxiv.org/html/2606.15647#bib.bib143)\]TrainingSkill acquisition and competency tracking–3D Organon VR\[[144](https://arxiv.org/html/2606.15647#bib.bib144)\]TrainingSpatial anatomy understanding–Telecollabo\-rative Diagnostic Treatment SystemTeladoc Mini Cart\[[145](https://arxiv.org/html/2606.15647#bib.bib145)\]RemoteRemote examination and consultationTelepresenceMercy Telehealth\[[146](https://arxiv.org/html/2606.15647#bib.bib146)\]Tele\-ICUDistributed monitoring and decision supportTelemedicineVSee\[[147](https://arxiv.org/html/2606.15647#bib.bib147)\]TelehealthMultimodal communication and care coordinationTelemedicineCreyos\[[148](https://arxiv.org/html/2606.15647#bib.bib148)\]RemoteCognitive assessment and longitudinal monitoringWorkflow

### III\-DIntegrated Application Scenarios in Healthcare

Integrated application scenarios demonstrate how medical embodied AI realizes closed\-loop perception, decision\-making, and action in real clinical settings, marking a key step toward practical healthcare deployment\. This section reviews four application domains: surgical robots, intelligent caregiving and companion robots, immersive medical education platforms, and telecollaborative diagnostic and treatment systems \(Table[II](https://arxiv.org/html/2606.15647#S3.T2)\)\.

#### III\-D1Surgical Robot

Surgical robots represent the most mature and widely deployed form of medical embodied AI, integrating multimodal perception, intelligent decision\-making, and precise action execution to support complex clinical procedures\. These systems are extensively applied in minimally invasive surgery, neurosurgery, orthopedics, ophthalmology, and interventional medicine, where high precision and safety are required\.

Representative platforms include the da Vinci Surgical System\[[121](https://arxiv.org/html/2606.15647#bib.bib121),[149](https://arxiv.org/html/2606.15647#bib.bib149),[150](https://arxiv.org/html/2606.15647#bib.bib150)\], which dominates minimally invasive surgery through teleoperated multi\-degree\-of\-freedom manipulation and high\-definition 3D vision; EMARO\[[122](https://arxiv.org/html/2606.15647#bib.bib122)\]and ROSA\[[123](https://arxiv.org/html/2606.15647#bib.bib123)\]provide image\-guided assistance and autonomous positioning in endoscopic and neurosurgical procedures; PRECEYES\[[124](https://arxiv.org/html/2606.15647#bib.bib124)\]and Symani\[[126](https://arxiv.org/html/2606.15647#bib.bib126)\]enable microscale manipulation via motion scaling and tremor suppression; and CyberKnife\[[127](https://arxiv.org/html/2606.15647#bib.bib127)\]integrates real\-time imaging with robotic radiosurgery\. Flexible and catheter\-based systems, such as Monarch\[[128](https://arxiv.org/html/2606.15647#bib.bib128)\], Flex\[[129](https://arxiv.org/html/2606.15647#bib.bib129)\], and CorPath GRX\[[130](https://arxiv.org/html/2606.15647#bib.bib130)\], further extend embodied intelligence to pulmonary, transoral, and cardiovascular interventions\. Together, these systems illustrate how embodied AI principles are instantiated in real\-world surgical practice\.

![Refer to caption](https://arxiv.org/html/2606.15647v1/x8.png)Figure 8:Intelligent Caregiving and Companion Robot\. \(a\) Pepper\[[133](https://arxiv.org/html/2606.15647#bib.bib133)\]\. \(b\) Arash\[[135](https://arxiv.org/html/2606.15647#bib.bib135)\]\.\(c\) Giraff\[[137](https://arxiv.org/html/2606.15647#bib.bib137)\]\. \(d\) AIBO\[[132](https://arxiv.org/html/2606.15647#bib.bib132)\]\. \(e\) PARO\[[131](https://arxiv.org/html/2606.15647#bib.bib131)\]\. \(f\) Robear\[[136](https://arxiv.org/html/2606.15647#bib.bib136)\]\. \(g\) ElliQ\[[134](https://arxiv.org/html/2606.15647#bib.bib134)\]\. \(h\) Telenoid\[[138](https://arxiv.org/html/2606.15647#bib.bib138)\]\.
#### III\-D2Intelligent Caregiving and Companion Robot

Intelligent caregiving and companion robots extend medical embodied AI from clinical procedures to daily care and long\-term assistance, providing continuous support in hospital wards, rehabilitation centers, and eldercare environments\. As shown in Fig\.[8](https://arxiv.org/html/2606.15647#S3.F8), these systems emphasize robust human–robot interaction, contextual awareness, and emotional engagement while complying with medical safety and ethical constraints\.

Existing systems can be broadly categorized into three groups\. Emotional companion robots focus on social interaction and affective support through multimodal perception and dialogue\. Representative platforms such as PARO\[[131](https://arxiv.org/html/2606.15647#bib.bib131)\], AIBO\[[132](https://arxiv.org/html/2606.15647#bib.bib132)\], Pepper\[[133](https://arxiv.org/html/2606.15647#bib.bib133)\], ElliQ\[[134](https://arxiv.org/html/2606.15647#bib.bib134)\], and Arash\[[135](https://arxiv.org/html/2606.15647#bib.bib135)\]have been deployed in dementia care, pediatric wards, and home\-based eldercare to enhance emotional well\-being and engagement\. Physical assistance robots provide direct support for patient transfer, posture adjustment, and mobility assistance using compliant control and safe human–robot interaction mechanisms\. Robear\[[136](https://arxiv.org/html/2606.15647#bib.bib136)\]exemplifies this category by enabling stable lifting and transfer of patients with limited mobility\. Telepresence companion robots facilitate remote caregiving and social connection by integrating communication interfaces with embodied mobility\. Systems such as Giraff\[[137](https://arxiv.org/html/2606.15647#bib.bib137)\]and Telenoid\[[138](https://arxiv.org/html/2606.15647#bib.bib138)\]support remote ward rounds, monitoring, and emotional communication, extending the reach of caregivers and clinicians\.

#### III\-D3Immersive Medical Education Platform

Immersive medical education platforms leverage virtual, augmented, and mixed reality technologies to provide safe, repeatable, and standardized training for medical education, overcoming the limitations of traditional experience\-based instruction\. By reconstructing anatomical structures, clinical procedures, and pathological processes, these platforms support efficient skill acquisition, remote learning, and objective assessment\.

Existing systems can be broadly categorized into three groups\. Cognitive training platforms emphasize clinical reasoning and decision\-making through interactive simulations; representative systems such as Touch Surgery\[[139](https://arxiv.org/html/2606.15647#bib.bib139)\]and Body Interact\[[140](https://arxiv.org/html/2606.15647#bib.bib140)\]support rehearsal of procedural logic, case analysis, and emergency decision\-making in virtual scenarios\. Operative skill training platforms focus on hands\-on procedural practice with realistic visual and haptic feedback\. Platforms including VirtaMed ArthroS\[[141](https://arxiv.org/html/2606.15647#bib.bib141)\]and Vimedix 3\.2\[[142](https://arxiv.org/html/2606.15647#bib.bib142)\]enable arthroscopy and ultrasound training via high\-fidelity simulation and automated performance evaluation\. Comprehensive simulation platforms integrate anatomy visualization with multidisciplinary education\. Systems such as OSSO\[[143](https://arxiv.org/html/2606.15647#bib.bib143)\]and 3D Organon VR Anatomy\[[144](https://arxiv.org/html/2606.15647#bib.bib144)\]provide immersive exploration of anatomical structures and spatial relationships, supporting standardized education and foundational skill development across medical disciplines\.

#### III\-D4Telecollaborative Diagnostic and Treatment System

Telecollaborative diagnostic and treatment systems enable distributed clinical collaboration by integrating sensing, communication, and interaction technologies, thereby overcoming geographic constraints and improving access to high\-quality healthcare services\. These systems support remote consultation, monitoring, and decision\-making in scenarios such as emergency response, primary care assistance, and multidisciplinary collaboration\.

Representative platforms include Teladoc Mini Cart\[[145](https://arxiv.org/html/2606.15647#bib.bib145)\]facilitates real\-time clinician–patient interaction for telemedicine services; Mercy Telehealth\[[146](https://arxiv.org/html/2606.15647#bib.bib146)\], which enables continuous remote monitoring and guidance in intensive care settings; and VSee\[[147](https://arxiv.org/html/2606.15647#bib.bib147)\]supports secure, low\-bandwidth video\-based collaboration for pathology consultation and medical education\. In addition, Creyos\[[148](https://arxiv.org/html/2606.15647#bib.bib148)\]provides remote cognitive assessment and quantitative analysis, illustrating the extension of telecollaborative systems to specialized diagnostic tasks\. Together, these systems demonstrate how embodied AI technologies can be integrated into distributed healthcare workflows to enhance collaboration efficiency and resource sharing\.

![Refer to caption](https://arxiv.org/html/2606.15647v1/x9.png)Figure 9:The datasets of medical embodied AI, which are categorized into perception, decision\-making, action, and simulation & synthetic\.

## IVDatasets and Benchmarks

High\-quality datasets are essential for advancing medical embodied AI\. This section reviews representative publicly available datasets, organized according to the three technical layers of perception, decision\-making, and action, together with simulation and synthetic data resources \(Fig\.[9](https://arxiv.org/html/2606.15647#S3.F9)\)\.

### IV\-APerception Datasets

#### IV\-A1Organ and Instrument Recognition Datasets

Organ recognition datasets primarily support segmentation and classification across diverse anatomical regions\. Public benchmarks cover major organs and pathological structures in the brain, thorax, abdomen, musculoskeletal system, and skin\. Representative datasets include VerSe, SPIDER, CTSpine1K, and CTPelvic1K for spinal and skeletal analysis\[[151](https://arxiv.org/html/2606.15647#bib.bib151),[152](https://arxiv.org/html/2606.15647#bib.bib152),[153](https://arxiv.org/html/2606.15647#bib.bib153),[154](https://arxiv.org/html/2606.15647#bib.bib154)\], ISIC and PH2for dermatological lesion analysis\[[155](https://arxiv.org/html/2606.15647#bib.bib155),[156](https://arxiv.org/html/2606.15647#bib.bib156)\], and TotalSegmentator\[[157](https://arxiv.org/html/2606.15647#bib.bib157)\]for large\-scale multi\-organ annotation\. For whole\-body and oncological analysis, datasets such as ULS\[[158](https://arxiv.org/html/2606.15647#bib.bib158)\]and AutoPET\[[159](https://arxiv.org/html/2606.15647#bib.bib159)\]support automated tumor segmentation and systemic disease assessment\.

Medical instrument recognition datasets focus on detecting, segmenting, and tracking surgical tools, forming a foundation for surgical automation and intraoperative assistance\. Dedicated datasets include RIS\[[160](https://arxiv.org/html/2606.15647#bib.bib160)\]for robotic instruments, UW\-Sinus\-Surgery\-C/L\[[161](https://arxiv.org/html/2606.15647#bib.bib161)\]for endoscopic sinus surgery, and SegSTRONG\-C2024\[[162](https://arxiv.org/html/2606.15647#bib.bib162)\], which emphasizes robustness under challenging imaging conditions\. In addition, large\-scale surgical video datasets such as Cholec80\[[163](https://arxiv.org/html/2606.15647#bib.bib163)\], CholecTrack20\[[164](https://arxiv.org/html/2606.15647#bib.bib164)\], Endoscapes\[[165](https://arxiv.org/html/2606.15647#bib.bib165)\], and m2caiSeg\[[166](https://arxiv.org/html/2606.15647#bib.bib166)\]provide multi\-task annotations for instrument recognition, tracking, and scene understanding\.

#### IV\-A2Medical Scene Modeling Datasets

Medical scene modeling datasets support the perception and understanding of spatial layouts, equipment distribution, and human activities in clinical environments such as operating rooms and hospital wards\. These datasets enable research ranging from semantic perception to 3D reconstruction and environment\-aware interaction\.

Representative datasets include MM\-OR\[[167](https://arxiv.org/html/2606.15647#bib.bib167)\], a large\-scale multimodal benchmark for operating room scene understanding that supports tasks such as semantic segmentation and scene graph construction\. The xawAR16 dataset\[[168](https://arxiv.org/html/2606.15647#bib.bib168)\]provides RGB\-D images and precise pose annotations for evaluating visual localization and 3D mapping in mixed\-reality surgical environments\. HIOD\[[169](https://arxiv.org/html/2606.15647#bib.bib169)\]and MCIndoor20000\[[170](https://arxiv.org/html/2606.15647#bib.bib170)\]focus on object detection and structural recognition in hospital interiors, supporting indoor perception and navigation research\. In addition, MYNursingHome\[[171](https://arxiv.org/html/2606.15647#bib.bib171)\]targets elderly care scenarios, enabling scene understanding and assistive interaction in long\-term care environments\.

#### IV\-A3Clinical Action and Pose Estimation Datasets

Clinical action and pose estimation datasets support the modeling of human motion, posture, and interactive behaviors in medical environments, enabling embodied agents to understand clinical activities during diagnosis, nursing, and surgical procedures\.

Representative datasets include MVOR\[[172](https://arxiv.org/html/2606.15647#bib.bib172)\], which provides multi\-view RGB\-D recordings from real operating rooms for 3D pose estimation and multi\-person tracking; PatientPose\[[173](https://arxiv.org/html/2606.15647#bib.bib173)\]offers upper\-body pose annotations from long\-term clinical recordings to support patient motion analysis; and MMD\-MSD\[[174](https://arxiv.org/html/2606.15647#bib.bib174)\], which integrates vision and wearable sensor data to model posture and physiological states in healthcare\-related activities\. In addition, Instrument3D\[[175](https://arxiv.org/html/2606.15647#bib.bib175)\]supports precise 3D tracking of surgical instruments, facilitating fine\-grained analysis of surgical actions\.

#### IV\-A4Multimodal Affective Perception Datasets

Multimodal affective perception datasets provide synchronized signals from physiological, behavioral, and visual modalities to support emotion recognition, affective modeling, and human–computer interaction in embodied AI\.

Representative datasets include AFFEC\[[176](https://arxiv.org/html/2606.15647#bib.bib176)\], which integrates eye tracking, facial action units, galvanic skin response, and personality traits for multimodal emotion classification; MERR\[[177](https://arxiv.org/html/2606.15647#bib.bib177)\]and Mixed Emotion Recognition\[[178](https://arxiv.org/html/2606.15647#bib.bib178)\], which support coarse\- and fine\-grained emotion recognition from multimodal signals; and SEED\-VII\[[179](https://arxiv.org/html/2606.15647#bib.bib179)\]offers EEG and eye\-tracking data with continuous emotion intensity labels\. In addition, ASCERTAIN\[[180](https://arxiv.org/html/2606.15647#bib.bib180)\]and DREAMER\[[181](https://arxiv.org/html/2606.15647#bib.bib181)\]combine EEG, ECG, and peripheral physiological signals with affective annotations, enabling research on emotion–personality relationships and practical emotion recognition in real\-world settings\.

### IV\-BDecision\-Making Datasets

#### IV\-B1Surgical Workflow Annotation Datasets

Surgical workflow annotation datasets provide structured temporal labels for modeling procedural phases, task dependencies, and decision logic, forming a key data foundation for surgical automation and decision\-making\.

Representative datasets include the CHOLECT dataset series\[[164](https://arxiv.org/html/2606.15647#bib.bib164)\], which supports fine\-grained action recognition through triplet annotations of instruments, actions, and targets in laparoscopic cholecystectomy; OphNet\[[182](https://arxiv.org/html/2606.15647#bib.bib182)\]offers hierarchical phase and action annotations for ophthalmic surgeries; and AutoLaparo\[[183](https://arxiv.org/html/2606.15647#bib.bib183)\], which integrates workflow recognition with motion prediction and image segmentation for hysterectomy procedures\. In addition, LapEx\[[184](https://arxiv.org/html/2606.15647#bib.bib184)\]focuses on sleeve gastrectomy with activity, scene, and skill assessment labels, while MISAW\[[185](https://arxiv.org/html/2606.15647#bib.bib185)\]combines synchronized video and kinematic data with phase\-level annotations to support multimodal analysis of minimally invasive vascular anastomosis\.

#### IV\-B2Medical Navigation Datasets

Medical navigation datasets support localization, path planning, and spatial understanding for embodied agents in surgical and clinical environments\.

Representative datasets include the Portable 6D Surgical Instrument Magnetic Localization Dataset\[[186](https://arxiv.org/html/2606.15647#bib.bib186)\], which provides six\-degree\-of\-freedom instrument tracking data for minimally invasive surgical navigation; the Head Model Collection for Mixed Reality Navigation\[[187](https://arxiv.org/html/2606.15647#bib.bib187)\], which offers CT/MRI\-derived anatomical models for mixed\-reality neurosurgical guidance; and large\-scale 3D environment datasets such as Gibson\[[188](https://arxiv.org/html/2606.15647#bib.bib188)\]and Habitat\-Matterport 3D\[[189](https://arxiv.org/html/2606.15647#bib.bib189)\], which provide realistic indoor digital twins to support research on navigation, spatial reasoning, and autonomous planning in medical and hospital\-like settings\.

#### IV\-B3Medical Question Answering Datasets

Medical question answering datasets provide high\-level semantic understanding and reasoning supervision for embodied AI, supporting clinical decision\-making and multimodal interaction\.

Representative datasets include SSG\-VQA\[[190](https://arxiv.org/html/2606.15647#bib.bib190)\], which constructs visual question answering benchmarks from laparoscopic videos using structured surgical scene graphs; and ERVQA\[[191](https://arxiv.org/html/2606.15647#bib.bib191)\], which focuses on emergency room scenarios to evaluate vision–language reasoning in real clinical environments\. Knowledge\-centric reasoning datasets such as MedReason\[[192](https://arxiv.org/html/2606.15647#bib.bib192)\]and ReasonMed\[[193](https://arxiv.org/html/2606.15647#bib.bib193)\]emphasize stepwise logical inference and interpretable medical reasoning based on structured knowledge and large language models\. In addition, ORQA\[[194](https://arxiv.org/html/2606.15647#bib.bib194)\]integrates multimodal data from operating room environments to support multitask surgical question answering, while Surg\-QA\[[195](https://arxiv.org/html/2606.15647#bib.bib195)\]provides large\-scale instruction\-based video question answering to enable semantic understanding of complex surgical workflows\.

TABLE III:Representative simulation platforms for medical embodied AI\.PlatformRepresentative TasksLearning ParadigmsStrengthLimitationSurRoL\[[196](https://arxiv.org/html/2606.15647#bib.bib196)\]Grasping, cutting, suturing, general robotic surgeryImitation learning;reinforcement learningStrong physical realism; diverse surgical tasksModerate visual fidelity; high compute costORBIT\-Surgical\[[197](https://arxiv.org/html/2606.15647#bib.bib197)\]Surgical dexterity learning; active perceptionReinforcement learning;imitation learningExcellent visual realism; efficient GPU\-parallel simulationHigh hardware demand; complex system setupSurgical Gym\[[198](https://arxiv.org/html/2606.15647#bib.bib198)\]Massive RL training; rapid policy iterationReinforcement learningExtremely high training speed; scalable optimizationLower physical and visual realismLapGym\[[111](https://arxiv.org/html/2606.15647#bib.bib111)\]Laparoscopic manipulation; path planning; human\-in\-the\-loop controlImitation learning;reinforcement learningHighly extensible; multimodal sensing supportLimited task diversity; user\-dependent fidelitySonoGym\[[199](https://arxiv.org/html/2606.15647#bib.bib199)\]Ultrasound navigation; bone reconstruction; intervention planningReinforcement learning;imitation learningHigh anatomical realism for ultrasound proceduresModality\-specific; limited generality

### IV\-CAction Datasets

Action datasets provide essential supervision for surgical action modeling, imitation learning, and execution\-level reasoning in medical embodied AI\.

Representative datasets include JIGSAWS\[[200](https://arxiv.org/html/2606.15647#bib.bib200)\], which combines kinematic and video data for benchmarking imitation learning of fundamental surgical skills; CholecTrack20\[[164](https://arxiv.org/html/2606.15647#bib.bib164)\]and m2cai16\-workflow\[[201](https://arxiv.org/html/2606.15647#bib.bib201)\], which support multi\-tool tracking and surgical phase recognition in laparoscopic cholecystectomy; and Endoscapes\[[165](https://arxiv.org/html/2606.15647#bib.bib165)\], which enables scene understanding and safety\-aware analysis in real surgical videos\. MultiBypass140\[[202](https://arxiv.org/html/2606.15647#bib.bib202)\]emphasizes hierarchical modeling of phases, steps, and adverse events in complex procedures, while SurgVU24\[[203](https://arxiv.org/html/2606.15647#bib.bib203)\]provides long\-horizon robotic surgery recordings for instrument recognition and strategy modeling\. Large\-scale resources such as GenSurgery\[[204](https://arxiv.org/html/2606.15647#bib.bib204)\]extend action modeling across diverse surgical types, and multimodal datasets including MM\-OR\[[167](https://arxiv.org/html/2606.15647#bib.bib167)\]and MITI\[[205](https://arxiv.org/html/2606.15647#bib.bib205)\]support semantic action understanding and intraoperative localization through integrated sensory signals\. In addition, CoPESD\[[206](https://arxiv.org/html/2606.15647#bib.bib206)\]targets fine\-grained manipulation in endoscopic submucosal dissection, enabling detailed modeling of complex surgical actions\.

### IV\-DSimulation Platforms and Synthetic Datasets

#### IV\-D1Surgical Simulation Platforms

High\-fidelity surgical simulation platforms provide essential infrastructure for training and evaluating medical embodied agents, particularly for reinforcement learning and imitation learning in safety\-critical scenarios\. Recent open\-source platforms enable scalable policy learning, realistic physical interaction, and reproducible experimentation \(Table[III](https://arxiv.org/html/2606.15647#S4.T3)\)\.

Representative platforms include SurRoL\[[196](https://arxiv.org/html/2606.15647#bib.bib196)\], which offers high\-fidelity surgical interaction with collision modeling, haptics, and demonstration collection for imitation and reinforcement learning; ORBIT\-Surgical\[[197](https://arxiv.org/html/2606.15647#bib.bib197)\], which emphasizes photorealistic rendering and GPU\-parallel training for dexterous manipulation; and Surgical Gym\[[198](https://arxiv.org/html/2606.15647#bib.bib198)\], a fully GPU\-based simulator that significantly accelerates large\-scale reinforcement learning\. LapGym\[[111](https://arxiv.org/html/2606.15647#bib.bib111)\]focuses on robot\-assisted laparoscopic surgery and supports multimodal perception, path planning, and human\-in\-the\-loop learning, while SonoGym\[[199](https://arxiv.org/html/2606.15647#bib.bib199)\]targets ultrasound\-guided navigation and intervention using anatomically realistic models\. Together, these platforms form a critical foundation for scalable training, safe policy optimization, and sim\-to\-real transfer in medical embodied AI\.

#### IV\-D2Synthetic Datasets

Due to limited access to real\-world medical data and strict privacy constraints, synthetic datasets have become an important complement for training and evaluating medical embodied AI models\. Recent synthetic resources span imaging, pathology, and structured clinical data, providing scalable and privacy\-preserving alternatives\.

Representative datasets include SynFundus\-1M\[[207](https://arxiv.org/html/2606.15647#bib.bib207)\], a large\-scale synthetic fundus dataset covering multiple disease categories with fine\-grained anatomical quality annotations; SNOW\[[208](https://arxiv.org/html/2606.15647#bib.bib208)\], which provides densely annotated synthetic pathology images for nuclei segmentation in breast cancer; and COVID\-19 10K/100K\[[209](https://arxiv.org/html/2606.15647#bib.bib209)\], synthetic electronic health record datasets generated with Synthea to model disease progression and clinical workflows\. Coherent dataset\[[210](https://arxiv.org/html/2606.15647#bib.bib210)\]further integrates synthetic multimodal data across FHIR\-based clinical records, medical imaging, genomics, and physiological signals, supporting end\-to\-end multimodal learning and system\-level evaluation\. Together, these datasets enable scalable model training, benchmarking, and sim\-to\-real analysis under realistic privacy constraints\.

## VChallenges and Outlook

Despite recent advances, medical embodied AI still faces fundamental challenges in achieving safe, robust, and reliable deployment within complex clinical environments\. This section systematically analyzes these challenges from three core perspectives—perception, decision\-making, and action—and discusses promising research directions toward clinically deployable embodied AI systems\.

### V\-AChallenges and Outlook in Medical Embodied Perception

#### V\-A1Insufficient Training Data and Perception Discrepancy

Medical embodied perception relies on large\-scale, high\-quality annotations for safety\-critical tasks such as intraoperative navigation, instrument recognition, and tissue segmentation\. Ethical and privacy constraints and the high cost of expert annotation limit data availability, leading to class imbalance and reduced domain diversity, particularly for rare pathologies and cross\-institutional deployment, where domain gaps degrade robustness and increase risk\.

Existing solutions include synthetic data generation, domain adaptation, semi\-supervised learning, and federated learning\. Synthetic data scale efficiently\[[207](https://arxiv.org/html/2606.15647#bib.bib207),[208](https://arxiv.org/html/2606.15647#bib.bib208)\]but fail to capture complex intraoperative variability, while domain adaptation mitigates distribution mismatch\[[211](https://arxiv.org/html/2606.15647#bib.bib211)\]yet depends on target\-domain access and struggles with unseen classes\. Semi\-supervised and federated learning leverage unlabeled or distributed data\[[212](https://arxiv.org/html/2606.15647#bib.bib212),[213](https://arxiv.org/html/2606.15647#bib.bib213)\]but remain constrained by annotation quality and domain heterogeneity\. Future work should emphasize generalized perception via unified multimodal backbones, uncertainty\-aware learning, and closed\-loop synthetic–real co\-training toward deployable clinical systems\.

#### V\-A2Semantic Ambiguity and Multimodal Knowledge Fusion Difficulties

Medical embodied AI integrates heterogeneous perceptual inputs, including visual observations, clinical texts, electronic health records, voice commands, and haptic signals\. Disparities in semantic granularity, spatial resolution, and temporal alignment make effective multimodal fusion challenging, often causing semantic ambiguity and intermodal conflicts in real\-time clinical settings\.

To bridge semantic gaps, existing approaches employ structured medical knowledge graphs and large medical language models\[[49](https://arxiv.org/html/2606.15647#bib.bib49),[74](https://arxiv.org/html/2606.15647#bib.bib74)\]\. However, knowledge bases suffer from delayed updates and uneven coverage\[[214](https://arxiv.org/html/2606.15647#bib.bib214),[215](https://arxiv.org/html/2606.15647#bib.bib215)\], while current alignment strategies rely on static mappings that struggle with dynamic contexts and evolving intraoperative semantics\. Future research should therefore emphasize context\-aware and adaptive semantic fusion, including graph\-based multimodal reasoning, causal modeling for uncertainty\-aware interpretability, and adaptive alignment via reinforcement and meta\-learning, enabling a transition from task\-specific sensing toward lifelong and context\-aware clinical perception systems\.

### V\-BChallenges and Outlook in Medical Embodied Decision\-making

#### V\-B1Medical Reasoning Complexity and Uncertainty Modeling

Medical embodied agents operate in dynamic clinical environments with incomplete and noisy information\. Intraoperative emergencies, anatomical variability, and equipment failures require multi\-step reasoning under uncertainty, making transparent and interpretable decision processes essential for clinical trust\.

Most existing systems rely on end\-to\-end policy learning or deep reinforcement learning\[[216](https://arxiv.org/html/2606.15647#bib.bib216)\], which perform well in controlled settings but lack explicit reasoning paths\. Recent advances in stepwise medical reasoning, graph neural networks, and causal inference improve traceability and robustness; however, constructing comprehensive causal knowledge graphs remains costly and incomplete, especially for rare or individualized conditions\. Future research should therefore pursue hybrid reasoning frameworks that integrate causal knowledge, expert\-defined rules, and learning\-based models, supported by expert\-in\-the\-loop feedback and safety\-aware assurance mechanisms\. In the long term, medical embodied decision\-making should evolve from static policy optimization toward explainable and human\-aligned clinical reasoning agents\.

#### V\-B2Lack of Mechanisms for Decision Pathway Generation and Validation

Translating reasoning outcomes into safe and executable action policies remains challenging due to the complexity and high\-risk nature of clinical workflows\. Although reinforcement and imitation learning show promise, most approaches lack systematic mechanisms for decision pathway validation and execution risk assessment\.

Future directions include adversarial or multi\-agent evaluation frameworks, digital twin–based validation platforms, and the integration of formal verification with learning\-based policies to support predictable and trustworthy clinical decision\-making in safety\-critical clinical settings\.

### V\-CChallenges and Outlook in Medical Embodied Action

#### V\-C1Error Sensitivity in High\-Precision Action Control

High\-precision medical actions impose stringent requirements on trajectory accuracy and control latency, particularly in minimally invasive procedures near sensitive anatomical structures\. Errors arise from mechanical compliance, sensor latency, calibration inaccuracies, and complex tissue–instrument interactions, while most systems lack sufficient fault tolerance\.

Current approaches based on visual servoing and imitation learning enable closed\-loop adjustment but remain constrained by perception–control latency and limited sensing resolution\. Future research should emphasize tightly coupled hardware–software frameworks, multimodal sensor fusion, and hybrid control strategies that combine model\-based control with learning\-based uncertainty estimation\.

#### V\-C2Lack of General\-Purpose Medical Simulation Platforms

Simulation platforms are critical for training and validating medical embodied actions, yet existing simulators lack sufficient physical fidelity and support for diverse procedures, rare events, and hierarchical task modeling\. Limitations in tissue deformation modeling and instrument–tissue interaction hinder reliable sim\-to\-real transfer\.

Future work should prioritize general\-purpose simulation platforms with real\-world closed\-loop validation, integrating personalized anatomical models, procedural scene generation, and digital twin frameworks\. Ultimately, medical embodied action must advance from precise but brittle execution toward adaptive, intention\-aware, and safety\-certified clinical actuation\.

### V\-DCross\-Cutting Challenges in Safety, Ethics, and Reliable Deployment

#### V\-D1Safety Assurance

Ensuring safety in medical embodied AI extends beyond improving perception accuracy or control precision\[[1](https://arxiv.org/html/2606.15647#bib.bib1)\]\. Risks may arise from the interaction of perception uncertainty, delayed decision\-making, and actuation errors under dynamic clinical conditions\[[14](https://arxiv.org/html/2606.15647#bib.bib14)\]\. While simulation\-based evaluation and empirical testing are widely adopted, they provide limited guarantees under rare or unforeseen scenarios\[[12](https://arxiv.org/html/2606.15647#bib.bib12)\]\. Future research should integrate runtime monitoring, fail\-safe mechanisms, and human\-in\-the\-loop supervision to support bounded\-risk operation in safety\-critical settings\.

#### V\-D2Ethical and Regulatory Considerations

Medical embodied AI introduces ethical challenges related to patient autonomy, informed consent, accountability, and data governance\. The increasing use of multimodal clinical data and autonomous decision\-making complicates responsibility attribution and regulatory compliance\[[5](https://arxiv.org/html/2606.15647#bib.bib5)\]\. Transparent decision processes, traceable action logs, and alignment with existing medical regulations are essential to ensure ethical deployment and clinical acceptance\.

#### V\-D3Reliable Clinical Deployment

Reliable deployment requires medical embodied AI systems to remain robust across institutions, patient populations, and long\-term operation\. Domain shifts, rare events, and system drift can degrade performance and erode clinical trust\[[21](https://arxiv.org/html/2606.15647#bib.bib21)\]\. Addressing these challenges calls for standardized benchmarks, continuous post\-deployment monitoring, and shared\-autonomy frameworks that allow clinicians to retain ultimate control while benefiting from intelligent assistance\.

## VIConclusion

Embodied AI introduces a transformative paradigm to healthcare, effectively bridging the critical gap between computational foundation models and the physical clinical world\. This review has provided a comprehensive survey of the field, systematically analyzing the core components of perception, decision\-making, and action, while cataloging representative medical applications and essential datasets\. Despite the promising progress, we also highlighted the significant challenges in clinical settings\. By elucidating the current landscape and identifying key bottlenecks, this work aims to serve as a foundational roadmap\. It is our hope that this survey will facilitate researchers in addressing these limitations, ultimately accelerating the transition of intelligent agents from theoretical frameworks to practical, reliable assistants in real\-world medical workflows\.

## References

- \[1\]C\. Varghese, E\. M\. Harrison, G\. O’Grady, and E\. J\. Topol, “Artificial intelligence in surgery,”*Nat\. Med\.*, vol\. 30, no\. 5, pp\. 1257–1268, 2024\.
- \[2\]F\. Isensee, P\. F\. Jaeger, S\. A\. A\. Kohl, J\. Petersen, and K\. H\. Maier\-Hein, “nnU\-Net: A self\-configuring method for deep learning\-based biomedical image segmentation,”*Nat\. Methods*, vol\. 18, no\. 2, pp\. 203–211, 2021\.
- \[3\]A\. H\. Thieme, Y\. Zheng, G\. Machiraju, C\. Sadee, M\. Mittermaier, M\. Gertler, J\. L\. Salinas, K\. Srinivasan, P\. Gyawali, and F\. C\.\-P\. et al\., “A deep\-learning algorithm to classify skin lesions from mpox virus infection,”*Nat\. Med\.*, vol\. 29, no\. 3, pp\. 738–747, 2023\.
- \[4\]D\. Ma, J\. Pang, M\. B\. Gotway, and J\. Liang, “A fully open AI foundation model applied to chest radiography,”*Nature*, pp\. 1–11, 2025\.
- \[5\]F\. Liu, H\. Zhou, B\. Gu, X\. Zou, J\. Huang, J\. Wu, Y\. Li, S\. S\. Chen, Y\. Hua, and P\. Z\. et al\., “Application of large language models in medicine,”*Nat\. Rev\. Bioeng\.*, pp\. 1–20, 2025\.
- \[6\]K\. Singhal, T\. Tu, J\. Gottweis, R\. Sayres, E\. Wulczyn, M\. Amin, L\. Hou, K\. Clark, S\. R\. Pfohl, and H\. C\.\-L\. et al\., “Toward expert\-level medical question answering with large language models,”*Nat\. Med\.*, vol\. 31, no\. 3, pp\. 943–950, 2025\.
- \[7\]X\. Liu, H\. Liu, G\. Yang, Z\. Jiang, S\. Cui, Z\. Zhang, H\. Wang, L\. Tao, Y\. Sun, and Z\. S\. et al\., “A generalist medical language model for disease diagnosis assistance,”*Nat\. Med\.*, vol\. 31, no\. 3, pp\. 932–942, 2025\.
- \[8\]Y\. Liu, W\. Chen, Y\. Bai, X\. Liang, G\. Li, W\. Gao, and L\. Lin, “Aligning cyber space with physical world: A comprehensive survey on embodied AI,”*IEEE/ASME Trans\. Mechatronics*, 2025\.
- \[9\]J\. Li, Z\. Xu, N\. Li, K\. Zhang, G\. Xiong, M\. Sun, C\. Hou, J\. Ji, F\. Zhang, and J\. Z\. et al\., “AI\-embodied multimodal flexible electronic robots with programmable sensing, actuating, and self\-learning,”*Nat\. Commun\.*, vol\. 16, no\. 1, p\. 8818, 2025\.
- \[10\]Y\. Long, A\. Lin, D\. H\. C\. Kwok, L\. Zhang, Z\. Yang, K\. Shi, L\. Song, J\. Fu, H\. Lin, and W\. W\. et al\., “Surgical embodied intelligence for generalized task autonomy in laparoscopic robot\-assisted surgery,”*Sci\. Robot\.*, vol\. 10, no\. 104, p\. eadt3093, 2025\.
- \[11\]P\. Fiorini, K\. Y\. Goldberg, Y\. Liu, and R\. H\. Taylor, “Concepts and trends in autonomy for robot\-assisted surgery,”*Proc\. IEEE*, vol\. 110, no\. 7, pp\. 993–1011, 2022\.
- \[12\]T\. Yao, H\. Wang, B\. Lu, J\. Ge, Z\. Pei, M\. Kowarschik, L\. Sun, L\. Seneviratne, and P\. Qi, “Sim\-to\-real learning with domain randomization for autonomous guidewire navigation in robot\-assisted endovascular procedures,”*IEEE Trans\. Autom\. Sci\. Eng\.*, 2025\.
- \[13\]T\. Yao, Y\. Xu, H\. Wang, X\. Qiu, K\. Althoefer, and P\. Qi, “Multi\-agent fuzzy reinforcement learning with LLM for cooperative navigation of endovascular robotics,”*IEEE Trans\. Fuzzy Syst\.*, 2025\.
- \[14\]A\. Pore, Z\. Li, D\. Dall’Alba, A\. Hernansanz, E\. D\. Momi, A\. Menciassi, A\. C\. Gelpi, J\. Dankelman, P\. Fiorini, and E\. V\. Poorten, “Autonomous navigation for robot\-assisted intraluminal and endovascular procedures: A systematic review,”*IEEE Trans\. Robot\.*, vol\. 39, no\. 4, pp\. 2529–2548, 2023\.
- \[15\]J\. Song, K\. Yang, H\. Chen, J\. Liu, Y\. Gu, Q\. Hui, Y\. Huang, M\. Li, Z\. Zhang, and T\. C\. et al\., “VascularPilot3D: Toward a 3D fully autonomous navigation for endovascular robotics,” in*Proc\. IEEE Int\. Conf\. Robot\. Autom\. \(ICRA\)*, 2025, pp\. 9318–9324\.
- \[16\]J\. Song, R\. Zhang, W\. Zhang, H\. Zhou, and M\. Ghaffari, “SLAM\-assisted 3D tracking system for laparoscopic surgery,” in*Proc\. IEEE Int\. Conf\. Robot\. Autom\. \(ICRA\)*, 2025, pp\. 6868–6874\.
- \[17\]W\. Arreola, J\. J\. Rivas, L\. Castrejon, and L\. E\. Sucar, “Affective embodied agent for patient assistance in virtual rehabilitation,”*IEEE Trans\. Affect\. Comput\.*, 2025\.
- \[18\]C\. Zhang and S\. Yu, “Virtual co\-embodiment rehabilitation: An innovative method integrating virtual co\-embodiment and action observation therapy in virtual reality rehabilitation,” in*Proc\. Int\. Conv\. Rehabil\. Eng\. Assistive Technol\. \(i\-CREATe\)*, 2024, pp\. 1–6\.
- \[19\]Z\. Jiang, X\. Huang, Z\. Wang, Y\. Liu, L\. Huang, and X\. Luo, “Embodied conversational agents for chronic diseases: Scoping review,”*J\. Med\. Internet Res\.*, vol\. 26, p\. e47134, 2024\.
- \[20\]G\. Fragapane, H\.\-H\. Hvolby, F\. Sgarbossa, and J\. O\. Strandhagen, “Autonomous mobile robots in hospital logistics,” in*Adv\. Prod\. Manage\. Syst\.*, 2020, pp\. 672–679\.
- \[21\]L\. Bernhard, P\. Schwingenschlögl, J\. Hofmann, D\. Wilhelm, and A\. Knoll, “Boosting the hospital by integrating mobile robotic assistance systems: A comprehensive classification of the risks to be addressed,”*Auton\. Robots*, vol\. 48, no\. 1, p\. 1, 2024\.
- \[22\]W\. Ding, Q\. Tian, Y\. Xia, Y\. Yang, Y\. Wang, and Y\. Zhang, “Research on multirobot collaboration platform for logistic distribution of medical consumables in the operating room,” in*Proc\. SPIE Conf\. Biomed\. Intell\. Syst\. \(IC\-BIS\)*, vol\. 13208, 2024, pp\. 637–642\.
- \[23\]O\. Palinko, R\. Wendlandt, S\. Udby, F\. Uhing, J\. H\. Fog, E\. Hansen, R\. P\. Junge, D\. G\. Holm, M\. Kipp, and L\. Bodenhagen, “Interaction matters when it comes to hand disinfection using robots at hospitals,” in*Proc\. Int\. Conf\. Social Robot\.*, 2024, pp\. 74–85\.
- \[24\]Y\. Liu, X\. Cao, T\. Chen, Y\. Jiang, J\. You, M\. Wu, X\. Wang, M\. Feng, Y\. Jin, and J\. Chen, “From screens to scenes: A survey of embodied AI in healthcare,”*Inf\. Fusion*, vol\. 119, p\. 103033, 2025\.
- \[25\]Z\. Zhong, “Hierarchical frameworks for embodied medical AI,”*ITM Web of Conferences*, vol\. 80, p\. 01037, 2025\.
- \[26\]S\. N\. Kumar, J\. Joy, A\. J\. James, and A\. Dixen, “Health care industry use cases of embodied AI,”*Building Embodied AI Systems*, pp\. 223–239, 2025\.
- \[27\]Y\. Tian, M\. Shi, X\. Zhang, B\. Zhang, M\. Wang, and Y\. Shi, “Assisting embodied AI: A survey of 3D segmentation models for medical CT images,”*CCF Transactions on Pervasive Computing and Interaction*, pp\. 1–22, 2025\.
- \[28\]Y\. Qiu, X\. Chen, X\. Wu, Y\. Li, P\. Xu, K\. Jin, X\. Shang, P\. Chotcomwongse, M\. He, and D\. Shi, “Embodied artificial intelligence in ophthalmology,”*npj Digital Medicine*, vol\. 8, no\. 1, p\. 351, 2025\.
- \[29\]J\. Liu, X\. Shi, T\. D\. Nguyen, H\. Zhang, T\. Zhang, W\. Sun, Y\. Li, A\. V\. Vasilakos, G\. Iacca, and A\. A\. K\. et al\., “Neural brain: A neuroscience\-inspired framework for embodied agents,”*arXiv preprint arXiv:2505\.07634*, 2025\.
- \[30\]H\. Liu, D\. Guo, and A\. Cangelosi, “Embodied intelligence: A synergy of morphology, action, perception, and learning,”*ACM Comput\. Surv\.*, vol\. 57, no\. 7, pp\. 1–36, 2025\.
- \[31\]G\. Paolo, J\. Gonzalez\-Billandon, and B\. Kégl, “Position: A call for embodied AI,” in*Proc\. Int\. Conf\. Mach\. Learn\. \(ICML\)*, 2024\.
- \[32\]B\. Wang, X\. Meng, X\. Wang, Z\. Zhu, A\. Ye, Y\. Wang, Z\. Yang, C\. Ni, G\. Huang, and X\. Wang, “EmbodiedDreamer: Advancing real\-to\-sim\-to\-real transfer for policy training via embodied world modeling,”*arXiv preprint arXiv:2507\.05198*, 2025\.
- \[33\]T\. Jiang, Y\. Guan, L\. Ma, J\. Xu, J\. Meng, W\. Chen, Z\. Zeng, L\. Li, D\. Wu, and R\. Chen, “DexSim2Real2: Building explicit world model for precise articulated object dexterous manipulation,”*arXiv preprint arXiv:2409\.08750*, 2024\.
- \[34\]Y\. Yardi, S\. Biruduganti, and L\. Ankile, “Bridging the sim\-to\-real gap: Vision encoder pre\-training for visuomotor policy transfer,”*arXiv preprint arXiv:2501\.16389*, 2025\.
- \[35\]G\. Liu, Y\. Deng, R\. Zhao, H\. Zhou, J\. Chen, J\. Chen, R\. Xu, Y\. Tai, and K\. Jia, “DexScale: Automating data scaling for Sim2Real generalizable robot control,” in*Proc\. Int\. Conf\. Mach\. Learn\. \(ICML\)*, 2025\.
- \[36\]L\. Fan, M\. Liang, Y\. Li, G\. Hua, and Y\. Wu, “Evidential active recognition: Intelligent and prudent open\-world embodied perception,” in*Proc\. IEEE/CVF Conf\. Comput\. Vis\. Pattern Recognit\. \(CVPR\)*, 2024, pp\. 16 351–16 361\.
- \[37\]Y\. Sun, N\. Cheng, S\. Zhang, W\. Li, L\. Yang, S\. Cui, H\. Liu, F\. Sun, J\. Zhang, and G\. D\. et al\., “Tactile data generation and applications based on visuo\-tactile sensors: A review,”*Inf\. Fusion*, p\. 103162, 2025\.
- \[38\]W\. Jin, H\. Du, B\. Zhao, X\. Tian, B\. Shi, and G\. Yang, “A comprehensive survey on multi\-agent cooperative decision\-making: Scenarios, approaches, challenges, and perspectives,”*arXiv preprint arXiv:2503\.13415*, 2025\.
- \[39\]R\. Liu, W\. Wang, and Y\. Yang, “Volumetric environment representation for vision\-language navigation,” in*Proc\. IEEE/CVF Conf\. Comput\. Vis\. Pattern Recognit\. \(CVPR\)*, 2024, pp\. 16 317–16 328\.
- \[40\]R\. Liu, X\. Wang, W\. Wang, and Y\. Yang, “Bird’s\-eye\-view scene graph for vision\-language navigation,” in*Proc\. IEEE/CVF Int\. Conf\. Comput\. Vis\. \(ICCV\)*, 2023, pp\. 10 968–10 980\.
- \[41\]R\. Liu, W\. Wang, and Y\. Yang, “Vision\-language navigation with energy\-based policy,” in*Proc\. Adv\. Neural Inf\. Process\. Syst\. \(NeurIPS\)*, 2024, pp\. 108 208–108 230\.
- \[42\]S\. Saxena, B\. Buchanan, C\. Paxton, P\. Liu, B\. Chen, N\. Vaskevicius, L\. Palmieri, J\. Francis, and O\. Kroemer, “GraphEQA: Using 3D semantic scene graphs for real\-time embodied question answering,”*arXiv preprint arXiv:2412\.14480*, 2024\.
- \[43\]Y\. Lei, Y\. Fu, T\. Wang, R\. L\. J\. Qiu, W\. J\. Curran, T\. Liu, and X\. Yang, “Deep learning in multi\-organ segmentation,”*arXiv preprint arXiv:2001\.10619*, 2020\.
- \[44\]F\. A\. Ahmed, M\. Yousef, M\. A\. Ahmed, H\. O\. Ali, A\. Mahboob, H\. Ali, Z\. Shah, O\. Aboumarzouk, A\. Al Ansari, and S\. Balakrishnan, “Deep learning for surgical instrument recognition and segmentation in robotic\-assisted surgeries: a systematic review,”*Artif\. Intell\. Rev\.*, vol\. 58, no\. 1, p\. 1, 2024\.
- \[45\]O\. Ronneberger, P\. Fischer, and T\. Brox, “U\-Net: Convolutional networks for biomedical image segmentation,” in*Proc\. Int\. Conf\. Med\. Image Comput\. Comput\.\-Assist\. Interv\. \(MICCAI\)*, 2015, pp\. 234–241\.
- \[46\]H\. Du, J\. Wang, M\. Liu, Y\. Wang, and E\. Meijering, “Swinpa\-net: Swin transformer\-based multiscale feature pyramid aggregation network for medical image segmentation,”*IEEE Trans\. Neural Netw\. Learn\. Syst\.*, vol\. 35, no\. 4, pp\. 5355–5366, 2022\.
- \[47\]M\. Islam, V\. S\. Vibashan, C\. M\. Lim, and H\. Ren, “St\-mtl: Spatio\-temporal multitask learning model to predict scanpath while tracking instruments in robotic surgery,”*Med\. Image Anal\.*, vol\. 67, p\. 101837, 2021\.
- \[48\]E\. Colleoni, S\. Moccia, X\. Du, E\. De Momi, and D\. Stoyanov, “Deep learning based robotic tool detection and articulation estimation with spatio\-temporal layers,”*IEEE Robot\. Autom\. Lett\.*, vol\. 4, no\. 3, pp\. 2714–2721, 2019\.
- \[49\]Z\. Zeng, Z\. Zhuo, X\. Jia, E\. Zhang, J\. Wu, J\. Zhang, Y\. Wang, C\. H\. Low, J\. Jiang, Z\. Zheng*et al\.*, “Surgvlm: A large vision\-language model and systematic evaluation benchmark for surgical intelligence,”*arXiv preprint arXiv:2506\.02555*, 2025\.
- \[50\]Z\. Li, A\. Shaban, J\.\-G\. Simard, D\. Rabindran, S\. DiMaio, and O\. Mohareri, “A robotic 3D perception system for operating room environment awareness,”*arXiv preprint arXiv:2003\.09487*, 2020\.
- \[51\]G\. Erol, A\. Güngör, U\. T\. Sevgi, B\. Gülsuna, Y\. Doğruel, H\. Emmez, and U\. Türe, “Creation of a microsurgical neuroanatomy laboratory and virtual operating room: a preliminary study,”*Neurosurg\. Focus*, vol\. 56, no\. 1, p\. E6, 2024\.
- \[52\]B\. G\. A\. Gerats, J\. M\. Wolterink, and I\. A\. M\. J\. Broeders, “NeRF\-or: Neural radiance fields for operating room scene reconstruction from sparse\-view RGB\-D videos,”*Int\. J\. Comput\. Assist\. Radiol\. Surg\.*, vol\. 20, no\. 1, pp\. 147–156, 2025\.
- \[53\]S\. Yang, Q\. Li, D\. Shen, B\. Gong, Q\. Dou, and Y\. Jin, “Deform3DGS: Flexible deformation for fast surgical scene reconstruction with gaussian splatting,” in*Proc\. Int\. Conf\. Med\. Image Comput\. Comput\.\-Assist\. Interv\. \(MICCAI\)*\. Springer, 2024, pp\. 132–142\.
- \[54\]W\. Xie, Y\. Ye, Q\. Hong, J\. Yao, S\. Wu, R\. Zhou, X\. Dong, and X\. Guo, “Endo\-hdr: Dynamic endoscopic reconstruction with deformable 3d gaussians and hierarchical depth regularization\.” Elsevier, 2025, p\. 114914\.
- \[55\]J\. Chen, X\. Zhang, M\. I\. Hoque, F\. Vasconcelos, D\. Stoyanov, D\. S\. Elson, and B\. Huang, “Surgicalgs: Dynamic 3d gaussian splatting for accurate robotic\-assisted surgical scene reconstruction,” in*International Conference on Medical Image Computing and Computer\-Assisted Intervention*\. Springer, 2025, pp\. 572–582\.
- \[56\]E\. Özsoy, E\. P\. Örnek, U\. Eck, T\. Czempiel, F\. Tombari, and N\. Navab, “4D\-or: Semantic scene graphs for OR domain modeling,” in*Proc\. Int\. Conf\. Med\. Image Comput\. Comput\.\-Assist\. Interv\. \(MICCAI\)*\. Springer, 2022, pp\. 475–485\.
- \[57\]E\. Özsoy, T\. Czempiel, F\. Holm, C\. Pellegrini, and N\. Navab, “Labrad\-or: Lightweight memory scene graphs for accurate bimodal reasoning in dynamic operating rooms,” in*Proc\. Int\. Conf\. Med\. Image Comput\. Comput\.\-Assist\. Interv\. \(MICCAI\)*\. Springer, 2023, pp\. 302–311\.
- \[58\]P\. He, Z\. Zhang, Y\. Zhang, X\. Zhao, and S\. Peng, “Spatial\-ORMLLM: Improve spatial relation understanding in the operating room with multimodal large language models,”*arXiv preprint arXiv:2508\.08199*, 2025\.
- \[59\]K\. C\. Demir, H\. Schieber, T\. Weise, D\. Roth, M\. May, A\. Maier, and S\. H\. Yang, “Deep learning in surgical workflow analysis: A review of phase and step recognition,”*IEEE J\. Biomed\. Health Inform\.*, vol\. 27, no\. 11, pp\. 5405–5417, 2023\.
- \[60\]K\. Feghoul, D\. S\. Maia, M\. E\. Amrani, M\. Daoudi, and A\. Amad, “MGRFormer: A multimodal transformer approach for surgical gesture recognition,” in*Proc\. IEEE Int\. Conf\. Autom\. Face Gesture Recognit\. \(FG\)*, 2024, pp\. 1–10\.
- \[61\]Y\. Men, J\. Luo, Z\. Zhao, H\. Wu, F\. Luo, G\. Zhang, and M\. Yu, “Surgical gesture recognition in open surgery based on 3DCNN and SlowFast,” in*Proc\. IEEE Int\. Conf\. Inf\. Technol\. Netw\. Electron\. Autom\. Control \(ITNEC\)*, 2024, pp\. 429–433\.
- \[62\]L\. Ma, H\. Kang, N\. Magnenat\-Thalmann, and K\. Wac, “TransSG: A spatio\-temporal transformer for surgical gesture recognition,” in*Proc\. Comput\. Graph\. Int\. Conf\.*, 2024, pp\. 151–165\.
- \[63\]B\. Jia, W\. Wang, X\. Tian, and X\. Wang, “STANet: A surgical gesture recognition method based on spatiotemporal fusion,”*Ann\. N\. Y\. Acad\. Sci\.*, 2025\.
- \[64\]S\. Cristina, V\. Despotovic, R\. Pérez\-Rodríguez, and S\. Aleksic, “Audio\- and video\-based human activity recognition systems in healthcare,”*IEEE Access*, vol\. 12, pp\. 8230–8245, 2024\.
- \[65\]B\. V\. Amsterdam, I\. Funke, E\. Edwards, S\. Speidel, J\. Collins, A\. Sridhar, J\. Kelly, M\. J\. Clarkson, and D\. Stoyanov, “Gesture recognition in robotic surgery with multimodal attention,”*IEEE Trans\. Med\. Imaging*, vol\. 41, no\. 7, pp\. 1677–1687, 2022\.
- \[66\]S\. Li and W\. Deng, “Deep facial expression recognition: A survey,”*IEEE Trans\. Affect\. Comput\.*, vol\. 13, no\. 3, pp\. 1195–1215, 2020\.
- \[67\]Y\. Li, J\. Wei, Y\. Liu, J\. Kauttonen, and G\. Zhao, “Deep learning for micro\-expression recognition: A survey,”*IEEE Trans\. Affect\. Comput\.*, vol\. 13, no\. 4, pp\. 2028–2046, 2022\.
- \[68\]L\. Zhang, Y\. Qian, O\. Arandjelović, T\. Zhu, and H\. Xiao, “Multimodal latent emotion recognition from micro\-expression and physiological signals,”*Pattern Recognit\.*, p\. 111963, 2025\.
- \[69\]F\. Zhang, Y\. Liu, X\. Yu, Z\. Wang, Q\. Zhang, J\. Wang, and Q\. Zhang, “Towards facial micro\-expression detection and classification using modified multimodal ensemble learning,”*Inf\. Fusion*, vol\. 115, p\. 102735, 2025\.
- \[70\]J\. Ye, Y\. Yu, L\. Lu, H\. Wang, Y\. Zheng, Y\. Liu, and Q\. Wang, “DEP\-former: Multimodal depression recognition based on facial expressions and audio features via emotional changes,”*IEEE Trans\. Circuits Syst\. Video Technol\.*, 2024\.
- \[71\]M\. Khan, W\. Gueaieb, A\. E\. Saddik, and S\. Kwon, “MSER: Multimodal speech emotion recognition using cross\-attention with deep fusion,”*Expert Syst\. Appl\.*, vol\. 245, p\. 122946, 2024\.
- \[72\]H\. Gao, Z\. Cai, X\. Wang, M\. Wu, and C\. Liu, “Multimodal fusion of behavioral and physiological signals for enhanced emotion recognition via feature decoupling and knowledge transfer,”*IEEE J\. Biomed\. Health Inform\.*, 2025\.
- \[73\]P\. S\. Kumar, P\. K\. Govarthan, A\. A\. S\. Gadda, N\. Ganapathy, and J\. F\. A\. Ronickom, “Deep learning\-based automated emotion recognition using multimodal physiological signals and time\-frequency methods,”*IEEE Trans\. Instrum\. Meas\.*, vol\. 73, pp\. 1–12, 2024\.
- \[74\]J\. Pan, C\. Liu, J\. Wu, F\. Liu, J\. Zhu, H\. B\. Li, C\. Chen, C\. Ouyang, and D\. Rueckert, “MedVLM\-R1: Incentivizing medical reasoning capability of vision\-language models via reinforcement learning,” in*Proc\. Int\. Conf\. Med\. Image Comput\. Comput\.\-Assist\. Interv\. \(MICCAI\)*, 2025, pp\. 337–347\.
- \[75\]Y\. Zhang, M\. Wang, Y\. Wu, P\. Tiwari, Q\. Li, B\. Wang, and J\. Qin, “DialogueLLM: Context\- and emotion\-knowledge\-tuned large language models for emotion recognition in conversations,”*arXiv preprint arXiv:2310\.11374*, 2023\.
- \[76\]C\. Garcia\-Vidal, G\. Sanjuan, P\. Puerta\-Alcalde, E\. Moreno\-García, and A\. Soriano, “Artificial intelligence to support clinical decision\-making processes,”*EBioMedicine*, vol\. 46, pp\. 27–29, 2019\.
- \[77\]T\. J\. Loftus, P\. J\. Tighe, A\. C\. Filiberto, P\. A\. Efron, S\. C\. Brakenridge, A\. M\. Mohr, P\. Rashidi, G\. R\. Upchurch, and A\. Bihorac, “Artificial intelligence and surgical decision\-making,”*JAMA Surg\.*, vol\. 155, no\. 2, pp\. 148–158, 2020\.
- \[78\]F\. G\. Mangano, O\. Admakin, H\. Lerner, and C\. Mangano, “Artificial intelligence and augmented reality for guided implant surgery planning: A proof of concept,”*J\. Dent\.*, vol\. 133, p\. 104485, 2023\.
- \[79\]E\. Aspland, D\. Gartner, and P\. Harper, “Clinical pathway modelling: A literature review,”*Health Syst\.*, vol\. 10, no\. 1, pp\. 1–23, 2021\.
- \[80\]X\. Gao, Y\. Jin, Y\. Long, Q\. Dou, and P\.\-A\. Heng, “Trans\-SVNet: Accurate phase recognition from surgical videos via hybrid embedding aggregation transformer,” in*Proc\. Int\. Conf\. Med\. Image Comput\. Comput\.\-Assist\. Interv\. \(MICCAI\)*, 2021, pp\. 593–603\.
- \[81\]T\. Czempiel, M\. Paschali, M\. Keicher, W\. Simson, H\. Feussner, S\. T\. Kim, and N\. Navab, “TeCNO: Surgical phase recognition with multi\-stage temporal convolutional networks,” in*Proc\. Int\. Conf\. Med\. Image Comput\. Comput\.\-Assist\. Interv\. \(MICCAI\)*, 2020, pp\. 343–352\.
- \[82\]A\. Kadkhodamohammadi, I\. Luengo, and D\. Stoyanov, “PATG: Position\-aware temporal graph networks for surgical phase recognition on laparoscopic videos,”*Int\. J\. Comput\. Assist\. Radiol\. Surg\.*, vol\. 17, no\. 5, pp\. 849–856, 2022\.
- \[83\]F\. X\. Zhang, N\. A\. Moubayed, and H\. P\. H\. Shum, “Towards graph representation learning\-based surgical workflow anticipation,” in*Proc\. IEEE\-EMBS Int\. Conf\. Biomed\. Health Inform\. \(BHI\)*, 2022, pp\. 1–4\.
- \[84\]C\. Li, C\. Wong, S\. Zhang, N\. Usuyama, H\. Liu, J\. Yang, T\. Naumann, H\. Poon, and J\. Gao, “LLaVA\-Med: Training a large language\-and\-vision assistant for biomedicine in one day,”*Adv\. Neural Inf\. Process\. Syst\.*, vol\. 36, pp\. 28 541–28 564, 2023\.
- \[85\]H\. K\. Gumprecht, D\. C\. Widenka, and C\. B\. Lumenta, “BrainLab VectorVision neuronavigation system: Technology and clinical experiences in 131 cases,”*Neurosurgery*, vol\. 44, no\. 1, pp\. 97–104, 1999\.
- \[86\]P\. Liu, C\. Li, C\. Xiao, Z\. Zhang, J\. Ma, J\. Gao, P\. Shao, I\. Valerio, T\. M\. Pawlik, and C\. D\. et al\., “A wearable augmented reality navigation system for surgical telementoring based on Microsoft HoloLens,”*Ann\. Biomed\. Eng\.*, vol\. 49, no\. 1, pp\. 287–298, 2021\.
- \[87\]A\. Li, J\. Han, Y\. Zhao, M\. Q\.\-H\. Meng, and L\. Liu, “RL\-USRegi: Autonomous ultrasound registration for radiation\-free spinal surgical navigation using reinforcement learning,”*IEEE Trans\. Autom\. Sci\. Eng\.*, 2025\.
- \[88\]H\. Robertshaw, L\. Karstensen, B\. Jackson, A\. Granados, and T\. C\. Booth, “Autonomous navigation of catheters and guidewires in mechanical thrombectomy using inverse reinforcement learning,”*Int\. J\. Comput\. Assist\. Radiol\. Surg\.*, vol\. 19, no\. 8, pp\. 1569–1578, 2024\.
- \[89\]G\. Zhou, Y\. Hong, and Q\. Wu, “NavGPT: Explicit reasoning in vision\-and\-language navigation with large language models,” in*Proc\. AAAI Conf\. Artif\. Intell\.*, vol\. 38, no\. 7, 2024, pp\. 7641–7649\.
- \[90\]G\. Zhou, Y\. Hong, Z\. Wang, X\. E\. Wang, and Q\. Wu, “NavGPT\-2: Unleashing navigational reasoning capability for large vision\-language models,” in*Proc\. Eur\. Conf\. Comput\. Vis\. \(ECCV\)*, 2024, pp\. 260–278\.
- \[91\]S\. Fan, R\. Liu, W\. Wang, and Y\. Yang, “Navigation instruction generation with BEV perception and large language models,” in*Proc\. Eur\. Conf\. Comput\. Vis\. \(ECCV\)*, 2024, pp\. 368–387\.
- \[92\]S\. Fan, R\. Liu, W\. Wang, and Y\. Yang, “Scene map\-based prompt tuning for navigation instruction generation,” in*Proc\. IEEE/CVF Conf\. Comput\. Vis\. Pattern Recognit\. \(CVPR\)*, 2025, pp\. 6898–6908\.
- \[93\]Z\. Lin, D\. Zhang, Q\. Tao, D\. Shi, G\. Haffari, Q\. Wu, M\. He, and Z\. Ge, “Medical visual question answering: A survey,”*Artif\. Intell\. Med\.*, vol\. 143, p\. 102611, 2023\.
- \[94\]Q\. Jin, Z\. Yuan, G\. Xiong, Q\. Yu, H\. Ying, C\. Tan, M\. Chen, S\. Huang, X\. Liu, and S\. Yu, “Biomedical question answering: A survey of approaches and challenges,”*ACM Comput\. Surv\.*, vol\. 55, no\. 2, pp\. 1–36, 2022\.
- \[95\]A\. W\. Rosen, I\. Ose, M\. Gögenur, L\. P\. K\. Andersen, R\. D\. Bojesen, R\. P\. Vogelsang, M\. H\. Rose, P\. W\. Steenfos, L\. B\. Hansen, and H\. S\. S\. et al\., “Clinical implementation of an AI\-based prediction model for decision support for patients undergoing colorectal cancer surgery,”*Nat\. Med\.*, pp\. 1–12, 2025\.
- \[96\]A\. Ş\. Çiftçi and A\. H\. Acar, “Artificial intelligence\-based chatbot assistance in clinical decision\-making for medically complex patients in oral surgery: A comparative study,”*BMC Oral Health*, vol\. 25, no\. 1, p\. 351, 2025\.
- \[97\]A\. Patil, V\. Patil, S\. Sankpal, T\. S\. Patankar, and H\. Bhute, “Multimodal decision support system for improved diagnosis and healthcare decision making,”*J\. Biol\. Health Sci\.*, 2025\.
- \[98\]L\. Gong, J\. Yang, S\. Han, and Y\. Ji, “MedBLIP: A multimodal method of medical question answering based on fine\-tuning large language models,”*Comput\. Med\. Imaging Graph\.*, p\. 102581, 2025\.
- \[99\]S\. Schmidgall, J\. D\. Opfermann, J\. W\. Kim, and A\. Krieger, “Will your next surgeon be a robot? autonomy and AI in robotic surgery,”*Sci\. Robot\.*, vol\. 10, no\. 104, p\. eadt0187, 2025\.
- \[100\]A\. Attanasio, B\. Scaglioni, E\. D\. Momi, P\. Fiorini, and P\. Valdastri, “Autonomy in surgical robotics,”*Annu\. Rev\. Control Robot\. Auton\. Syst\.*, vol\. 4, no\. 1, pp\. 651–679, 2021\.
- \[101\]A\. Peloso, R\. Damiano, X\. Zhang, A\. Bicchi, E\. Votta, and E\. D\. Momi, “Imitation learning for path planning in cardiac percutaneous interventions,”*IEEE Trans\. Biomed\. Eng\.*, 2025\.
- \[102\]J\. W\. Kim, T\. Z\. Zhao, S\. Schmidgall, A\. Deguet, M\. Kobilarov, C\. Finn, and A\. Krieger, “Surgical robot transformer \(SRT\): Imitation learning for surgical tasks,”*arXiv preprint arXiv:2407\.12998*, 2024\.
- \[103\]S\. Paradis, M\. Hwang, B\. Thananjeyan, J\. Ichnowski, D\. Seita, D\. Fer, T\. Low, J\. E\. Gonzalez, and K\. Goldberg, “Intermittent visual servoing: Efficiently learning policies robust to instrument changes for high\-precision surgical manipulation,” in*Proc\. IEEE Int\. Conf\. Robot\. Autom\. \(ICRA\)*, 2021, pp\. 7166–7173\.
- \[104\]M\. Moghani, N\. Nelson, M\. Ghanem, A\. Diaz\-Pinto, K\. Hari, M\. Azizian, K\. Goldberg, S\. Huver, and A\. Garg, “SuFIA\-BC: Generating high\-quality demonstration data for visuomotor policy learning in surgical subtasks,”*arXiv preprint arXiv:2504\.14857*, 2025\.
- \[105\]A\. Segato, M\. D\. Marzo, S\. Zucchelli, S\. Galvan, R\. Secoli, and E\. D\. Momi, “Inverse reinforcement learning intra\-operative path planning for steerable needles,”*IEEE Trans\. Biomed\. Eng\.*, vol\. 69, no\. 6, pp\. 1995–2005, 2021\.
- \[106\]W\. Chi, G\. Dagnino, T\. M\. Y\. Kwok, A\. Nguyen, D\. Kundrat, M\. E\. M\. K\. Abdelaziz, C\. Riga, C\. Bicknell, and G\.\-Z\. Yang, “Collaborative robot\-assisted endovascular catheterization with generative adversarial imitation learning,” in*Proc\. IEEE Int\. Conf\. Robot\. Autom\. \(ICRA\)*, 2020, pp\. 2414–2420\.
- \[107\]A\. Gomaa, B\. Mahdy, N\. Kleer, and A\. Krüger, “Towards a surgeon\-in\-the\-loop ophthalmic robotic apprentice using reinforcement and imitation learning,” in*Proc\. IEEE/RSJ Int\. Conf\. Intell\. Robots Syst\. \(IROS\)*, 2024, pp\. 6939–6946\.
- \[108\]B\. Li, R\. Wei, J\. Xu, B\. Lu, C\. H\. Yee, C\. F\. Ng, P\.\-A\. Heng, Q\. Dou, and Y\.\-H\. Liu, “3D perception\-based imitation learning under limited demonstration for laparoscope control in robotic surgery,” in*Proc\. IEEE Int\. Conf\. Robot\. Autom\. \(ICRA\)*, 2022, pp\. 7664–7670\.
- \[109\]C\. Yu, J\. Liu, S\. Nemati, and G\. Yin, “Reinforcement learning in healthcare: A survey,”*ACM Comput\. Surv\.*, vol\. 55, no\. 1, pp\. 1–36, 2021\.
- \[110\]V\. M\. Varier, D\. K\. Rajamani, N\. Goldfarb, F\. Tavakkolmoghaddam, A\. Munawar, and G\. S\. Fischer, “Collaborative suturing: A reinforcement learning approach to automate hand\-off tasks in suturing for surgical robots,” in*Proc\. IEEE Int\. Conf\. Robot Human Interact\. Commun\. \(RO\-MAN\)*, 2020, pp\. 1380–1386\.
- \[111\]P\. M\. Scheikl, B\. Gyenes, R\. Younis, C\. Haas, G\. Neumann, M\. Wagner, and F\. Mathis\-Ullrich, “LapGym: An open\-source framework for reinforcement learning in robot\-assisted laparoscopic surgery,”*J\. Mach\. Learn\. Res\.*, vol\. 24, no\. 368, pp\. 1–42, 2023\.
- \[112\]F\. Meng, S\. Guo, W\. Zhou, and Z\. Chen, “Evaluation of an autonomous navigation method for vascular interventional surgery in virtual environments,” in*Proc\. IEEE Int\. Conf\. Mechatronics Autom\. \(ICMA\)*, 2022, pp\. 1599–1604\.
- \[113\]J\. Liu, A\. Andres, Y\. Jiang, X\. Luo, W\. Shu, and S\. A\. Tsaftaris, “Surgical task automation using actor–critic frameworks and self\-supervised imitation learning,”*arXiv preprint*, 2024, arXiv:2409\.02724\.
- \[114\]H\. Li, X\.\-H\. Zhou, X\.\-L\. Xie, S\.\-Q\. Liu, Z\.\-Q\. Feng, and Z\.\-G\. Hou, “CASOG: Conservative actor–critic with smooth gradient for skill learning in robot\-assisted intervention,”*IEEE Trans\. Ind\. Electron\.*, vol\. 71, no\. 7, pp\. 7722–7731, 2023\.
- \[115\]Z\. Min, J\. Lai, and H\. Ren, “Innovating robot\-assisted surgery through large vision models,”*Nat\. Rev\. Electr\. Eng\.*, pp\. 1–14, 2025\.
- \[116\]S\. Schmidgall, J\. Cho, C\. Zakka, and W\. Hiesinger, “GP\-VLS: A general\-purpose vision–language model for surgery,”*arXiv preprint*, 2024, arXiv:2407\.19305\.
- \[117\]A\. Moglia, K\. Georgiou, P\. Cerveri, L\. Mainardi, R\. M\. Satava, and A\. Cuschieri, “Large language models in healthcare: From a systematic review on medical examinations to a comparative analysis on fundamentals of robotic surgery online test,”*Artif\. Intell\. Rev\.*, vol\. 57, no\. 9, p\. 231, 2024\.
- \[118\]L\. Seenivasan, M\. Islam, G\. Kannan, and H\. Ren, “SurgicalGPT: End\-to\-end language–vision GPT for visual question answering in surgery,” in*Proc\. Int\. Conf\. Med\. Image Comput\. Comput\.\-Assist\. Interv\. \(MICCAI\)*, 2023, pp\. 281–290\.
- \[119\]M\. Moor, Q\. Huang, S\. Wu, M\. Yasunaga, Y\. Dalmia, J\. Leskovec, C\. Zakka, E\. P\. Reis, and P\. Rajpurkar, “Med\-flamingo: A multimodal medical few\-shot learner,” in*Proc\. Mach\. Learn\. Health \(ML4H\)*, 2023, pp\. 353–367\.
- \[120\]S\. Li, J\. Wang, R\. Dai, W\. Ma, W\. Y\. Ng, Y\. Hu, and Z\. Li, “RoboNurse\-VLA: Robotic scrub nurse system based on vision–language–action model,”*arXiv preprint*, 2024, arXiv:2409\.19590\.
- \[121\]C\. D’Ettorre, A\. Mariani, A\. Stilli, F\. Rodriguez y Baena, P\. Valdastri, A\. Deguet, P\. Kazanzides, R\. H\. Taylor, G\. S\. Fischer, S\. P\. DiMaio*et al\.*, “Accelerating surgical robotics research: A review of 10 years with the da vinci research kit,”*IEEE Robot\. Autom\. Mag\.*, vol\. 28, no\. 4, pp\. 56–78, 2021\.
- \[122\]K\. Tadano and K\. Kawashima, “A pneumatic laparoscope holder controlled by head movement,”*Int\. J\. Med\. Robot\. Comput\. Assist\. Surg\.*, vol\. 11, no\. 3, pp\. 331–340, 2015\.
- \[123\]L\. Chenin, J\. Peltier, and M\. Lefranc, “Minimally invasive transforaminal lumbar interbody fusion with the ROSA spine robot and intraoperative flat\-panel CT guidance,”*Acta Neurochir\.*, vol\. 158, no\. 6, pp\. 1125–1128, 2016\.
- \[124\]T\. L\. Edwards, K\. Xue, H\. C\. M\. Meenink, M\. J\. Beelen, G\. J\. L\. Naus, M\. P\. Simunovic, M\. Latasiewicz, A\. D\. Farmery, M\. D\. De Smet, and R\. E\. MacLaren, “First\-in\-human study of the safety and viability of intraocular robotic surgery,”*Nat\. Biomed\. Eng\.*, vol\. 2, no\. 9, pp\. 649–656, 2018\.
- \[125\]W\. L\. Bargar, A\. Bauer, and M\. Börner, “Primary and revision total hip replacement using the ROBODOC® system,”*Clin\. Orthop\. Relat\. Res\.*, vol\. 354, pp\. 82–91, 1998\.
- \[126\]N\. Dastagir, D\. Obed, M\. Tamulevicius, K\. Dastagir, and P\. M\. Vogt, “The use of the Symani surgical system® in emergency hand trauma care,”*Surg\. Innov\.*, vol\. 31, no\. 5, pp\. 460–465, 2024\.
- \[127\]W\. Kilby, J\. R\. Dooley, G\. Kuduvalli, S\. Sayeh, and C\. R\. Maurer, “The CyberKnife® robotic radiosurgery system in 2010,”*Technol\. Cancer Res\. Treat\.*, vol\. 9, no\. 5, pp\. 433–452, 2010\.
- \[128\]C\. F\. Graetzel, A\. Sheehy, and D\. P\. Noonan, “Robotic bronchoscopy drive mode of the auris monarch platform,” in*Proc\. IEEE Int\. Conf\. Robot\. Autom\. \(ICRA\)*, 2019, pp\. 3895–3901\.
- \[129\]M\. Remacle, V\. M\. N\. Prasad, G\. Lawson, L\. Plisson, V\. Bachy, and S\. Van der Vorst, “Transoral robotic surgery \(TORS\) with the medrobotics flex™ system: First surgical application on humans,”*Eur\. Arch\. Oto\-Rhino\-Laryngol\.*, vol\. 272, no\. 6, pp\. 1451–1455, 2015\.
- \[130\]G\. W\. Britz, S\. S\. Panesar, P\. Falb, J\. Tomas, V\. Desai, and A\. Lumsden, “Neuroendovascular\-specific engineering modifications to the CorPath GRX robotic system,”*J\. Neurosurg\.*, vol\. 133, no\. 6, pp\. 1830–1836, 2019\.
- \[131\]T\. Shibata, “Therapeutic seal robot as biofeedback medical device: Qualitative and quantitative evaluations of robot therapy in dementia care,”*Proc\. IEEE*, vol\. 100, no\. 8, pp\. 2527–2538, 2012\.
- \[132\]K\. Tanaka, H\. Makino, K\. Nakamura, A\. Nakamura, M\. Hayakawa, H\. Uchida, M\. Kasahara, H\. Kato, and T\. Igarashi, “Pilot study of group robot intervention on pediatric inpatients and their caregivers using New Aibo,”*Eur\. J\. Pediatr\.*, vol\. 181, no\. 3, pp\. 1055–1061, 2022\.
- \[133\]A\. K\. Pandey and R\. Gelin, “A mass\-produced sociable humanoid robot: Pepper, the first machine of its kind,”*IEEE Robot\. Autom\. Mag\.*, vol\. 25, no\. 3, pp\. 40–48, 2018\.
- \[134\]E\. Broadbent, K\. Loveys, G\. Ilan, G\. Chen, M\. M\. Chilukuri, S\. G\. Boardman, P\. M\. Doraiswamy, and D\. Skuler, “ElliQ: An AI\-driven social robot to alleviate loneliness: Progress and lessons learned,”*J\. Aging Res\. Lifestyle*, vol\. 13, pp\. 22–28, 2024\.
- \[135\]A\. Meghdari, A\. Shariati, M\. Alemi, G\. R\. Vossoughi, A\. Eydi, E\. Ahmadi, B\. Mozafari, A\. Amoozandeh Nobaveh, and R\. Tahami, “ARASH: A social robot buddy to support children with cancer in a hospital environment,”*Proc\. Inst\. Mech\. Eng\. H*, vol\. 232, no\. 6, pp\. 605–618, 2018\.
- \[136\]Z\. H\. Khan, A\. Siddique, and C\. W\. Lee, “Robotics utilization for healthcare digitization in global COVID\-19 management,”*Int\. J\. Environ\. Res\. Public Health*, vol\. 17, no\. 11, p\. 3819, 2020\.
- \[137\]J\. González\-Jiménez, C\. Galindo, and J\. R\. Ruiz\-Sarmiento, “Technical improvements of the Giraff telepresence robot based on users’ evaluation,” in*Proc\. IEEE Int\. Symp\. Robot Human Interact\. Commun\. \(RO\-MAN\)*, 2012, pp\. 827–832\.
- \[138\]K\. Ogawa, S\. Nishio, K\. Koda, K\. Taura, T\. Minato, C\. T\. Ishii, and H\. Ishiguro, “Telenoid: Tele\-presence android for communication,” in*ACM SIGGRAPH Emerging Technologies*, 2011, p\. 1\.
- \[139\]R\. E\. Clark, D\. F\. Feldon, J\. J\. G\. van Merriënboer, K\. A\. Yates, and S\. Early, “Cognitive task analysis,” in*Handbook of Research on Educational Communications and Technology*, 2008, pp\. 577–593\.
- \[140\]M\. Y\. Kolesnyk, “First experience of using the Body Interact simulation platform in intern attestation,” 2020\.
- \[141\]K\. Gallagher, S\. Bahadori, J\. Antonis, T\. Immins, T\. W\. Wainwright, and R\. Middleton, “Validation of the hip arthroscopy module of the VirtaMed virtual reality arthroscopy trainer,”*Surg\. Technol\. Int\.*, vol\. 34, pp\. 430–436, 2019\.
- \[142\]V\. A\. Vasilev and S\. N\. Kondrichina, “Possibilities for using the Vimedix 3\.2 virtual simulator to train ultrasound specialists,”*Digit\. Diagn\.*, vol\. 5, no\. 1, pp\. 41–52, 2024\.
- \[143\]M\. Keller, S\. Zuffi, M\. J\. Black, and S\. Pujades, “OSSO: Obtaining skeletal shape from outside,” in*Proc\. IEEE/CVF Conf\. Comput\. Vis\. Pattern Recognit\. \(CVPR\)*, 2022, pp\. 20 492–20 501\.
- \[144\]J\. Lilly, “3D Organon VR Anatomy,”*J\. Med\. Libr\. Assoc\.*, vol\. 110, no\. 2, p\. 276, 2022\.
- \[145\]“Teladoc Health,” Online, 2025, accessed: Nov\. 14, 2025\. Available:https://www\.teladochealth\.com\.
- \[146\]L\. Klingensmith and L\. Knodel, “Mercy virtual nursing: An innovative care delivery model,”*Nurse Leader*, vol\. 14, no\. 4, pp\. 275–279, 2016\.
- \[147\]D\. Guo, W\. Liu, X\. Zhang, M\. Zhao, B\. Zhu, T\. Hou, and H\. He, “Duck egg white–derived peptide VSEE regulates bone and lipid metabolism via Wnt/β\\beta\-catenin signaling and gut microbiota,”*Mol\. Nutr\. Food Res\.*, vol\. 63, no\. 24, p\. 1900525, 2019\.
- \[148\]E\. D\. Kirby, B\. Beyst, J\. Beyst, S\. M\. Brodie, and R\. C\. N\. D’Arcy, “A retrospective observational study of real\-world clinical data from the cognitive function development therapy program,”*Front\. Hum\. Neurosci\.*, vol\. 18, p\. 1508815, 2024\.
- \[149\]C\. Freschi, V\. Ferrari, F\. Melfi, M\. Ferrari, F\. Mosca, and A\. Cuschieri, “Technical review of the da vinci surgical telemanipulator,”*Int\. J\. Med\. Robot\. Comput\. Assist\. Surg\.*, vol\. 9, no\. 4, pp\. 396–406, 2013\.
- \[150\]Intuitive Surgical, “da vinci surgical system,” Online, 2013, available:http://www\.intusurg\.com/html/davinci\.html\.
- \[151\]A\. Sekuboyina, M\. E\. Husseini, A\. Bayat, M\. Löffler, H\. Liebl, H\. Li, G\. Tetteh, J\. Kukačka, C\. Payer, D\. Štern*et al\.*, “VerSe: A vertebrae labelling and segmentation benchmark for multi\-detector CT,”*Med\. Image Anal\.*, vol\. 73, p\. 102166, 2021\.
- \[152\]J\. W\. van der Graaf, M\. L\. van Hooff, C\. F\. M\. Buckens, M\. Rutten, J\. L\. C\. van Susante, R\. J\. Kroeze, M\. de Kleuver, B\. van Ginneken, and N\. Lessmann, “Lumbar spine segmentation in MR images: A dataset and public benchmark,”*Sci\. Data*, vol\. 11, no\. 1, p\. 264, 2024\.
- \[153\]Y\. Deng, C\. Wang, Y\. Hui, Q\. Li, J\. Li, S\. Luo, M\. Sun, Q\. Quan, S\. Yang, Y\. Hao*et al\.*, “CTSpine1K: A large\-scale dataset for spinal vertebrae segmentation in computed tomography,”*arXiv preprint*, 2021, arXiv:2105\.14711\.
- \[154\]P\. Liu, H\. Han, Y\. Du, H\. Zhu, Y\. Li, F\. Gu, H\. Xiao, J\. Li, C\. Zhao, L\. Xiao*et al\.*, “Deep learning for pelvic bone segmentation: Large\-scale CT datasets and baseline models,”*Int\. J\. Comput\. Assist\. Radiol\. Surg\.*, vol\. 16, no\. 5, pp\. 749–756, 2021\.
- \[155\]D\. Gutman, N\. C\. F\. Codella, E\. Celebi, B\. Helba, M\. Marchetti, N\. Mishra, and A\. Halpern, “ISIC 2016 challenge: Skin lesion analysis toward melanoma detection,”*arXiv preprint*, 2016, arXiv:1605\.01397\.
- \[156\]T\. Mendonça, M\. E\. Celebi, T\. Mendonca, and J\. Marques, “PH2: A public database for dermoscopic image analysis,”*Dermoscopic Image Anal\.*, vol\. 2, 2015\.
- \[157\]J\. Wasserthal, H\.\-C\. Breit, M\. T\. Meyer, M\. Pradella, D\. Hinck, A\. W\. Sauter, T\. Heye, D\. T\. Boll, J\. Cyriac, S\. Yang*et al\.*, “TotalSegmentator: Robust segmentation of 104 anatomical structures in CT images,”*Radiol\. Artif\. Intell\.*, vol\. 5, no\. 5, p\. e230024, 2023\.
- \[158\]M\. J\. J\. de Grauw, E\. T\. Scholten, E\. J\. Smit, M\. J\. C\. M\. Rutten, M\. Prokop, B\. van Ginneken, and A\. Hering, “ULS23 challenge: A benchmark for universal 3D lesion segmentation in CT,”*Med\. Image Anal\.*, p\. 103525, 2025\.
- \[159\]S\. Gatidis, T\. Hepp, M\. Früh, C\. La Fougère, K\. Nikolaou, C\. Pfannenberg, B\. Schölkopf, T\. Küstner, C\. Cyran, and D\. Rubin, “A whole\-body FDG\-PET/CT dataset with manually annotated tumor lesions,”*Sci\. Data*, vol\. 9, no\. 1, p\. 601, 2022\.
- \[160\]M\. Allan, A\. Shvets, T\. Kurmann, Z\. Zhang, R\. Duggal, Y\.\-H\. Su, N\. Rieke, I\. Laina, N\. Kalavakonda, S\. Bodenstedt*et al\.*, “The 2017 robotic instrument segmentation challenge,”*arXiv preprint*, 2019, arXiv:1902\.06426\.
- \[161\]S\. Lin, F\. Qin, Y\. Li, R\. A\. Bly, K\. S\. Moe, and B\. Hannaford, “LC\-GAN: Image\-to\-image translation based on generative adversarial network for endoscopic images,” in*Proc\. IEEE/RSJ Int\. Conf\. Intell\. Robots Syst\. \(IROS\)*, 2020, pp\. 2914–2920\.
- \[162\]H\. Ding, Y\. Zhang, T\. Lu, R\. Liang, H\. Shu, L\. Seenivasan, Y\. Long, Q\. Dou, C\. Gao, Y\. Leng*et al\.*, “Segstrong\-c: Segmenting surgical tools robustly on non\-adversarial generated corruptions—an EndoVis’24 challenge,”*arXiv preprint arXiv:2407\.11906*, 2024\.
- \[163\]A\. P\. Twinanda, S\. Shehata, D\. Mutter, J\. Marescaux, M\. De Mathelin, and N\. Padoy, “Endonet: A deep architecture for recognition tasks on laparoscopic videos,”*IEEE Trans\. Med\. Imaging*, vol\. 36, no\. 1, pp\. 86–97, 2016\.
- \[164\]C\. I\. Nwoye, K\. Elgohary, A\. Srinivas, F\. Zaid, J\. L\. Lavanchy, and N\. Padoy, “Cholectrack20: A multi\-perspective tracking dataset for surgical tools,” in*Proc\. IEEE/CVF Conf\. Comput\. Vis\. Pattern Recognit\. \(CVPR\)*, 2025, pp\. 8942–8952\.
- \[165\]A\. Murali, D\. Alapatt, P\. Mascagni, A\. Vardazaryan, A\. Garcia, N\. Okamoto, G\. Costamagna, D\. Mutter, J\. Marescaux, B\. Dallemagne*et al\.*, “The endoscapes dataset for surgical scene segmentation, object detection, and critical view of safety assessment: Official splits and benchmark,”*arXiv preprint arXiv:2312\.12429*, 2023\.
- \[166\]S\. Maqbool, A\. Riaz, H\. Sajid, and O\. Hasan, “m2caiseg: Semantic segmentation of laparoscopic images using convolutional neural networks,”*arXiv preprint arXiv:2008\.10134*, 2020\.
- \[167\]E\. Özsoy, C\. Pellegrini, T\. Czempiel, F\. Tristram, K\. Yuan, D\. Bani\-Harouni, U\. Eck, B\. Busam, M\. Keicher, and N\. Navab, “Mm\-or: A large multimodal operating room dataset for semantic understanding of high\-intensity surgical environments,” in*Proc\. IEEE/CVF Conf\. Comput\. Vis\. Pattern Recognit\. \(CVPR\)*, 2025, pp\. 19 378–19 389\.
- \[168\]N\. L\. Rodas, F\. Barrera, and N\. Padoy, “See it with your own eyes: Markerless mobile augmented reality for radiation awareness in the hybrid room,”*IEEE Trans\. Biomed\. Eng\.*, vol\. 64, no\. 2, pp\. 429–440, 2016\.
- \[169\]D\. Hu, S\. Li, and M\. Wang, “Object detection in hospital facilities: A comprehensive dataset and performance evaluation,”*Eng\. Appl\. Artif\. Intell\.*, vol\. 123, p\. 106223, 2023\.
- \[170\]F\. S\. Bashiri, E\. LaRose, P\. Peissig, and A\. P\. Tafti, “Mcindoor20000: A fully\-labeled image dataset to advance indoor objects detection,”*Data Brief*, vol\. 17, pp\. 71–75, 2018\.
- \[171\]A\. Ismail, S\. A\. Ahmad, A\. C\. Soh, M\. K\. Hassan, and H\. H\. Harith, “Mynursinghome: A fully\-labelled image dataset for indoor object classification,”*Data Brief*, vol\. 32, p\. 106268, 2020\.
- \[172\]V\. Srivastav, T\. Issenhuth, A\. Kadkhodamohammadi, M\. de Mathelin, A\. Gangi, and N\. Padoy, “MVOR: A multi\-view RGB\-D operating room dataset for 2D and 3D human pose estimation,”*arXiv preprint arXiv:1808\.08180*, 2018\.
- \[173\]K\. Chen, P\. Gabriel, A\. Alasfour, C\. Gong, W\. K\. Doyle, O\. Devinsky, D\. Friedman, P\. Dugan, L\. Melloni, T\. Thesen*et al\.*, “Patient\-specific pose estimation in clinical environments,”*IEEE J\. Transl\. Eng\. Health Med\.*, vol\. 6, pp\. 1–11, 2018\.
- \[174\]V\. Markova, T\. Ganchev, S\. Filkova, and M\. Markov, “MMD\-MSD: A multimodal multisensory dataset in support of research and technology development for musculoskeletal disorders,”*Algorithms*, vol\. 17, no\. 5, p\. 187, 2024\.
- \[175\]J\. Wu, Z\. Chen, and M\. Xu, “Surgtrack: CAD\-free 3D tracking of real\-world surgical instruments,” in*Proc\. Int\. Conf\. Med\. Image Comput\. Comput\.\-Assist\. Interv\. Workshops \(MICCAI Workshops\)*, vol\. 15274, 2025, p\. 168\.
- \[176\]M\. J\. Sekiavandi, L\. Dixen, J\. Fimland, S\. K\. Desu, A\.\-B\. Zserai, Y\. S\. Lee, M\. Barrett, and P\. Burelli, “Advancing face\-to\-face emotion communication: A multimodal dataset \(AFFEC\),”*arXiv preprint arXiv:2504\.18969*, 2025\.
- \[177\]Z\. Cheng, Z\.\-Q\. Cheng, J\.\-Y\. He, K\. Wang, Y\. Lin, Z\. Lian, X\. Peng, and A\. Hauptmann, “Emotion\-LLaMA: Multimodal emotion recognition and reasoning with instruction tuning,”*Adv\. Neural Inf\. Process\. Syst\. \(NeurIPS\)*, vol\. 37, pp\. 110 805–110 853, 2024\.
- \[178\]P\. Yang, N\. Liu, X\. Liu, Y\. Shu, W\. Ji, Z\. Ren, J\. Sheng, M\. Yu, R\. Yi, D\. Zhang*et al\.*, “A multimodal dataset for mixed emotion recognition,”*Sci\. Data*, vol\. 11, no\. 1, p\. 847, 2024\.
- \[179\]W\.\-B\. Jiang, X\.\-H\. Liu, W\.\-L\. Zheng, and B\.\-L\. Lu, “Seed\-VII: A multimodal dataset of six basic emotions with continuous labels for emotion recognition,”*IEEE Trans\. Affect\. Comput\.*, 2024\.
- \[180\]R\. Subramanian, J\. Wache, M\. K\. Abadi, R\. L\. Vieriu, S\. Winkler, and N\. Sebe, “ASCERTAIN: Emotion and personality recognition using commercial sensors,”*IEEE Trans\. Affect\. Comput\.*, vol\. 9, no\. 2, pp\. 147–160, 2016\.
- \[181\]S\. Katsigiannis and N\. Ramzan, “DREAMER: A database for emotion recognition through EEG and ECG signals from wireless low\-cost off\-the\-shelf devices,”*IEEE J\. Biomed\. Health Inform\.*, vol\. 22, no\. 1, pp\. 98–107, 2017\.
- \[182\]M\. Hu, P\. Xia, L\. Wang, S\. Yan, F\. Tang, Z\. Xu, Y\. Luo, K\. Song, J\. Leitner, X\. Cheng*et al\.*, “Ophnet: A large\-scale video benchmark for ophthalmic surgical workflow understanding,” in*Proc\. Eur\. Conf\. Comput\. Vis\. \(ECCV\)*, 2024, pp\. 481–500\.
- \[183\]Z\. Wang, B\. Lu, Y\. Long, F\. Zhong, T\.\-H\. Cheung, Q\. Dou, and Y\.\-H\. Liu, “Autolaparo: A dataset of integrated multi\-tasks for image\-guided surgical automation in laparoscopic hysterectomy,” in*Proc\. Int\. Conf\. Med\. Image Comput\. Comput\.\-Assist\. Interv\. \(MICCAI\)*, 2022, pp\. 486–496\.
- \[184\]A\. Derathé, F\. Reche, S\. Guy, K\. Charrière, B\. Trilling, P\. Jannin, A\. Moreau\-Gaudry, B\. Gibaud, and S\. Vórös, “LapEx: A multimodal dataset for context recognition and practice assessment in laparoscopic surgery,”*Sci\. Data*, vol\. 12, no\. 1, p\. 342, 2025\.
- \[185\]A\. Huaulmé, D\. Sarikaya, K\. Le Mut, F\. Despinoy, Y\. Long, Q\. Dou, C\.\-B\. Chng, W\. Lin, S\. Kondo, L\. Bravo\-Sánchez*et al\.*, “Micro\-surgical anastomose workflow recognition challenge report,”*Comput\. Methods Programs Biomed\.*, vol\. 212, p\. 106452, 2021\.
- \[186\]Z\. Wu, D\. Tong, H\. Xie, L\. Sun, X\. Fan, and Z\. Yang, “A portable 6D surgical instrument magnetic localization system with dynamic error correction,”*IEEE Sens\. J\.*, 2025\.
- \[187\]Z\. Qi, H\. Jin, X\. Xu, Q\. Wang, Z\. Gan, R\. Xiong, S\. Zhang, M\. Liu, J\. Wang, X\. Ding*et al\.*, “Head model dataset for mixed reality navigation in neurosurgical interventions for intracranial lesions,”*Sci\. Data*, vol\. 11, no\. 1, p\. 538, 2024\.
- \[188\]F\. Xia, A\. R\. Zamir, Z\. He, A\. Sax, J\. Malik, and S\. Savarese, “Gibson Env: Real\-world perception for embodied agents,” in*Proc\. IEEE/CVF Conf\. Comput\. Vis\. Pattern Recognit\. \(CVPR\)*, 2018, pp\. 9068–9079\.
- \[189\]S\. K\. Ramakrishnan, A\. Gokaslan, E\. Wijmans, O\. Maksymets, A\. Clegg, J\. Turner, E\. Undersander, W\. Galuba, A\. Westbury, A\. X\. Chang*et al\.*, “HM3D: 1000 large\-scale 3D environments for embodied AI,”*arXiv preprint arXiv:2109\.08238*, 2021\.
- \[190\]K\. Yuan, M\. Kattel, J\. L\. Lavanchy, N\. Navab, V\. Srivastav, and N\. Padoy, “Advancing surgical VQA with scene graph knowledge,”*Int\. J\. Comput\. Assist\. Radiol\. Surg\.*, vol\. 19, no\. 7, pp\. 1409–1417, 2024\.
- \[191\]S\. Ray, K\. Gupta, S\. Kundu, P\. A\. Kasat, S\. Aditya, and P\. Goyal, “ERVQA: A dataset to benchmark the readiness of large vision–language models in hospital environments,”*arXiv preprint arXiv:2410\.06420*, 2024\.
- \[192\]J\. Wu, W\. Deng, X\. Li, S\. Liu, T\. Mi, Y\. Peng, Z\. Xu, Y\. Liu, H\. Cho, C\.\-I\. Choi*et al\.*, “MedReason: Eliciting factual medical reasoning steps in LLMs via knowledge graphs,”*arXiv preprint arXiv:2504\.00993*, 2025\.
- \[193\]Y\. Sun, X\. Qian, W\. Xu, H\. Zhang, C\. Xiao, L\. Li, D\. Zhao, W\. Huang, T\. Xu, Q\. Bai*et al\.*, “Reasonmed: A 370k multi\-agent generated dataset for advancing medical reasoning,” in*Proc\. Conf\. Empir\. Methods Nat\. Lang\. Process\. \(EMNLP\)*, 2025, pp\. 26 457–26 478\.
- \[194\]E\. Özsoy, C\. Pellegrini, D\. Bani\-Harouni, K\. Yuan, M\. Keicher, and N\. Navab, “ORQA: A benchmark and foundation model for holistic operating room modeling,”*arXiv preprint arXiv:2505\.12890*, 2025\.
- \[195\]J\. Li, G\. Skinner, G\. Yang, B\. R\. Quaranto, S\. D\. Schwaitzberg, P\. C\. W\. Kim, and J\. Xiong, “LLaVA\-Surg: Towards a multimodal surgical assistant via structured surgical video learning,”*arXiv preprint arXiv:2408\.07981*, 2024\.
- \[196\]J\. Xu, B\. Li, B\. Lu, Y\.\-H\. Liu, Q\. Dou, and P\.\-A\. Heng, “SurRoL: An open\-source reinforcement learning centered and dVRK\-compatible platform for surgical robot learning,” in*Proc\. IEEE/RSJ Int\. Conf\. Intell\. Robots Syst\. \(IROS\)*, 2021, pp\. 1821–1828\.
- \[197\]Q\. Yu, M\. Moghani, K\. Dharmarajan, V\. Schorp, W\. C\.\-H\. Panitch, J\. Liu, K\. Hari, H\. Huang, M\. Mittal, K\. Goldberg*et al\.*, “ORBIT\-Surgical: An open simulation framework for learning surgical augmented dexterity,” in*Proc\. IEEE Int\. Conf\. Robot\. Autom\. \(ICRA\)*, 2024, pp\. 15 509–15 516\.
- \[198\]S\. Schmidgall, A\. Krieger, and J\. Eshraghian, “Surgical Gym: A high\-performance GPU\-based platform for reinforcement learning with surgical robots,” in*Proc\. IEEE Int\. Conf\. Robot\. Autom\. \(ICRA\)*, 2024, pp\. 13 354–13 361\.
- \[199\]Y\. Ao, M\. Moghani, M\. Mittal, M\. Prajapat, L\. Wu, F\. Giraud, F\. Carrillo, A\. Krause, and P\. Fürnstahl, “SonoGym: High\-performance simulation for challenging surgical tasks with robotic ultrasound,”*arXiv preprint arXiv:2507\.01152*, 2025\.
- \[200\]Y\. Gao, S\. S\. Vedula, C\. E\. Reiley, N\. Ahmidi, B\. Varadarajan, H\. C\. Lin, L\. Tao, L\. Zappella, B\. Béjar, D\. D\. Yuh*et al\.*, “JIGSAWS: A surgical activity dataset for human motion modeling,” in*MICCAI Workshop on Modeling and Monitoring of Computer Assisted Interventions \(M2CAI\)*, 2014\.
- \[201\]R\. Stauder, D\. Ostler, M\. Kranzfelder, S\. Koller, H\. Feußner, and N\. Navab, “The TUM LapChole dataset for the M2CAI 2016 workflow challenge,”*arXiv preprint arXiv:1610\.09278*, 2016\.
- \[202\]J\. L\. Lavanchy, S\. Ramesh, D\. Dall’Alba, C\. Gonzalez, P\. Fiorini, B\. P\. Müller\-Stich, P\. C\. Nett, J\. Marescaux, D\. Mutter, and N\. Padoy, “Challenges in multi\-centric generalization: Phase and step recognition in Roux\-en\-Y gastric bypass surgery,”*Int\. J\. Comput\. Assist\. Radiol\. Surg\.*, vol\. 19, no\. 11, pp\. 2249–2257, 2024\.
- \[203\]A\. Zia, M\. Berniker, R\. Nespolo, C\. Perreault, Z\. Wang, B\. Mueller, R\. Schmidt, K\. Bhattacharyya, X\. Liu, and A\. Jarc, “SurgVU: Surgical visual understanding dataset,”*arXiv preprint arXiv:2501\.09209*, 2025\.
- \[204\]S\. Schmidgall, J\. W\. Kim, J\. Jopling, and A\. Krieger, “General surgery vision transformer: A video pre\-trained foundation model for general surgery,”*arXiv preprint arXiv:2403\.05949*, 2024\.
- \[205\]R\. Hartwig, D\. Ostler, J\.\-C\. Rosenthal, H\. Feußner, D\. Wilhelm, and D\. Wollherr, “MITI: SLAM benchmark for laparoscopic surgery,”*arXiv preprint arXiv:2202\.11496*, 2022\.
- \[206\]G\. Wang, H\. Xiao, R\. Zhang, H\. Gao, L\. Bai, X\. Yang, Z\. Li, H\. Li, and H\. Ren, “CoPESD: A multi\-level surgical motion dataset for training large vision–language models to co\-pilot endoscopic submucosal dissection,” in*Proc\. ACM Int\. Conf\. Multimedia \(ACM MM\)*, 2025, pp\. 12 636–12 643\.
- \[207\]F\. Shang, J\. Fu, Y\. Yang, H\. Huang, J\. Liu, and L\. Ma, “SynFundus\-1M: A high\-quality million\-scale synthetic fundus image dataset with fifteen types of annotation,”*arXiv preprint arXiv:2312\.00377*, 2023\.
- \[208\]K\. Ding, M\. Zhou, H\. Wang, O\. Gevaert, D\. Metaxas, and S\. Zhang, “A large\-scale synthetic pathological dataset for deep learning\-enabled segmentation of breast cancer,”*Sci\. Data*, vol\. 10, no\. 1, p\. 231, 2023\.
- \[209\]J\. Walonoski, S\. Klaus, E\. Granger, D\. Hall, A\. Gregorowicz, G\. Neyarapally, A\. Watson, and J\. Eastman, “Synthea: Novel coronavirus \(COVID\-19\) model and synthetic dataset,”*Intell\.\-Based Med\.*, vol\. 1, p\. 100007, 2020\.
- \[210\]J\. Walonoski, D\. Hall, K\. M\. Bates, M\. H\. Farris, J\. Dagher, M\. E\. Downs, R\. T\. Sivek, B\. Wellner, A\. Gregorowicz, M\. Hadley*et al\.*, “The “coherent dataset”: Combining patient data and imaging in a comprehensive synthetic health record,”*Electronics*, vol\. 11, no\. 8, p\. 1199, 2022\.
- \[211\]H\. Guan and M\. Liu, “Domain adaptation for medical image analysis: A survey,”*IEEE Trans\. Biomed\. Eng\.*, vol\. 69, no\. 3, pp\. 1173–1185, 2021\.
- \[212\]D\. Wang, Y\. Zhang, K\. Zhang, and L\. Wang, “FocalMix: Semi\-supervised learning for 3D medical image detection,” in*Proc\. IEEE/CVF Conf\. Comput\. Vis\. Pattern Recognit\. \(CVPR\)*, 2020, pp\. 3951–3960\.
- \[213\]I\. Dayan, H\. R\. Roth, A\. Zhong, A\. Harouni, A\. Gentili, A\. Z\. Abidin, A\. Liu, A\. B\. Costa, B\. J\. Wood, C\.\-S\. Tsai*et al\.*, “Federated learning for predicting clinical outcomes in patients with COVID\-19,”*Nat\. Med\.*, vol\. 27, no\. 10, pp\. 1735–1743, 2021\.
- \[214\]Z\. Sun, H\. Yin, H\. Chen, T\. Chen, L\. Cui, and F\. Yang, “Disease prediction via graph neural networks,”*IEEE J\. Biomed\. Health Inform\.*, vol\. 25, no\. 3, pp\. 818–826, 2020\.
- \[215\]H\. Wu, W\. Shi, A\. Choudhary, and M\. D\. Wang, “Clinical decision making under uncertainty: A bootstrapped counterfactual inference approach,”*BMC Med\. Inform\. Decis\. Mak\.*, vol\. 24, no\. 1, p\. 275, 2024\.
- \[216\]M\. Sharifi, S\. Tripathi, Y\. Chen, Q\. Zhang, and M\. Tavakoli, “Reinforcement learning methods for assistive and rehabilitation robotic systems: A survey,”*IEEE Trans\. Syst\., Man, Cybern\., Syst\.*, 2025\.
Towards Next-Generation Healthcare: A Survey of Medical Embodied AI for Perception, Decision-Making, and Action

Similar Articles

World Action Models: The Next Frontier in Embodied AI

Measuring What Matters: Benchmarking Generative, Multimodal, and Agentic AI in Healthcare

Toward Enactive Artificial Intelligence

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

The Current State Of FDA-Approved AI-Enabled Medical Devices.

Submit Feedback

Similar Articles

World Action Models: The Next Frontier in Embodied AI
Measuring What Matters: Benchmarking Generative, Multimodal, and Agentic AI in Healthcare
Toward Enactive Artificial Intelligence
AutoMedBench: Towards Medical AutoResearch with Agentic AI Models
The Current State Of FDA-Approved AI-Enabled Medical Devices.