Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models
Summary
This systematic scoping review examines three categories of large AI models in dental healthcare: language-generative models, discriminative vision foundation models, and dental-specific foundation models, analyzing 97 studies to show that general-purpose and domain-specific models play complementary roles, with integrated pipelines outperforming single-model approaches.
View Cached Full Text
Cached at: 06/03/26, 09:42 AM
# Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models Source: [https://arxiv.org/html/2606.02914](https://arxiv.org/html/2606.02914) ###### Abstract Background\.Oral diseases affect nearly 3\.5 billion people worldwide, yet the clinical potential of large\-scale AI models in dentistry remains poorly understood in comparative terms\. Recent years have seen rapid growth in three distinct model categories: language\-generative models, discriminative vision foundation models, and dental\-specific foundation models\. No unified review has examined how these categories relate to one another, where they complement or compete, and what their collective limitations mean for clinical deployment\. Methods\.Following PRISMA\-ScR guidelines, we conducted a systematic search across four databases \(PubMed, Google Scholar, Scopus, arXiv\), independently screened by two reviewers\. After applying predefined inclusion and exclusion criteria, 97 studies published between 2020 and 2026 were included\. We propose a two\-dimensional classification framework organizing models by architectural paradigm and degree of dental specialization, and apply this framework to critically analyze model performance, methodology, and limitations across clinical, educational, and imaging applications\. Results\.Language\-generative models perform strongly on text\-based tasks including clinical reasoning, licensing examinations, and patient communication, but show inconsistent performance on image\-dependent diagnostic tasks\. Discriminative vision foundation models, particularly adapted variants of SAM and CLIP, achieve strong results in tooth segmentation and lesion detection when fine\-tuned on dental data\. Dental\-specific foundation models spanning both pretrained from scratch systems such as DentVFM and heavily fine\-tuned multimodal systems such as DentVLM and OralGPT demonstrate the strongest performance on complex multimodal clinical tasks\. Across all categories, integrated pipelines combining model types consistently outperform single\-model approaches\. A data availability asymmetry is also observed: dental\-specific pretraining is concentrated almost entirely in the vision family, reflecting the relative scarcity of large dental text corpora compared to imaging datasets\. Conclusions\.General\-purpose and dental\-specific models play complementary rather than competing roles, and the most effective dental AI systems combine both within structured pipelines\. Safe autonomous deployment remains conditional on resolving three persistent barriers: hallucination in generative models, limited availability of annotated dental datasets, and the absence of standardized clinical evaluation benchmarks\. ###### keywords: large language models , foundation models , dental AI , vision\-language models , clinical decision support , oral healthcare , systematic scoping review ††journal:Artificial Intelligence in Medicine\\affiliation \[uaeu\] organization=Department of Computer Science and Software Engineering, College of Information Technology, United Arab Emirates University, city=Al Ain, postcode=15551, country=UAE \\affiliation \[uos\] organization=Department of Oral and Craniofacial Health Sciences, College of Dental Medicine, University of Sharjah, city=Sharjah, country=UAE \\affiliation \[wcm\] organization=Weill Cornell Medicine\-Qatar, city=Doha, state=Education City, country=Qatar \\affiliation \[mcgill\] organization=Faculty of Dental Medicine and Oral Health Sciences, McGill University, city=Montreal, state=Quebec, country=Canada ## 1Introduction Oral diseases affect nearly 3\.5 billion people, close to half the world’s population, making them the most widespread non\-communicable diseases globally\[[112](https://arxiv.org/html/2606.02914#bib.bib11)\]\. The scale of this burden has driven decades of interest in applying artificial intelligence \(AI\) to dental practice\. Early efforts, however, ran into a consistent set of problems\. Models built on classical machine learning and later on convolutional neural networks \(CNNs\) were narrow by design: each required its own large annotated dataset, broke down when imaging equipment or acquisition protocols changed, and could not connect across the multiple data types that real clinical work involves\[[61](https://arxiv.org/html/2606.02914#bib.bib117)\]\. Foundation models changed the terms of this problem\. Pretrained on large, diverse corpora, these systems can be directed toward new tasks through prompting or lightweight fine\-tuning rather than training from scratch\. In dentistry, this has made tractable a range of applications that earlier architectures could not handle: clinical reasoning, imaging analysis, patient communication, and educational assessment\. Figure[1](https://arxiv.org/html/2606.02914#S1.F1)traces how this transition unfolded across three generations of dental AI\. Figure 1:The evolution of AI technology in dental practice, illustrating the progression from early rule\-based and classical machine learning systems through the deep learning era to modern foundation models\. Advances in model architecture, data utilization, and learning strategies have progressively increased generalizability, multimodal capability, and the range of clinical applications\.The first generation of dental AI relied on hand\-crafted features fed into support vector machines, decision trees, and rule\-based expert systems\. These worked well on narrow, well\-defined problems such as caries screening on bitewing radiographs or rule\-based classification of periodontal status, but generalized poorly beyond the conditions they were built for\[[108](https://arxiv.org/html/2606.02914#bib.bib119),[11](https://arxiv.org/html/2606.02914#bib.bib120)\]\. CNNs raised the performance ceiling considerably, achieving results competitive with specialists on tasks such as tooth detection and numbering, periapical lesion identification, and restoration segmentation\[[105](https://arxiv.org/html/2606.02914#bib.bib121),[61](https://arxiv.org/html/2606.02914#bib.bib117),[64](https://arxiv.org/html/2606.02914#bib.bib122)\]\. Even so, each application still required its own large labeled dataset, performance dropped when imaging hardware or protocols changed, and the models offered no natural language output that clinicians could work with directly\. The shift to transformer\-based foundation models addressed these limitations more directly: trained at scale on heterogeneous data, they support zero\-shot and few\-shot transfer, handle multiple modalities within a single architecture, and interact through natural language\[[107](https://arxiv.org/html/2606.02914#bib.bib123),[14](https://arxiv.org/html/2606.02914#bib.bib118)\]\. This period has produced both general\-purpose models applied to dental tasks through prompting and fine\-tuning, and domain\-specific foundation models pretrained on curated dental corpora\[[37](https://arxiv.org/html/2606.02914#bib.bib22),[48](https://arxiv.org/html/2606.02914#bib.bib107)\]\. Despite the volume of work now published in this area, the field still lacks a coherent comparative account of these model classes\. Studies have largely addressed one category at a time: LLMs evaluated on clinical question answering, vision foundation models adapted for radiographic segmentation, dental\-specific architectures benchmarked on narrow imaging tasks\[[120](https://arxiv.org/html/2606.02914#bib.bib93),[73](https://arxiv.org/html/2606.02914#bib.bib104),[119](https://arxiv.org/html/2606.02914#bib.bib106)\]\. How these categories relate to each other, where they are complementary, where one outperforms another, and what their shared limitations mean for safe deployment, has not been examined within a single framework\. Without that perspective, choosing the right model class for a given clinical application, or deciding when to combine approaches, is difficult to do on principled grounds\. This review addresses that problem\. We conducted a systematic search of four major databases following PRISMA\-ScR guidelines\[[80](https://arxiv.org/html/2606.02914#bib.bib125)\], ultimately analyzing 97 studies published between 2020 and 2026 covering general\-purpose LLMs, vision\-language models \(VLMs\), and dental\-specific foundation models across clinical, educational, and imaging applications\. The work aims to gather and synthesize this literature in a reproducible and transparent manner; to propose a classification scheme that organizes these model classes by architectural paradigm and degree of dental specialization; to critically analyze and compare their performance, methods, and limitations across application domains; and to identify the open challenges that must be resolved before these systems can be deployed autonomously in clinical practice\. The contributions of this work are as follows: - 1\.Systematic literature synthesis:A reproducible search and selection of 97 studies across four major databases following PRISMA\-ScR guidelines, covering the full range of large AI model types applied in dental healthcare from 2020 to 2026\. - 2\.Classification framework:A two\-dimensional taxonomy organizing large dental AI models by architectural paradigm \(generative vs\. vision/multimodal\) and degree of dental specialization \(general\-purpose, adapted, domain\-specific\), providing a principled basis for comparing model classes\. - 3\.Critical performance analysis:A detailed critical review of reported model performance, experimental methodology, and stated limitations across all three model categories and multiple clinical application domains, going beyond descriptive summary to evaluate the quality and generalizability of reported findings\. - 4\.Cross\-model complementarity:An analysis of how general\-purpose and dental\-specific models interact in practice, with evidence that integrated multi\-model pipelines consistently outperform single\-model approaches across several task categories\. - 5\.Challenges and future directions:A structured account of the barriers that currently prevent clinical deployment, including hallucination in generative models, limited availability of annotated dental data, and the absence of standardized evaluation benchmarks, with specific recommendations for how each might be addressed\. ## 2Search Strategy and Study Selection This review sought to identify studies published between 2020 and early 2026 that developed, evaluated, or applied large\-scale AI models in dental healthcare contexts\. The study selection process followed the PRISMA\-ScR framework\[[80](https://arxiv.org/html/2606.02914#bib.bib125)\], covering four stages: identification, screening, eligibility, and inclusion\. A list of abbreviations used throughout this review is provided in Table[8](https://arxiv.org/html/2606.02914#S8.T8)\. Identification\.A comprehensive literature search was conducted across four databases: PubMed, Google Scholar, Scopus, and arXiv\. These databases were selected to provide complementary coverage: PubMed captures peer\-reviewed clinical and biomedical literature; arXiv ensures inclusion of recent AI preprints not yet formally published; and Scopus and Google Scholar provide broad interdisciplinary reach across engineering, computer science, and health informatics\. The search strategy focused on three main domains: large language models in dental applications, multimodal and vision\-language models in dental healthcare, and dental\-specific foundation models\. Search terms were combined using Boolean operators \(AND, OR\) across three categories to capture the full range of relevant work\. AI model terms included ”large language model,” ”GPT,” ”ChatGPT,” ”Claude,” ”Gemini,” ”foundation model,” ”vision\-language model,” ”multimodal model,” ”SAM,” ”Segment Anything Model,” ”transformer,” ”BERT,” and ”vision transformer\.” Dental\-related terms included ”dental,” ”dentistry,” ”oral health,” ”tooth,” ”teeth,” ”periodontal,” ”orthodontic,” ”endodontic,” ”oral surgery,” ”maxillofacial,” ”caries,” ”panoramic radiograph,” ”CBCT,” and ”dental imaging\.” Application\-related terms included ”diagnosis,” ”detection,” ”segmentation,” ”classification,” ”clinical decision support,” ”treatment planning,” ”education,” and ”patient communication\.” Screening\.The initial search identified 1,129 records across the four databases\. After removing duplicates, 823 papers remained\. Title and abstract screening was conducted independently by two reviewers \(S\.H\. and R\.D\.\) using predefined inclusion and exclusion criteria\. Disagreements were resolved through discussion and consensus\. Studies were considered eligible if they investigated, developed, evaluated, or applied large\-scale foundation models in dental or oral health contexts\. Eligible study types included original research, technical reports, validation studies, and comparative studies published between 2020 and 2026 and written in English\. Studies were excluded if they were review papers, pilot studies, unrelated to dentistry, or focused solely on traditional machine learning or conventional deep learning without a foundation model component\. After this stage, 214 papers remained\. Eligibility\.The remaining 214 studies underwent independent full\-text screening by both reviewers \(S\.H\. and R\.D\.\)\. After attempting full\-text retrieval, 30 reports could not be accessed due to unavailable or restricted full texts, leaving 184 studies for assessment\. Papers were excluded if they reported no assessable or sufficiently rigorous evaluation of model performance \(n = 25\), lacked a clear and direct dental application \(n = 22\), or were redundant or overlapping with already included work \(n = 34\)\. Studies with very small sample sizes that did not allow for meaningful performance assessment \(n = 6\) were also excluded\. Any disagreements at this stage were resolved through discussion, with a third reviewer consulted where consensus could not be reached\. Data extraction\.Data were extracted from each included study using a standardized charting form covering: study aim, model type and architecture, dental application domain, dataset characteristics, evaluation metrics, key results, and reported limitations\. Extraction was performed by S\.H\. and verified by R\.D\. Discrepancies were resolved through discussion\. Quality appraisal\.Formal quality appraisal of individual studies was not conducted\. In accordance with JBI methodology for scoping reviews, which aim to map the available evidence rather than assess its quality, appraisal was considered outside the scope of this review\. Methodological limitations of individual studies are instead noted narratively within the relevant results sections\. Included studies\.After applying these criteria, 97 papers were included in the final analysis, as illustrated in the PRISMA flow diagram in Figure[2](https://arxiv.org/html/2606.02914#S2.F2)\. Figure 2:PRISMA 2020 flow diagram summarizing the systematic process of study identification, screening, eligibility assessment, and inclusion\. Records were retrieved from four databases, duplicates were removed, and remaining studies were independently screened by two reviewers \(S\.H\. and R\.D\.\) based on predefined inclusion and exclusion criteria\.The classification framework used to organize these 97 studies by architectural paradigm and degree of dental specialization is described in the following section\. ## 3Conceptual Framework of Large AI Models in Dental Healthcare Classifying large AI models in dental healthcare is not straightforward\. These systems vary in how they process data, what they produce as output, and how closely they are tied to dental knowledge\. Rather than attempting an exhaustive taxonomy across all possible dimensions, this review organizes the 97 studies identified in Section[2](https://arxiv.org/html/2606.02914#S2)along two axes that best explain the differences in what models can do and where they are applicable: architectural paradigm, which determines whether a model produces language or structured visual representations; and degree of dental specialization, which reflects how far a model has been adapted from its general\-purpose origins toward clinical dental use\. The relationships between model categories and their clinical roles are summarized in Figure[3](https://arxiv.org/html/2606.02914#S3.F3)\. Figure 3:Large AI models in dental healthcare organized by architectural paradigm \(left\) and degree of dental specialization \(right\)\. The left panel distinguishes language\-generative models, which produce language, from discriminative vision foundation models, which produce structured visual outputs such as embeddings, masks, or bounding boxes\. The right panel shows how general\-purpose models can be progressively adapted to dental tasks through inference\-time strategies, parameter\-efficient fine\-tuning, or full domain\-specific pretraining\.### 3\.1Architectural Paradigm All models reviewed here are built on the transformer architecture\[[107](https://arxiv.org/html/2606.02914#bib.bib123)\], but they split into two families based on what they produce\. Language\-generative modelsproduce language as output\. Large language models use decoder\-only transformers trained autoregressively on text: the model learns to predict the next token from all preceding context\[[14](https://arxiv.org/html/2606.02914#bib.bib118)\], and at sufficient scale this produces systems capable of reasoning, instruction following, and in\-context generalization\. Vision\-language models extend this by coupling a visual encoder \(typically a Vision Transformer that converts an image into a sequence of patch embeddings\[[29](https://arxiv.org/html/2606.02914#bib.bib115)\]\) to a language decoder through a learned projection layer\[[67](https://arxiv.org/html/2606.02914#bib.bib129),[49](https://arxiv.org/html/2606.02914#bib.bib126)\]\. The result accepts both images and text as input while generating language as output, which makes these models suitable for visual question answering, image\-grounded reasoning, and radiographic report generation\. The defining property of this family, whether text\-only or image\-conditioned, is that they generate language\. Although vision\-language models such as GPT\-4o and Gemini accept image input, they remain within the language\-generative family because their primary output is generated language; the addition of a visual encoder does not alter the generative nature of the underlying decoder\. Discriminative vision foundation modelslearn visual representations without a language generation objective\. Contrastive models such as CLIP align image and text encoders in a shared embedding space\[[83](https://arxiv.org/html/2606.02914#bib.bib98)\], enabling zero\-shot classification through similarity comparison\. Promptable segmentation models such as SAM use a Vision Transformer encoder paired with a lightweight mask decoder, accepting point or bounding\-box prompts to generate segmentation masks across image domains\[[54](https://arxiv.org/html/2606.02914#bib.bib89)\]\. Open\-vocabulary detection models such as GroundingDINO incorporate language into the detection pipeline via cross\-modal attention, allowing object localization from arbitrary text queries rather than fixed category sets\[[68](https://arxiv.org/html/2606.02914#bib.bib102)\]\. Architecturally diverse as these models are, they share one property that separates them from the generative family: their outputs are structured visual representations: embeddings, masks, or bounding boxes, not language\. This distinction matters clinically\. Language\-generative models are the natural fit for tasks that require explanation, dialogue, or text output: clinical question answering, patient communication, and report drafting\. Discriminative vision foundation models are better suited to tasks that require spatial precision: tooth segmentation, lesion localization, and anatomical landmark detection\. Many effective dental AI pipelines combine both, using a vision model to extract structured findings and a generative model to interpret and communicate them\. ### 3\.2Degree of Dental Specialization General\-purpose models are not trained on dental data\. Getting them to perform well on clinical tasks requires some form of adaptation, and the appropriate strategy depends on how much labeled dental data is available, what computational resources exist, and how large the gap is between the model’s general pretraining distribution and the target dental task\. At the lightest end, models are used without any change to their weights\. Prompt engineering shapes model behavior through carefully constructed inputs; chain\-of\-thought prompting, for instance, encourages the model to reason through intermediate steps before producing a final answer, which consistently improves performance on clinical reasoning tasks\[[111](https://arxiv.org/html/2606.02914#bib.bib137)\]\. In\-context learning goes further by conditioning the model on a small number of worked examples within the prompt, enabling task\-specific behavior without any training\[[14](https://arxiv.org/html/2606.02914#bib.bib118)\]\. Both prompt engineering and in\-context learning require no infrastructure beyond the model itself, making them the lowest\-cost entry point for clinical deployment of general LLMs\. Retrieval\-augmented generation shares this property of requiring no parameter updates, but it is substantively more complex to deploy: it requires an external retrieval system, a curated and maintained knowledge base, and often a separate embedding model to index documents\[[62](https://arxiv.org/html/2606.02914#bib.bib136)\]\. The trade\-off is meaningful: RAG gives the model access to current clinical guidelines or institutional protocols without retraining and substantially reduces hallucination on knowledge\-intensive queries, but it introduces infrastructure dependencies that prompt engineering and in\-context learning do not\. Within the inference\-time tier, these two sub\-levels \(prompt\-based strategies and retrieval\-augmented strategies\) therefore differ in deployment cost even though neither requires parameter updates\. None of these approaches require annotated dental data, which makes the inference\-time tier as a whole the default starting point for most clinical applications of general LLMs\. When labeled dental data is available, parameter\-efficient fine\-tuning offers a middle ground between no adaptation and full retraining\. Adapter modules insert small trainable bottleneck layers into a frozen model\[[45](https://arxiv.org/html/2606.02914#bib.bib134)\]; Low\-Rank Adaptation \(LoRA\) achieves similar efficiency by representing weight updates as low\-rank matrix products\[[46](https://arxiv.org/html/2606.02914#bib.bib135)\]\. Both approaches update a small fraction of parameters while preserving general knowledge, which matters when annotated dental datasets are too small to support full fine\-tuning without overfitting\. For vision foundation models in dental imaging, adapter\-based strategies have proven particularly effective at bridging the domain gap between general visual pretraining and the specific characteristics of dental radiographs\[[109](https://arxiv.org/html/2606.02914#bib.bib90)\]\. Dental\-specific foundation models take specialization the furthest, replacing general\-purpose pretraining with large\-scale training on curated dental corpora\. Masked image modeling trains these models to reconstruct missing regions of dental radiographs from context, learning structural representations directly from imaging data without manual annotation\[[44](https://arxiv.org/html/2606.02914#bib.bib131)\]\. Instruction tuning then aligns outputs with clinical tasks by training on expert\-annotated question\-answer pairs\[[77](https://arxiv.org/html/2606.02914#bib.bib133)\]\. The resulting models carry dental knowledge in their weights rather than acquiring it at inference time, which gives them a consistent advantage on tasks requiring stable anatomical understanding across imaging protocols and patient populations\. The clinical implication of this spectrum is not that more specialization is always better\. Inference\-time strategies are flexible, low\-cost, and easy to update as guidelines change\. Fully specialized models are harder to maintain and require substantial data to build, but outperform general models on tasks where dental\-specific visual or linguistic knowledge is decisive\. The spectrum should also not be read as a set of mutually exclusive choices\. In practice, effective dental AI systems frequently combine strategies from different levels rather than relying on a single tier\. A LoRA\-fine\-tuned language model can be further augmented with RAG at inference time; a dental\-specific vision foundation model can be coupled with a general\-purpose language decoder to produce natural language reports; and prompt engineering is routinely applied on top of fine\-tuned models to steer output format and reasoning style\. These combinations are not exceptions to the framework but reflect how specialization actually works in deployed systems\. The evidence reviewed in Sections[4](https://arxiv.org/html/2606.02914#S4),[5](https://arxiv.org/html/2606.02914#S5), and[6](https://arxiv.org/html/2606.02914#S6)shows that the most consistently strong\-performing systems tend to be those that combine complementary model types within integrated pipelines, a pattern that is examined explicitly in the discussion\. It should be noted that the two axes are analytically useful but not fully independent\. In practice, dental\-specific foundation models on Axis 2 are concentrated almost entirely within the discriminative vision family on Axis 1\. This reflects a data availability asymmetry: large curated dental imaging datasets exist in sufficient volume to support domain\-specific visual pretraining, whereas comparably sized dental text corpora for language\-generative pretraining remain scarce\. As a result, language\-generative models in dentistry are most commonly found at the general\-purpose or adapted end of the specialization spectrum, while discriminative vision models span the full range from zero\-shot application to dental\-specific pretraining\. The two axes therefore describe partially overlapping rather than fully orthogonal dimensions, and this concentration of dental\-specific work in the vision family is itself an important finding about the current state of the field\. ## 4General Generative Models in Dentistry ### 4\.1Clinical Reasoning and Diagnostic Support Many dental studies examine whether general\-purpose AI can support clinical reasoning across various oral health fields\. While these models show strong diagnostic potential, their performance remains inconsistent and varies by specialty and task complexity, as reflected in the studies summarized in Tables[1](https://arxiv.org/html/2606.02914#S4.T1)\. #### 4\.1\.1Oral Pathology & Mucosal Lesions Oral pathology is one of the most extensively evaluated application domains for language\-generative models in dentistry, with studies spanning text\-based differential diagnosis, image\-based lesion recognition, and clinical decision support\. Across these applications, a consistent pattern emerges: performance is substantially stronger in text\-based diagnostic tasks than in clinical image interpretation\. Evaluations of oral and maxillofacial disease cases showed that GPT\-4 approached specialist\-level performance while substantially outperforming GPT\-3\.5\[[103](https://arxiv.org/html/2606.02914#bib.bib12)\]\. A clinically important finding was the strong dependence on input quality, with diagnostic accuracy declining from 80\.18% to 61\.80% when case descriptions were provided using non\-standardized language\. This suggests that reliable performance depends heavily on the quality and structure of clinical information provided to the model\. Image\-based evaluations present a more challenging picture\. Assessments of oral and labial mucosal lesions found that diagnostic performance remained unreliable, although treatment recommendations were more accurate when a correct diagnosis had been established\[[99](https://arxiv.org/html/2606.02914#bib.bib13)\]\. Errors were also reported to be systematic across repeated queries rather than random\. Similar findings were reported in oral lichen planus detection, where seven LLMs were evaluated using histopathologically confirmed photographs across multiple experimental settings\[[85](https://arxiv.org/html/2606.02914#bib.bib14)\]\. Example\-guided prompting improved performance compared with zero\-shot approaches, demonstrating that additional contextual guidance can enhance image\-based reasoning\. Nevertheless, all evaluated CNN models consistently outperformed all evaluated LLMs, indicating that purpose\-trained image\-analysis systems remain more effective for lesion\-classification tasks than general\-purpose generative models\. #### 4\.1\.2Periapical and Endodontic Lesions Language\-generative models have been evaluated in endodontics across three distinct task types: radiographic lesion detection, clinical question answering, and evaluation using questions derived from professional endodontic guidelines, each with markedly different performance outcomes\. Radiographic detection of periapical lesions remains challenging\. Evaluations of ChatGPT 5\.0 and Gemini Flash 2\.0 on periapical radiographs reported high sensitivity but very low specificity, resulting in frequent false\-positive findings and errors in tooth localisation\[[15](https://arxiv.org/html/2606.02914#bib.bib15)\]\. Clinical question answering has produced considerably stronger results\. In evaluations of endodontic questions, GPT\-4o achieved the highest overall accuracy \(82\.5%\), while Gemini demonstrated significant improvement over a four\-day testing period\[[16](https://arxiv.org/html/2606.02914#bib.bib16)\]\. Similar findings were reported in open\-ended endodontic assessments, where GPT\-4 outperformed earlier\-generation models while demonstrating the lowest misinformation rate\[[78](https://arxiv.org/html/2606.02914#bib.bib17)\]\. Despite differences in question formats and evaluation methods, GPT\-4–based models consistently demonstrated the strongest overall performance\. Guideline\-based benchmarking across 11 LLms reported the highest accuracies among the evaluated endodontic applications\. Using questions derived from AAE and ESE position statements, ChatGPT\-4o and Claude Opus 4 achieved the highest accuracy \(95\.0%\), while DeepSeek achieved 63\.3%\.\[[27](https://arxiv.org/html/2606.02914#bib.bib18)\]\. Importantly, even the highest\-performing models produced confidently incorrect answers in some cases, highlighting a hallucination risk that is not fully captured by accuracy metrics alone\. Comparisons with dental students provide additional perspective\. In standardized endodontic scenarios, ChatGPT achieved 99% diagnostic accuracy and outperformed both junior and senior dental students\[[82](https://arxiv.org/html/2606.02914#bib.bib19)\]\. Taken together, these findings highlight important considerations beyond accuracy, including false\-positive findings, variation across repeated interactions, and the challenges of translating performance from standardized assessments to real clinical settings\. #### 4\.1\.3Periodontal and Peri\-implant Disease Diagnosis Research in periodontology and peri\-implant disease reveals a wide performance range across models and a clear signal that domain\-adapted systems outperform general\-purpose chatbots on specialist clinical tasks\. In peri\-implant disease, eight chatbots were evaluated on simulated cases of mucositis and peri\-implantitis, with GPT\-4o achieving the highest diagnostic accuracy at 88\.8% and Copilot the lowest at 49\.9%\[[4](https://arxiv.org/html/2606.02914#bib.bib20)\]\. A similar pattern emerged in periodontology, where ChatGPT\-4\.0 produced the most accurate and comprehensive responses to open\-ended periodontal questions compared with other evaluated models\[[22](https://arxiv.org/html/2606.02914#bib.bib21)\]\. These findings suggest considerable variation in periodontal knowledge among general\-purpose LLMs\. However, domain adaptation can substantially improve performance, as illustrated by PerioGPT, an instruction\-tuned GPT\-4o built on a RAG framework and a periodontal knowledge base, which achieved 81\.16% accuracy on specialist periodontal questions and AAP In\-Service Examination items while outperforming base chatbot models in accuracy, question complexity, and precision\[[37](https://arxiv.org/html/2606.02914#bib.bib22)\]\. This finding demonstrates how knowledge integration through RAG can substantially improve performance without retraining the underlying model\. #### 4\.1\.4Supernumerary Teeth and Structural Anomalies The use of Large AI for analyzing dental radiographs and intraoral photographs has recently been investigated, with performance varying considerably according to the clinical task and degree of model customization\. In the detection of supernumerary teeth, a customized GPT\-4V model achieved a diagnostic accuracy of 91%, substantially outperforming standard GPT\-4o \(77%\) and GPT\-4V \(63%\) while producing fewer false\-positive findings\[[6](https://arxiv.org/html/2606.02914#bib.bib24)\]\. These findings demonstrate the potential benefits of domain\-specific customization for radiographic interpretation\. In contrast, studies evaluating superior labial frenulum classification from intraoral photographs reported inconsistent performance and limited agreement with expert assessments across all tested models, with overall accuracy remaining low\[[52](https://arxiv.org/html/2606.02914#bib.bib23)\]\. Together, these findings suggest that performance remains highly dependent on both the imaging task and the degree of model adaptation\. #### 4\.1\.5Dental Trauma The use of language\-generative models in dental traumatology has focused primarily on clinical reasoning and guideline\-based decision support, with most studies based on IADT guidelines\. Across these studies, performance was generally moderate to high, although substantial differences were observed across evaluation criteria and clinical tasks\. primary dentition trauma questions, significant differences were reported in accuracy, completeness, readability, and response time\. ChatGPT\-4o achieved the highest accuracy, DeepSeek performed best in completeness and readability, while ChatGPT\-4o and Gemini generated the fastest responses, with no single model consistently outperforming others across all criteria\[[93](https://arxiv.org/html/2606.02914#bib.bib27)\]\. Similarly, a simulation\-based study incorporating clinical findings, radiographs, and pulp vitality data found no significant overall differences in diagnostic accuracy among models\. However, Gemini achieved perfect diagnostic accuracy, ChatGPT\-4o achieved 97% accuracy in antibiotic recommendations, and DeepSeek demonstrated the greatest response variability across trials\[[57](https://arxiv.org/html/2606.02914#bib.bib26)\]\. A larger benchmark of 125 IADT\-based questions reported overall accuracies ranging from 73\.8% to 86\.4%, although performance declined in more complex cases such as luxation injuries, supporting the use of these tools as educational aids rather than diagnostic systems\[[102](https://arxiv.org/html/2606.02914#bib.bib25)\]\. #### 4\.1\.6Dental Implantology and Oral and Maxillofacial Surgery Research on Language\-generative models in dental implantology and oral and maxillofacial surgery has primarily evaluated their performance in clinical question answering, case analysis, and radiographic interpretation\. Studies in implantology generally report strong performance in question answering and case analysis tasks, with advanced models achieving high scores across both applications\[[113](https://arxiv.org/html/2606.02914#bib.bib29),[114](https://arxiv.org/html/2606.02914#bib.bib28)\]\. However, performance varied across clinical tasks\. While diagnostic capabilities consistently exceeded treatment\-planning performance across tested models, substantial variability remained in treatment recommendations despite relatively strong diagnostic performance\[[114](https://arxiv.org/html/2606.02914#bib.bib28)\]\. Image interpretation remains challenging despite recent improvements in multimodal models\. In implant fixture localization, reasoning\-focused multimodal models substantially outperformed GPT\-4o, achieving sensitivities of approximately 66–69% compared with 16\.97% for GPT\-4o while also reducing false\-positive findings\[[74](https://arxiv.org/html/2606.02914#bib.bib30)\]\. Nevertheless, moderate sensitivity, restoration\-driven errors, and considerable run\-to\-run variability remained important limitations\. Similar challenges were observed in the classification of impacted mandibular third molars on panoramic radiographs, where performance varied across classification tasks and radiographic features\. Although individual models demonstrated strengths in specific assessments, none achieved acceptable agreement for Pell and Gregory classification\[[35](https://arxiv.org/html/2606.02914#bib.bib31)\]\. Studies in oral and maxillofacial surgery further highlight the influence of prompting strategies on model performance\. In oral and maxillofacial disease questioning, chain\-of\-thought prompting improved multiple\-choice accuracy by approximately 3\.1% and enhanced the structure and completeness of responses\[[50](https://arxiv.org/html/2606.02914#bib.bib32)\]\. Collectively, the evidence indicates that model performance varies substantially across implantology and oral and maxillofacial applications, with strengths in some clinical tasks but persistent limitations in more complex decision\-making and image\-based assessments\. #### 4\.1\.7AI\-Assisted Clinical Decision Support in Dentistry Language\-generative models are emerging as clinical decision\-support tools in dentistry, with applications spanning dental history analysis, medication safety assessment, guideline interpretation, and treatment planning\. Studies evaluating dental history analysis suggest that GPT\-4\-based systems can support clinical decision\-making while substantially reducing the time required for patient assessment compared with conventional dentist\-based evaluation\[[51](https://arxiv.org/html/2606.02914#bib.bib36)\]\. Similar interest has been directed toward medication safety, where LLMs have demonstrated the ability to identify clinically significant drug–drug interactions, although important trade\-offs between sensitivity and specificity remain evident across models\[[101](https://arxiv.org/html/2606.02914#bib.bib33)\]\. For example, ChatGPT\-5 achieved very high sensitivity \(98%\) but lower specificity, whereas DeepSeek\-Chat demonstrated perfect specificity \(100%\) while missing a large proportion of critical alerts\[[101](https://arxiv.org/html/2606.02914#bib.bib33)\]\. Guideline\-based decision support has also been investigated through RAG systems incorporating the 2021 AHA guideline for infective endocarditis prophylaxis\. Across models and prompting strategies, performance varied considerably, with reported accuracies ranging from approximately 42% to 90%, while guideline retrieval integration reduced hallucinations\[[86](https://arxiv.org/html/2606.02914#bib.bib35)\]\. Similar variability was observed in acute dental pain management, where models differed significantly in scientific accuracy, clarity, and comprehensiveness, with Claude and ChatGPT\-4o achieving the highest overall performance\[[104](https://arxiv.org/html/2606.02914#bib.bib37)\]\. Treatment\-planning studies further demonstrate both the potential and limitations of current models\. In endodontic restoration planning, Gemini achieved the highest performance across repeated evaluations, while DeepSeek demonstrated the lowest\[[94](https://arxiv.org/html/2606.02914#bib.bib34)\]\. Improvements following exposure to example answers were also observed in some models, suggesting that performance may be influenced by prior examples and prompting approaches\[[94](https://arxiv.org/html/2606.02914#bib.bib34)\]\. Nevertheless, none of the evaluated models achieved perfect repeatability, and incomplete or partially accurate responses remained common\[[94](https://arxiv.org/html/2606.02914#bib.bib34)\]\. Together, these findings suggest that LLMs can support a range of clinical decision\-support tasks, although performance remains dependent on the specific application, model, and evaluation context\. Table 1:Summary of studies evaluating Clinical Reasoning and Diagnostic Support\. ### 4\.2Educational Applications Dental education represents the most extensively studied application domain for general generative models in dentistry\. Research has followed two main trajectories: benchmarking model performance on licensing and specialty examinations \(Table[2](https://arxiv.org/html/2606.02914#S4.T2)\), and evaluating LLMs as active learning tools for students and practitioners \(Table[3](https://arxiv.org/html/2606.02914#S4.T3)\)\. #### 4\.2\.1Licensing and Specialty Examinations Language\-generative models have been increasingly investigated in dental licensing examinations across diverse international contexts\. Across most examination systems, performance is strongest on text\-based and knowledge\-oriented questions\. GPT\-4o achieved 81\.1% accuracy on the Korean National Dental Licensing Examination\[[96](https://arxiv.org/html/2606.02914#bib.bib40)\], while studies based on the INBDE, ADAT, DAT, and U\.S\. board\-style question banks similarly reported strong performance in knowledge\-based domains\[[26](https://arxiv.org/html/2606.02914#bib.bib41),[76](https://arxiv.org/html/2606.02914#bib.bib46)\]\. However, this pattern is not universal\. The Taiwan National Dental Licensing Examination reported substantially lower performance, with overall accuracy ranging from 44\.63–54\.89%, particularly in clinically oriented sections, highlighting persistent limitations in clinical reasoning\[[66](https://arxiv.org/html/2606.02914#bib.bib42)\]\. The most consistent challenge across the licensing literature is performance on image\-dependent questions\. This contrast is particularly evident in studies based on the Japanese National Dental Examination, where text\-based accuracy reached 79\.9–92\.2% while image\-based performance declined to 45\.6–67\.8%\[[75](https://arxiv.org/html/2606.02914#bib.bib38)\]\. Similar findings were reported in image\-dependent oral pathology evaluations, where diagnostic accuracy remained moderate at 45\.4–61\.4%\[[110](https://arxiv.org/html/2606.02914#bib.bib39)\]\. Together, these studies suggest that examination performance remains strongly influenced by question modality\. Specialty examinations largely reinforce this picture\. High performance has been reported in oral pathology, restorative dentistry, orthodontics, and oral radiology, with several studies reporting accuracies exceeding 80% and, in some cases, approaching or exceeding 90%\[[31](https://arxiv.org/html/2606.02914#bib.bib47),[40](https://arxiv.org/html/2606.02914#bib.bib48),[117](https://arxiv.org/html/2606.02914#bib.bib50),[20](https://arxiv.org/html/2606.02914#bib.bib49),[100](https://arxiv.org/html/2606.02914#bib.bib52)\]\. In oral pathology, accuracy reached up to 90%, reflecting the benefits of domain specificity, although prompting strategies produced only modest improvements\[[115](https://arxiv.org/html/2606.02914#bib.bib43)\]\. Nevertheless, several studies noted declining performance in complex procedural questions, clinical reasoning tasks, and visually oriented assessments\[[40](https://arxiv.org/html/2606.02914#bib.bib48),[20](https://arxiv.org/html/2606.02914#bib.bib49),[1](https://arxiv.org/html/2606.02914#bib.bib51)\]\. Dental technician examinations demonstrated more moderate performance and only limited evidence of self\-learning improvement in some models\[[47](https://arxiv.org/html/2606.02914#bib.bib45),[38](https://arxiv.org/html/2606.02914#bib.bib44)\]\. Although some models approached professional standards in selected specialties, direct comparisons with human candidates indicate that LLMs continue to underperform in overall scores and clinical domains, particularly in endodontics and orthodontics\[[95](https://arxiv.org/html/2606.02914#bib.bib53)\]\. #### 4\.2\.2Language\-Generative Models as Dental Education Support Tools Language\-generative models are increasingly being integrated into dental education for learning, self\-assessment, and clinical reasoning\. Across dental education studies, performance is generally moderate to high on exam\-style questions, although outcomes vary across domains, question formats, and levels of complexity\. High performance has been reported in dental occlusion and endodontics, where advanced models achieved accuracies exceeding 90% on undergraduate\-level questions, diagnostic case\-based assessments, and dental occlusion evaluations, in some cases outperforming dental students on diagnostic tasks\[[3](https://arxiv.org/html/2606.02914#bib.bib55),[5](https://arxiv.org/html/2606.02914#bib.bib56),[32](https://arxiv.org/html/2606.02914#bib.bib57)\]\. Similar findings have been reported in dental caries education, although performance declined when addressing more complex content\[[9](https://arxiv.org/html/2606.02914#bib.bib54)\]\. Additional studies in practitioner\-level endodontics indicate moderate but consistent reliability, supporting their use for guided learning rather than independent clinical decision\-making\[[69](https://arxiv.org/html/2606.02914#bib.bib58)\]\. Clinical knowledge assessments show greater variability\. Performance has been reported at levels approaching those of clinically experienced students, while studies in traumatic dental injury management and dental traumatology demonstrated moderate to high knowledge accuracy but also identified variability across question formats and occasional inaccuracies in generated explanations\[[58](https://arxiv.org/html/2606.02914#bib.bib59),[90](https://arxiv.org/html/2606.02914#bib.bib60),[59](https://arxiv.org/html/2606.02914#bib.bib61)\]\. Beyond conventional question\-answering tasks, LLMs have shown potential in simulation\-based education\. AI\-generated clinical case simulations in temporomandibular disorder training achieved diagnostic outcomes comparable to real patient interactions while offering greater standardization, information density, and transparency of reasoning\[[87](https://arxiv.org/html/2606.02914#bib.bib62)\]\. LLMs have also demonstrated utility in generating reflective assignments and supporting qualitative analysis, producing outputs that were largely comparable to human\-generated work\[[13](https://arxiv.org/html/2606.02914#bib.bib63)\]\. Across dental specialties, performance remains variable\. Studies in paediatric dentistry, orthodontics, periodontology, prosthodontics, restorative dentistry, and endodontics report generally accurate and relevant responses or high accuracy, although limitations related to reliability, consistency across repeated testing, and understanding of closely related concepts remain evident\[[28](https://arxiv.org/html/2606.02914#bib.bib64),[41](https://arxiv.org/html/2606.02914#bib.bib65),[56](https://arxiv.org/html/2606.02914#bib.bib66),[60](https://arxiv.org/html/2606.02914#bib.bib68)\]\. Several studies also demonstrate that optimization strategies such as retrieval\-augmented generation, in\-context learning, and majority voting can improve performance and reduce knowledge\-based errors\[[39](https://arxiv.org/html/2606.02914#bib.bib67)\]\. Taken together, these studies indicate that LLMs provide accessible, high\-quality informational content for students and professionals, although challenges related to readability and reliability remain\[[79](https://arxiv.org/html/2606.02914#bib.bib69)\]\. Table 2:Summary of studies evaluating LLMs in Licensing and Specialty Examinations\.Table 3:Summary of studies evaluating LLMs as Dental Education Support Tools\. ### 4\.3Patient Communication and Report Generation The third major use of language\-generative models in dentistry is patient communication, spanning two areas: patient question answering and radiology report generation, with findings summarized in Tables[4](https://arxiv.org/html/2606.02914#S4.T4)and[5](https://arxiv.org/html/2606.02914#S4.T5)\. This includes answering questions, creating consent forms, simplifying reports, and helping patients make decisions\. It is a sensitive area because clear, accurate, and empathetic information affects safety and how well patients follow treatment\. #### 4\.3\.1Patient Question Answering Patients increasingly turn to online resources for dental information and self\-management, and LLMs have emerged as a natural candidate for this role, offering immediate, accessible responses with demonstrated advantages over traditional search engines in quality, readability, empathy, and overall user satisfaction\[[84](https://arxiv.org/html/2606.02914#bib.bib70)\]\. Across dental specialties, LLMs generally provide accurate, informative, and comprehensive responses to patient questions, with newer models such as ChatGPT\-4 and Gemini consistently outperforming earlier systems\[[91](https://arxiv.org/html/2606.02914#bib.bib71),[33](https://arxiv.org/html/2606.02914#bib.bib73),[34](https://arxiv.org/html/2606.02914#bib.bib74)\]\. Performance is strongest for common patient\-oriented queries, whereas more complex or expert\-level questions remain more challenging\[[121](https://arxiv.org/html/2606.02914#bib.bib75),[92](https://arxiv.org/html/2606.02914#bib.bib79)\]\. Although response quality is often rated highly, readability remains inconsistent across models and applications\[[116](https://arxiv.org/html/2606.02914#bib.bib72),[33](https://arxiv.org/html/2606.02914#bib.bib73),[55](https://arxiv.org/html/2606.02914#bib.bib76)\]\. Beyond information provision, LLMs show promise for patient communication and follow\-up care\. Studies report high ratings for the accuracy, appropriateness, and empathy of responses in postoperative settings, while retrieval\-enhanced systems further improve response clarity and accuracy\[[18](https://arxiv.org/html/2606.02914#bib.bib77),[12](https://arxiv.org/html/2606.02914#bib.bib78)\]\. However, evaluations reveal a persistent gap between professional and lay perceptions, with clinicians generally favoring expert\-generated responses while patients and parents sometimes rate chatbot responses as comparable or superior\[[21](https://arxiv.org/html/2606.02914#bib.bib80)\]\. Despite these encouraging findings, performance remains variable across platforms, and responses may contain errors, biases, inconsistencies, or fabricated information\. Consequently, current evidence supports the use of LLMs as supplementary patient communication tools rather than independent sources of dental advice\[[23](https://arxiv.org/html/2606.02914#bib.bib82),[81](https://arxiv.org/html/2606.02914#bib.bib81)\]\. #### 4\.3\.2Radiology Report Generation Language\-generative models have been increasingly explored for automated dental radiology report generation, driven by the recognition that manual report writing is time\-consuming and prone to variability\[[97](https://arxiv.org/html/2606.02914#bib.bib83)\]\. Early studies demonstrated that LLMs such as ChatGPT can generate reports with high textual similarity to reference reports and good readability, although critical diagnostic details may be omitted despite largely error\-free language\[[97](https://arxiv.org/html/2606.02914#bib.bib83)\]\. More recent studies have explored combining vision\-based models with LLMs, reporting good overall performance, improved report quality, and reductions in hallucinations through the separation of image interpretation and language generation tasks\[[25](https://arxiv.org/html/2606.02914#bib.bib85),[10](https://arxiv.org/html/2606.02914#bib.bib86)\]\. Beyond report generation, LLMs have also been used to simplify radiology reports for patients\. These simplified versions improve readability, understanding, and engagement, facilitating patient participation in clinical discussions and decision\-making\[[98](https://arxiv.org/html/2606.02914#bib.bib84)\]\. More broadly, Studies evaluating patient communication suggest that LLM\-generated responses are often more structured, clear, and empathetic, and may be preferred over traditional responses or conventional information sources\[[72](https://arxiv.org/html/2606.02914#bib.bib87),[84](https://arxiv.org/html/2606.02914#bib.bib70)\]\. However, readability remains inconsistent, and some responses may lack clear actionable guidance for patients\[[2](https://arxiv.org/html/2606.02914#bib.bib88)\]\. Figure[4](https://arxiv.org/html/2606.02914#S4.F4)summarizes the main points across clinical, educational, and communication aspects of language\-generative models in dentistry\. Figure 4:Summary of the main themes across clinical, educational, and patient communication domains\.Table 4:Summary of studies evaluating LLMs in Patient Question Answering\.Table 5:Summary of studies evaluating LLMs in Report Generation\. ## 5Discriminative Vision Foundation Models Where the models reviewed in Section[4](https://arxiv.org/html/2606.02914#S4)are designed to produce language, the models examined here learn structured visual representations: embeddings, segmentation masks, or bounding boxes\. In dentistry, three architectures have been most prominently explored: SAM for prompt\-based segmentation\[[54](https://arxiv.org/html/2606.02914#bib.bib89)\], CLIP for contrastive image\-text alignment\[[83](https://arxiv.org/html/2606.02914#bib.bib98)\], and GroundingDINO for open\-vocabulary object detection\[[68](https://arxiv.org/html/2606.02914#bib.bib102)\]\. These models share a common challenge: none were trained on dental data, and all require meaningful adaptation before they perform adequately on clinical dental imaging tasks, as reflected in the studies summarized in Table[6](https://arxiv.org/html/2606.02914#S5.T6)\. ### 5\.1Tooth and Anatomical Segmentation SAM is the most extensively adapted foundation model in dental imaging, with six distinct adaptation strategies represented in the reviewed literature\. These strategies address different failure modes of zero\-shot SAM application to dental images, and their diversity reflects both the richness of the domain challenge and the absence of a consensus approach\. Zero\-shot SAM performance on dental images is weak\. Dental radiographs present low contrast between tooth structures and surrounding bone, complex overlapping anatomies in panoramic views, fine boundary details at enamel\-pulp interfaces, and three\-dimensional structures compressed into two\-dimensional projections\[[120](https://arxiv.org/html/2606.02914#bib.bib93),[109](https://arxiv.org/html/2606.02914#bib.bib90),[65](https://arxiv.org/html/2606.02914#bib.bib94)\]\. These characteristics differ substantially from the natural image domain on which SAM was trained\. Each of the six reviewed adaptation papers acknowledges this domain gap and proposes a different engineering response to it\. PPA\-SAM addresses the gap for CBCT\-based tooth segmentation by integrating SAM with a 3D VNet and adversarial learning in a dual\-encoder architecture\[[65](https://arxiv.org/html/2606.02914#bib.bib94)\]\. The adversarial component is designed to improve robustness and generalization across CBCT imaging conditions\. Tooth\-ASAM takes a broader modality\-agnostic approach, introducing adapter modules into SAM’s encoder and mask decoder to support simultaneous adaptation across CBCT, panoramic radiographs, and natural tooth images\[[109](https://arxiv.org/html/2606.02914#bib.bib90)\]\. Experimental results show improved Dice, IoU, HD95, and ASSD compared to baseline SAM across all three modalities, though specific improvement margins are not uniformly reported across modalities, which limits precise cross\-modality comparison\. EASAM tackles boundary imprecision through a dual\-branch architecture that combines SAM’s high\-level encoder features with edge information from a parallel CNN\-based branch\[[120](https://arxiv.org/html/2606.02914#bib.bib93)\]\. This is a technically distinct approach: rather than modifying SAM’s parameters, it augments the feature pipeline with domain\-specific edge signals\. The resulting system shows improved segmentation accuracy on panoramic dental X\-rays, though generalization across modalities was not evaluated\. ToothSC\-SAM addresses annotation cost rather than accuracy, introducing a two\-stage framework where region\-of\-interest extraction precedes prompt\-based 3D SAM segmentation with skip connections\[[63](https://arxiv.org/html/2606.02914#bib.bib96)\]\. It achieves approximately 93% of fully supervised performance while reducing annotation time from several hours to minutes, a practically important result for clinical deployment where expert annotation is expensive and scarce\. 3DTeethSAM extends SAM to three\-dimensional dental mesh data through multi\-view rendering and reconstruction, achieving mean IoU of 91\.90% on the 3DTeethSeg benchmark\[[70](https://arxiv.org/html/2606.02914#bib.bib95)\]\. The approach projects 3D mesh data into 2D views for SAM processing, then reconstructs the segmentation in 3D, producing strong benchmark results but introducing multi\-stage processing complexity and reliance on accurate 3D\-to\-2D projection\. The detection\-guided pipeline of Atchibay et al\. places SAM within a three\-stage pipeline where YOLOv11 first localises teeth, SAM generates initial masks, and U\-Net refines boundaries\[[7](https://arxiv.org/html/2606.02914#bib.bib97)\]\. The Dice score improves from 0\.672 for raw SAM to 0\.903 after the full pipeline, with YOLOv11 achieving mAP@0\.5 of 0\.993 in the detection stage\. This pipeline achieves the strongest reported segmentation performance among the reviewed SAM adaptations but introduces the highest system complexity and multiple failure propagation points\. These six adaptation strategies are compared in Figure[5](https://arxiv.org/html/2606.02914#S5.F5)\. Figure 5:SAM adaptation strategies and adapted models for dental image segmentation, illustrating the range of architectural approaches used to address domain shift from natural image pretraining to clinical dental imaging\. ### 5\.2Detection and Diagnosis Beyond segmentation, discriminative vision foundation models have been applied to two further task types: open\-vocabulary abnormality detection and contrastive diagnosis\. Du et al\. applied GroundingDINO for dental abnormality detection using text prompts of abnormality class names\[[30](https://arxiv.org/html/2606.02914#bib.bib99)\]\. The system enhances detection through FDI\-based tooth notation and a multi\-level strategy combining global image\-level and tooth\-level local detection, achieving 37\.0% mAP and 66\.3% AP50, these represent meaningful improvements over baseline\. Kim et al\. applied a CLIP\-based contrastive approach to TMJOA diagnosis on panoramic radiographs, aligning images with diagnostic text labels \(e\.g\., “Right TMJOA”, “Left normal”\) through shared embedding similarity\[[53](https://arxiv.org/html/2606.02914#bib.bib101)\]\. The system incorporates semi\-supervised pseudo\-labeling to leverage unlabeled data, achieving accuracy and F1\-score of0\.929±0\.0120\.929\\pm 0\.012on a binary diagnostic task\. The CLIP\-based approach produces diagnostic labels rather than spatial masks, making it appropriate for image classification but unsuitable for tasks requiring anatomical localisation\. Despite meaningful progress in segmentation and detection, discriminative vision foundation models consistently require task\-specific adaptation to perform adequately on dental data, and sensitivity to prompt quality or upstream inputs limits their robustness in fully automated clinical workflows\. Table 6:Summary of studies evaluating Discriminative Vision Foundation Models in dentistry\. ## 6Dental\-Specific Foundation Models The models reviewed in Sections[4](https://arxiv.org/html/2606.02914#S4)and[5](https://arxiv.org/html/2606.02914#S5)are general\-purpose systems adapted to dental tasks\. This section examines models built with dental knowledge at their core, either through large\-scale domain\-specific pretraining or through extensive supervised training on curated dental datasets\. As established in Section[3](https://arxiv.org/html/2606.02914#S3), these two approaches differ fundamentally in how dental knowledge is acquired and what trade\-offs they carry, as reflected in the studies summarized in Table[7](https://arxiv.org/html/2606.02914#S6.T7)\. ### 6\.1Pretrained\-from\-Scratch: DentVFM DentVFM is the only model in the reviewed literature that constitutes a true dental foundation model in the architectural sense: trained from scratch on a large curated dental imaging dataset without relying on a general\-purpose checkpoint\[[48](https://arxiv.org/html/2606.02914#bib.bib107)\]\. The model uses a Vision Transformer backbone in both 2D and 3D variants, trained via self\-supervised learning on DentVista, a dataset of approximately 1\.6 million multimodal dental radiographic images spanning panoramic radiographs, intraoral X\-rays, anteroposterior and lateral X\-rays, MRI, CT, and CBCT from multiple medical centers\. The self\-supervised objective allows the model to learn dental\-specific visual representations without requiring manual annotation at scale\. The breadth of DentVFM’s evaluation is one of its distinguishing features\. It is assessed through DentBench\. Results show DentVFM significantly outperforms supervised, self\-supervised, and weakly supervised baselines, achieves competitive performance with as little as 25% labeled data, and in some tasks, such as cyst diagnosis and TMJ abnormality detection, reaches performance comparable to or exceeding that of experienced dentists\. Cross\-modality diagnostic capability is also demonstrated: the model performs reliably even when certain imaging modalities are unavailable, which has practical value in resource\-limited settings\. These results position DentVFM as a label\-efficient, adaptable, and scalable foundation for advancing intelligent dental healthcare, with particular promise for bridging access gaps in resource\-limited settings\. ### 6\.2Heavily Fine\-Tuned Dental Models The majority of dental\-specific foundation models in the reviewed literature are heavily fine\-tuned systems: general\-purpose VLMs adapted through large\-scale instruction tuning, reinforcement learning, or multi\-stage training on curated dental corpora\. CephGPT\-4is the earliest and most narrowly scoped model in this group, focused on a single well\-defined task: cephalometric analysis for orthodontic diagnosis\[[71](https://arxiv.org/html/2606.02914#bib.bib108)\]\. It fine\-tunes MiniGPT\-4 and VisualGLM on a dataset combining cephalometric images with doctor\-patient dialogue, using U\-Net for automated landmark detection and then aligning the outputs with diagnostic reports\. The approach is clinically sensible — cephalometric analysis is time\-consuming, prone to inter\-operator variability, and well\-suited to automation\. However, the evaluation is not benchmarked against existing cephalometric AI systems or against expert clinicians, which limits assessment of its clinical contribution\. DentalGPTaddresses a broader and harder problem: interpreting subtle visual patterns in dental images for disease classification and visual question answering\[[19](https://arxiv.org/html/2606.02914#bib.bib109)\]\. It is built on a 7B\-parameter backbone trained on more than 120,000 dental images with detailed textual descriptions highlighting diagnostically relevant features, followed by a reinforcement learning stage to strengthen complex reasoning\. The use of reinforcement learning for dental reasoning is a methodological novelty: the training signal pushes the model beyond recognition toward step\-by\-step diagnostic inference\. Results show competitive performance on dental VQA benchmarks despite the compact architecture, suggesting that data quality and reasoning training are more influential than model scale for this task\. DentVLMrepresents the most ambitious clinical evaluation among the heavily fine\-tuned models\[[73](https://arxiv.org/html/2606.02914#bib.bib104)\]\. Trained on 110,447 images and 2\.46 million visual question\-answer pairs covering 36 diagnostic tasks across seven imaging modalities, it follows a two\-stage pipeline: visual\-text alignment followed by diagnostic instruction tuning with anatomical localisation output\. The model’s clinical evaluation is exceptional by dental AI standards: 25 dentists, 1,946 patients, and 3,105 QA pairs, with DentVLM outperforming junior dentists on 21 of 36 tasks and senior dentists on 12 tasks while reducing diagnostic time by 15\-22%\. It also achieved 19\.6% higher accuracy for oral diseases and 27\.9% higher for malocclusions compared to leading proprietary and open\-source models\. This is one of the strongest clinical validation studies in the dental AI literature and provides evidence that heavily fine\-tuned models with sufficient data can reach clinically relevant performance thresholds on well\-defined tasks\. OralGPTtargets a specific and clinically important challenge: oral mucosal disease diagnosis and description, where data scarcity and lesion heterogeneity are severe\[[119](https://arxiv.org/html/2606.02914#bib.bib106)\]\. It adopts a two\-stage framework: first learning disease\-related visual features from classification labels, then training on expert\-authored captions to enable clinically meaningful natural language descriptions\. To address data scarcity, it introduces a similarity\-guided pseudo\-captioning mechanism that transfers descriptive knowledge from well\-annotated images to weakly labeled ones, effectively augmenting supervision without additional expert annotation\. The model achieves competitive diagnostic performance on four common oral conditions and generates clinically meaningful natural language descriptions\. This approach is novel and practically important: it provides a pathway for building specialist dental AI systems in data\-sparse clinical areas where large annotated datasets cannot be assembled\. OralGPT\-Omniis the most comprehensive model in scope, integrating approximately 3\.21 million text tokens, 59,658 images, and 90 videos across eight dental imaging modalities tasks\[[42](https://arxiv.org/html/2606.02914#bib.bib105)\]\. Its key methodological contribution is TRACE\-CoT, a chain\-of\-thought dataset capturing dentists’ step\-by\-step diagnostic reasoning including image inspection, hypothesis generation, knowledge reference, and verification\. The model is trained through a multi\-stage pipeline that progressively enhances image understanding, domain knowledge integration, and transparent reasoning, allowing it not only to generate answers but also to explain how those answers were derived\. On the MMOral\-Uni and MMOral\-OPG benchmarks, OralGPT\-Omni achieves scores of 51\.84 and 45\.31 respectively, substantially outperforming GPT\-5 and other proprietary models\. OralGPT\-Plustakes a different approach to clinical alignment, introducing an agentic architecture that mirrors the iterative, symmetry\-aware diagnostic workflow of dentists examining panoramic radiographs\[[36](https://arxiv.org/html/2606.02914#bib.bib110)\]\. The model follows an inspect\-zoom\-compare loop, using reinforcement learning to encourage clinically meaningful re\-examination steps\. It is trained on DentalProbe, a dataset of 5,000 expert\-curated diagnostic trajectories, and evaluated on MMOral\-X, a new benchmark of 300 open\-ended panoramic QA pairs\. The agentic design is conceptually important: it acknowledges that single\-pass VLM inference is structurally misaligned with how clinicians actually read panoramic radiographs, and it operationalises a more realistic diagnostic process\. ### 6\.3LLM\-Driven Tools and Systems Beyond standalone models, several systems integrate language\-generative or vision\-language models as components within structured clinical pipelines\. ArchMapis a training\-free framework for structured understanding of 3D intraoral scans\[[118](https://arxiv.org/html/2606.02914#bib.bib111)\]\. It combines a geometry\-aware arch\-flattening module with a dental knowledge base encoding hierarchical tooth ontology and a schema\-constrained vision\-language inference pipeline, converting raw 3D mesh data into deterministic, clinically structured outputs\. Validated on 1,060 pre\- and post\-orthodontic cases, it achieves strong performance in tooth counting, anatomical partitioning, and dentition\-stage classification\. The training\-free design is a practical advantage: it requires no labeled dental data and can be updated by modifying the knowledge base without retraining\. GumAgentuses a VLM as an input validation gate within a gum disease detection pipeline\[[24](https://arxiv.org/html/2606.02914#bib.bib112)\]\. The VLM classifies input images into intraoral, denture, or non\-relevant categories before downstream segmentation, ensuring that the segmentation model only receives appropriate inputs\. This design addresses a practical deployment problem: real\-world patient\-facing systems receive heterogeneous image inputs that specialized models cannot handle\. ClinicGPTis a proof\-of\-concept retrieval system for dental clinic administration, integrating a fine\-tuned LLM with an institutional knowledge base\[[122](https://arxiv.org/html/2606.02914#bib.bib113)\]\. Developed at the Schulich School of Medicine and Dentistry, it responds to administrative and protocol\-related queries using embeddings and a vector database\. While it does not address clinical diagnosis, it represents an important non\-diagnostic use case: using LLMs to reduce administrative burden in dental settings, where 15% of appointment time is spent on administrative tasks and 21% on waiting for instructor assistance\[[122](https://arxiv.org/html/2606.02914#bib.bib113)\]\. ### 6\.4Datasets and Benchmarks The development of dental\-specific foundation models has revealed a critical infrastructure gap: the absence of large\-scale, well\-curated datasets and standardized evaluation benchmarks for dental AI\. MMOral addresses this gap for panoramic X\-ray analysis with 20,563 annotated images paired with approximately 1\.3 million instruction\-following instances across attribute extraction, report generation, visual question answering, and image\-grounded dialogue\[[43](https://arxiv.org/html/2606.02914#bib.bib114)\]\. The associated MMOral\-Bench evaluates models across five clinical dimensions including teeth condition, pathological findings, historical treatments, jawbone observations, and clinical recommendations\. Evaluation of 64 LVLMs on this benchmark reveals that GPT\-4o achieves only 41\.45% accuracy, confirming that strong general\-purpose performance does not transfer to dental\-specific tasks without domain adaptation\. A single epoch of supervised fine\-tuning on MMOral data yields a 24\.73% improvement for OralGPT, demonstrating the value of domain\-specific instruction data\. DentBench\[[48](https://arxiv.org/html/2606.02914#bib.bib107)\]provides complementary coverage for the full spectrum of dental radiology, spanning eight subspecialties, more than 40 diseases, and seven radiographic modalities from 15 global regions, supporting evaluation of DentVFM across diverse downstream tasks\. Table 7:Summary of studies evaluating Dental Vision and Multimodal Foundation Models\. ## 7Discussion, Limitations, and Future Directions Figure 6:Transition from narrow to foundation models and key model categories with their complementary roles in dental healthcare\.### 7\.1Comparison with Existing Reviews The transition from narrow task\-specific systems toward integrated foundation model ecosystems, and the complementary roles of each model category, are illustrated in Figure[6](https://arxiv.org/html/2606.02914#S7.F6)\. This review occupies a distinct position among secondary literature on AI in dentistry: prior reviews have addressed specific subsets of the topic, but none has examined language\-generative models, discriminative vision foundation models, and dental\-specific foundation models within a single unified framework\. Within the LLM subset, Umer et al\.\[[106](https://arxiv.org/html/2606.02914#bib.bib1)\]\(BDJ Open, 2024\) identified 17 studies, with ChatGPT dominant, Likert scales the most frequently used evaluation metric, and advanced prompting in only two studies\. The present review’s 97 studies through early 2026 reflect a phase transition rather than a difference in search strategy: model diversity has expanded to include Gemini, Claude, DeepSeek, and domain\-specific models, and advanced prompting and RAG are now common rather than exceptional\. In the educational domain specifically, Aura\-Tormos et al\.\[[8](https://arxiv.org/html/2606.02914#bib.bib3)\]conducted a systematic review of 60 studies found GPT\-4 outperformed GPT\-3\.5, curricula integration remained informal, and misinformation and overreliance concerns were frequently reported—findings this review confirms and extends with two additions: a documented text\-image performance divide across multiple examination systems, and temporal instability of model outputs as a systematic gap\. For oral and maxillofacial surgery, Ronsivalle et al\.\[[88](https://arxiv.org/html/2606.02914#bib.bib4)\]reached the same conclusion of constrained performance in complex and emotionally sensitive scenarios from only four eligible studies; the present review grounds this conclusion in a substantially larger evidence base\. In the broader healthcare context, Busch et al\.\[[17](https://arxiv.org/html/2606.02914#bib.bib5)\]reviewed 89 studies across 29 specialties, identifying design limitations \(lack of domain optimization, data transparency\) and output limitations \(non\-reproducibility, incorrectness\) as the two primary domains\. The dental findings here map directly onto this framework: absent dental domain optimization corresponds to the image\-dependent performance gap, and output limitations including hallucination and inconsistency are documented extensively across the clinical reasoning and patient communication sections\. What the dental literature adds is that the text\-image divide is sharper and more consequential in dentistry than in most specialties, given the centrality of radiographic interpretation to dental diagnosis\. For discriminative vision models, no comparable dental review exists; Ryu et al\.\[[89](https://arxiv.org/html/2606.02914#bib.bib7)\]reviewed medical vision\-language models from 2022 to 2024, confirming promise alongside generalization and integration challenges—both replicated here—while this review adds that no shared evaluation benchmark currently exists for discriminative vision model comparison in dentistry\. ### 7\.2The Complementarity Pattern and Its Clinical Implications The most practically significant finding of this review is that language\-generative models, discriminative vision models, and dental\-specific foundation models are complementary rather than competing systems\. Each category covers task types where the others are weakest\. Language\-generative models excel at reasoning, education, and dialogue, but fail at image\-dependent diagnosis\. Discriminative vision models achieve strong spatial precision in segmentation and detection but produce no language output\. Dental\-specific foundation models achieve the strongest performance on complex multimodal tasks but require substantial training data and compute that limits their availability\. As illustrated in Figure[7](https://arxiv.org/html/2606.02914#S7.F7), this complementarity is not incidental: the three model categories occupy distinct performance profiles across task types, with no single category dominating all columns\. Figure 7:Performance heatmap of large AI models across dental task categories\. Ratings \(Moderate, High, Very High\) reflect a qualitative synthesis by the authors based on the range of quantitative performance metrics reported across included studies for each model–task combination; they are not derived from a single threshold but represent the authors’ assessment of the overall evidence\. Empty cells indicate the task was not evaluated or reported for that model\. This complementarity is reflected in the integrated pipeline literature\. Studies combining vision models with language decoders\[[25](https://arxiv.org/html/2606.02914#bib.bib85),[10](https://arxiv.org/html/2606.02914#bib.bib86)\], RAG\-augmented LLMs\[[37](https://arxiv.org/html/2606.02914#bib.bib22),[86](https://arxiv.org/html/2606.02914#bib.bib35)\], and agentic vision\-language systems\[[36](https://arxiv.org/html/2606.02914#bib.bib110),[118](https://arxiv.org/html/2606.02914#bib.bib111)\], consistently shows that combinations outperform single\-model approaches\. This is not a peripheral finding: it is a structural property of the current state of the field\. No single architecture covers the full range of tasks that clinical dental practice involves\. The field’s current focus on single\-model evaluation is therefore misaligned with how effective dental AI systems will actually be built\. Investment in integration infrastructure, standardized pipeline interfaces, and multi\-model evaluation protocols would accelerate practical progress more than continued optimization of individual models on narrow benchmarks\. ### 7\.3Where the Field Is Heading Several directional trends are visible in the reviewed literature that point toward where the field is most likely to develop over the next three to five years\. Agentic reasoning architectures\.OralGPT\-Plus’s iterative inspect\-zoom\-compare framework\[[36](https://arxiv.org/html/2606.02914#bib.bib110)\]represents a conceptual shift from single\-pass inference toward diagnostic workflows that mimic clinical reasoning\. This direction is likely to be extended as reinforcement learning methods become more capable of supervising multi\-step clinical reasoning chains\. The DentalGPT reinforcement learning stage\[[19](https://arxiv.org/html/2606.02914#bib.bib109)\]and OralGPT\-Omni’s TRACE\-CoT dataset\[[42](https://arxiv.org/html/2606.02914#bib.bib105)\]indicate that reasoning supervision, rather than additional training data, may be the decisive factor in moving from information retrieval to genuine diagnostic support\. Annotation\-efficient pretraining\.DentVFM’s 25% label efficiency result\[[48](https://arxiv.org/html/2606.02914#bib.bib107)\]and ToothSC\-SAM’s annotation time reduction from hours to minutes\[[63](https://arxiv.org/html/2606.02914#bib.bib96)\]point toward a future in which dental AI systems can be built and adapted with substantially less expert annotation than current approaches require\. Self\-supervised and semi\-supervised methods that leverage the structure of dental imaging data without manual labels are likely to be the dominant pretraining strategy for dental\-specific models within this period\. Knowledge\-guided deterministic pipelines\.ArchMap’s training\-free, ontology\-guided architecture\[[118](https://arxiv.org/html/2606.02914#bib.bib111)\]and the infective endocarditis RAG system\[[86](https://arxiv.org/html/2606.02914#bib.bib35)\]both demonstrate that clinical knowledge structured in a knowledge base can guide and constrain AI outputs without requiring the knowledge to be learned from data\. This approach is particularly valuable for regulatory and deployment contexts, where the provenance and accuracy of model knowledge must be auditable\. Knowledge\-guided systems offer a pathway to deployable dental AI tools that is distinct from, and potentially faster than, large\-scale data\-driven pretraining\. Multimodal clinical data integration\.Current dental foundation models primarily rely on imaging data, imaging\-text pairs, or limited clinical metadata\. Although recent systems incorporate selected clinical text and treatment\-planning information, none of the reviewed studies demonstrates comprehensive integration and joint reasoning over dental images, structured electronic health records, longitudinal patient history, and clinical notes\. Consequently, a substantial gap remains between current dental AI systems and the multimodal clinical decision\-support tools required for routine clinical deployment\. ### 7\.4Global Scope Gaps The reviewed literature has pronounced geographic, demographic, and clinical coverage gaps that limit the generalizability of reported findings\. Geographically, studies are concentrated in East Asia \(particularly China, Japan, South Korea, and Taiwan\), Turkey, and the United States\. This concentration is not neutral: dental imaging hardware, clinical protocols, patient demographics, and disease prevalence patterns differ substantially across these regions and the rest of the world\. Model performance reported in these settings may not transfer to sub\-Saharan Africa, South Asia, Latin America, or rural settings in any region, where the WHO oral disease burden is highest\. No reviewed study evaluated model performance in low\-income country clinical settings, despite the WHO data cited in the introduction indicating that the global oral health burden falls disproportionately on middle and low\-income populations\. Clinically, several condition categories are sparsely represented or absent\. Rare oral diseases, head and neck oncology, cleft lip and palate, and oral manifestations of systemic diseases are not represented in any reviewed study\. Paediatric dentistry is covered in the educational literature but largely absent from clinical reasoning and imaging studies\. Geriatric oral health, which represents a growing burden globally, is not addressed\. By imaging modality, panoramic radiography dominates the reviewed literature, followed by periapical radiographs, CBCT, and intraoral photography\. Cone beam CT for maxillofacial applications, cephalometric radiographs beyond CephGPT\-4, and digital impressions beyond ChatIOS are underrepresented\. The integration of multiple imaging modalities within a single diagnostic workflow, which is routine in complex clinical cases, has not been evaluated in any reviewed study\. ### 7\.5Regulatory and Ethical Considerations The reviewed literature pays limited attention to the regulatory and ethical frameworks that govern clinical AI deployment, yet these frameworks will determine whether and how the systems reviewed can reach clinical practice\. In the United States, dental AI diagnostic tools would likely fall under the FDA’s Software as a Medical Device \(SaMD\) framework, requiring pre\-market notification or approval depending on intended use and risk classification\. No reviewed study discusses regulatory pathway, risk classification, or post\-market surveillance design\. In the European Union, dental AI systems would be classified under the EU AI Act \(2024\), which establishes high\-risk AI system requirements for medical diagnosis, including transparency obligations, human oversight requirements, and conformity assessment\. The AI Act’s prohibition on AI systems that pose unacceptable risks and its requirements for explainability and bias documentation apply directly to the clinical decision\-support applications reviewed in Sections 4 through 6\. Hallucination, the generation of plausible but factually incorrect content, is acknowledged in the reviewed literature but never systematically quantified\. For patient communication and clinical decision support applications, hallucination rates, types, and severity must be characterized before deployment can be considered responsible\. Bias in training data, which affects whose dental conditions are represented and whose imaging hardware characteristics are captured in training sets, is also unaddressed across the reviewed literature\. Models trained predominantly on East Asian patient populations may perform differently on patients with different dental anatomy, disease prevalence, or imaging characteristics\. ### 7\.6Limitations of This Review Several limitations of this review should be acknowledged in line with PRISMA\-ScR reporting requirements\. The search window of 2020 to early 2026 was chosen to capture the foundation model era, but it excludes earlier work on deep learning in dentistry that provides important context for the performance gains reported here\. The inclusion of arXiv preprints, which account for a meaningful portion of the dental\-specific foundation model literature, introduces a risk of including results that have not yet undergone peer review\. Although both reviewers \(S\.H\. and R\.D\.\) screened independently, formal inter\-rater reliability statistics were not calculated for the screening process, which limits the precision of the claim of systematic reproducibility\. The two\-dimensional classification framework proposed in Section 3 is a theoretical contribution of this review and has not been empirically validated as a taxonomy\. Its categories are analytically useful, as argued throughout the review, but the partial correlation between the two axes, documented in Section 3 itself, means the framework describes the literature rather than explaining it\. An English\-language restriction was applied, which may exclude relevant work published in Chinese, Japanese, Turkish, or other languages, particularly given the geographic concentration of the reviewed literature in non\-English\-speaking regions\. Finally, the heterogeneity of outcome measures across the reviewed studies precluded quantitative synthesis: no meta\-analysis was possible, and all comparisons are narrative\. ## 8Conclusion This review examined 97 studies published between 2020 and 2026, covering language\-generative models, discriminative vision foundation models, and dental\-specific foundation models, organized through a two\-dimensional classification framework based on architectural paradigm and degree of dental specialization\. Language\-generative models show strong performance on text\-based tasks, with accuracy consistently reaching 80–95% on licensing examinations and clinical question answering, but fall to 45–68% on image\-dependent tasks, a gap that is replicated independently across five national examination systems\. Discriminative vision models, adapted from SAM and CLIP, achieve competitive segmentation and classification performance but have not been evaluated against a shared benchmark\. Dental\-specific foundation models, particularly DentVLM and DentVFM, deliver the strongest performance on complex multimodal clinical tasks, with DentVLM outperforming junior dentists on 21 of 36 tasks in a rigorous clinical study\. The dominant pattern across all categories is complementarity rather than competition\. No single model class covers the full range of clinical dental tasks\. The strongest performing systems in the reviewed literature are those that combine model types within structured pipelines, using language\-generative models for reasoning and communication, discriminative vision models for spatial analysis, and dental\-specific models for domain\-critical multimodal tasks\. This finding has direct implications for how the field allocates research effort: optimizing individual models on narrow benchmarks is less likely to produce clinically deployable systems than investing in the integration infrastructure that allows complementary model types to work together\. Safe autonomous deployment of any of these systems in clinical dental practice remains conditional on resolving three persistent barriers: hallucination in generative models, which has been acknowledged but not systematically quantified in the reviewed literature; the scarcity of large, diverse, well\-annotated dental datasets, which constrains both training and evaluation; and the absence of standardized clinical evaluation benchmarks, which prevents meaningful cross\-study performance comparison\. Until these barriers are addressed through sustained data curation, shared benchmark development, and prospective clinical validation studies, the appropriate role for these systems is as supervised assistive tools that augment rather than replace clinical expertise\. The field is advancing rapidly\. Agentic reasoning architectures, annotation\-efficient self\-supervised pretraining, knowledge\-guided deterministic pipelines, and chain\-of\-thought clinical supervision are all active directions with early but promising evidence\. Whether these directions produce clinically deployable tools within a five\-year horizon will depend as much on regulatory clarity, data infrastructure, and the development of shared evaluation standards as on further model innovation\. Table 8:List of abbreviations used in this review\. ## CRediT Author Contribution Statement Sema Helali:Conceptualization, Methodology, Data curation, Formal analysis, Investigation, Writing – original draft, Writing – review and editing\.Lina Abu Nada:Clinical expertise, Validation of clinical relevance and implications of AI in dental practice, Writing – review and editing\.Alaa Abd\-Alrazaq:Methodology, Data extraction mechanism, Writing – review and editing \(Discussion\)\.Sausan Alqawas:Clinical expertise, Validation of clinical relevance and implications of AI in dental practice, Writing – review and editing\.Faleh Tamimi:Clinical expertise, Guidance of analysis, Writing – review and editing\.Rafat Damseh:Conceptualization, Methodology, Data curation, Formal analysis, Validation, Writing – review and editing, Supervision, Project administration\. ## Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper\. ## Funding This work was supported by United Arab Emirates University through the UAEU Strategic Research Grant 12R315\. ## Data Availability The data supporting the findings of this review are available from the corresponding author upon reasonable request\. ## Ethics Approval This study is a scoping review of publicly available literature and did not involve human participants, animal subjects, or identifiable personal data\. Ethical approval was therefore not required\. ## References - \[1\]F\. Akkoca, M\. Özdede, G\. İlhan, E\. Koyuncu, and H\. Ellidokuz\(2025\-06\)Assessing the success of chatgpt\-4o in oral radiology education and practice: a pioneering research\.Cumhuriyet Dental Journal28\(2\),pp\. 210–215\.Cited by:[§4\.2\.1](https://arxiv.org/html/2606.02914#S4.SS2.SSS1.p3.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.tab1.1.1.1.1.1.1.1.1.5.4.1.1.1)\. - \[2\]M\. M\. Alnsour, W\. Al\-Ghannam, I\. A\. Alraheam, A\. H\. A\. Sabrah, Y\. Oweis, and M\. K\. Al\-Omiri\(2025\)Assessing the suitability of chatgpt in responding to public inquiries about dental crown restorations\.BMC Oral Health25\(1\),pp\. 1949\.External Links:[Document](https://dx.doi.org/10.1186/s12903-025-07281-8)Cited by:[§4\.3\.2](https://arxiv.org/html/2606.02914#S4.SS3.SSS2.p2.1),[Table 5](https://arxiv.org/html/2606.02914#S4.T5.3.3.3.3.3.3.3.9.5.1.1.1)\. - \[3\]H\. Alqahtani\(2025\)Assessment of artificial intelligence chatbots in responding to dental occlusion questions: a comparative study\.BMC Oral Health26\(1\),pp\. 201\.External Links:[Document](https://dx.doi.org/10.1186/s12903-025-07573-z)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.p1.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.tab1.2.1.1.1.1.1.1.2.1.1.1.1)\. - \[4\]I\. Amador Barbosa, M\. Sergio Almeida Alves, P\. Rayse Zagalo de Almeida, P\. de Almeida Rodrigues, R\. Pimentel de Oliveira, S\. Augusto Fernades de Menezes, J\. D\. Mendonça de Moura, and R\. Roberto de Souza Fonseca\(2025\)Assessing the diagnostic and treatment accuracy of large language models \(llms\) in peri\-implant diseases: a clinical experimental study\.Journal of Dentistry162,pp\. 106091\.External Links:[Document](https://dx.doi.org/10.1016/j.jdent.2025.106091)Cited by:[§4\.1\.3](https://arxiv.org/html/2606.02914#S4.SS1.SSS3.p1.1),[Table 1](https://arxiv.org/html/2606.02914#S4.T1.3.1.1.1.1.1.1.10.9.1.1.1)\. - \[5\]E\. Arılı Öztürk, C\. Turan Gökduman, and B\. C\. Çanakçi\(2025\)Evaluation of the performance of chatgpt\-4 and chatgpt\-4o as a learning tool in endodontics\.International Endodontic Journal\.Note:Advance online publicationExternal Links:[Document](https://dx.doi.org/10.1111/iej.14217)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.p1.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.tab1.2.1.1.1.1.1.1.3.2.1.1.1)\. - \[6\]E\. M\. Aşar, İ\. İpek, and K\. Bilge\(2025\)Customized gpt\-4v\(ision\) for radiographic diagnosis: can large language model detect supernumerary teeth?\.BMC Oral Health25\(1\),pp\. 756\.External Links:[Document](https://dx.doi.org/10.1186/s12903-025-06163-3)Cited by:[§4\.1\.4](https://arxiv.org/html/2606.02914#S4.SS1.SSS4.p1.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.1.1.1.1.1.1.1.1.5.3.1.1.1)\. - \[7\]S\. Atchibay, A\. Alchaar, F\. Alizadeh\-Shabdiz, and P\. Liao\(2026\)Advancing sam for dental imaging: a detection\-prompted pipeline for high\-accuracy tooth segmentation\.Note:Under review \(MIDL 2026 submission\)Cited by:[§5\.1](https://arxiv.org/html/2606.02914#S5.SS1.p7.1),[Table 6](https://arxiv.org/html/2606.02914#S5.T6.2.2.2.2.2.2.9.6.1.1.1)\. - \[8\]J\. I\. Aura\-Tormos, M\. Llacer\-Martinez, and I\. Torres\-Osca\(2026\)Educational applications of chatgpt in university\-based dental education\. a systematic review\.European Journal of Dental Education30\(2\),pp\. 644–660\.External Links:[Document](https://dx.doi.org/10.1111/eje.70011)Cited by:[§7\.1](https://arxiv.org/html/2606.02914#S7.SS1.p2.1)\. - \[9\]A\. A\. Azhari, W\. M\. Ahmed, A\. Alhamadani, A\. Alfaraj, M\. Zhang, and C\. T\. Lu\(2026\)Assessing the efficacy of artificial intelligence platforms in answering dental caries multiple\-choice questions: a comparative study of chatgpt and google gemini language models\.Dentistry Journal14\(2\),pp\. 72\.External Links:[Document](https://dx.doi.org/10.3390/dj14020072)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.p1.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.tab1.2.1.1.1.1.1.1.5.4.1.1.1)\. - \[10\]Y\. Balel, K\. Sağtaş, F\. Teke, and M\. A\. Kurt\(2026\)A novel hybrid large language model approach for reporting panoramic radiographs and performance comparison with current large language models\.Journal of Imaging Informatics in Medicine\.Note:Advance online publicationExternal Links:[Document](https://dx.doi.org/10.1007/s10278-026-01880-9),[Link](https://doi.org/10.1007/s10278-026-01880-9)Cited by:[§4\.3\.2](https://arxiv.org/html/2606.02914#S4.SS3.SSS2.p1.1),[Table 5](https://arxiv.org/html/2606.02914#S4.T5.3.3.3.3.3.3.3.7.3.1.1.1),[§7\.2](https://arxiv.org/html/2606.02914#S7.SS2.p3.1)\. - \[11\]N\. Bashir, Z\. Ur Rahman, and S\. Chen\(2022\-07\)Systematic comparison of machine learning algorithms to develop and validate predictive models for periodontitis\.Journal of Clinical Periodontology49,pp\.\.External Links:[Document](https://dx.doi.org/10.1111/jcpe.13692)Cited by:[§1](https://arxiv.org/html/2606.02914#S1.p3.1)\. - \[12\]I\. Batool, N\. Naved, S\. M\. R\. Kazmi, and F\. Umer\(2024\)Leveraging large language models in the delivery of post\-operative dental care: a comparison between an embedded gpt model and chatgpt\.BDJ Open10\(1\),pp\. 48\.External Links:[Document](https://dx.doi.org/10.1038/s41405-024-00226-3)Cited by:[§4\.3\.1](https://arxiv.org/html/2606.02914#S4.SS3.SSS1.p3.1),[Table 4](https://arxiv.org/html/2606.02914#S4.T4.11.11.11.11.11.11.11.11.2.1.1)\. - \[13\]M\. Brondani, C\. Alves, C\. Ribeiro, M\. M\. Braga, R\. C\. M\. Garcia, T\. Ardenghi, and K\. Pattanaporn\(2024\)Artificial intelligence, chatgpt, and dental education: implications for reflective assignments and qualitative research\.Journal of Dental Education88\(12\),pp\. 1671–1680\.External Links:[Document](https://dx.doi.org/10.1002/jdd.13663)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.3.3.3.3.3.3.3.3.7.3.1.1.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.p2.1)\. - \[14\]T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan,et al\.\(2020\)Language models are few\-shot learners\.Advances in Neural Information Processing Systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2606.02914#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.02914#S3.SS1.p2.1),[§3\.2](https://arxiv.org/html/2606.02914#S3.SS2.p2.1)\. - \[15\]D\. P\. Bubna, N\. H\. R\. Mattos, L\. B\. D\. P\. Luiz, F\. Baratto\-Filho, M\. T\. Mattos\-Calil, Y\. T\. C\. Silva\-Sousa, E\. C\. Küchler, Schroder,Â\. G\. D\., C\. M\. Araujo, and B\. M\. M\. Araujo\(2026\)Can large language models detect periapical lesions in anterior teeth? a comparative study\.Brazilian Dental Journal36,pp\. e256861\.External Links:[Document](https://dx.doi.org/10.1590/0103-644020256861)Cited by:[§4\.1\.2](https://arxiv.org/html/2606.02914#S4.SS1.SSS2.p2.1),[Table 1](https://arxiv.org/html/2606.02914#S4.T1.3.1.1.1.1.1.1.5.4.1.1.1)\. - \[16\]M\. Büker, M\. Sümbüllü, and H\. Arslan\(2025\)Comparative performance of chatbots in endodontic clinical decision support: a 4\-day accuracy and consistency study\.International Dental Journal75\(5\),pp\. 100920\.External Links:[Document](https://dx.doi.org/10.1016/j.identj.2025.100920)Cited by:[§4\.1\.2](https://arxiv.org/html/2606.02914#S4.SS1.SSS2.p3.1),[Table 1](https://arxiv.org/html/2606.02914#S4.T1.3.1.1.1.1.1.1.6.5.1.1.1)\. - \[17\]F\. Busch, L\. Hoffmann, C\. Rueger, E\. van Dijk, R\. Kader, E\. Ortiz\-Prado, M\. Makowski, L\. Saba, M\. Hadamitzky, J\. Kather, D\. Truhn, R\. Cuocolo, L\. Adams, and K\. Bressem\(2025\-01\)Current applications and challenges in large language models for patient care: a systematic review\.Communications Medicine5,pp\.\.External Links:[Document](https://dx.doi.org/10.1038/s43856-024-00717-2)Cited by:[§7\.1](https://arxiv.org/html/2606.02914#S7.SS1.p3.1)\. - \[18\]Y\. Cai, R\. Zhao, H\. Zhao, Y\. Li, and L\. Gou\(2024\)Exploring the use of chatgpt/gpt\-4 for patient follow\-up after oral surgeries\.International Journal of Oral and Maxillofacial Surgery53\(10\),pp\. 867–872\.External Links:[Document](https://dx.doi.org/10.1016/j.ijom.2024.04.002)Cited by:[§4\.3\.1](https://arxiv.org/html/2606.02914#S4.SS3.SSS1.p3.1),[Table 4](https://arxiv.org/html/2606.02914#S4.T4.19.19.19.19.19.19.19.22.2.1.1.1)\. - \[19\]Z\. Cai, J\. Zhang, J\. Zhao, Z\. Zeng, Y\. Li, J\. Liang, others, and B\. Wang\(2025\)DentalGPT: incentivizing multimodal complex reasoning in dentistry\.arXiv preprint arXiv:2512\.11558\.Cited by:[§6\.2](https://arxiv.org/html/2606.02914#S6.SS2.p3.1),[Table 7](https://arxiv.org/html/2606.02914#S6.T7.3.1.1.1.1.1.1.6.5.1.1.1),[§7\.3](https://arxiv.org/html/2606.02914#S7.SS3.p2.1)\. - \[20\]B\. Çakmak, T\. Sökmen, and B\. Baloş Tuncer\(2025\)Artificial intelligence\-powered chatbots’ responses to orthodontic questions from the dentistry specialization examination: accuracy and source evaluation\.Journal of Dental Sciences\.External Links:ISSN 1991\-7902,[Document](https://dx.doi.org/10.1016/j.jds.2025.11.027),[Link](https://doi.org/10.1016/j.jds.2025.11.027)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.02914#S4.SS2.SSS1.p3.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.tab1.1.1.1.1.1.1.1.1.3.2.1.1.1)\. - \[21\]İ\. H\. Çelik, H\. Camcı, and F\. Salmanpour\(2026\)Bridging the information gap in pediatric dentistry: a comparison of chatgpt\-4o, google gemini advanced, and expert responses based on evaluations by parents and pediatric dentists\.Journal of Clinical Pediatric Dentistry50\(1\),pp\. 147–155\.Cited by:[§4\.3\.1](https://arxiv.org/html/2606.02914#S4.SS3.SSS1.p3.1),[Table 4](https://arxiv.org/html/2606.02914#S4.T4.18.18.18.18.18.18.18.18.2.1.1)\. - \[22\]G\. S\. Chatzopoulos, V\. P\. Koidou, L\. Tsalikis, and E\. G\. Kaklamanos\(2025\)Large language models in periodontology: assessing their performance in clinically relevant questions\.The Journal of Prosthetic Dentistry134\(6\),pp\. 2328–2336\.External Links:[Document](https://dx.doi.org/10.1016/j.prosdent.2024.10.020)Cited by:[§4\.1\.3](https://arxiv.org/html/2606.02914#S4.SS1.SSS3.p1.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.1.1.1.1.1.1.1.1.3.1.1.1.1)\. - \[23\]J\. Chen, X\. Ge, C\. Yuan, Y\. Chen, X\. Li, X\. Zhang, S\. Chen, W\. Zheng, and C\. Miao\(2025\)Comparing orthodontic pre\-treatment information provided by large language models\.BMC Oral Health25\(1\),pp\. 838\.External Links:[Document](https://dx.doi.org/10.1186/s12903-025-06246-1)Cited by:[§4\.3\.1](https://arxiv.org/html/2606.02914#S4.SS3.SSS1.p4.1),[Table 4](https://arxiv.org/html/2606.02914#S4.T4.19.19.19.19.19.19.19.23.3.1.1.1)\. - \[24\]C\. Cheng, H\. S\. Tsang, R\. T\. Hsung, Y\. Chan, W\. Lo, and W\. Lam\(2025\)GumAgent: towards an accessible gum disease detection tool leveraging vision language model\.In2025 8th International Conference on Information Communication and Signal Processing \(ICICSP\),pp\. 513–518\.External Links:[Document](https://dx.doi.org/10.1109/ICICSP66564.2025.11338496)Cited by:[§6\.3](https://arxiv.org/html/2606.02914#S6.SS3.p3.1),[Table 7](https://arxiv.org/html/2606.02914#S6.T7.3.1.1.1.1.1.1.10.9.1.1.1)\. - \[25\]C\. Dasanayaka, K\. Dandeniya, M\. Dissanayake, C\. Gunasena, and R\. Jayasinghe\(2025\)Multimodal ai and large language models for orthopantomography radiology report generation and q&a\.Applied System Innovation8\(2\),pp\. 39\.External Links:[Document](https://dx.doi.org/10.3390/asi8020039),[Link](https://doi.org/10.3390/asi8020039)Cited by:[§4\.3\.2](https://arxiv.org/html/2606.02914#S4.SS3.SSS2.p1.1),[Table 5](https://arxiv.org/html/2606.02914#S4.T5.3.3.3.3.3.3.3.6.2.1.1.1),[§7\.2](https://arxiv.org/html/2606.02914#S7.SS2.p3.1)\. - \[26\]M\. Dashti, S\. Ghasemi, N\. Ghadimi, D\. Hefzi, A\. Karimian, N\. Zare, A\. Fahimipour, Z\. Khurshid, M\. M\. Chafjiri, and S\. Ghaedsharaf\(2024\)Performance of chatgpt 3\.5 and 4 on u\.s\. dental examinations: the inbde, adat, and dat\.Imaging science in dentistry54\(3\),pp\. 271–275\.External Links:[Document](https://dx.doi.org/10.5624/isd.20240037)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.02914#S4.SS2.SSS1.p1.1),[Table 2](https://arxiv.org/html/2606.02914#S4.T2.1.1.1.1.1.1.1.1.2.1.1)\. - \[27\]L\. P\. de Araújo, L\. B\. Moreno, B\. C\. C\. de Araújo, E\. T\. Chaves, T\. M\. Botero, and V\. H\. D\. Romero\(2026\)From evidence\-based endodontics to generative ai: a comparative study of 11 large language models\.Journal of Endodontics,pp\. S0099–2399\(26\)00010–5\.Note:Advance online publicationExternal Links:[Document](https://dx.doi.org/10.1016/j.joen.2026.01.009)Cited by:[§4\.1\.2](https://arxiv.org/html/2606.02914#S4.SS1.SSS2.p4.1),[Table 1](https://arxiv.org/html/2606.02914#S4.T1.3.1.1.1.1.1.1.8.7.1.1.1)\. - \[28\]A\. Dermata, A\. Arhakis, M\. A\. Makrygiannakis, K\. Giannakopoulos, and E\. G\. Kaklamanos\(2025\)Evaluating the evidence\-based potential of six large language models in paediatric dentistry: a comparative study on generative artificial intelligence\.European Archives of Paediatric Dentistry26\(3\),pp\. 527–535\.External Links:[Document](https://dx.doi.org/10.1007/s40368-025-01012-x)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.3.3.3.3.3.3.3.3.8.4.1.1.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.p3.1)\. - \[29\]A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, N\. Houlsby,et al\.\(2020\)An image is worth 16x16 words: transformers for image recognition at scale\.arXiv preprint arXiv:2010\.11929\.Cited by:[§3\.1](https://arxiv.org/html/2606.02914#S3.SS1.p2.1)\. - \[30\]C\. Du, X\. Chen, J\. Wang, J\. Wang, Z\. Li, Z\. Zhang, and Q\. Lao\(2024\)Prompting vision\-language models for dental notation aware abnormality detection\.InMedical Image Computing and Computer Assisted Intervention – MICCAI 2024,Berlin, Heidelberg,pp\. 687–697\.External Links:[Document](https://dx.doi.org/10.1007/978-3-031-72390-2%5F64),[Link](https://doi.org/10.1007/978-3-031-72390-2%5C_64)Cited by:[§5\.2](https://arxiv.org/html/2606.02914#S5.SS2.p2.1),[Table 6](https://arxiv.org/html/2606.02914#S5.T6.2.2.2.2.2.2.10.7.1.1.1)\. - \[31\]M\. B\. Dundar Sari and B\. Sezer\(2026\)Comparative performance evaluation of chatgpt\-4 omni and gemini advanced in the turkish dentistry specialization exam\.BMC Medical Education26,pp\. 251\.External Links:[Document](https://dx.doi.org/10.1186/s12909-026-08621-0),[Link](https://doi.org/10.1186/s12909-026-08621-0)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.02914#S4.SS2.SSS1.p3.1),[Table 2](https://arxiv.org/html/2606.02914#S4.T2.2.2.2.2.2.2.2.11.8.1.1.1)\. - \[32\]P\. M\. Durmazpinar and E\. Ekmekci\(2025\)Comparing diagnostic skills in endodontic cases: dental students versus chatgpt\-4o\.BMC Oral Health25\(1\),pp\. 457\.External Links:[Document](https://dx.doi.org/10.1186/s12903-025-05857-y)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.p1.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.tab1.2.1.1.1.1.1.1.4.3.1.1.1)\. - \[33\]D\. Dursun and R\. Bilici Geçer\(2024\)Can artificial intelligence models serve as patient information consultants in orthodontics?\.BMC Medical Informatics and Decision Making24\(1\),pp\. 211\.External Links:[Document](https://dx.doi.org/10.1186/s12911-024-02619-8),[Link](https://doi.org/10.1186/s12911-024-02619-8)Cited by:[§4\.3\.1](https://arxiv.org/html/2606.02914#S4.SS3.SSS1.p2.1),[Table 4](https://arxiv.org/html/2606.02914#S4.T4.19.19.19.19.19.19.19.21.1.1.1.1)\. - \[34\]B\. Erdem, M\. Özcan, and Ç\. Şar\(2025\)Comparative analysis of artificial intelligence chatbots in orthodontic emergency scenarios: chatgpt\-3\.5, chatgpt\-4\.0, copilot, and gemini\.The Angle Orthodontist96\(1\),pp\. 100–105\.External Links:[Document](https://dx.doi.org/10.2319/021825-146.1),[Link](https://doi.org/10.2319/021825-146.1)Cited by:[§4\.3\.1](https://arxiv.org/html/2606.02914#S4.SS3.SSS1.p2.1),[Table 4](https://arxiv.org/html/2606.02914#S4.T4.5.5.5.5.5.5.5.5.2.1.1)\. - \[35\]M\. B\. Erden, M\. G\. Kanmaz, and G\. A\. Sabah\(2025\)Can chatbots replace experts? diagnostic accuracy of ai models in classifying impacted mandibular third molars\.Odontology\.Note:Advance online publicationExternal Links:[Document](https://dx.doi.org/10.1007/s10266-025-01214-1),[Link](https://doi.org/10.1007/s10266-025-01214-1)Cited by:[§4\.1\.6](https://arxiv.org/html/2606.02914#S4.SS1.SSS6.p2.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.tab1.1.1.1.1.1.1.1.4.3.1.1.1)\. - \[36\]Y\. Fan, J\. Hao, H\. Chen, J\. Bao, Y\. Shao, Y\. Liang, K\. F\. Hung, and H\. Tang\(2026\)OralGPT\-plus: learning to use visual tools via reinforcement learning for panoramic x\-ray analysis\.arXiv preprint arXiv:2603\.06366\.Cited by:[§6\.2](https://arxiv.org/html/2606.02914#S6.SS2.p7.1),[Table 7](https://arxiv.org/html/2606.02914#S6.T7.3.1.1.1.1.1.1.8.7.1.1.1),[§7\.2](https://arxiv.org/html/2606.02914#S7.SS2.p3.1),[§7\.3](https://arxiv.org/html/2606.02914#S7.SS3.p2.1)\. - \[37\]F\. Fanelli, M\. Saleh, P\. Santamaria, K\. Zhurakivska, L\. Nibali, and G\. Troiano\(2025\)Development and comparative evaluation of a reinstructed gpt\-4o model specialized in periodontology\.Journal of Clinical Periodontology52\(5\),pp\. 707–716\.External Links:[Document](https://dx.doi.org/10.1111/jcpe.14101)Cited by:[§1](https://arxiv.org/html/2606.02914#S1.p3.1),[§4\.1\.3](https://arxiv.org/html/2606.02914#S4.SS1.SSS3.p2.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.1.1.1.1.1.1.1.1.4.2.1.1.1),[§7\.2](https://arxiv.org/html/2606.02914#S7.SS2.p3.1)\. - \[38\]H\. Fukuda, M\. Morishita, K\. Muraoka, S\. Yamaguchi, T\. Nakamura, M\. Habu, I\. Yoshioka, S\. Awano, and K\. Ono\(2025\)Evaluating the accuracy and performance of chatgpt\-4o in solving japanese national dental technician examination\.International Dental Journal75\(4\),pp\. 100847\.External Links:[Document](https://dx.doi.org/10.1016/j.identj.2025.100847)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.02914#S4.SS2.SSS1.p3.1),[Table 2](https://arxiv.org/html/2606.02914#S4.T2.2.2.2.2.2.2.2.10.7.1.1.1)\. - \[39\]S\. Gao, Z\. A\. Wang, Z\. Gao, J\. Liu, H\. Zhang, S\. Pan, and Y\. Zhou\(2026\)Performance of enhanced large language models on prosthodontic multiple\-choice questions\.International Dental Journal76\(2\),pp\. 109441\.External Links:[Document](https://dx.doi.org/10.1016/j.identj.2026.109441)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.3.3.3.3.3.3.3.3.11.7.1.1.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.p3.1)\. - \[40\]M\. Haberal and D\. Hançerlioğulları\(2026\)Can artificial intelligence chatbots think like dentists? a comparative analysis based on dental specialty examination questions in restorative dentistry\.BMC Oral Health26\(1\),pp\. 231\.External Links:[Document](https://dx.doi.org/10.1186/s12903-025-07612-9),[Link](https://doi.org/10.1186/s12903-025-07612-9)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.02914#S4.SS2.SSS1.p3.1),[Table 2](https://arxiv.org/html/2606.02914#S4.T2.2.2.2.2.2.2.2.12.9.1.1.1)\. - \[41\]Z\. Hakami, S\. A\. K\. Saheb, and O\. A\. Bawazeer\(2026\)Orthodontic knowledge assessment: a comparison of five ai chatbots\.Saudi Dental Journal38,pp\. 20\.External Links:[Document](https://dx.doi.org/10.1007/s44445-025-00091-2)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.3.3.3.3.3.3.3.3.9.5.1.1.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.p3.1)\. - \[42\]J\. Hao, Y\. Liang, L\. Lin, Y\. Fan, W\. Zhou, K\. Guo,et al\.\(2025\)OralGPT\-omni: a versatile dental multimodal large language model\.arXiv preprint arXiv:2511\.22055\.Cited by:[§6\.2](https://arxiv.org/html/2606.02914#S6.SS2.p6.1),[Table 7](https://arxiv.org/html/2606.02914#S6.T7.3.1.1.1.1.1.1.7.6.1.1.1),[§7\.3](https://arxiv.org/html/2606.02914#S7.SS3.p2.1)\. - \[43\]J\. Hao, Y\. Fan, Y\. Sun, K\. Guo, L\. Lin, J\. Yang, Q\. Y\. H\. Ai, L\. M\. Wong, H\. Tang, and K\. F\. Hung\(2025\)Towards better dental ai: a multimodal benchmark and instruction dataset for panoramic x\-ray analysis\.arXiv preprint arXiv:2509\.09254\.Cited by:[§6\.4](https://arxiv.org/html/2606.02914#S6.SS4.p2.1)\. - \[44\]K\. He, X\. Chen, S\. Xie, Y\. Li, P\. Dollár, and R\. Girshick\(2022\)Masked autoencoders are scalable vision learners\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 16000–16009\.Cited by:[§3\.2](https://arxiv.org/html/2606.02914#S3.SS2.p4.1)\. - \[45\]N\. Houlsby, A\. Giurgiu, S\. Jastrzebski, B\. Morrone, Q\. De Laroussilhe, A\. Gesmundo, M\. Attariyan, and S\. Gelly\(2019\)Parameter\-efficient transfer learning for nlp\.InInternational conference on machine learning,pp\. 2790–2799\.Cited by:[§3\.2](https://arxiv.org/html/2606.02914#S3.SS2.p3.1)\. - \[46\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.Iclr1\(2\),pp\. 3\.Cited by:[§3\.2](https://arxiv.org/html/2606.02914#S3.SS2.p3.1)\. - \[47\]C\. Huang, Y\. Lee, A\. Sun, and C\. Chiang\(2025\)Performance of chatgpt\-4, gemini, and deepseek\-v3 on answering the multiple choice questions from taiwan national dental technician licensing examinations and their self\-learning abilities over a three\-week period\.Journal of Dental Sciences20\(4\),pp\. 2154–2162\.External Links:ISSN 1991\-7902,[Document](https://dx.doi.org/10.1016/j.jds.2025.07.011)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.02914#S4.SS2.SSS1.p3.1),[Table 2](https://arxiv.org/html/2606.02914#S4.T2.2.2.2.2.2.2.2.9.6.1.1.1)\. - \[48\]X\. Huang, F\. Xiao, D\. He, A\. Gao, D\. Li, X\. Zhang, S\. Zhang, and X\. Wang\(2025\)Towards generalist intelligence in dentistry: vision foundation models for oral and maxillofacial radiology\.arXiv preprint arXiv:2510\.14532\.Cited by:[§1](https://arxiv.org/html/2606.02914#S1.p3.1),[§6\.1](https://arxiv.org/html/2606.02914#S6.SS1.p1.1),[§6\.4](https://arxiv.org/html/2606.02914#S6.SS4.p3.1),[Table 7](https://arxiv.org/html/2606.02914#S6.T7.3.1.1.1.1.1.1.2.1.1.1.1),[§7\.3](https://arxiv.org/html/2606.02914#S7.SS3.p3.1)\. - \[49\]A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§3\.1](https://arxiv.org/html/2606.02914#S3.SS1.p2.1)\. - \[50\]K\. Ji, Z\. Wu, J\. Han, G\. Zhai, and J\. Liu\(2025\)Evaluating chatgpt\-4’s performance on oral and maxillofacial queries: chain of thought and standard method\.Frontiers in Oral Health6,pp\. 1541976\.External Links:[Document](https://dx.doi.org/10.3389/froh.2025.1541976),[Link](https://doi.org/10.3389/froh.2025.1541976)Cited by:[§4\.1\.6](https://arxiv.org/html/2606.02914#S4.SS1.SSS6.p3.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.tab1.1.1.1.1.1.1.1.5.4.1.1.1)\. - \[51\]O\. B\. Kandaz, T\. Teksoz, C\. Avlayici,et al\.\(2026\)Using ai large language models to assess dental history in systemic conditions\.Discover Artificial Intelligence6,pp\. 103\.External Links:[Document](https://dx.doi.org/10.1007/s44163-025-00816-6)Cited by:[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.p1.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.tab1.1.1.1.1.1.1.1.6.5.1.1.1)\. - \[52\]M\. G\. Kanmaz and G\. Agani Sabah\(2025\)Diagnostic accuracy of large language models in the classification of superior labial frenulum attachments\.Odontology\.External Links:[Document](https://dx.doi.org/10.1007/s10266-025-01283-2)Cited by:[§4\.1\.4](https://arxiv.org/html/2606.02914#S4.SS1.SSS4.p1.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.1.1.1.1.1.1.1.1.6.4.1.1.1)\. - \[53\]D\. Kim, J\. Y\. Han, S\. Kim, H\. Yun, and W\. J\. Yi\(2025\)DentalCLIP: semi\-supervised contrastive learning for tmj osteoarthritis diagnosis\.InAnnual International Conference of the IEEE Engineering in Medicine and Biology Society \(EMBC\),pp\. 1–6\.External Links:[Document](https://dx.doi.org/10.1109/EMBC58623.2025.11254620)Cited by:[§5\.2](https://arxiv.org/html/2606.02914#S5.SS2.p3.1),[Table 6](https://arxiv.org/html/2606.02914#S5.T6.2.2.2.2.2.2.2.3.1.1)\. - \[54\]A\. Kirillov, E\. Mintun, N\. Ravi, H\. Mao, C\. Rolland, L\. Gustafson, T\. Xiao, S\. Whitehead, A\. C\. Berg, W\. Lo, P\. Dollár, and R\. Girshick\(2023\)Segment anything\.Proceedings of the IEEE/CVF International Conference on Computer Vision \(ICCV\),pp\. 3992–4003\.External Links:[Document](https://dx.doi.org/10.1109/ICCV51070.2023.00371)Cited by:[§3\.1](https://arxiv.org/html/2606.02914#S3.SS1.p3.1),[§5](https://arxiv.org/html/2606.02914#S5.p1.1)\. - \[55\]G\. Kofos, A\. Fardi, T\. Lillis, F\. Ioannis, and N\. Dabarakis\(2025\)Evaluation of artificial intelligence conversational models in providing information on dental implants: a comparative analysis of chatgpt, gemini and medgebra\.Journal of Evaluation in Clinical Practice31\(8\),pp\. e70304\.External Links:[Document](https://dx.doi.org/10.1111/jep.70304)Cited by:[§4\.3\.1](https://arxiv.org/html/2606.02914#S4.SS3.SSS1.p2.1),[Table 4](https://arxiv.org/html/2606.02914#S4.T4.6.6.6.6.6.6.6.6.2.1.1)\. - \[56\]C\. Z\. Koyuncuoglu, A\. H\. Selcuker, and E\. Ozyilmaz\(2025\)The reliability of answers from four different ai chatbots on periodontology theoretical exam questions: an evaluation in dental education\.BMC Oral Health26\(1\),pp\. 114\.External Links:[Document](https://dx.doi.org/10.1186/s12903-025-07387-z)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.3.3.3.3.3.3.3.3.10.6.1.1.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.p3.1)\. - \[57\]Ö\. Küçük Keleş and Z\. B\. Arslan\(2026\)Performance of artificial intelligence chatbots in the diagnosis and management of simulated dental trauma cases: an evaluation based on iadt guidelines\.Clinical Oral Investigations\.Note:Published online: 23 December 2025External Links:[Document](https://dx.doi.org/10.1007/s00784-025-06716-4)Cited by:[§4\.1\.5](https://arxiv.org/html/2606.02914#S4.SS1.SSS5.p2.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.1.1.1.1.1.1.1.1.7.5.1.1.1)\. - \[58\]Ö\. Kurt and E\. Şimsek\(2025\)Knowledge\-level comparison in pulpal and periapical diseases: dental students versus artificial intelligence models \(gemini, microsoft copilot, chatgpt\-3\.5, chatgpt\-4o\): cross\-sectional study\.BMC Medical Education25\(1\),pp\. 1657\.External Links:[Document](https://dx.doi.org/10.1186/s12909-025-08263-8)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.3.3.3.3.3.3.3.3.6.2.1.1.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.p2.1)\. - \[59\]H\. E\. Kuru, A\. Aşık, and D\. M\. Demir\(2025\)Can artificial intelligence language models effectively address dental trauma questions?\.Dental Traumatology41\(5\),pp\. 567–580\.External Links:[Document](https://dx.doi.org/10.1111/edt.13063)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.2.2.2.2.2.2.2.2.2.2.1.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.p2.1)\. - \[60\]C\. Lafourcade, O\. Kérourédan, B\. Ballester, and R\. Richert\(2025\)Accuracy, consistency, and contextual understanding of large language models in restorative dentistry and endodontics\.Journal of Dentistry157,pp\. 105764\.External Links:[Document](https://dx.doi.org/10.1016/j.jdent.2025.105764)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.3.3.3.3.3.3.3.3.12.8.1.1.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.p3.1)\. - \[61\]S\. Lee, S\. I\. Oh, J\. Jo, S\. Kang, Y\. Shin, and J\. W\. Park\(2021\)Deep learning for early dental caries detection in bitewing radiographs\.Scientific Reports11\(1\),pp\. 16807\.External Links:[Document](https://dx.doi.org/10.1038/s41598-021-96368-7)Cited by:[§1](https://arxiv.org/html/2606.02914#S1.p1.1),[§1](https://arxiv.org/html/2606.02914#S1.p3.1)\. - \[62\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§3\.2](https://arxiv.org/html/2606.02914#S3.SS2.p2.1)\. - \[63\]C\. Li, S\. Li, P\. Chen, L\. Li, C\. Wang, and Z\. Cai\(2025\)ToothSC\-sam: a novel network model based on skip\-connections and sam for tooth segmentation in cbct images\.Preprints\.org\.Note:Preprint, not peer\-reviewedExternal Links:[Document](https://dx.doi.org/10.20944/preprints202504.1562.v1),[Link](https://doi.org/10.20944/preprints202504.1562.v1)Cited by:[§5\.1](https://arxiv.org/html/2606.02914#S5.SS1.p5.1),[Table 6](https://arxiv.org/html/2606.02914#S5.T6.2.2.2.2.2.2.7.4.1.1.1),[§7\.3](https://arxiv.org/html/2606.02914#S7.SS3.p3.1)\. - \[64\]C\. Liet al\.\(2021\)Detection of dental apical lesions using cnns on periapical radiograph\.Sensors21\(21\),pp\. 7049\.External Links:[Document](https://dx.doi.org/10.3390/s21217049)Cited by:[§1](https://arxiv.org/html/2606.02914#S1.p3.1)\. - \[65\]J\. Liao, H\. Wang, H\. Gu, and Y\. Cai\(2024\)PPA\-sam: plug\-and\-play adversarial segment anything model for 3d tooth segmentation\.Applied Sciences14\(8\),pp\. 3259\.External Links:[Document](https://dx.doi.org/10.3390/app14083259),[Link](https://doi.org/10.3390/app14083259)Cited by:[§5\.1](https://arxiv.org/html/2606.02914#S5.SS1.p2.1),[§5\.1](https://arxiv.org/html/2606.02914#S5.SS1.p3.1),[Table 6](https://arxiv.org/html/2606.02914#S5.T6.2.2.2.2.2.2.6.3.1.1.1)\. - \[66\]C\. C\. Lin, J\. Sun, C\. Chang, Y\. Chang, and J\. Z\. Chang\(2025\)Performance of artificial intelligence chatbots in national dental licensing examination\.Journal of Dental Sciences20\(4\),pp\. 2307–2314\.External Links:ISSN 1991\-7902,[Document](https://dx.doi.org/10.1016/j.jds.2025.05.012)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.02914#S4.SS2.SSS1.p1.1),[Table 2](https://arxiv.org/html/2606.02914#S4.T2.2.2.2.2.2.2.2.5.2.1.1.1)\. - \[67\]H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee\(2023\)Visual instruction tuning\.Advances in neural information processing systems36,pp\. 34892–34916\.Cited by:[§3\.1](https://arxiv.org/html/2606.02914#S3.SS1.p2.1)\. - \[68\]S\. Liu, Z\. Zeng, T\. Ren, F\. Li, H\. Zhang, J\. Yang, others, and L\. Zhang\(2024\)Grounding dino: marrying dino with grounded pre\-training for open\-set object detection\.InEuropean Conference on Computer Vision,pp\. 38–55\.Cited by:[§3\.1](https://arxiv.org/html/2606.02914#S3.SS1.p3.1),[§5](https://arxiv.org/html/2606.02914#S5.p1.1)\. - \[69\]M\. Llorente de Pedro, A\. Suárez, J\. Algar, V\. Díaz\-Flores García, C\. Andreu\-Vázquez, and Y\. Freire\(2025\)Assessing chatgpt’s reliability in endodontics: implications for ai\-enhanced clinical learning\.Applied Sciences15\(10\),pp\. 5231\.External Links:[Document](https://dx.doi.org/10.3390/app15105231)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.3.3.3.3.3.3.3.3.5.1.1.1.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.p1.1)\. - \[70\]Z\. Lu, J\. Lou, M\. Ma, H\. Jin, Y\. Zheng, and K\. Zhou\(2026\-03\)3DTeethSAM: taming sam2 for 3d teeth segmentation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 7609–7617\.Cited by:[§5\.1](https://arxiv.org/html/2606.02914#S5.SS1.p6.1),[Table 6](https://arxiv.org/html/2606.02914#S5.T6.2.2.2.2.2.2.8.5.1.1.1)\. - \[71\]L\. Ma, J\. Han, Z\. Wang, and D\. Zhang\(2023\)Cephgpt\-4: an interactive multimodal cephalometric measurement and diagnostic system with visual large language model\.arXiv preprint arXiv:2307\.07518\.Cited by:[§6\.2](https://arxiv.org/html/2606.02914#S6.SS2.p2.1),[Table 7](https://arxiv.org/html/2606.02914#S6.T7.3.1.1.1.1.1.1.5.4.1.1.1)\. - \[72\]E\. E\. Maruska, A\. Elgreatly, W\. Madaio, K\. Razoky, C\. Bay, and A\. Mahrous\(2025\)Comparing dentist and chatbot answers to dental questions for quality and empathy\.JADA Foundational Science4,pp\. 100044\.External Links:ISSN 2772\-414X,[Document](https://dx.doi.org/10.1016/j.jfscie.2025.100044),[Link](https://doi.org/10.1016/j.jfscie.2025.100044)Cited by:[§4\.3\.2](https://arxiv.org/html/2606.02914#S4.SS3.SSS2.p2.1),[Table 5](https://arxiv.org/html/2606.02914#S4.T5.3.3.3.3.3.3.3.8.4.1.1.1)\. - \[73\]Z\. Meng, J\. Hao, X\. Dai, Y\. Feng, J\. Liu, B\. Feng,et al\.\(2025\)DentVLM: a multimodal vision\-language model for comprehensive dental diagnosis and enhanced clinical practice\.arXiv preprint arXiv:2509\.23344\.Cited by:[§1](https://arxiv.org/html/2606.02914#S1.p4.1),[§6\.2](https://arxiv.org/html/2606.02914#S6.SS2.p4.1),[Table 7](https://arxiv.org/html/2606.02914#S6.T7.3.1.1.1.1.1.1.3.2.1.1.1)\. - \[74\]Y\. Mine, T\. Taji, S\. Takeda, S\. Okazaki, T\. Y\. Peng, N\. Kakimoto, and T\. Murayama\(2026\)Assessing multimodal large language models for localizing dental implant fixtures on panoramic radiographs\.Journal of Dentistry168,pp\. 106580\.Note:Advance online publicationExternal Links:[Document](https://dx.doi.org/10.1016/j.jdent.2026.106580),[Link](https://doi.org/10.1016/j.jdent.2026.106580)Cited by:[§4\.1\.6](https://arxiv.org/html/2606.02914#S4.SS1.SSS6.p2.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.tab1.1.1.1.1.1.1.1.3.2.1.1.1)\. - \[75\]Y\. Mine, S\. Okazaki, T\. Taji, H\. Kawaguchi, N\. Kakimoto, and T\. Murayama\(2025\)Benchmarking multimodal large language models on the dental licensing examination: challenges with clinical image interpretation\.Journal of Dental Sciences20\(4\),pp\. 2427–2435\.External Links:[Document](https://dx.doi.org/10.1016/j.jds.2025.03.018)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.02914#S4.SS2.SSS1.p2.1),[Table 2](https://arxiv.org/html/2606.02914#S4.T2.2.2.2.2.2.2.2.6.3.1.1.1)\. - \[76\]H\. C\. Nguyen, H\. P\. Dang, T\. L\. Nguyen, V\. Hoang, and V\. A\. Nguyen\(2025\)Accuracy of latest large language models in answering multiple choice questions in dentistry: a comparative study\.PLOS ONE20\(1\),pp\. e0317423\.External Links:[Document](https://dx.doi.org/10.1371/journal.pone.0317423)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.02914#S4.SS2.SSS1.p1.1),[Table 2](https://arxiv.org/html/2606.02914#S4.T2.2.2.2.2.2.2.2.2.2.1.1)\. - \[77\]L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§3\.2](https://arxiv.org/html/2606.02914#S3.SS2.p4.1)\. - \[78\]Y\. Özbay, D\. Erdoğan, and G\. A\. Dinçer\(2025\)Evaluation of the performance of large language models in clinical decision\-making in endodontics\.BMC Oral Health25,pp\. 648\.External Links:[Document](https://dx.doi.org/10.1186/s12903-025-06050-x)Cited by:[§4\.1\.2](https://arxiv.org/html/2606.02914#S4.SS1.SSS2.p3.1),[Table 1](https://arxiv.org/html/2606.02914#S4.T1.3.1.1.1.1.1.1.7.6.1.1.1)\. - \[79\]Z\. Öztürk, C\. Bal, and B\. N\. Çelikkaya\(2025\)Evaluation of information provided by chatgpt versions on traumatic dental injuries for dental students and professionals\.Dental Traumatology41\(4\),pp\. 427–436\.External Links:[Document](https://dx.doi.org/10.1111/edt.13042)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.3.3.3.3.3.3.3.3.13.9.1.1.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.p3.1)\. - \[80\]M\. J\. Pageet al\.\(2021\)The prisma 2020 statement: an updated guideline for reporting systematic reviews\.BMJ372,pp\. n71\.External Links:[Document](https://dx.doi.org/10.1136/bmj.n71)Cited by:[§1](https://arxiv.org/html/2606.02914#S1.p5.1),[§2](https://arxiv.org/html/2606.02914#S2.p1.1)\. - \[81\]S\. Prasad, M\. Koseoglu, S\. Antonopoulou, H\. M\. Huber, A\. Azarbal, S\. Kurniawan, and C\. Sukotjo\(2025\)Assessing readability and accuracy of content produced by the american college of prosthodontists and large language models for patient education in prosthodontics\.Journal of Prosthodontics\.Note:Advance online publicationExternal Links:[Document](https://dx.doi.org/10.1111/jopr.70022)Cited by:[§4\.3\.1](https://arxiv.org/html/2606.02914#S4.SS3.SSS1.p4.1),[Table 4](https://arxiv.org/html/2606.02914#S4.T4.19.19.19.19.19.19.19.19.2.1.1)\. - \[82\]A\. Qutieshat, A\. Al Rusheidi, S\. Al Ghammari, A\. Alarabi, A\. Salem, and M\. Zelihic\(2024\)Comparative analysis of diagnostic accuracy in endodontic assessments: dental students vs\. artificial intelligence\.Diagnosis11\(3\),pp\. 259–265\.External Links:[Document](https://dx.doi.org/10.1515/dx-2024-0034)Cited by:[§4\.1\.2](https://arxiv.org/html/2606.02914#S4.SS1.SSS2.p5.1),[Table 1](https://arxiv.org/html/2606.02914#S4.T1.3.1.1.1.1.1.1.9.8.1.1.1)\. - \[83\]A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever\(2021\-18–24 Jul\)Learning transferable visual models from natural language supervision\.InProceedings of the 38th International Conference on Machine Learning,M\. Meila and T\. Zhang \(Eds\.\),Proceedings of Machine Learning Research, Vol\.139,pp\. 8748–8763\.External Links:[Link](https://proceedings.mlr.press/v139/radford21a.html)Cited by:[§3\.1](https://arxiv.org/html/2606.02914#S3.SS1.p3.1),[§5](https://arxiv.org/html/2606.02914#S5.p1.1)\. - \[84\]Y\. Ren and J\. Sun\(2026\)Comparing large language models and search engine responses to common orthodontic questions\.PLOS ONE21\(1\),pp\. e0339908\.External Links:[Document](https://dx.doi.org/10.1371/journal.pone.0339908),[Link](https://doi.org/10.1371/journal.pone.0339908)Cited by:[§4\.3\.1](https://arxiv.org/html/2606.02914#S4.SS3.SSS1.p1.1),[§4\.3\.2](https://arxiv.org/html/2606.02914#S4.SS3.SSS2.p2.1),[Table 4](https://arxiv.org/html/2606.02914#S4.T4.1.1.1.1.1.1.1.1.2.1.1)\. - \[85\]P\. Rewthamrongsris, J\. Burapacheep, E\. Phattarataratip, P\. Kulthanaamondhita, A\. Tichy, F\. Schwendicke, T\. Osathanon, and K\. Sappayatosok\(2025\)Image\-based diagnostic performance of llms vs cnns for oral lichen planus: example\-guided and differential diagnosis\.International Dental Journal75\(4\),pp\. 100848\.External Links:[Document](https://dx.doi.org/10.1016/j.identj.2025.100848)Cited by:[§4\.1\.1](https://arxiv.org/html/2606.02914#S4.SS1.SSS1.p4.1),[Table 1](https://arxiv.org/html/2606.02914#S4.T1.3.1.1.1.1.1.1.4.3.1.1.1)\. - \[86\]P\. Rewthamrongsris, V\. Thongchotchat, J\. Burapacheep, V\. Trachoo, Z\. Khurshid, and T\. Porntaveetus\(2026\)Evaluating retrieval\-augmented generation\-large language models for infective endocarditis prophylaxis: clinical accuracy and efficiency\.International Dental Journal76\(1\),pp\. 109344\.External Links:[Document](https://dx.doi.org/10.1016/j.identj.2025.109344)Cited by:[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.p2.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.tab1.1.1.1.1.1.1.1.8.7.1.1.1),[§7\.2](https://arxiv.org/html/2606.02914#S7.SS2.p3.1),[§7\.3](https://arxiv.org/html/2606.02914#S7.SS3.p4.1)\. - \[87\]P\. Rodrigues\-Pereira, M\. A\. P\. Dias\-Calças, A\. Moreira Mélo, M\. O\. Melchior, L\. Gaspar Ribeiro, A\. Pazin\-Filho, J\. F\. Mazzi\-Chaves, and L\. V\. Magri\(2025\)Generative artificial intelligence\-driven clinical case simulation in temporomandibular disorder education: chatgpt versus real patients\.Journal of Dental Education\.Note:Advance online publicationExternal Links:[Document](https://dx.doi.org/10.1002/jdd.70104)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.3.3.3.3.3.3.3.3.3.2.1.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.p2.1)\. - \[88\]V\. Ronsivalle, S\. Santonocito, U\. Cammarata, E\. Lo Muzio, and M\. Cicciù\(2025\)Current applications of chatbots powered by large language models in oral and maxillofacial surgery: a systematic review\.Dentistry Journal13\(6\)\.External Links:[Link](https://www.mdpi.com/2304-6767/13/6/261),ISSN 2304\-6767,[Document](https://dx.doi.org/10.3390/dj13060261)Cited by:[§7\.1](https://arxiv.org/html/2606.02914#S7.SS1.p2.1)\. - \[89\]J\. Ryu, H\. Y\. Kang, Y\. S\. Chu, and S\. Yang\(2025\-09\-01\)Vision\-language foundation models for medical imaging: a review of current practices and innovations\.Biomedical Engineering Letters15\(5\),pp\. 809–830\.External Links:[Document](https://dx.doi.org/10.1007/s13534-025-00484-6)Cited by:[§7\.1](https://arxiv.org/html/2606.02914#S7.SS1.p3.1)\. - \[90\]H\. Sağlam, G\. P\. Sezgin, T\. Kaplan, and S\. S\. Kaplan\(2026\)Artificial intelligence chatbots versus dentists: a comparative knowledge assessment on traumatic dental injury management\.BMC Oral Health26\(1\),pp\. 313\.External Links:[Document](https://dx.doi.org/10.1186/s12903-026-07728-6)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.1.1.1.1.1.1.1.1.1.2.1.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.p2.1)\. - \[91\]F\. Salmanpour, H\. Camcı, and Ö\. Geniş\(2025\)Comparative analysis of ai chatbot \(chatgpt\-4\.0 and microsoft copilot\) and expert responses to common orthodontic questions: patient and orthodontist evaluations\.BMC Oral Health25\(1\),pp\. 896\.External Links:[Document](https://dx.doi.org/10.1186/s12903-025-06194-w),[Link](https://doi.org/10.1186/s12903-025-06194-w)Cited by:[§4\.3\.1](https://arxiv.org/html/2606.02914#S4.SS3.SSS1.p2.1),[Table 4](https://arxiv.org/html/2606.02914#S4.T4.3.3.3.3.3.3.3.3.3.1.1)\. - \[92\]B\. Sezer and A\. E\. Okutan\(2025\)Evaluation of chatgpt\-4’s performance on pediatric dentistry questions: accuracy and completeness analysis\.BMC Oral Health25\(1\),pp\. 1427\.External Links:[Document](https://dx.doi.org/10.1186/s12903-025-06791-9)Cited by:[§4\.3\.1](https://arxiv.org/html/2606.02914#S4.SS3.SSS1.p2.1),[Table 4](https://arxiv.org/html/2606.02914#S4.T4.17.17.17.17.17.17.17.17.7.1.1)\. - \[93\]B\. Sezer and T\. Aydoğdu\(2025\)Performance of advanced artificial intelligence models in traumatic dental injuries in primary dentition: a comparative evaluation of chatgpt\-4 omni, deepseek, gemini advanced, and claude 3\.7 in terms of accuracy, completeness, response time, and readability\.Applied Sciences15\(14\)\.External Links:[Link](https://www.mdpi.com/2076-3417/15/14/7778),ISSN 2076\-3417,[Document](https://dx.doi.org/10.3390/app15147778)Cited by:[§4\.1\.5](https://arxiv.org/html/2606.02914#S4.SS1.SSS5.p1.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.1.1.1.1.1.1.1.1.1.2.1.1)\. - \[94\]M\. Shirani and M\. Emami\(2025\)Performance comparison of large language models in treatment planning for the restoration of endodontically treated teeth over time\.Journal of Dentistry161,pp\. 105998\.External Links:[Document](https://dx.doi.org/10.1016/j.jdent.2025.105998)Cited by:[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.p3.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.tab1.1.1.1.1.1.1.1.10.9.1.1.1)\. - \[95\]S\. Sismanoglu and B\. S\. Capan\(2025\)Performance of artificial intelligence on turkish dental specialization exam: can chatgpt\-4\.0 and gemini advanced achieve comparable results to humans?\.BMC Medical Education25\(1\),pp\. 214\.External Links:[Document](https://dx.doi.org/10.1186/s12909-024-06389-9)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.02914#S4.SS2.SSS1.p3.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.tab1.1.1.1.1.1.1.1.1.6.5.1.1.1)\. - \[96\]E\. S\. Song, G\. H\. Kim, and S\. Lee\(2026\)Evaluation of gpt\-4o and gemini advanced on the korean national dental licensing examination: accuracy, consistency, and question generation\.Journal of Dental Sciences21\(1\),pp\. 96–102\.External Links:[Document](https://dx.doi.org/10.1016/j.jds.2025.07.020)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.02914#S4.SS2.SSS1.p1.1),[Table 2](https://arxiv.org/html/2606.02914#S4.T2.2.2.2.2.2.2.2.4.1.1.1.1)\. - \[97\]D\. Stephan, A\. Bertsch, M\. Burwinkel, S\. Vinayahalingam, B\. Al\-Nawas, P\. W\. Kämmerer, and D\. G\. Thiem\(2024\)AI in dental radiology—improving the efficiency of reporting with chatgpt: comparative study\.Journal of Medical Internet Research26,pp\. e60684\.External Links:[Document](https://dx.doi.org/10.2196/60684),[Link](https://doi.org/10.2196/60684)Cited by:[§4\.3\.2](https://arxiv.org/html/2606.02914#S4.SS3.SSS2.p1.1),[Table 5](https://arxiv.org/html/2606.02914#S4.T5.3.3.3.3.3.3.3.5.1.1.1.1)\. - \[98\]D\. Stephan, A\. S\. Bertsch, S\. Schumacher, B\. Puladi, M\. Burwinkel, B\. Al\-Nawas, P\. W\. Kämmerer, and D\. G\. Thiem\(2025\)Improving patient communication by simplifying ai\-generated dental radiology reports with chatgpt: comparative study\.Journal of Medical Internet Research27,pp\. e73337\.External Links:[Document](https://dx.doi.org/10.2196/73337),[Link](https://doi.org/10.2196/73337)Cited by:[§4\.3\.2](https://arxiv.org/html/2606.02914#S4.SS3.SSS2.p2.1),[Table 5](https://arxiv.org/html/2606.02914#S4.T5.3.3.3.3.3.3.3.3.4.1.1)\. - \[99\]A\. Suárez, Y\. Freire, M\. Suárez, V\. Díaz\-Flores García, C\. Andreu\-Vázquez, I\. J\. Thuissard Vasallo, A\. I\. Castillo Varón, and C\. Martín\(2025\)Diagnostic performance of multimodal large language models in the analysis of oral pathology\.Oral Diseases31\(12\),pp\. 3344–3354\.External Links:[Document](https://dx.doi.org/10.1111/odi.70009)Cited by:[§4\.1\.1](https://arxiv.org/html/2606.02914#S4.SS1.SSS1.p3.1),[Table 1](https://arxiv.org/html/2606.02914#S4.T1.3.1.1.1.1.1.1.3.2.1.1.1)\. - \[100\]M\. Tassoker\(2025\)ChatGPT\-4 omni’s superiority in answering multiple\-choice oral radiology questions\.BMC Oral Health25\(1\),pp\. 173\.External Links:[Document](https://dx.doi.org/10.1186/s12903-025-05554-w)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.02914#S4.SS2.SSS1.p3.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.tab1.1.1.1.1.1.1.1.1.4.3.1.1.1)\. - \[101\]S\. Tayeb, C\. Barausse, G\. Pellegrino, M\. Sansavini, R\. Pistilli, and P\. Felice\(2025\)Comparing artificial intelligence \(chatgpt, gemini, deepseek\) and oral surgeons in detecting clinically relevant drug–drug interactions in dental therapy\.Applied Sciences15\(23\),pp\. 12851\.External Links:[Document](https://dx.doi.org/10.3390/app152312851)Cited by:[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.p1.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.tab1.1.1.1.1.1.1.1.7.6.1.1.1)\. - \[102\]K\. Termteerapornpimol, S\. Kulvitit, S\. Prommanee, Z\. Khurshid, and T\. Porntaveetus\(2025\)Comparative benchmark of seven large language models for traumatic dental injury knowledge\.European Journal of Dentistry\.Note:Advance online publicationExternal Links:[Document](https://dx.doi.org/10.1055/s-0045-1812064)Cited by:[§4\.1\.5](https://arxiv.org/html/2606.02914#S4.SS1.SSS5.p3.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.1.1.1.1.1.1.1.1.8.6.1.1.1)\. - \[103\]S\. Tomo, J\. R\. Lechien, H\. S\. Bueno, D\. F\. Cantieri\-Debortoli, and L\. E\. Simonato\(2024\)Accuracy and consistency of chatgpt\-3\.5 and \-4 in providing differential diagnoses in oral and maxillofacial diseases: a comparative diagnostic performance analysis\.Clinical Oral Investigations28\(10\),pp\. 544\.External Links:[Document](https://dx.doi.org/10.1007/s00784-024-05939-1)Cited by:[§4\.1\.1](https://arxiv.org/html/2606.02914#S4.SS1.SSS1.p2.1),[Table 1](https://arxiv.org/html/2606.02914#S4.T1.3.1.1.1.1.1.1.2.1.1.1.1)\. - \[104\]B\. Tosun and Z\. Öztürk\(2025\)Performance of five large language models in managing acute dental pain: a comprehensive analysis\.Turk Endod J10\(1\),pp\. 39–49\.Note:doi: 10\.14744/TEJ\.2025\.27147External Links:[Document](https://dx.doi.org/10.14744/TEJ.2025.27147),[Link](https://dx.doi.org/10.14744/TEJ.2025.27147),https://dx\.doi\.org/10\.14744/TEJ\.2025\.27147Cited by:[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.p2.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.tab1.1.1.1.1.1.1.1.9.8.1.1.1)\. - \[105\]D\. V\. Tuzoffet al\.\(2019\)Tooth detection and numbering in panoramic radiographs using convolutional neural networks\.Dento Maxillofacial Radiology48\(4\),pp\. 20180051\.External Links:[Document](https://dx.doi.org/10.1259/dmfr.20180051)Cited by:[§1](https://arxiv.org/html/2606.02914#S1.p3.1)\. - \[106\]F\. Umer, I\. Batool, and N\. Naved\(2024\-12\)Innovation and application of large language models \(llms\) in dentistry – a scoping review\.BDJ Open10,pp\.\.External Links:[Document](https://dx.doi.org/10.1038/s41405-024-00277-6)Cited by:[§7\.1](https://arxiv.org/html/2606.02914#S7.SS1.p2.1)\. - \[107\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.Advances in Neural Information Processing Systems30\.Cited by:[§1](https://arxiv.org/html/2606.02914#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.02914#S3.SS1.p1.1)\. - \[108\]G\. Virupaiah and A\. Sathyanarayana\(2020\)Analysis of image enhancement techniques for dental caries detection using texture analysis and support vector machine\.International Journal of Applied Science and Engineering17,pp\. 75–86\.External Links:[Link](https://api.semanticscholar.org/CorpusID:231800088)Cited by:[§1](https://arxiv.org/html/2606.02914#S1.p3.1)\. - \[109\]P\. Wang, H\. Gu, and Y\. Sun\(2025\)Tooth segmentation on multimodal images using adapted segment anything model\.Scientific Reports15,pp\. 13874\.External Links:[Document](https://dx.doi.org/10.1038/s41598-025-96301-2)Cited by:[§3\.2](https://arxiv.org/html/2606.02914#S3.SS2.p3.1),[§5\.1](https://arxiv.org/html/2606.02914#S5.SS1.p2.1),[§5\.1](https://arxiv.org/html/2606.02914#S5.SS1.p3.1),[Table 6](https://arxiv.org/html/2606.02914#S5.T6.2.2.2.2.2.2.5.2.1.1.1)\. - \[110\]H\. Watanabe, O\. Uehara, T\. Morikawa, T\. Kojima, T\. Suga, A\. Toyofuku, S\. Takada, and Y\. Abiko\(2025\)Performance of large language models on image\-based oral pathology questions from the japanese national dental examination\.Journal of Dental Sciences\.External Links:ISSN 1991\-7902,[Document](https://dx.doi.org/10.1016/j.jds.2025.08.037),[Link](https://www.sciencedirect.com/science/article/pii/S1991790225003113)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.02914#S4.SS2.SSS1.p2.1),[Table 2](https://arxiv.org/html/2606.02914#S4.T2.2.2.2.2.2.2.2.7.4.1.1.1)\. - \[111\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§3\.2](https://arxiv.org/html/2606.02914#S3.SS2.p2.1)\. - \[112\]World Health Organization\(2022\)Global oral health status report: towards universal health coverage for oral health by 2030\.Technical reportWorld Health Organization,Geneva\.External Links:[Link](https://www.who.int/publications/i/item/9789240061484)Cited by:[§1](https://arxiv.org/html/2606.02914#S1.p1.1)\. - \[113\]X\. Wu, G\. Cai, B\. Guo, L\. Ma, S\. Shao, J\. Yu, Y\. Zheng, L\. Wang, and F\. Yang\(2025\)A multi\-dimensional performance evaluation of large language models in dental implantology: comparison of chatgpt, deepseek, grok, gemini and qwen across diverse clinical scenarios\.BMC Oral Health25\(1\),pp\. 1272\.External Links:[Document](https://dx.doi.org/10.1186/s12903-025-06619-6),[Link](https://doi.org/10.1186/s12903-025-06619-6)Cited by:[§4\.1\.6](https://arxiv.org/html/2606.02914#S4.SS1.SSS6.p1.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.1.1.1.1.1.1.1.1.9.7.1.1.1)\. - \[114\]Y\. Wu, Y\. Zhang, M\. Xu, C\. Jinzhi, Y\. Xue, and Y\. Zheng\(2025\)Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study\.BMC Medical Informatics and Decision Making25\(1\),pp\. 147\.External Links:[Document](https://dx.doi.org/10.1186/s12911-025-02972-2),[Link](https://doi.org/10.1186/s12911-025-02972-2)Cited by:[§4\.1\.6](https://arxiv.org/html/2606.02914#S4.SS1.SSS6.p1.1),[§4\.1\.7](https://arxiv.org/html/2606.02914#S4.SS1.SSS7.tab1.1.1.1.1.1.1.1.2.1.1.1.1)\. - \[115\]Y\. Wu, K\. Tso, and C\. Chiang\(2025\)Performance of chatgpt in answering the oral pathology questions of various types or subjects from taiwan national dental licensing examinations\.Journal of Dental Sciences20\(3\),pp\. 1709–1715\.External Links:[Document](https://dx.doi.org/10.1016/j.jds.2025.03.030)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.02914#S4.SS2.SSS1.p3.1),[Table 2](https://arxiv.org/html/2606.02914#S4.T2.2.2.2.2.2.2.2.8.5.1.1.1)\. - \[116\]B\. B\. Yamaç, R\. Akçar, İ\. E\. Şarkan, and C\. Arslan\(2025\)Assessment of the information quality of chatbot technologies on orthodontic miniscrews\.Dental Press Journal of Orthodontics30\(5\),pp\. e2524255\.External Links:[Document](https://dx.doi.org/10.1590/2177-6709.30.5.e2524255.oar),[Link](https://doi.org/10.1590/2177-6709.30.5.e2524255.oar)Cited by:[§4\.3\.1](https://arxiv.org/html/2606.02914#S4.SS3.SSS1.p2.1),[Table 4](https://arxiv.org/html/2606.02914#S4.T4.4.4.4.4.4.4.4.4.2.1.1)\. - \[117\]B\. E\. Yilmaz, B\. N\. Gokkurt Yilmaz, and F\. Ozbey\(2025\)Artificial intelligence performance in answering multiple\-choice oral pathology questions: a comparative analysis\.BMC Oral Health25\(1\),pp\. 573\.External Links:[Document](https://dx.doi.org/10.1186/s12903-025-05926-2),[Link](https://doi.org/10.1186/s12903-025-05926-2)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.02914#S4.SS2.SSS1.p3.1),[§4\.2\.2](https://arxiv.org/html/2606.02914#S4.SS2.SSS2.tab1.1.1.1.1.1.1.1.1.2.1.1.1.1)\. - \[118\]B\. Zhang, Y\. Miao, T\. Wu, T\. Chen, J\. Jiang, Z\. Li, Z\. Tang, L\. Yu, and J\. Su\(2025\)ArchMap: arch\-flattening and knowledge\-guided vision language model for tooth counting and structured dental understanding\.In2025 IEEE International Conference on Big Data \(BigData\),pp\. 7529–7538\.External Links:[Document](https://dx.doi.org/10.1109/BigData66926.2025.11402150)Cited by:[§6\.3](https://arxiv.org/html/2606.02914#S6.SS3.p2.1),[Table 7](https://arxiv.org/html/2606.02914#S6.T7.3.1.1.1.1.1.1.9.8.1.1.1),[§7\.2](https://arxiv.org/html/2606.02914#S7.SS2.p3.1),[§7\.3](https://arxiv.org/html/2606.02914#S7.SS3.p4.1)\. - \[119\]J\. Zhang, B\. Du, Y\. Miao, D\. Sun, and X\. Cao\(2025\)OralGPT: a two\-stage vision\-language model for oral mucosal disease diagnosis and description\.arXiv preprint arXiv:2510\.13911\.Cited by:[§1](https://arxiv.org/html/2606.02914#S1.p4.1),[§6\.2](https://arxiv.org/html/2606.02914#S6.SS2.p5.1),[Table 7](https://arxiv.org/html/2606.02914#S6.T7.3.1.1.1.1.1.1.4.3.1.1.1)\. - \[120\]J\. Zhang, M\. Lin, H\. Hou, B\. Sun, F\. Hu, Y\. Yu, and M\. Li\(2025\)EASAM: an edge\-aware sam\-based paradigm for tooth segmentation\.Signal, Image and Video Processing19,pp\. 673\.External Links:[Document](https://dx.doi.org/10.1007/s11760-025-04208-2),[Link](https://doi.org/10.1007/s11760-025-04208-2)Cited by:[§1](https://arxiv.org/html/2606.02914#S1.p4.1),[§5\.1](https://arxiv.org/html/2606.02914#S5.SS1.p2.1),[§5\.1](https://arxiv.org/html/2606.02914#S5.SS1.p4.1),[Table 6](https://arxiv.org/html/2606.02914#S5.T6.2.2.2.2.2.2.4.1.1.1.1)\. - \[121\]Q\. Zhang, Z\. Wu, J\. Song, S\. Luo, and Z\. Chai\(2025\)Comprehensiveness of large language models in patient queries on gingival and endodontic health\.International Dental Journal75\(1\),pp\. 151–157\.External Links:[Document](https://dx.doi.org/10.1016/j.identj.2024.06.022)Cited by:[§4\.3\.1](https://arxiv.org/html/2606.02914#S4.SS3.SSS1.p2.1),[Table 4](https://arxiv.org/html/2606.02914#S4.T4.10.10.10.10.10.10.10.10.5.1.1)\. - \[122\]K\. X\. Zhou\(2024\)Introducing clinicgpt: a custom large language model for institutional dental clinics\.Journal of Dental Education88\(S3\),pp\. 1979–1981\.External Links:[Document](https://dx.doi.org/10.1002/jdd.13348),[Link](https://onlinelibrary.wiley.com/doi/abs/10.1002/jdd.13348)Cited by:[§6\.3](https://arxiv.org/html/2606.02914#S6.SS3.p4.1),[Table 7](https://arxiv.org/html/2606.02914#S6.T7.3.1.1.1.1.1.1.11.10.1.1.1)\.
Similar Articles
Audio-Visual Intelligence in Large Foundation Models
This survey paper provides a comprehensive review of audio-visual intelligence within large foundation models, establishing a unified taxonomy, synthesizing core methodologies, and outlining key datasets, benchmarks, and open research challenges.
Measuring What Matters: Benchmarking Generative, Multimodal, and Agentic AI in Healthcare
This paper presents a structured framework for benchmarking generative, multimodal, and agentic AI in healthcare, addressing the gap between high benchmark scores and real-world clinical reliability, safety, and relevance.
Specialization Beats Scale: A Strategic Variable Most AI Procurement Decisions Overlook
This article argues that specialized small models can outperform larger frontier models in specific enterprise domains at a fraction of the cost, using the DharmaOCR model as a case study. It highlights how training history alignment with deployment tasks can make parameter count less decisive.
A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models
This paper presents a multi-domain red teaming framework for evaluating safety, robustness, and fairness of medical LLMs across 690 clinically grounded scenarios. Results show that high aggregate accuracy can mask critical failures, and hybrid evaluation with clinician oversight is necessary for credible safety assessment.
The era of depending on just one AI model is over. Here is what is taking over
The AI industry is moving from single-model usage to multi-model infrastructure, creating operational challenges due to different SDKs and formats. The article discusses how teams are combining multiple AI providers and the need for better management solutions.