Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models
Summary
This paper presents a large-scale assessment of medical LLMs, including custom MedGPTs and open-source models, finding 25-30% exhibit low factual accuracy and 33.6-54.3% violate operational thresholds, highlighting systemic safety risks.
View Cached Full Text
Cached at: 05/21/26, 06:34 AM
# Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models
Source: [https://arxiv.org/html/2605.20591](https://arxiv.org/html/2605.20591)
###### Abstract
Medical large language models \(LLMs\), including custom medical GPTs \(MedGPTs\) and open\-source models, are increasingly deployed on web platforms to provide clinical guidance\. However, they pose risks of hallucination, policy noncompliance, and unsafe design\. We conduct a large\-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open\-source LLMs\. We introduce two frameworks: MedGPT\-HEval for hallucination detection and an LLM\-based pipeline for assessing policy violations and developer intent\. Our results show that 25–30% of MedGPTs exhibit low factual accuracy, with bottom\- and middle\-tier models at highest risk; 33\.6–54\.3% violate operational thresholds, and 57\.06% of Action\-enabled models lack adequate privacy disclosures\. Compared with open\-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open\-source models are more stable\. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi\-metric evaluation and stronger safeguards\. We release HAA\-MedGPT, a structured dataset that supports future research on the safety of web\-facing medical LLMs\.
## IIntroduction
Large language models \(LLMs\) are increasingly deployed as user\-centric applications on commercial platforms, where users interact with configurable agents, such as those in the OpenAI store\. Among these, custom\-built medical GPTs \(MedGPTs\) provide diagnostic suggestions, health advice, treatment explanations, and symptom checks\[[5](https://arxiv.org/html/2605.20591#bib.bib21)\]\. The GPT Store\[[36](https://arxiv.org/html/2605.20591#bib.bib137)\]hosts thousands of such models, marketed for consultation, triage, and education\[[46](https://arxiv.org/html/2605.20591#bib.bib47)\]\. Despite automated and human reviews, unsafe or abusive MedGPTs continue to appear\[[34](https://arxiv.org/html/2605.20591#bib.bib45)\], underscoring persistent safety risks in platform\-level deployments\. In parallel, open\-source medical LLMs such as Galactica\[[48](https://arxiv.org/html/2605.20591#bib.bib113)\], PMC\-LLaMA\[[53](https://arxiv.org/html/2605.20591#bib.bib116)\], and MedAlpaca\[[18](https://arxiv.org/html/2605.20591#bib.bib115)\]prioritize transparency and flexibility\. However, they often fall short on diagnostic performance\[[16](https://arxiv.org/html/2605.20591#bib.bib675),[3](https://arxiv.org/html/2605.20591#bib.bib56),[38](https://arxiv.org/html/2605.20591#bib.bib57)\], showing a trade\-off between accessibility and clinical accuracy\. These ecosystems illustrate a key tension in medical LLM deployment: platform\-hosted MedGPTs deliver high factual accuracy but carry safety risks, while open\-source models offer transparency but may lack reliability\. This motivates systematic evaluation of factual correctness, consistency, and safety in medical LLMs\.
MedGPTs expose users to two critical but often overlooked risks\. The first is clinical hallucination – confident but false or fabricated medical information\[[22](https://arxiv.org/html/2605.20591#bib.bib18),[7](https://arxiv.org/html/2605.20591#bib.bib20)\], such as unsafe treatments or incorrect drug advice\. The second is design\-level abuse, where GPT creators violate OpenAI’s privacy policies\[[41](https://arxiv.org/html/2605.20591#bib.bib201)\]or manipulate users through deceptive names, descriptions, or by using tools that circumvent safeguards\. These risks are amplified by platform trust indicators \(such as ratings and reviews\) that falsely legitimize unsafe GPTs\. Harm arises from both outputs and model configuration\. Developers craft names, descriptions, and conversation starters that convey trust or authority and may enable tools such as web browsing\[[21](https://arxiv.org/html/2605.20591#bib.bib76)\], while linking to vague or broken privacy policies\. These surface\-level cues shape users’ trust and introduce risk before they interact with models\. Unlike general domains, where users can spot LLM errors, patients often lack the medical knowledge to evaluate LLM\-generated advice\[[26](https://arxiv.org/html/2605.20591#bib.bib32)\]\.
Recent studies show that clinical hallucinations persist in LLMs despite domain tuning and expert evaluation\. Kim et al\.\[[26](https://arxiv.org/html/2605.20591#bib.bib32)\]proposed a hallucination taxonomy and found that Med\-PaLM\[[24](https://arxiv.org/html/2605.20591#bib.bib14)\]and GPT\-4\[[35](https://arxiv.org/html/2605.20591#bib.bib9)\]still produce misleading outputs, even with retrieval\-augmented prompting\. Asgari et al\.\[[6](https://arxiv.org/html/2605.20591#bib.bib15)\]similarly observed factually unsupported but fluent medical claims from GPT\-4, raising concerns about trust and verifiability\. Broader GPT ecosystem studies have focused on jailbreaks, usage patterns, and configuration extraction\[[34](https://arxiv.org/html/2605.20591#bib.bib45),[21](https://arxiv.org/html/2605.20591#bib.bib76),[60](https://arxiv.org/html/2605.20591#bib.bib227)\], but treat these issues as isolated failures\. No prior work has systematically analyzed clinical hallucination or actor\-level abuse in real\-world MedGPTs on the web GPT marketplaces\[[11](https://arxiv.org/html/2605.20591#bib.bib228),[12](https://arxiv.org/html/2605.20591#bib.bib8),[36](https://arxiv.org/html/2605.20591#bib.bib137)\]\. To fill this gap, we introduce HAA\-MedGPT, the first large\-scale dataset and evaluation framework for detecting hallucinations and intent\-level misuse in MedGPTs\. Our approach combines multi\-metric scoring with policy\-aligned analysis to uncover structural risks in model outputs and how they are built, presented, and deployed\.
We crawled metadata from 6,233 MedGPTs on the OpenAI Store\. We selected a stratified sample of 1,500 models to balance platform coverage while respecting OpenAI’s query limits and maintaining responsible, non\-disruptive interaction with the GPT Store\. Models were grouped by conversation count into three tiers: Top 1000, Middle 250 \(random\), and Bottom 250\. Each GPT was evaluated using standardized clinical prompts and a multi\-metric scoring framework\. In parallel, we included 10 open\-source medical LLMs – such as Galactica\[[48](https://arxiv.org/html/2605.20591#bib.bib113)\], PMC\-LLaMA\[[53](https://arxiv.org/html/2605.20591#bib.bib116)\], and MedAlpaca\[[18](https://arxiv.org/html/2605.20591#bib.bib115)\]– in the analysis to compare platform\-hosted models with transparent, community\-driven alternatives\. We also assessed actor\-level risk using a rubric\-based analysis of names, descriptions, conversation starters, and policy statements, enabling a detailed evaluation of platform\-hosted and open\-source medical LLMs\.
Unlike prior studies that examine hallucination in base LLMs\[[6](https://arxiv.org/html/2605.20591#bib.bib15)\]or misuse in general GPT applications\[[46](https://arxiv.org/html/2605.20591#bib.bib47)\], our work is the first to empirically analyze deployed, user\-facing, healthcare GPTs at web scale, capturing both content\-level hallucinations and developer\-driven misuse signals\. We ask the following four core questions\.
- •𝐑𝐐𝟏\\mathbf\{RQ1\}:How do rates of clinical hallucination in MedGPTs vary across popularity tiers on the OpenAI Store?
- •𝐑𝐐𝟐\\mathbf\{RQ2\}:Are users able to discern hallucinated content in MedGPTs’ outputs?
- •𝐑𝐐𝟑\\mathbf\{RQ3\}:How do developer\-defined design choices enable abusive behaviors or weaken privacy safeguards in MedGPTs?
- •𝐑𝐐𝟒\\mathbf\{RQ4\}:How do clinical hallucinations differ between MedGPTs and open\-source medical LLMs?
To assess RQ1, we introduce MedGPT\-HEval \(§[V\-A](https://arxiv.org/html/2605.20591#S5.SS1)\), a structured multi\-metric framework for evaluating clinical hallucination in MedGPTs\. We queried each model using a clinical scenario extracted from the MedQA benchmark\[[23](https://arxiv.org/html/2605.20591#bib.bib112)\]and repeated the same question five times to obtain multiple responses\. This approach captures variability in the model’s outputs and allows for a more robust assessment of hallucination\. The responses were then evaluated using four metrics: G\-Eval\[[54](https://arxiv.org/html/2605.20591#bib.bib110)\], BARTScore\[[58](https://arxiv.org/html/2605.20591#bib.bib109)\], semantic entropy\[[31](https://arxiv.org/html/2605.20591#bib.bib108)\], and cosine similarity\[[27](https://arxiv.org/html/2605.20591#bib.bib107)\], capturing both factual alignment and consistency\. We found that 25–30% of MedGPTs across top, middle, and bottom tiers score below 0\.8 in G\-Eval, with only 37\.27% achieving BARTScore≥−3\.5\\geq\-3\.5and 41\.07% reaching cosine similarity≥0\.4\\geq 0\.4\. At the same time, 59\.87% have semantic entropy less than 2, indicating moderate response stability\. Bottom\- and middle\-tier MedGPTs pose the highest hallucination risk and weakest contextual alignment\. This shows that model popularity is not a reliable indicator of factual accuracy or safety in web\-deployed medical GPTs\.
To evaluate RQ2, we analyzed conversation volume, star ratings, and review sentiment for the Top 1000 MedGPTs \(§[V\-B](https://arxiv.org/html/2605.20591#S5.SS2)\)\. Analysis shows that user engagement does not reflect awareness of clinical hallucinations, with near\-zero correlations for conversation counts \(G\-Eval: \-0\.0347; BART: \-0\.0196; entropy: 0\.0057; cosine: \-0\.0318\) and reviews \(\-0\.0449 to 0\.0732\)\. In contrast, reviews correlate strongly with usage \(r=0\.9999r=0\.9999for positive reviews,r=0\.9656r=0\.9656for negative reviews\)\. This indicates that user feedback reflects activity rather than accuracy, revealing a gap between perceived trust and actual reliability in web\-deployed MedGPTs\.
To assess RQ3, we applied an automated scoring pipeline based on OpenAI’s operationalized usage policies\. We used K\-means clustering to determine a cut\-off to classify MedGPTs as compliant or noncompliant \(§[V\-C](https://arxiv.org/html/2605.20591#S5.SS3)\)\. Misuse is common among MedGPTs: 54\.3% of Top 1000, 48\.0% of Middle 250, and 33\.6% of Bottom 250 models exceed the 0\.45 risk threshold, and two\- or three\-violation cases affect 33–64% across popularity tiers\. Privacy and compliance gaps are also prevalent among 170 MedGPTs with Actions enabled \(§[V\-D](https://arxiv.org/html/2605.20591#S5.SS4)\): only 42\.94% had accessible privacy policies \(57\.06% lacked documentation\) and nearly 70% of extracted policies scored below the threshold, exposing users to unsafe data practices and regulatory violations\.
To assess RQ4, we investigate clinical hallucination in the open\-source medical LLMs and compare the results with the MedGPTs \(§[V\-E](https://arxiv.org/html/2605.20591#S5.SS5)\)\. Open\-source medical LLMs show a trade\-off between accuracy and consistency: Galactica \(G\-Eval: 0\.6480\) and Aloe\-Alpha \(G\-Eval: 0\.5948\) have the highest factual accuracy, while MedAlpaca \(0\.4863\), Apollo \(0\.4863\), and MentalHealthChatbot \(0\.4354\) achieve stronger semantic alignment, and BioMistral \(2\.3978\) is the most variable\. Compared with MedGPTs, which exhibit higher G\-Eval \(0\.9238\) and cosine similarity \(0\.4054\) but greater entropy \(1\.9272\), these results indicate that MedGPTs provide superior factuality and semantic coherence\. In contrast, open\-source models remain more stable and predictable, underscoring the need for multi\-metric evaluation of hallucination risk\.
Our Contributions\.Specifically, the contributions of this paper are as follows\.
- •Web\-scale audit of MedGPTs\. We design and deploy the first scalable measurement pipeline for auditing web\-deployed medical LLM ecosystems, combining automated discovery, interaction\-based inference probing, metadata extraction, and policy\-compliance analysis across 6,233 deployed agents\.
- •Dual\-layer safety analysis\. We introduce MedGPT\-HEval for clinical hallucination detection and a complementary actor\-level misuse evaluator, jointly exposing risks overlooked in prior work\.
- •Evidence of structural governance failures\. We show that platform trust indicators \(ratings, reviews, conversation counts\) do not correlate with safety, and that 49\.8% of models violate operational policy\.
- •Privacy\-risk quantification\. We provide the first empirical assessment of privacy policy alignment for Action\-enabled GPTs, showing that 57\.06% lack functional policy disclosures\.
- •Public dataset and tools\. We release HAA\-MedGPT111[https://anonymous\.4open\.science/r/HAA\-MedGPT\-2E78](https://anonymous.4open.science/r/HAA-MedGPT-2E78)–the first large\-scale dataset of 6,233 custom web\-deployed MedGPTs from the OpenAI Store, enabling future Web\-safety, platform\-governance, and public\-health research\.
## IIRelated Work
In this section, we discuss previous studies in the literature that are related to our work\.
### II\-ALLM Deployment in Healthcare
Recent studies have examined LLMs in clinical settings, focusing primarily on foundation models or isolated tasks in controlled environments\. Ahmed et al\.\[[4](https://arxiv.org/html/2605.20591#bib.bib6)\]and Shekhar et al\.\[[45](https://arxiv.org/html/2605.20591#bib.bib5)\]explored ChatGPT’s potential in cardiovascular care and ambulance triage, respectively, but their works remain conceptual\. Benchmarking efforts by Gumilar et al\.\[[15](https://arxiv.org/html/2605.20591#bib.bib59)\]and Pagano et al\.\[[38](https://arxiv.org/html/2605.20591#bib.bib57)\]evaluated LLMs such as GPT\-4, GPT\-4o, LLaMA\-3\.1, and Copilot on oncology and orthopedics, focusing solely on accuracy\. Extensive reviews\[[20](https://arxiv.org/html/2605.20591#bib.bib51),[32](https://arxiv.org/html/2605.20591#bib.bib64)\]raised concerns around hallucination and ethical opacity, but lacked empirical deployment analysis\. Governance frameworks such as Health\-LLM\[[56](https://arxiv.org/html/2605.20591#bib.bib54)\]and Polaris\[[33](https://arxiv.org/html/2605.20591#bib.bib62)\]offer safety\-aware designs but operate in tightly controlled or simulated environments\. Apparently, none of these efforts evaluates the real\-world risks of custom MedGPTs\. In contrast, our work systematically evaluates deployed MedGPTs for hallucination and actor\-driven abuse\.
### II\-BClinical Hallucinations in LLMs
Several studies examine hallucinations in medical LLMs using expert benchmarks, taxonomies, and error analyses\. Kim et al\.\[[25](https://arxiv.org/html/2605.20591#bib.bib16)\]proposed a typology \(diagnostic, factual, outdated\) across models \(e\.g\., GPT\-4o, PMC\-LLaMA\[[53](https://arxiv.org/html/2605.20591#bib.bib116)\], MedAlpaca\-13B\[[18](https://arxiv.org/html/2605.20591#bib.bib115)\]\)\. Asgari et al\.\[[6](https://arxiv.org/html/2605.20591#bib.bib15)\]and Vishwanarth et al\.\[[50](https://arxiv.org/html/2605.20591#bib.bib37)\]built clinician\-in\-the\-loop audit tools \(e\.g\., CREOLA\) for summarization\. Qin et al\.\[[40](https://arxiv.org/html/2605.20591#bib.bib38)\]used entropy\-based dialogue modeling to address misinformation but did not target LLM\-originated hallucinations or unsafe design\. Zhu et al\.\[[62](https://arxiv.org/html/2605.20591#bib.bib35)\]offered a unified taxonomy linking hallucinations to data, training, and inference, but not to deployed GPTs or developer misuse\. Agarwal et al\.\[[1](https://arxiv.org/html/2605.20591#bib.bib1729)\]’s MEDHALU output faithfulness over author intent\. Collectively, these studies frame hallucination as a content problem, overlooking structural risks tied to GPT authorship and deployment\. Our work fills this gap by analyzing both output\-level hallucinations and actor\-driven factors in deployed MedGPTs\.
### II\-CActor\-Level Abuse and Intent in GPT Deployments
Recent large\-scale audits probe misuse in public LLM apps but rarely target healthcare\. Zhang et al\.\[[59](https://arxiv.org/html/2605.20591#bib.bib43)\]analyzed 10,000 GPT apps for prompt and configuration leaks; Hou et al\.\[[21](https://arxiv.org/html/2605.20591#bib.bib76)\]surveyed 786,000 apps, revealing data over\-collection and deceptive logic for general\-purpose abuse\. Shen et al\.\[[46](https://arxiv.org/html/2605.20591#bib.bib47)\]’s GPTracker flagged over 2,000 Store violations without assessing hallucination or health\-specific harm\. Rodríguez et al\.\[[41](https://arxiv.org/html/2605.20591#bib.bib201)\]proposed red\-teaming compliance but focused on non\-medical, single\-turn prompts\. These studies demonstrate scalable assessment methods but overlook how developer choices create clinical risk\. Our work pairs hallucination detection with design\-level analysis to expose how author decisions drive real\-world clinical failures\.
In summary, prior work either examines hallucinations in base models or analyzes generic misuse, without addressing risks in custom MedGPTs\. No existing study evaluates how developer choices drive clinical hallucination or abuse\. To fill this gap, we present HAA\-MedGPT, the first large\-scale dataset and framework for detectingHallucination andActor\-levelAbuse in OpenAI\-deployedMedicalGPTs\. In the next section, we detail our data collection and analysis methods\.
## IIIHAA–MedGPT: Dataset for Hallucination and Actor\-level Abuse Detection in MedGPTs
In this section, before discussing the data collection and analysis methods, we briefly describe clinical hallucination and actor\-abuse vectors, and conclude by highlighting the ethical considerations\.
### III\-AClinical Hallucination in Medical LLMs
Clinical hallucinations are outputs from medical LLMs that are fluent but factually incorrect, incomplete, or irrelevant\[[25](https://arxiv.org/html/2605.20591#bib.bib16),[42](https://arxiv.org/html/2605.20591#bib.bib33)\]\. These errors fall into two types: factuality \(false, outdated, or unverifiable content\) and faithfulness \(contradictions or deviations from the user prompt\)\[[43](https://arxiv.org/html/2605.20591#bib.bib130)\]\. Unlike obvious falsehoods in general LLMs, clinical hallucinations use domain\-specific language and coherent logic, making inaccuracies appear plausible\[[22](https://arxiv.org/html/2605.20591#bib.bib18)\]\. This increases their danger, as such errors can misguide diagnoses, treatments, or prescriptions\[[25](https://arxiv.org/html/2605.20591#bib.bib16)\]\. Prior work has developed taxonomies, benchmarks\[[25](https://arxiv.org/html/2605.20591#bib.bib16),[1](https://arxiv.org/html/2605.20591#bib.bib1729)\], and tools like CREOLA\[[6](https://arxiv.org/html/2605.20591#bib.bib15)\]to assess impact\. However, detection remains difficult, especially for non\-experts, since subtle flaws often pass unnoticed\[[6](https://arxiv.org/html/2605.20591#bib.bib15)\]\. When accepted uncritically, these outputs erode trust and pose real clinical risks\[[39](https://arxiv.org/html/2605.20591#bib.bib1728)\]\. As medical LLMs gain real\-world use, addressing hallucinations is not just a technical task but a clinical safety imperative\.
### III\-BActor\-level Abuse in MedGPTs
Actor\-level abuse refers to intentional design choices by GPT creators that predispose MedGPTs to unsafe or unethical use before any user interaction\. Unlike hallucinations, which emerge during generation, these risks are embedded in metadata–names, descriptions, conversation starters, and privacy policies–which shape user expectations and may falsely signal clinical authority\. Prior work has shown that such configurations can encode malicious intent and bypass safeguards\[[46](https://arxiv.org/html/2605.20591#bib.bib47),[21](https://arxiv.org/html/2605.20591#bib.bib76),[41](https://arxiv.org/html/2605.20591#bib.bib201)\]\. In healthcare, this risk is magnified\. For example, models named “Instant Diagnosis AI” or “Cancer Risk Estimator” may promote unverified treatments or simulate diagnostic authority, encouraging unsafe reliance without disclaimers or expert review\. The threat grows with Actions\-enabled GPTs, which connect to external APIs and process real\-time data\. As public\-facing MedGPTs proliferate, detecting and mitigating actor\-level abuse is essential for safety, trust, and responsible deployment\.
### III\-COpen\-source Medical LLMs
Recently, there has been significant interest in developing and deploying LLMs tailored to medical domains\. These models are promising in diagnostic and clinical decision\-making reliability\[[61](https://arxiv.org/html/2605.20591#bib.bib50)\]\. In this work, we considered 10 open\-source medical LLMs, which differ in training data, fine\-tuning methods, and evaluation benchmarks, as shown in Table[I](https://arxiv.org/html/2605.20591#S3.T1)\. These models are selected due to their popularity, recency, and support for text generation\.
TABLE I:Overview of Open\-source medical LLMs\[[19](https://arxiv.org/html/2605.20591#bib.bib100),[30](https://arxiv.org/html/2605.20591#bib.bib101)\]\. Here, PT – Pre\-training, STF – Supervised Fine\-tuning, DPO – Direct Preference Optimization, RLHF – Reinforcement Learning with Human Feedback
### III\-DData Collection Pipeline
To assess the landscape of MedGPTs, we developed a scalable extractor to overcome the visibility limitations of the OpenAI GPT Store, which provides access only to a curated subset\. Since GPTs are not fully indexed, our extractor used OpenAI’s public search API to collect metadata at scale\. We developed a domain\-specific taxonomy of 171 medical keywords spanning specialties, roles, conditions, tasks, and general terminology\. This taxonomy, generated via a Python\-based deduplication pipeline, ensures broad and consistent retrieval of medically relevant GPTs\. Using a Selenium\-based crawler, we automated searches for each keyword\. We extracted metadata fields, including model name, description, author, category, conversation starters, reviews, and capabilities \(e\.g\., browsing, code interpreter, Actions\)\. The pipeline ran from January 20–22, 2026, yielding 9,245 unique GPTs\. This dataset reinforces our large\-scale analysis of hallucination and actor\-level abuse in deployed MedGPTs\.
### III\-EData Analysis
#### III\-E1Metadata Curation
To ensure reliability and domain specificity, we manually reviewed all 9,245 collected GPTs based on their names, descriptions, and conversation starters, excluding 3,012 irrelevant entries\. This yielded a final set of 6,233 \(≈\\approx67\.42%\) MedGPTs relevant to healthcare contexts\. We then cleaned the metadata by removing formatting artifacts \(e\.g\., extraneous “\+” symbols\) and standardizing numeric shorthand \(e\.g\., “10K” to 10000, “1M” to 1000000\)\. The resulting corpus provides a structured, machine\-readable record of MedGPTs, including creator IDs, usage metrics, capabilities, and interface elements\. This curated dataset underpins our analyses of hallucination, actor\-level misuse, and policy compliance across deployed MedGPTs in the OpenAI Store\.
#### III\-E2Overall Distribution
As shown in Table[II](https://arxiv.org/html/2605.20591#S3.T2), usage and feedback data for MedGPTs in the OpenAI Store reveal major disparities in visibility and oversight\. The vast majority \(85\.58%, 5,334 GPTs\) have under 100 conversations, and 90\.81% \(5,660 GPTs\) lack any user ratings, suggesting minimal public scrutiny\. Engagement is highly skewed: only 15 GPTs have more than 100,000 interactions, and just 3 have more than 5,000 reviews\. Even among the reviewed GPTs, quality indicators are weak: only 571 \(9\.16%\) average above 3\.0 stars, raising concerns about reliability\. Based on review and conversation counts, we estimate total interactions range from 148,761 to 11\.40 million\. This imbalance reveals a systemic blind spot, where thousands of MedGPTs remain untested, unvetted, and potentially unsafe\. In clinical contexts, where inaccuracies can harm patients, the lack of community feedback makes passive moderation inadequate\. These findings underscore the urgent need for proactive safety and vulnerability assessments tailored to MedGPTs\.
#### III\-E3Authorship
Fig\.[1](https://arxiv.org/html/2605.20591#S4.F1)shows that medical GPT authorship in the OpenAI Store is highly unbalanced\. Over half of developers \(53\.68%, or 3,346\) have published only one GPT, indicating many are isolated or experimental efforts\. In contrast, a small number of creators dominate the ecosystem – one author alone published 690 GPTs \(11\.07%\), and the top 20 account for 1,620 GPTs \(26% of the total\) \(Fig\.[2](https://arxiv.org/html/2605.20591#S4.F2)\)\. This concentration raises safety and governance concerns\. While some high\-volume contributors may aim to expand content, the platform provides no visibility into their identity, expertise, or quality control\. Without transparency or verification, mass\-produced GPTs \(especially in healthcare\) risk spreading low\-quality or misleading advice unchecked\.
TABLE II:Distribution of MedGPTs metadata\.
#### III\-E4Capabilities
Table[III](https://arxiv.org/html/2605.20591#S4.T3)shows a wide variation in medical GPT capabilities, raising safety and privacy concerns in clinical contexts\. While all models support basic conversation, many include advanced tools\. Web Search is most common \(71\.15%\), enabling real\-time external queries\. Over half \(53\.63%\) support DALL\.E image generation, 30\.16% support code interpretation for data analysis or computation, and 19\.27% support canvas, enabling collaborative writing and coding tasks\. Actions that call external APIs account for just 2\.73% because they are more sensitive\. GPT\-4o’s image tool appears in only 14\.13%, likely due to its recent launch\. Nearly 25% are text\-only, while just 0\.08% support all six features\. Most fall in the mid\-range: 32\.547% support two tools, 29\.41% support three\. This trend toward moderately equipped but increasingly connected models elevates privacy risks, especially when tools process sensitive health data\. Therefore, there is a need for adequate safety analysis to examine both outputs and underlying technical configurations\.
#### III\-E5Ethical Considerations
We conducted this assessment in accordance with data ethics, platform policies, and institutional standards\. All data were sourced from publicly available metadata on the OpenAI GPT Store, including GPT names, descriptions, conversation starters, categories, usage metrics, ratings, reviews, and capabilities\. No user conversations, personal data, or private content were accessed\. Engagement metrics \(e\.g\., review and conversation counts\) were treated as aggregate and non\-identifiable, with no attempt to trace or link data to individuals\. Actor\-level abuse was evaluated using criteria derived from OpenAI’s published policies, applied only to public GPT metadata and accessible privacy statements\. We accessed no gated content, and data collection was performed via rate\-limited queries in compliance with OpenAI’s usage guidelines\. All data are stored on institution\-approved servers with access limited to authorized researchers\.
## IVMeasurement Methodology
In this section, we introduce two frameworks to evaluate clinical hallucination and actor\-intent misuse in MedGPTs across popularity tiers in the OpenAI Store\.
To support large\-scale interaction, we develop an automated tool using Selenium\[[44](https://arxiv.org/html/2605.20591#bib.bib217)\]to simulate user behavior through the web interface\. This allowed us to prompt GPTs and collect responses as users would\. Given OpenAI’s account rate limits222Auditing all 6,233 models is practically and ethically infeasible under platform rate limits; our stratified design ensures representative coverage while following responsible Web auditing norms\., we used both ChatGPT Plus and ChatGPT Pro subscriptions, allowing around 100 prompts every three hours\. Testing all 6,233 MedGPTs was infeasible–it would take over a year\. Instead, we sampled 1,500 models: the top 1,000 by conversation count, 250 from the middle tier, and 250 from the bottom tier, capturing behavior across the usage spectrum\.
Following prior work\[[43](https://arxiv.org/html/2605.20591#bib.bib130)\], we restrict our interaction with LLMs – both MedGPTs and open\-source models – to inference\-only queries, excluding parameters such as temperature, gradients, or log\-probabilities\. This configuration mirrors the type of access available to typical users or patients, ensuring that our evaluation reflects realistic usage conditions\.
Figure 1:ECDF of number of MedGPTs per author/developer\.Figure 2:Top 20 authors of MedGPTs\.TABLE III:Capabilities of MedGPTs\.### IV\-AMedGPT\-HEval: A Clinical Hallucination Evaluation Framework
We introduce MedGPT\-HEval, a structured multi\-metric framework for evaluating clinical hallucination in MedGPTs\. We queried each GPT five times with diagnostic questions from the MedQA benchmark\[[23](https://arxiv.org/html/2605.20591#bib.bib112)\], framed within a clinical vignette to ensure contextual consistency\. To capture variability in LLM outputs, we repeated the same question five times to obtain multiple responses and evaluated them using four complementary metrics: G\-Eval\[[54](https://arxiv.org/html/2605.20591#bib.bib110)\], BARTScore\[[58](https://arxiv.org/html/2605.20591#bib.bib109)\], Semantic Entropy\[[31](https://arxiv.org/html/2605.20591#bib.bib108)\], and Cosine Similarity\[[27](https://arxiv.org/html/2605.20591#bib.bib107)\]\.
#### IV\-A1G\-Eval
We use the G\-Eval framework\[[54](https://arxiv.org/html/2605.20591#bib.bib110)\]to assess response faithfulness based on the clinical prompt and standardized context\. Leveraging Gemini’s chain\-of\-thought reasoning, each response is independently scored using a five\-part rubric: \(1\) consistency with the prompt, \(2\) factual accuracy, \(3\) completeness, \(4\) citation reliability, and \(5\) inference justification\. Scores \(0–1 scale\) are averaged across all responses\. G\-Eval uses Gemini 3\.1 Pro as a judgment engine – chosen for its advanced reasoning and intelligence – to evaluate outputs through explicit steps: checking factual consistency, validating cited sources, and assessing logical coherence\. We incorporate token\-level confidence to mitigate bias from verbosity\. The final score is the mean of the rubric components\.
#### IV\-A2BARTScore
BARTScore\[[58](https://arxiv.org/html/2605.20591#bib.bib109)\]estimates the likelihood of a generated text \(yy\) given a reference input \(xx\) using the pretrained BART model\[[57](https://arxiv.org/html/2605.20591#bib.bib1)\]\. It assesses faithfulness by computing the weighted log\-probability ofyyconditioned onxx, reflecting how closely the generated response aligns with the prompt’s meaning and structure\. In our case,xxis the clinical vignette and question, andyyis the response generated by each MedGPT\. Higher scores indicate better alignment and lower hallucination risk\.
#### IV\-A3Semantic Entropy
Semantic Entropy \(SE\)\[[31](https://arxiv.org/html/2605.20591#bib.bib108)\]quantifies the model’s uncertainty in generating outputs\. Unlike prior formulations that require multiple samples and semantic\-equivalence comparisons\[[8](https://arxiv.org/html/2605.20591#bib.bib34)\], we compute SE directly from a single forward pass using token\-level output probabilities:
SE\(x\)=−∑ipilogpiSE\(x\)=\-\\sum\_\{i\}p\_\{i\}\\log p\_\{i\}
wherepip\_\{i\}is the softmax probability of tokeniifor the generated sequence\. This approach captures uncertainty and potential risk of hallucination\. A higherSE\(x\)SE\(x\)indicates greater uncertainty in the model’s predictions, which may signal a higher likelihood of hallucination\.
#### IV\-A4Cosine Similarity
We use cosine similarity\[[27](https://arxiv.org/html/2605.20591#bib.bib107)\]to assess semantic alignment between a medical GPT’s response and its clinical input, defined as the concatenation of case context and question\. Embeddings are generated with BioBERT, a transformer pretrained on biomedical text to ensure accurate encoding of clinical semantics\. Scores range from 0 \(no similarity\) to 1 \(perfect alignment\)\. Lower scores indicate semantic drift or fabrication, key indicators of hallucination\.
### IV\-BActor\-level Abuse Evaluation Framework
#### IV\-B1Method
To evaluate design\-intent risks, we developed an automated LLM\-based scoring framework to assess MedGPT metadata\. We focused on models with sufficient descriptive detail–specifically, those including at least two of name, description, or conversation starters–as these best reflect developer intent\. Each model was evaluated for compliance with OpenAI’s usage policies\[[37](https://arxiv.org/html/2605.20591#bib.bib106)\], focusing on four misuse categories: health consultations, scams, privacy violations, and illicit activities\.
A key challenge is the vagueness of OpenAI’s policies\. For instance, while “providing medical advice” is prohibited, the role of disclaimers or scope limitations remains unclear\. To resolve this, we translated policy language into concrete criteria using LLM\-as\-a\-judge methods\[[13](https://arxiv.org/html/2605.20591#bib.bib103)\], embedding these rules into structured Gemini 3\.1 Pro prompts to ensure consistent and reproducible assessments\. Gemini 3\.1 Pro was chosen for its newness, strong smartness, advanced reasoning and intelligence, capability to solve complex problems, and low hallucination rate\[[10](https://arxiv.org/html/2605.20591#bib.bib2)\]\.
Each prompt included the GPT’s metadata, a relevant policy clause, and a structured scoring rubric333[https://anonymous\.4open\.science/r/medical\_llms\-appendix\-10DA](https://anonymous.4open.science/r/medical_llms-appendix-10DA)\. Gemini 3\.1 Pro returned a misuse risk score and, when applicable, the specific policy or provision violated\.
#### IV\-B2Determining Threshold
To separate non\-compliant MedGPTs from compliant ones, we applied K\-means clustering withk=2k=2to the misuse risk scores\. This unsupervised approach avoids reliance on arbitrary cutoffs and aligns with prior work\[[46](https://arxiv.org/html/2605.20591#bib.bib47)\]that uses similar methods to flag unsafe GPT behavior\. The clustering produced a decision boundary at 0\.45, which we adopt as the threshold for flagging non\-compliant behavior\. To validate this threshold, we conducted an expert review on models near the boundary \(misuse scores 0\.35\-0\.55\)\. Expert judgments were compared with the clustering results\. This yielded a Cohen’s Kappa score of 0\.8188, indicating almost perfect agreement\. The silhouette score for the clustering was 0\.7555, indicating that the two clusters are well separated and cohesive\. These results confirm that the 0\.45 threshold reliably separates non\-compliant and compliant MedGPTs\.
## VResults
In this section, we present the empirical findings of our study, demonstrating how clinical hallucination and actor\-level misuse manifest in MedGPTs across different popularity tiers\. Afterwards, we show the pervasiveness of clinical hallucination in open\-source medical LLMs and a comparative analysis with MedGPTs\.
### V\-AClinical Hallucination in MedGPTs
Fig\.[6\(a\)](https://arxiv.org/html/2605.20591#S5.F6.sf1)shows the distribution of hallucination scores across MedGPTs grouped by popularity tier\. As illustrated in Fig\.LABEL:fig:hallu\_g\-eval, 25%, 22%, and 30% of MedGPTs in the top\-, middle\-, and bottom\-tier categories score below 0\.8, respectively\. Similarly, 30%, 22%, and 32% of models in the top, middle, and bottom tiers score below 0\.9\. Consequently, 70%, 78%, and 68% of MedGPTs in these tiers achieve G\-Eval scores of at least 0\.9\. Across all evaluated MedGPTs, 25\.13% score 0\.8 or lower, indicating a non\-trivial level of hallucination risk in the ecosystem\. We also identify four MedGPTs in the top tier and one in the middle tier with a G\-Eval score of zero\. Our manual inspection shows that these models refused to respond to the diagnostic prompt, stating that they could not provide medical advice\. Although such refusals reduce hallucination risk, they also reveal inconsistent safety or capability boundaries among deployed MedGPTs\. Overall, bottom\-tier MedGPTs exhibit the highest hallucination risk, followed by top\-tier models, while mid\-tier systems demonstrate comparatively lower hallucination rates\. These results suggest that popularity does not necessarily correlate with factual reliability, raising concerns about the trust users may place in widely used web\-deployed medical GPT applications\.
Fig\.LABEL:fig:hallu\_bartscorepresents the BARTScore distribution for MedGPTs across the three categories\. Approximately 5\.9%, 5\.6%, and 0\.2% of MedGPTs in the top, middle, and bottom categories, respectively, score below \-4, indicating weak alignment between the generated responses and the reference context\. The majority of MedGPTs fall within the \-4 to \-3\.5 range, accounting for 52%, 64\.4%, and 74\.4% of models in the top, middle, and bottom categories, respectively\. Only 42\.1%, 30%, and 24\.8% of MedGPTs in the top, middle, and bottom categories achieve scores of \-3\.5 or higher, suggesting stronger semantic alignment with the reference content\. Overall, only 37\.27% of MedGPTs reach this threshold\. These results indicate that many MedGPTs generate responses with limited semantic alignment to the clinical context, including several top\-rated ones\. This questions the reliability of MedGPTs for context\-grounded clinical information and reveals the risk of inaccurate outputs that could influence medical guidance\.
Fig\.LABEL:fig:hallu\_semanticreports the semantic entropy scores across the three popularity categories\. About 21%, 66\.8%, and 90% of MedGPTs in the top, middle, and bottom categories, respectively, record entropy scores≥\\geq2\.5\. In contrast, 79%, 33\.3%, and 10% achieve scores below 2\. Because semantic entropy is a cost metric – where lower values indicate more stable, consistent outputs – top\-tier MedGPTs exhibit substantially lower uncertainty than middle\- and bottom\-tier MedGPTs\. Overall, 59\.87% of MedGPTs obtain entropy scores below 2, suggesting moderate response stability across the ecosystem but clear performance gaps between popularity tiers\. The middle\- and bottom\-tier MedGPTs show greater variability in generated responses, which may lead to inconsistent or unreliable clinical content in real\-world healthcare applications\.
Fig\.LABEL:fig:hallu\_cosinepresents cosine similarity across the three popularity categories and shows clear tier differences\. The middle category has the largest share of low\-alignment models \(54\.4%<0\.4<0\.4\), compared with 43\.6% in the bottom category and 37\.1% in the top category\. The bottom category is concentrated most heavily in the 0\.4–0\.45 band \(48\.8%\), exceeding the top \(35\.1%\) and middle \(36%\) categories\. Models with strong alignment \(\>0\.45\>0\.45\) appear far more often in the top category \(27\.8%\) than in the middle \(9\.68%\) or bottom \(7\.6%\)\. Overall, 41\.07% of MedGPTs score below 0\.4\. Lower cosine similarity outside the top tier suggests that many models provide responses that do not align with the given context, increasing the risk of irrelevant or misleading content\.
Human annotation:To validate our automated scoring, we manually reviewed a stratified subset of 300 MedGPTs: 200 from the Top 1000 tier and 50 each from the middle and bottom tiers \(≈\\approx20% per tier\) – given the large number of models and the impracticality of reviewing all samples\. Each response was independently assessed against the ground truth using the same metrics as the automated system, and discrepancies were resolved by consensus\. Human reviewers agreed with the automated scoring in 90\.5–94\.0% of cases \(Cohen’s kappa = 0\.8163\), confirming the robustness and reliability of our automated evaluation framework\.
### V\-BUsers’ Perception of Clinical Hallucination
We evaluated whether user engagement reflects awareness of clinical hallucinations in MedGPTs\. Fig\.[6\(b\)](https://arxiv.org/html/2605.20591#S5.F6.sf2)shows correlations between hallucination metrics and conversation counts for the Top 1000 MedGPTs, the only tier with sufficient review data\. The four hallucination metrics are largely decoupled from user activity, with rank correlations of \-0\.0347, \-0\.0196, 0\.0057, and \-0\.0318 for G\-Eval, BARTScore, semantic entropy, and cosine similarity, respectively, suggesting that users rarely recognize or penalize clinical errors\. Table[IV](https://arxiv.org/html/2605.20591#S5.T4)confirms that hallucination scores are uncorrelated with both positive and negative reviews \(ranging from \-0\.0449 to 0\.0732\), whereas reviews correlate strongly with usage \(r=0\.9999r=0\.9999positive,r=0\.9656r=0\.9656negative\), as shown in Fig\.[8\(a\)](https://arxiv.org/html/2605.20591#S5.F8.sf1)\. These results reveal that user feedback reflects activity rather than accuracy, highlighting a critical gap between perceived and actual reliability in healthcare applications\.
\(\(a\)\)Cumulative distribution of MedGPTs hallucination scores\.\(\(b\)\)Correlation between hallucination scores and conversation counts of Top 1000 MedGPTs\.TABLE IV:Correlation between hallucination scores and reviews of Top 1000 MedGPTs\.
\(\(a\)\)Correlation between conversation counts and reviews of Top 1000 MedGPTs\.### V\-CActor\-defined Design Intent
To assess risks beyond clinical hallucinations, we analyzed whether developer\-defined design intent contributes to abusive behavior in MedGPTs\. Fig\.[8](https://arxiv.org/html/2605.20591#S5.F8)shows the distribution of risk scores across popularity tiers\. While the Top 1000 MedGPTs exhibit slightly lower risk than the middle and bottom tiers, abuse is present across all categories\. Using the 0\.45 threshold \(Section[IV\-B2](https://arxiv.org/html/2605.20591#S4.SS2.SSS2)\), we identify misuse in 54\.3% of Top 1000, 48\.0% of Middle 250, and 33\.6% of Bottom 250, resulting in an overall misuse rate of 49\.8% \(Fig\.LABEL:fig:misuse\_stacked\)\. These findings demonstrate that misuse is a systemic issue stemming from how developers encode design intent\. Fig\.LABEL:fig:misuse\_barfurther breaks down abuse types\. It shows that most violations involve two or three cases, affecting 33\.15% to 63\.72% of MedGPTs across tiers\. Two\-violation cases are most common in the bottom tier, while three\-violation cases dominate the top and middle tiers\. A small fraction of MedGPTs violate one case \(Top: 1\.84%, Middle: 1\.67%, Bottom: 2\.38%\) or four cases \(Top: 1\.28%, Middle: 1\.67%\)\. Although isolated violations are common, cumulative violations amplify the risk, especially in popular MedGPTs\.
Figure 8:Cumulative distribution of misuse in MedGPTs\.\(\(a\)\)Statistical distribution of misused MedGPTs\.TABLE V:Guidelines for disagreed cases\.
Human annotation:To validate our automated scoring system and confirm that misused MedGPTs violated OpenAI’s privacy policies, we manually reviewed a random 20% of models using the same procedure described in Section[V\-A](https://arxiv.org/html/2605.20591#S5.SS1)\. Each model’s name, description, and conversation starters were evaluated according to the policy taxonomy in Table[IX](https://arxiv.org/html/2605.20591#S6.T9)\. Human assessments agreed with the automated scoring in 89\.5–92\.0% of cases \(Cohen’s kappa = 0\.9479\)\. Most discrepancies, detailed in Table[V](https://arxiv.org/html/2605.20591#S5.T5), arose from vague descriptions, missing conversation starters, or ambiguous phrasing, demonstrating both the reliability of our system and its limitations in instances with sparse metadata\.
### V\-DMedGPTs with Actions capability
We analyzed MedGPTs with Actions capability to assess whether their published privacy policies reduce user risk or reflect non\-compliance\. Of the 170 MedGPTs with Actions enabled, we identified 108 unique third\-party domains\. As shown in Table[VI](https://arxiv.org/html/2605.20591#S5.T6), only 73 \(42\.94%\) had accessible privacy policies\. Seven MedGPTs are linked to an unrelated policy\. Most \(66, 38\.82%\) listed a single domain with one accessible policy, while four listed two domains, two listed three, and one listed four, often reusing the same policy\. These inconsistencies suggest that while some creators follow disclosure standards, many offer minimal or mismatched documentation, undermining transparency\.
The remaining 97 MedGPTs \(57\.06%\) lacked accessible policies\. Of these, 62 \(36\.47%\) disclosed no third\-party domains\. In contrast, 35 listed domains with nonfunctional links: 13 were invalid, 8 redirected to generic homepages \(including one OpenAI Enterprise instance\), 11 led to suspended services, and 3 failed due to a DNS error \(Table[VI](https://arxiv.org/html/2605.20591#S5.T6)\)\. These failures prevent users from understanding how their data is collected or shared\. In clinical contexts, where inputs may include sensitive health information, the absence of clear policies undermines consent, compliance, and trust\.
Noncompliant GPT authorship is highly concentrated \(Fig\.LABEL:fig:author\_actions\_noncompliant\)\. While≈\\approx80% were single\-authored, two authors alone account for 35 and 10 models, respectively\. These are not obscure: conversation counts range from 0 to over 25,000, averaging 984 \(Fig\.LABEL:fig:conversation\_actions\_noncompliant\)\. This indicates widespread use despite poor privacy protections\.
Our automated analysis of 38 extracted privacy policies reveals major compliance gaps\. As shown in Fig\.[12](https://arxiv.org/html/2605.20591#S5.F12), nearly 70% scored below 0\.45, despite high Gemini 3\.1 Pro confidence scores \(0\.713–0\.968\)\. Common lapses include missing disclosures on data sharing, cookies, analytics, user rights, and legal processing bases\. These lapses reflect widespread neglect of privacy standards, increasing regulatory exposure and embedding unsafe data practices into clinical workflows\.
Human Review:To validate the automated assessment, we manually reviewed these 38 policy statements and compared them with human labels\. This produces a Cohen’s Kappa of 0\.8143, indicating almost perfect agreement\. Thresholds for non\-compliance were calibrated based on this validation, ensuring robust detection of privacy risks across Actions\-enabled MedGPTs\.
TABLE VI:Distribution of privacy policy accessibility\.\(\(a\)\)Distribution of authors and conversation counts for noncompliant MedGPTs\.Figure 12:Distribution of privacy policy alignment scores\.
### V\-EClinical Hallucination in Open\-source Medical LLMs
We evaluate clinical hallucination in open\-source medical LLMs using the methodology described in Section[IV](https://arxiv.org/html/2605.20591#S4), with results summarized in Table[VII](https://arxiv.org/html/2605.20591#S5.T7)\. Galactica and Aloe\-Alpha achieve the highest G\-Eval scores \(0\.6480 and 0\.5948\), indicating stronger factual accuracy, whereas BioMistral and ChatDoctor score the lowest, suggesting a higher risk of hallucination\. Cosine similarity highlights semantic consistency: MedAlpaca \(0\.4863\), Apollo \(0\.4863\), and MentalHealthChatbot \(0\.4354\) outperform Aloe\-Alpha \(0\.2298\), revealing that high factual accuracy does not always align with coherent responses\. Entropy values further capture response uncertainty, with MentalHealthChatbot \(1\.2077\) and Galactica \(1\.2345\) producing more stable outputs, while BioMistral \(2\.3978\) is the most variable\. BART scores reinforce these distinctions, with MedAlpaca and Apollo \(\-1\.6549\) exhibiting comparatively lower hallucination tendencies\. Overall, the results reveal a trade\-off between factual accuracy and response consistency, emphasizing the need for multi\-metric evaluation when assessing hallucination in medical LLMs\.
We compare MedGPTs with open\-source medical LLMs using four metrics, as shown in Table[VIII](https://arxiv.org/html/2605.20591#S5.T8)\. MedGPTs achieve significantly higher average G\-Eval \(0\.9238 vs\. 0\.4558\), indicating stronger factual accuracy, though their performance ranges from 0 to 1, revealing occasional low\-quality outputs\. BART scores show that open\-source models maintain slightly higher average \(\-3\.5310 vs\. \-3\.6307\) and maximum \(\-1\.6540 vs\. \-1\.8115\) values, indicating more reliable outputs\. Semantic entropy is lower for open\-source models \(1\.6129 vs\. 1\.9272\), suggesting that their outputs are more consistent and predictable\. Cosine similarity is higher on average for MedGPTs \(0\.4054 vs\. 0\.3731\), reflecting better semantic alignment\. These results demonstrate that MedGPTs provide superior factual accuracy and semantic coherence, while open\-source medical LLMs remain more consistent and reliable on certain metrics\. These findings demonstrate the need to consider multiple evaluation dimensions when assessing hallucination in medical LLMs\.
TABLE VII:Open\-source LLMs evaluation metrics\.TABLE VIII:MedGPTs’ Metrics versus Open\-source LLMs\.## VIDiscussion and Recommendations
### VI\-ADiscussion
This subsection outlines the broader implications of our findings and proposes concrete steps to improve safety, accountability, and patient trust in medical LLMs\.
Hallucination is systemic, not incidental\.MedGPT\-HEval reveals that 25–30% of models across all tiers score below 0\.8 in G\-Eval, with fewer than 42% achieving adequate BARTScore or cosine similarity, and only≈\\approx60% producing stable outputs \(semantic entropy<<2\)\. Bottom\- and middle\-tier models pose the greatest risk, demonstrating that popularity is a poor predictor of factual reliability\. Without multi\-metric evaluation before deployment, hallucination remains a pervasive safety concern in web\-deployed MedGPTs\.
User unawareness of clinical hallucination\.Engagement data reveal a structural failure – users consistently reward usage, not accuracy\. While conversation counts and reviews are tightly correlated \(r≈1r\\approx 1\), their correlation with clinical hallucination is negligible \(r<0\.06r<0\.06\), indicating that users rarely recognize clinical errors\. In fact, less reliable MedGPTs often received more positive feedback – suggesting that fluency is mistaken for trustworthiness\. This mismatch allows unsafe models to gain prominence even when they offer misleading guidance\. To address this, platforms should integrate standardized safety assessments into model profiles, aligning trust signals with actual reliability\.
Abuse by design and unsafe by intent\.Misuse in MedGPTs is often deliberate\. Over 49% violated OpenAI’s usage policies, many explicitly promoting unsafe consultations, scams, or illicit advice in their metadata\. Over 33% combined multiple violations, amplifying risk\. These were not borderline cases, as our manual review confirmed the intent\. This points to a developer ecosystem where harmful behavior is embedded at design time\. Without enforced red\-teaming and stronger accountability, such models will persist and scale\.
Privacy as collateral damage\. Privacy violations were widespread\. More than half of MedGPTs with Actions capability lacked accessible policies; among those available,≈\\approx70% failed basic compliance checks\. Many were generic templates without key disclosures on data sharing, tracking, or user rights\. These noncompliant GPTs are often widely used, exposing users to unsafe data handling\. Enforceable policy standards, routine audits, and model takedowns are urgently needed to safeguard patient privacy in LLM\-driven healthcare\.
Clinical hallucination varies across medical LLMs\.Open\-source medical LLMs show a 25–65% trade\-off between factual accuracy and semantic coherence: some models produce more accurate outputs, while others are more consistent or aligned with context\. Compared with MedGPTs, which generally achieve higher factual accuracy and semantic alignment \(G\-Eval≈\\approx0\.92, cosine≈\\approx0\.41\), open\-source models are more stable but may sacrifice correctness\. These findings emphasize that popularity or perceived quality does not guarantee reliability, underscoring the need for multi\-metric evaluation to ensure safe clinical outputs from medical LLMs\.
Because the OpenAI Store operates as a general\-purpose AI marketplace, our findings likely generalize to other Web\-scale LLM deployment ecosystems \(HuggingFace spaces, Poe bots, Replit agents\), making our framework broadly applicable\.
### VI\-BRecommendations
Here, we outline concrete safeguards to mitigate clinical hallucination and unsafe design in medical LLMs\.
- •Hallucinationsin medical LLMs – confident but unfounded claims – pose serious safety risks\. Mitigation requires evidence grounding, model\-level refinement, and human oversight\. Retrieval\-augmented generation \(RAG\) anchors biomedical outputs, while hallucination\-aware fine\-tuning and reinforcement learning with human feedback \(RLHF\) improve factuality\[[14](https://arxiv.org/html/2605.20591#bib.bib27)\]\. Real\-time systems such as detection\-and\-repair\[[49](https://arxiv.org/html/2605.20591#bib.bib24)\]and RARR\[[9](https://arxiv.org/html/2605.20591#bib.bib25)\]revise outputs based on retrieved evidence\. Human\-in\-the\-loop frameworks\[[2](https://arxiv.org/html/2605.20591#bib.bib26)\]can flag low\-confidence responses, escalate to clinicians, and guide model updates\. These approaches shift hallucination control from reactive to proactive, trustworthy design\.
- •Our analysis showed that 49\.8% of MedGPTs encode abusive behaviors, including unsafe consultation and illicit advice\. Addressing this requires shifting from broad policy language to enforceable, developer\-focused safeguards\. The OpenAI Platform should translate usage policies\[[41](https://arxiv.org/html/2605.20591#bib.bib201)\]into clear, auditable rules \(as our framework demonstrates\), automatically evaluate models at submission\[[46](https://arxiv.org/html/2605.20591#bib.bib47)\], and flag high\-risk cases for review\. Routine assessments of high\-risk GPTs\[[46](https://arxiv.org/html/2605.20591#bib.bib47)\]and transparency dashboards showing model intent, compliance, and known risks can deter misuse\. Without such measures, harmful models continue to scale, embedding risk into the foundations of healthcare LLMs\.
- •MedGPTs with Actions capability pose serious privacy risks: 57\.06% lacked accessible policies, and≈\\approx70% of retrieved ones failed basic compliance\. Creators often reuse vague templates or link to broken or irrelevant pages, despite many models logging thousands of interactions\. The OpenAI Platform should use our evaluation framework to flag noncompliant models, enforce policy validation at submission, reject invalid disclosures, and show privacy scores\[[46](https://arxiv.org/html/2605.20591#bib.bib47)\]\. In clinical contexts, missing or misleading policies undermine consent, erode trust, and risk regulatory violations\. Automated filters should detect breaches at scale, supplemented by human review to catch nuanced issues, exploitation, or unsafe behavior that automated systems may miss\.
### VI\-CConcluding Remarks
We conducted the first large\-scale audit of MedGPTs on the OpenAI Store, analyzing 6,233 models via automated extraction, structured prompting, and multi\-metric scoring\. We also evaluated hallucination risks in 10 open\-source medical LLMs for comparison\. Our framework identifies two systemic risks: clinical hallucinations and actor\-driven misuse \(abusive intent, privacy violations, unsafe Actions\)\. Results show 25–30% of MedGPTs score below 0\.8 in G\-Eval, 37\.27% reach a BART score≥−3\.5\\geq\-3\.5, 41\.07% have cosine similarity≥0\.4\\geq 0\.4, and 59\.87% have semantic entropy<2<2\. Misuse affects 54\.3% \(top\), 48\.0% \(middle\), and 33\.6% \(bottom\) tiers, and only 42\.94% of 170 Action\-enabled models had accessible privacy policies, nearly 70% of which scored below the threshold\. Compared with MedGPTs, open\-source models have lower G\-Eval \(0\.456 vs\. 0\.924\) but more stable semantic entropy \(1\.613 vs\. 1\.927\)\. User engagement shows near\-zero correlation with hallucination metrics, highlighting gaps in accuracy, consistency, and perceived reliability\.
This study has several limitations\.First, it reflects a snapshot of the OpenAI GPT Store \(January 20–22, 2026\), and trends may change\.Second, although we extracted 6,233 MedGPTs, coverage may be incomplete, and open\-source evaluations focused on prominent models\.Third, we considered only clinical hallucinations and actor\-driven risks \(consultation abuse, scams, privacy violations, illicit activity, and Actions\); other vectors, such as adversarial prompts or data leakage, remain unexplored\.Fourth, intent analysis relied on static metadata; dynamic abuses via third\-party APIs may evade detection\.Fifth, due to OpenAI query limits, we analyzed a stratified subset rather than the full dataset\. While representative, broader coverage may yield different results\. Future work should extend to continuous, scalable audits, dynamic red\-teaming, and standardized safety benchmarks for trustworthy medical LLMs\.
## References
- \[1\]V\. Agarwal, Y\. Jin, M\. Chandra, M\. D\. Choudhury, S\. Kumar, and N\. Sastry\(2025\)MedHalu: hallucinations in responses to healthcare queries by large language models\.External Links:2409\.19492,[Link](https://arxiv.org/abs/2409.19492)Cited by:[§II\-B](https://arxiv.org/html/2605.20591#S2.SS2.p1.1),[§III\-A](https://arxiv.org/html/2605.20591#S3.SS1.p1.1)\.
- \[2\]M\. A\. Ahmad, I\. Yaramis, and T\. D\. Roy\(2023\)Creating trustworthy llms: dealing with hallucinations in healthcare ai\.External Links:2311\.01463,[Link](https://arxiv.org/abs/2311.01463)Cited by:[1st item](https://arxiv.org/html/2605.20591#S6.I1.i1.p1.1)\.
- \[3\]M\. Ahmed, J\. Lam, A\. Chow, and C\. Chow\(2025\)A primer on large language models \(llms\) and chatgpt for cardiovascular healthcare professionals\.CJC Open\.External Links:ISSN 2589\-790X,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.cjco.2025.02.012),[Link](https://www.sciencedirect.com/science/article/pii/S2589790X25001106)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p1.1)\.
- \[4\]M\. Ahmed, J\. Lam, A\. Chow, and C\. Chow\(2025\-12\)A primer on large language models \(llms\) and chatgpt for cardiovascular healthcare professionals\.CJC Open7\(5\),pp\. 660–666\(en\)\.External Links:ISSN,[Link](https://doi.org/10.1016/j.cjco.2025.02.012),[Document](https://dx.doi.org/10.1016/j.cjco.2025.02.012)Cited by:[§II\-A](https://arxiv.org/html/2605.20591#S2.SS1.p1.1)\.
- \[5\]Y\. Alsabawi, P\. R\. Quesada, and D\. T\. Rouse\(2025\)Readability of custom chatbot vs\. gpt\-4 responses to otolaryngology\-related patient questions\.American Journal of Otolaryngology46\(5\),pp\. 104717\.External Links:ISSN 0196\-0709,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.amjoto.2025.104717),[Link](https://www.sciencedirect.com/science/article/pii/S0196070925001206)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p1.1)\.
- \[6\]E\. Asgari, N\. Montaña\-Brown, M\. Dubois, S\. Khalil, J\. Balloch, J\. A\. Yeung, and D\. Pimenta\(2025\-05\)A framework to assess clinical safety and hallucination rates of llms for medical text summarisation\.npj Digital Medicine8\(1\),pp\. 274\.External Links:ISSN 2398\-6352,[Link](https://doi.org/10.1038/s41746-025-01670-7),[Document](https://dx.doi.org/10.1038/s41746-025-01670-7)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p3.1),[§I](https://arxiv.org/html/2605.20591#S1.p5.1),[§II\-B](https://arxiv.org/html/2605.20591#S2.SS2.p1.1),[§III\-A](https://arxiv.org/html/2605.20591#S3.SS1.p1.1)\.
- \[7\]Y\. Bang, Z\. Ji, A\. Schelten, A\. Hartshorn, T\. Fowler, C\. Zhang, N\. Cancedda, and P\. Fung\(2025\)HalluLens: llm hallucination benchmark\.External Links:2504\.17550,[Link](https://arxiv.org/abs/2504.17550)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p2.1)\.
- \[8\]S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal\(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630\(8017\),pp\. 625–630\.External Links:[Document](https://dx.doi.org/10.1038/s41586-024-07421-0),ISSN 1476\-4687,[Link](https://doi.org/10.1038/s41586-024-07421-0)Cited by:[§IV\-A3](https://arxiv.org/html/2605.20591#S4.SS1.SSS3.p1.1)\.
- \[9\]L\. Gao, Z\. Dai, P\. Pasupat, A\. Chen, A\. T\. Chaganty, Y\. Fan, V\. Zhao, N\. Lao, H\. Lee, D\. Juan, and K\. Guu\(2023\-07\)RARR: researching and revising what language models say, using language models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 16477–16508\.External Links:[Link](https://aclanthology.org/2023.acl-long.910/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.910)Cited by:[1st item](https://arxiv.org/html/2605.20591#S6.I1.i1.p1.1)\.
- \[10\]\(2026\-02\)Gemini 3\.1 Pro: A smarter model for your most complex tasks\.\(en\-us\)\.External Links:[Link](https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/)Cited by:[§IV\-B1](https://arxiv.org/html/2605.20591#S4.SS2.SSS1.p2.1)\.
- \[11\]GPTApps\.io\(\)\.\.Note:[https://gptsapp\.io/trending\-gpts/top\-1000\-gpts\-ranked](https://gptsapp.io/trending-gpts/top-1000-gpts-ranked)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p3.1)\.
- \[12\]GPTStore\.AI\(2025\)GPTStore\.ai\.External Links:,[Link](https://gptstore.ai/)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p3.1)\.
- \[13\]J\. Gu, X\. Jiang, Z\. Shi, H\. Tan, X\. Zhai, C\. Xu, W\. Li, Y\. Shen, S\. Ma, H\. Liu, S\. Wang, K\. Zhang, Y\. Wang, W\. Gao, L\. Ni, and J\. Guo\(2025\)A survey on llm\-as\-a\-judge\.External Links:2411\.15594,[Link](https://arxiv.org/abs/2411.15594)Cited by:[§IV\-B1](https://arxiv.org/html/2605.20591#S4.SS2.SSS1.p2.1)\.
- \[14\]E\. Gumaan\(2025\)Theoretical foundations and mitigation of hallucination in large language models\.External Links:2507\.22915,[Link](https://arxiv.org/abs/2507.22915)Cited by:[1st item](https://arxiv.org/html/2605.20591#S6.I1.i1.p1.1)\.
- \[15\]K\. E\. Gumilar, B\. R\. Indraprasta, A\. S\. Faridzi, B\. M\. Wibowo, A\. Herlambang, E\. Rahestyningtyas, B\. Irawan, Z\. Tambunan, A\. F\. Bustomi, B\. N\. Brahmantara, Z\. Yu, Y\. Hsu, H\. Pramuditya, V\. G\. E\. Putra, H\. Nugroho, P\. Mulawardhana, B\. A\. Tjokroprawiro, T\. Hedianto, I\. H\. Ibrahim, J\. Huang, D\. Li, C\. Lu, J\. Yang, L\. Liao, and M\. Tan\(2024\)Assessment of large language models \(llms\) in decision\-making support for gynecologic oncology\.Computational and Structural Biotechnology Journal23,pp\. 4019–4026\.External Links:ISSN 2001\-0370,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.csbj.2024.10.050),[Link](https://www.sciencedirect.com/science/article/pii/S2001037024003702)Cited by:[§II\-A](https://arxiv.org/html/2605.20591#S2.SS1.p1.1)\.
- \[16\]K\. E\. Gumilar, B\. R\. Indraprasta, A\. S\. Faridzi, B\. M\. Wibowo, A\. Herlambang, E\. Rahestyningtyas, B\. Irawan, Z\. Tambunan, A\. F\. Bustomi, B\. N\. Brahmantara, Z\. Yu, Y\. Hsu, H\. Pramuditya, V\. G\. E\. Putra, H\. Nugroho, P\. Mulawardhana, B\. A\. Tjokroprawiro, T\. Hedianto, I\. H\. Ibrahim, J\. Huang, D\. Li, C\. Lu, J\. Yang, L\. Liao, and M\. Tan\(2024\)Assessment of large language models \(llms\) in decision\-making support for gynecologic oncology\.Computational and Structural Biotechnology Journal23,pp\. 4019–4026\.External Links:ISSN 2001\-0370,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.csbj.2024.10.050),[Link](https://www.sciencedirect.com/science/article/pii/S2001037024003702)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p1.1)\.
- \[17\]A\. K\. Gururajan, E\. Lopez\-Cuena, J\. Bayarri\-Planas, A\. Tormos, D\. Hinjos, P\. Bernabeu\-Perez, A\. Arias\-Duart, P\. A\. Martin\-Torres, L\. Urcelay\-Ganzabal, M\. Gonzalez\-Mallo, S\. Alvarez\-Napagao, E\. Ayguadé\-Parra, and U\. C\. D\. Garcia\-Gasulla\(2024\)Aloe: a family of fine\-tuned open healthcare llms\.External Links:2405\.01886Cited by:[TABLE I](https://arxiv.org/html/2605.20591#S3.T1.9.1.9.8.1),[TABLE VII](https://arxiv.org/html/2605.20591#S5.T7.4.4.12.8.1)\.
- \[18\]T\. Han, L\. C\. Adams, J\. Papaioannou, P\. Grundmann, T\. Oberhauser, A\. Figueroa, A\. Löser, D\. Truhn, and K\. K\. Bressem\(2025\)MedAlpaca – an open\-source collection of medical conversational ai models and training data\.External Links:2304\.08247,[Link](https://arxiv.org/abs/2304.08247)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p1.1),[§I](https://arxiv.org/html/2605.20591#S1.p4.1),[§II\-B](https://arxiv.org/html/2605.20591#S2.SS2.p1.1),[TABLE I](https://arxiv.org/html/2605.20591#S3.T1.9.1.4.3.1),[TABLE VII](https://arxiv.org/html/2605.20591#S5.T7.4.4.7.3.1)\.
- \[19\]K\. He, R\. Mao, Q\. Lin, Y\. Ruan, X\. Lan, M\. Feng, and E\. Cambria\(2025\)A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics\.Information Fusion118,pp\. 102963\.External Links:ISSN 1566\-2535,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.inffus.2025.102963),[Link](https://www.sciencedirect.com/science/article/pii/S1566253525000363)Cited by:[TABLE I](https://arxiv.org/html/2605.20591#S3.T1),[TABLE I](https://arxiv.org/html/2605.20591#S3.T1.8.2)\.
- \[20\]K\. He, R\. Mao, Q\. Lin, Y\. Ruan, X\. Lan, M\. Feng, and E\. Cambria\(2025\)A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics\.Information Fusion118,pp\. 102963\.External Links:ISSN 1566\-2535,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.inffus.2025.102963),[Link](https://www.sciencedirect.com/science/article/pii/S1566253525000363)Cited by:[§II\-A](https://arxiv.org/html/2605.20591#S2.SS1.p1.1)\.
- \[21\]X\. Hou, Y\. Zhao, and H\. Wang\(2024\-07\)On the \(In\)Security of LLM App Stores\.arXiv\(en\)\.Note:arXiv:2407\.08422 \[cs\]External Links:[Link](http://arxiv.org/abs/2407.08422)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p2.1),[§I](https://arxiv.org/html/2605.20591#S1.p3.1),[§II\-C](https://arxiv.org/html/2605.20591#S2.SS3.p1.1),[§III\-B](https://arxiv.org/html/2605.20591#S3.SS2.p1.1)\.
- \[22\]L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin, and T\. Liu\(2025\-01\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Transactions on Information Systems43\(2\),pp\. 1–55\.External Links:ISSN 1558\-2868,[Link](http://dx.doi.org/10.1145/3703155),[Document](https://dx.doi.org/10.1145/3703155)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p2.1),[§III\-A](https://arxiv.org/html/2605.20591#S3.SS1.p1.1)\.
- \[23\]D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits\(2021\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.Applied Sciences11\(14\)\.External Links:[Link](https://www.mdpi.com/2076-3417/11/14/6421),ISSN 2076\-3417,[Document](https://dx.doi.org/10.3390/app11146421)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p7.2),[§IV\-A](https://arxiv.org/html/2605.20591#S4.SS1.p1.1)\.
- \[24\]S\. Karan, A\. Shekoofeh, T\. Tao, M\. S\. Sara, W\. Jason, W\. C\. Hyung, S\. Nathan, T\. Ajay, C\. Heather, P\. Stephen, P\. Perry, S\. Martin, G\. Paul, K\. Chris, B\. Abubakr, S\. Nathanael, C\. Aakanksha, M\. Philip, D\. Dina, A\. y\. A\. Blaise, W\. Dale, S\. C\. Greg, M\. Yossi, C\. Katherine, G\. Juraj, T\. Nenad, L\. Yun, R\. Alvin, B\. Joelle, S\. Christopher, K\. Alan, and N\. Vivek\(2023\)Large language models encode clinical knowledge\.Nature620\(1\),pp\. 172–180\.External Links:ISSN,[Link](https://doi.org/10.1038/s41586-023-06291-2),[Document](https://dx.doi.org/10.1038/s41586-023-06291-2)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p3.1)\.
- \[25\]Y\. Kim, H\. Jeong, S\. Chen, S\. S\. Li, M\. Lu, K\. Alhamoud, J\. Mun, C\. Grau, M\. Jung, R\. Gameiro, L\. Fan, E\. Park, T\. Lin, J\. Yoon, W\. Yoon, M\. Sap, Y\. Tsvetkov, P\. Liang, X\. Xu, X\. Liu, D\. McDuff, H\. Lee, H\. W\. Park, S\. Tulebaev, and C\. Breazeal\(2025\)Medical hallucinations in foundation models and their impact on healthcare\.External Links:2503\.05777,[Link](https://arxiv.org/abs/2503.05777)Cited by:[§II\-B](https://arxiv.org/html/2605.20591#S2.SS2.p1.1),[§III\-A](https://arxiv.org/html/2605.20591#S3.SS1.p1.1)\.
- \[26\]Y\. Kim, H\. Jeong, S\. Chen, S\. S\. Li, M\. Lu, K\. Alhamoud, J\. Mun, C\. Grau, M\. Jung, R\. Gameiro, L\. Fan, E\. Park, T\. Lin, J\. Yoon, W\. Yoon, M\. Sap, Y\. Tsvetkov, P\. Liang, X\. Xu, X\. Liu, D\. McDuff, H\. Lee, H\. W\. Park, S\. Tulebaev, and C\. Breazeal\(2025\)Medical hallucinations in foundation models and their impact on healthcare\.External Links:2503\.05777,[Link](https://arxiv.org/abs/2503.05777)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p2.1),[§I](https://arxiv.org/html/2605.20591#S1.p3.1)\.
- \[27\]V\. Kotu and B\. Deshpande\(2019\)Chapter 4 \- classification\.InData Science \(Second Edition\),V\. Kotu and B\. Deshpande \(Eds\.\),pp\. 65–163\.External Links:ISBN 978\-0\-12\-814761\-0,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/B978-0-12-814761-0.00004-6),[Link](https://www.sciencedirect.com/science/article/pii/B9780128147610000046)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p7.2),[§IV\-A4](https://arxiv.org/html/2605.20591#S4.SS1.SSS4.p1.1),[§IV\-A](https://arxiv.org/html/2605.20591#S4.SS1.p1.1)\.
- \[28\]Y\. Labrak, A\. Bazoge, E\. Morin, P\. Gourraud, M\. Rouvier, and R\. Dufour\(2024\)BioMistral: a collection of open\-source pretrained large language models for medical domains\.External Links:2402\.10373Cited by:[TABLE I](https://arxiv.org/html/2605.20591#S3.T1.9.1.7.6.1),[TABLE VII](https://arxiv.org/html/2605.20591#S5.T7.4.4.10.6.1)\.
- \[29\]Y\. Li, Z\. Li, K\. Zhang, R\. Dan, S\. Jiang, and Y\. Zhang\(2023\)ChatDoctor: a medical chat model fine\-tuned on a large language model meta\-ai \(llama\) using medical domain knowledge\.External Links:2303\.14070,[Link](https://arxiv.org/abs/2303.14070)Cited by:[TABLE I](https://arxiv.org/html/2605.20591#S3.T1.9.1.3.2.1),[TABLE VII](https://arxiv.org/html/2605.20591#S5.T7.4.4.6.2.1)\.
- \[30\]L\. Liu, X\. Yang, J\. Lei, X\. Liu, Y\. Shen, Z\. Zhang, P\. Wei, J\. Gu, Z\. Chu, Z\. Qin, and K\. Ren\(2024\)A survey on medical large language models: technology, application, trustworthiness, and future directions\.ArXivabs/2406\.03712\.External Links:[Link](https://api.semanticscholar.org/CorpusID:270285974)Cited by:[TABLE I](https://arxiv.org/html/2605.20591#S3.T1),[TABLE I](https://arxiv.org/html/2605.20591#S3.T1.8.2)\.
- \[31\]I\. D\. Melamed\(1997\)Measuring semantic entropy\.InTagging Text with Lexical Semantics: Why, What, and How?,External Links:[Link](https://aclanthology.org/W97-0207/)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p7.2),[§IV\-A3](https://arxiv.org/html/2605.20591#S4.SS1.SSS3.p1.1),[§IV\-A](https://arxiv.org/html/2605.20591#S4.SS1.p1.1)\.
- \[32\]X\. Meng, X\. Yan, K\. Zhang, D\. Liu, X\. Cui, Y\. Yang, M\. Zhang, C\. Cao, J\. Wang, X\. Wang, J\. Gao, Y\. Wang, J\. Ji, Z\. Qiu, M\. Li, C\. Qian, T\. Guo, S\. Ma, Z\. Wang, Z\. Guo, Y\. Lei, C\. Shao, W\. Wang, H\. Fan, and Y\. Tang\(2024\)The application of large language models in medicine: a scoping review\.iScience27\(5\),pp\. 109713\.External Links:ISSN 2589\-0042,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.isci.2024.109713),[Link](https://www.sciencedirect.com/science/article/pii/S2589004224009350)Cited by:[§II\-A](https://arxiv.org/html/2605.20591#S2.SS1.p1.1)\.
- \[33\]S\. Mukherjee, P\. Gamble, M\. S\. Ausin, N\. Kant, K\. Aggarwal, N\. Manjunath, D\. Datta, Z\. Liu, J\. Ding, S\. Busacca, C\. Bianco, S\. Sharma, R\. Lasko, M\. Voisard, S\. Harneja, D\. Filippova, G\. Meixiong, K\. Cha, A\. Youssefi, M\. Buvanesh, H\. Weingram, S\. Bierman\-Lytle, H\. S\. Mangat, K\. Parikh, S\. Godil, and A\. Miller\(2024\)Polaris: a safety\-focused llm constellation architecture for healthcare\.External Links:2403\.13313,[Link](https://arxiv.org/abs/2403.13313)Cited by:[§II\-A](https://arxiv.org/html/2605.20591#S2.SS1.p1.1)\.
- \[34\]S\. O\. Ogundoyin, M\. Ikram, H\. J\. Asghar, B\. Z\. H\. Zhao, and D\. Kaafar\(2025\)Unsafe by design? a first look at security and privacy risks in openai’s custom gpt ecosystem\.In2025 Workshop on Privacy in the Electronic Society \(WPES ’25\), October 13–17, 2025, Taipei, Taiwan\. ACM, New York, NY, USA,Vol\.,pp\.\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1145/3733802.3764054)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p1.1),[§I](https://arxiv.org/html/2605.20591#S1.p3.1)\.
- \[35\]OpenAI, J\. Achiam, S\. Adler, and S\. A\. et al\.\(2024\)GPT\-4 technical report\.External Links:2303\.08774,[Link](https://arxiv.org/abs/2303.08774)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p3.1)\.
- \[36\]OpenAI\(2025\)OpenAI GPT store\.External Links:,[Link](https://openai.com/index/introducing-the-gpt-store/)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p1.1),[§I](https://arxiv.org/html/2605.20591#S1.p3.1)\.
- \[37\]OpenAI\(June 27, 2025\)Privacy policy\.External Links:,[Link](https://openai.com/policies/row-privacy-policy/)Cited by:[§IV\-B1](https://arxiv.org/html/2605.20591#S4.SS2.SSS1.p1.1)\.
- \[38\]S\. Pagano, L\. Strumolo, K\. Michalk, J\. Schiegl, L\. C\. Pulido, J\. Reinhard, G\. Maderbacher, T\. Renkawitz, and M\. Schuster\(2025\)Evaluating chatgpt, gemini and other large language models \(llms\) in orthopaedic diagnostics: a prospective clinical study\.Computational and Structural Biotechnology Journal28,pp\. 9–15\.External Links:ISSN 2001\-0370,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.csbj.2024.12.013),[Link](https://www.sciencedirect.com/science/article/pii/S2001037024004343)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p1.1),[§II\-A](https://arxiv.org/html/2605.20591#S2.SS1.p1.1)\.
- \[39\]A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu\(2023\)Med\-halt: medical domain hallucination test for large language models\.External Links:2307\.15343,[Link](https://arxiv.org/abs/2307.15343)Cited by:[§III\-A](https://arxiv.org/html/2605.20591#S3.SS1.p1.1)\.
- \[40\]L\. Qin, Y\. Zhang, H\. Liang, A\. Jatowt, and Z\. Yang\(2024\)Listening to patients: a framework of detecting and mitigating patient misreport for medical dialogue generation\.External Links:2410\.06094,[Link](https://arxiv.org/abs/2410.06094)Cited by:[§II\-B](https://arxiv.org/html/2605.20591#S2.SS2.p1.1)\.
- \[41\]D\. Rodriguez, W\. Seymour, J\. M\. D\. Alamo, and J\. Such\(2025\)Towards safer chatbots: a framework for policy compliance evaluation of custom GPTs\.External Links:2502\.01436,[Link](https://arxiv.org/abs/2502.01436)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p2.1),[§II\-C](https://arxiv.org/html/2605.20591#S2.SS3.p1.1),[§III\-B](https://arxiv.org/html/2605.20591#S3.SS2.p1.1),[2nd item](https://arxiv.org/html/2605.20591#S6.I1.i2.p1.1)\.
- \[42\]C\. S\(2010\)Hallucinations: clinical aspects and management\.\.Industrial psychiatry journal19\(1\),pp\. 5–12\.External Links:ISSN,[Document](https://dx.doi.org/),[Link](https://doi.org/10.4103/0972-6748.77625)Cited by:[§III\-A](https://arxiv.org/html/2605.20591#S3.SS1.p1.1)\.
- \[43\]S\. Schmidgall, C\. Harris, I\. Essien, D\. Olshvang, T\. Rahman, J\. W\. Kim, R\. Ziaei, J\. Eshraghian, P\. Abadir, and R\. Chellappa\(2024\)Addressing cognitive bias in medical language models\.External Links:2402\.08113,[Link](https://arxiv.org/abs/2402.08113)Cited by:[§III\-A](https://arxiv.org/html/2605.20591#S3.SS1.p1.1),[§IV](https://arxiv.org/html/2605.20591#S4.p3.1)\.
- \[44\]Selenium Project\(2024\)Selenium webdriver\.Note:[https://www\.selenium\.dev](https://www.selenium.dev/)Accessed: 2025\-04\-15Cited by:[§IV](https://arxiv.org/html/2605.20591#S4.p2.1)\.
- \[45\]A\. C\. Shekhar, J\. Kimbrell, A\. Saharan, J\. Stebel, E\. Ashley, and E\. E\. Abbott\(2025\)Use of a large language model \(llm\) for ambulance dispatch and triage\.The American Journal of Emergency Medicine89,pp\. 27–29\.External Links:ISSN 0735\-6757,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.ajem.2024.12.032),[Link](https://www.sciencedirect.com/science/article/pii/S0735675724007150)Cited by:[§II\-A](https://arxiv.org/html/2605.20591#S2.SS1.p1.1)\.
- \[46\]X\. Shen, Y\. Shen, M\. Backes, and Y\. Zhang\(2025\)GPTracker: a large\-scale measurement of misused gpts\.In2025 IEEE Symposium on Security and Privacy \(SP\),Vol\.,pp\. 336–354\.External Links:[Document](https://dx.doi.org/10.1109/SP61157.2025.00118)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p1.1),[§I](https://arxiv.org/html/2605.20591#S1.p5.1),[§II\-C](https://arxiv.org/html/2605.20591#S2.SS3.p1.1),[§III\-B](https://arxiv.org/html/2605.20591#S3.SS2.p1.1),[§IV\-B2](https://arxiv.org/html/2605.20591#S4.SS2.SSS2.p1.1),[2nd item](https://arxiv.org/html/2605.20591#S6.I1.i2.p1.1),[3rd item](https://arxiv.org/html/2605.20591#S6.I1.i3.p1.1)\.
- \[47\]Tanusri\(2024\)Mental health therapy chatbot\.External Links:[Link](https://huggingface.co/tanusrich/Mental_Health_Chatbot)Cited by:[TABLE I](https://arxiv.org/html/2605.20591#S3.T1.9.1.10.9.1),[TABLE VII](https://arxiv.org/html/2605.20591#S5.T7.4.4.13.9.1)\.
- \[48\]R\. Taylor, M\. Kardas, G\. Cucurull, T\. Scialom, A\. Hartshorn, E\. Saravia, A\. Poulton, V\. Kerkez, and R\. Stojnic\(2022\)Galactica: a large language model for science\.arXiv preprint arXiv:2211\.09085\.Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p1.1),[§I](https://arxiv.org/html/2605.20591#S1.p4.1),[TABLE I](https://arxiv.org/html/2605.20591#S3.T1.9.1.2.1.1),[TABLE VII](https://arxiv.org/html/2605.20591#S5.T7.4.4.5.1.1)\.
- \[49\]N\. Varshney, W\. Yao, H\. Zhang, J\. Chen, and D\. Yu\(2023\)A stitch in time saves nine: detecting and mitigating hallucinations of llms by validating low\-confidence generation\.External Links:2307\.03987,[Link](https://arxiv.org/abs/2307.03987)Cited by:[1st item](https://arxiv.org/html/2605.20591#S6.I1.i1.p1.1)\.
- \[50\]P\. R\. Vishwanath, S\. Tiwari, T\. G\. Naik, S\. Gupta, D\. N\. Thai, W\. Zhao, S\. KWON, V\. Ardulov, K\. Tarabishy, A\. McCallum, and W\. Salloum\(2024\)Faithfulness hallucination detection in healthcare AI\.InArtificial Intelligence and Data Science for Healthcare: Bridging Data\-Centric AI and People\-Centric Healthcare,External Links:[Link](https://openreview.net/forum?id=6eMIzKFOpJ)Cited by:[§II\-B](https://arxiv.org/html/2605.20591#S2.SS2.p1.1)\.
- \[51\]J\. Wang, Z\. Yang, Z\. Yao, and H\. Yu\(2024\)Jmlr: joint medical llm and retrieval training for enhancing reasoning and professional question answering capability\.arXiv preprint arXiv:2402\.17887\.Cited by:[TABLE I](https://arxiv.org/html/2605.20591#S3.T1.9.1.6.5.1),[TABLE VII](https://arxiv.org/html/2605.20591#S5.T7.4.4.9.5.1)\.
- \[52\]X\. Wang, N\. Chen, J\. Chen, Y\. Hu, Y\. Wang, X\. Wu, A\. Gao, X\. Wan, H\. Li, and B\. Wang\(2024\)Apollo: lightweight multilingual medical llms towards democratizing medical ai to 6b people\.External Links:2403\.03640Cited by:[TABLE I](https://arxiv.org/html/2605.20591#S3.T1.9.1.8.7.1),[TABLE VII](https://arxiv.org/html/2605.20591#S5.T7.4.4.11.7.1)\.
- \[53\]C\. Wu, W\. Lin, X\. Zhang, Y\. Zhang, Y\. Wang, and W\. Xie\(2023\)PMC\-llama: towards building open\-source language models for medicine\.External Links:2304\.14454,[Link](https://arxiv.org/abs/2304.14454)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p1.1),[§I](https://arxiv.org/html/2605.20591#S1.p4.1),[§II\-B](https://arxiv.org/html/2605.20591#S2.SS2.p1.1),[TABLE I](https://arxiv.org/html/2605.20591#S3.T1.9.1.5.4.1),[TABLE VII](https://arxiv.org/html/2605.20591#S5.T7.4.4.8.4.1)\.
- \[54\]L\. Yang, I\. Dan, X\. Yichong, W\. Shuohang, X\. Ruochen, and Z\. Chenguang\(2023\)G\-Eval: NLG Evaluation using Gpt\-4 with Better Human Alignment\.In2023 Conference on Empirical Methods in Natural Language Processing,Vol\.,pp\. 2511–2522\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.153)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p7.2),[§IV\-A1](https://arxiv.org/html/2605.20591#S4.SS1.SSS1.p1.1),[§IV\-A](https://arxiv.org/html/2605.20591#S4.SS1.p1.1)\.
- \[55\]S\. Yang, H\. Zhao, S\. Zhu, G\. Zhou, H\. Xu, Y\. Jia, and H\. Zan\(2023\)Zhongjing: enhancing the chinese medical capabilities of large language model through expert feedback and real\-world multi\-turn dialogue\.External Links:2308\.03549,[Link](https://arxiv.org/abs/2308.03549)Cited by:[TABLE I](https://arxiv.org/html/2605.20591#S3.T1.9.1.11.10.1),[TABLE VII](https://arxiv.org/html/2605.20591#S5.T7.4.4.14.10.1)\.
- \[56\]Q\. Yu, M\. Jin, D\. Shu, C\. Zhang, L\. Fan, W\. Hua, S\. Zhu, Y\. Meng, Z\. Wang, M\. Du, and Y\. Zhang\(2025\)Health\-llm: personalized retrieval\-augmented disease prediction system\.External Links:2402\.00746,[Link](https://arxiv.org/abs/2402.00746)Cited by:[§II\-A](https://arxiv.org/html/2605.20591#S2.SS1.p1.1)\.
- \[57\]H\. Yuan, Z\. Yuan, R\. Gan, J\. Zhang, Y\. Xie, and S\. Yu\(2022\)BioBART: pretraining and evaluation of a biomedical generative language model\.External Links:2204\.03905Cited by:[§IV\-A2](https://arxiv.org/html/2605.20591#S4.SS1.SSS2.p1.6)\.
- \[58\]W\. Yuan, G\. Neubig, and P\. Liu\(2021\)BARTSCORE: evaluating generated text as text generation\.InProceedings of the 35th International Conference on Neural Information Processing Systems,NIPS ’21,Red Hook, NY, USA\.External Links:ISBN 9781713845393Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p7.2),[§IV\-A2](https://arxiv.org/html/2605.20591#S4.SS1.SSS2.p1.6),[§IV\-A](https://arxiv.org/html/2605.20591#S4.SS1.p1.1)\.
- \[59\]Z\. Zhang, L\. Zhang, X\. Yuan, A\. Zhang, M\. Xu, and F\. Qian\(2024\)A first look at gpt apps: landscape and vulnerability\.External Links:2402\.15105,[Link](https://arxiv.org/abs/2402.15105)Cited by:[§II\-C](https://arxiv.org/html/2605.20591#S2.SS3.p1.1)\.
- \[60\]B\. Z\. H\. Zhao, M\. Ikram, and M\. A\. Kaafar\(2024\)GPTs window shopping: an analysis of the landscape of custom ChatGPT models\.External Links:2405\.10547,[Link](https://arxiv.org/abs/2405.10547)Cited by:[§I](https://arxiv.org/html/2605.20591#S1.p3.1)\.
- \[61\]H\. Zhou, F\. Liu, B\. Gu, X\. Zou, J\. Huang, J\. Wu, Y\. Li, S\. S\. Chen, P\. Zhou, J\. Liu, Y\. Hua, C\. Mao, C\. You, X\. Wu, Y\. Zheng, L\. Clifton, Z\. Li, J\. Luo, and D\. A\. Clifton\(2024\)A survey of large language models in medicine: progress, application, and challenge\.External Links:2311\.05112,[Link](https://arxiv.org/abs/2311.05112)Cited by:[§III\-C](https://arxiv.org/html/2605.20591#S3.SS3.p1.1)\.
- \[62\]Z\. Zhu, Y\. Zhang, X\. Zhuang, F\. Zhang, Z\. Wan, Y\. Chen, Q\. QingqingLong, Y\. Zheng, and X\. Wu\(2025\-07\)Can we trust AI doctors? a survey of medical hallucination in large language and large vision\-language models\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 6748–6769\.External Links:[Link](https://aclanthology.org/2025.findings-acl.350/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.350),ISBN 979\-8\-89176\-256\-5Cited by:[§II\-B](https://arxiv.org/html/2605.20591#S2.SS2.p1.1)\.
TABLE IX:Operationalized policies and context for forbidden scenarios\.Proscribed caseContextPolicy descriptionHealth consultationUsing these models to advise someone on possible diagnoses, recommend medications or dosages, interpret lab results, or suggest actions in response to acute medical symptoms or crises\.OpenAI’s models are not specifically designed or approved to offer medical guidance\. They should not be used for diagnosing health conditions or recommending treatments\. These tools are not to be used for handling emergencies or life\-threatening situations requiring urgent care\. It is the developers’ or creators’ responsibility to ensure compliance with relevant medical regulations when utilizing OpenAI services in any healthcare\-related application\.Medical/Economic scamDeveloping or deploying applications, chatbots, marketing tools, or automated scripts that leverage OpenAI’s models to generate persuasive messaging intended to sell or endorse unproven medical treatments, therapies, or health\-related products, particularly when such usage is aimed at influencing consumer behavior for profit\.OpenAI strictly forbids the use of its models or GPT\-based tools to promote misleading medical treatments or health\-related products for commercial gain\.Privacy InfringementDeploying GPT\-based systems in applications that involve high\-stakes decision\-making, surveillance, user profiling, or the unauthorized collection or manipulation of personal data, such as tools used for facial recognition, legal or medical consultation, financial planning, social engineering, or deceptive user engagement, without ensuring full compliance with applicable laws, ethical standards, user consent protocols, and age\-appropriate protections\.Engaging in any activity that compromises the privacy, rights, or safety of others is strictly prohibited\. This includes, but is not limited to: \(a\) Collecting, processing, inferring, or generating personal data without adhering to relevant legal obligations\. \(b\) Employing biometric technologies, such as facial recognition, for identifying or profiling individuals\. \(c\) Monitoring individuals without consent or enabling the use of spyware or surveillance tools\. \(d\) Delivering personalized advice in sensitive areas such as healthcare, law, or finance without appropriate qualifications or disclaimers\. \(e\) Making automated decisions that significantly impact an individual’s rights, freedoms, or well\-being\. \(f\) Producing or disseminating false, misleading, or deceptive content, including disinformation, misinformation, or manipulative digital engagement\. \(g\) Impersonating people or organizations without explicit permission or legal authority\. \(h\) Misleading users about the intentions, capabilities, or function of your GPT or model\. \(i\) Creating models designed for or targeting users under the age of 13, or offering content inappropriate for minors\.Illicit ActivityUsing these models to support illegal activities such as providing unlicensed medical advice, generating counterfeit prescriptions, promoting the sale of unauthorized drugs or treatments, falsifying health records or insurance claims, or engaging in other unlawful practices across healthcare, pharmaceuticals, or related sectors like insurance and telemedicine\.The use of OpenAI’s models, tools, or services for any unlawful purpose is strictly prohibited\.Similar Articles
A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models
This paper presents a multi-domain red teaming framework for evaluating safety, robustness, and fairness of medical LLMs across 690 clinically grounded scenarios. Results show that high aggregate accuracy can mask critical failures, and hybrid evaluation with clinician oversight is necessary for credible safety assessment.
Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction
This paper proposes a large language model-driven data augmentation framework using GPT-5 to generate synthetic oral monologues from written anchors for cognitive score prediction from speech. A similarity-guided selection strategy consistently reduces prediction error, particularly for minority low-score participants.
When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure
This paper investigates how large language models maintain correct beliefs under adversarial pressure in clinical settings, proposing R-FT fine-tuning to improve epistemic resilience while balancing corrigibility, and demonstrating significant robustness gains on medical benchmarks.
MeasHalu: Mitigation of Scientific Measurement Hallucinations for Large Language Models with Enhanced Reasoning
MeasHalu is a novel framework for mitigating scientific measurement hallucinations in LLMs through a two-stage reasoning-aware fine-tuning strategy and progressive reward curriculum. It introduces a fine-grained taxonomy of measurement-specific hallucinations and demonstrates improved accuracy on the MeasEval benchmark.
@pallavishekhar_: https://x.com/pallavishekhar_/status/2058460434035060758
Explains what large language models actually do (next-token prediction) and why they sound confident even when wrong. Offers a mental model and verification checklist for using LLMs safely.