MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills
Summary
This paper introduces MedSkillAudit, a domain-specific framework for auditing the safety and quality of medical research AI agent skills before deployment. The study demonstrates that the system achieves reliable assessment consistency comparable to or better than human expert review.
View Cached Full Text
Cached at: 05/08/26, 08:13 AM
Paper page - MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills
Source: https://huggingface.co/papers/2604.20441 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
A domain-specific audit framework for medical research agent skills demonstrates reliable assessment consistency compared to expert review, supporting governance of specialized AI capabilities in healthcare applications.
Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems.Medical research agent skillsrequire safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework formedical research agent skills, with a focus onreliabilityagainstexpert review. Methods: We developedMedSkillAudit([email protected]), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned aquality score(0-100), an ordinalrelease disposition(Production Ready / Limited Release / Beta Only / Reject), and ahigh-risk failure flag. System-expert agreement was quantified usingICC(2andlinearly weighted Cohen’s kappa, benchmarked against the human inter-rater baseline. Results: The meanconsensus quality scorewas 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold.MedSkillAuditachievedICC(2= 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613).Protocol Designshowed the strongest category-level agreement (ICC = 0.551));Academic Writingshowed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governingmedical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.
View arXiv pageView PDFGitHub537Add to collection
Get this paper in your agent:
hf papers read 2604\.20441
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.20441 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.20441 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.20441 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
Skill Inspector
Skill Inspector is a developer tool that audits AI agent skills to help prevent malware risks.
SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing
SkillAudit introduces a framework for evolving LLM agent skills without ground-truth feedback by using paired trajectory auditing and contrastive evaluation. It achieves 73.9% average task reward across 89 tasks, outperforming baseline methods.
Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task
This exploratory study evaluates whether augmenting AI agents with a medical research skill package improves the quality of transcriptomic research analysis outputs compared to native AI, using a multi-model human evaluation in an NSCLC biomarker task. Results show a directional but statistically non-significant improvement, highlighting the need for larger, more robust evaluations.
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents
OpenSkillEval is an automatic evaluation framework for auditing open-source skills used by LLM agents across multiple downstream tasks. Using over 600 dynamically generated tasks and 30 skills, the authors find that skill availability does not guarantee effective usage and that benefits depend heavily on the model and framework.
Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory
This paper introduces SkeMex, a self-evolving framework that enhances medical agents by distilling interaction trajectories into structured skill memory, enabling better long-term clinical reasoning through context-dependent utility estimation and governance.