MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills
Summary
This paper introduces MedSkillAudit, a domain-specific framework for auditing the safety and quality of medical research AI agent skills before deployment. The study demonstrates that the system achieves reliable assessment consistency comparable to or better than human expert review.
View Cached Full Text
Cached at: 05/08/26, 08:13 AM
Paper page - MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills
Source: https://huggingface.co/papers/2604.20441 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
A domain-specific audit framework for medical research agent skills demonstrates reliable assessment consistency compared to expert review, supporting governance of specialized AI capabilities in healthcare applications.
Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems.Medical research agent skillsrequire safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework formedical research agent skills, with a focus onreliabilityagainstexpert review. Methods: We developedMedSkillAudit([email protected]), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned aquality score(0-100), an ordinalrelease disposition(Production Ready / Limited Release / Beta Only / Reject), and ahigh-risk failure flag. System-expert agreement was quantified usingICC(2andlinearly weighted Cohen’s kappa, benchmarked against the human inter-rater baseline. Results: The meanconsensus quality scorewas 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold.MedSkillAuditachievedICC(2= 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613).Protocol Designshowed the strongest category-level agreement (ICC = 0.551));Academic Writingshowed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governingmedical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.
View arXiv pageView PDFGitHub537Add to collection
Get this paper in your agent:
hf papers read 2604\.20441
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.20441 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.20441 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.20441 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
Skill Inspector
Skill Inspector is a developer tool that audits AI agent skills to help prevent malware risks.
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
This paper introduces SkillRet, a large-scale benchmark for evaluating skill retrieval in LLM agents, addressing the challenge of selecting relevant skills from large libraries. It provides a dataset of over 17,000 skills and demonstrates that task-specific fine-tuning significantly improves retrieval performance.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
SkillFlow introduces a benchmark of 166 tasks across 20 families for evaluating autonomous agents' ability to discover, repair, and maintain skills over time through a lifelong learning protocol. Experiments reveal a substantial capability gap among leading models, with Claude Opus 4.6 improving significantly while others show limited or negative gains from skill evolution.
SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks
SkillLearnBench introduces the first benchmark for evaluating continual skill learning in LLM agents across 20 real-world tasks, revealing that no method dominates and scaling LLMs does not guarantee better skills.
SkillOS: Learning Skill Curation for Self-Evolving Agents
This paper introduces SkillOS, a reinforcement learning framework that enables LLM agents to learn long-term skill curation policies for self-evolution, improving performance and generalization across tasks.