MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

Hugging Face Daily Papers Papers

Summary

This paper introduces MedSkillAudit, a domain-specific framework for auditing the safety and quality of medical research AI agent skills before deployment. The study demonstrates that the system achieves reliable assessment consistency comparable to or better than human expert review.

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills, with a focus on reliability against expert review. Methods: We developed MedSkillAudit ([email protected]), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0-100), an ordinal release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high-risk failure flag. System-expert agreement was quantified using ICC(2,1) and linearly weighted Cohen's kappa, benchmarked against the human inter-rater baseline. Results: The mean consensus quality score was 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold. MedSkillAudit achieved ICC(2,1) = 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613). Protocol Design showed the strongest category-level agreement (ICC = 0.551); Academic Writing showed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governing medical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/08/26, 08:13 AM

Paper page - MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

Source: https://huggingface.co/papers/2604.20441 Authors:

,

,

,

,

,

,

,

,

,

,

Abstract

A domain-specific audit framework for medical research agent skills demonstrates reliable assessment consistency compared to expert review, supporting governance of specialized AI capabilities in healthcare applications.

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems.Medical research agent skillsrequire safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework formedical research agent skills, with a focus onreliabilityagainstexpert review. Methods: We developedMedSkillAudit([email protected]), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned aquality score(0-100), an ordinalrelease disposition(Production Ready / Limited Release / Beta Only / Reject), and ahigh-risk failure flag. System-expert agreement was quantified usingICC(2andlinearly weighted Cohen’s kappa, benchmarked against the human inter-rater baseline. Results: The meanconsensus quality scorewas 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold.MedSkillAuditachievedICC(2= 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613).Protocol Designshowed the strongest category-level agreement (ICC = 0.551));Academic Writingshowed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governingmedical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.

View arXiv pageView PDFGitHub537Add to collection

Get this paper in your agent:

hf papers read 2604\.20441

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.20441 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.20441 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.20441 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

Skill Inspector

Product Hunt

Skill Inspector is a developer tool that audits AI agent skills to help prevent malware risks.

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

arXiv cs.AI

This paper introduces SkillRet, a large-scale benchmark for evaluating skill retrieval in LLM agents, addressing the challenge of selecting relevant skills from large libraries. It provides a dataset of over 17,000 skills and demonstrates that task-specific fine-tuning significantly improves retrieval performance.

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Hugging Face Daily Papers

SkillFlow introduces a benchmark of 166 tasks across 20 families for evaluating autonomous agents' ability to discover, repair, and maintain skills over time through a lifelong learning protocol. Experiments reveal a substantial capability gap among leading models, with Claude Opus 4.6 improving significantly while others show limited or negative gains from skill evolution.

SkillOS: Learning Skill Curation for Self-Evolving Agents

Hugging Face Daily Papers

This paper introduces SkillOS, a reinforcement learning framework that enables LLM agents to learn long-term skill curation policies for self-evolution, improving performance and generalization across tasks.