MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

Hugging Face Daily Papers Papers

Summary

This paper introduces MedSkillAudit, a domain-specific framework for auditing the safety and quality of medical research AI agent skills before deployment. The study demonstrates that the system achieves reliable assessment consistency comparable to or better than human expert review.

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills, with a focus on reliability against expert review. Methods: We developed MedSkillAudit ([email protected]), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0-100), an ordinal release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high-risk failure flag. System-expert agreement was quantified using ICC(2,1) and linearly weighted Cohen's kappa, benchmarked against the human inter-rater baseline. Results: The mean consensus quality score was 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold. MedSkillAudit achieved ICC(2,1) = 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613). Protocol Design showed the strongest category-level agreement (ICC = 0.551); Academic Writing showed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governing medical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.
Original Article
View Cached Full Text

Cached at: 05/08/26, 08:13 AM

Paper page - MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

Source: https://huggingface.co/papers/2604.20441 Authors:

,

,

,

,

,

,

,

,

,

,

Abstract

A domain-specific audit framework for medical research agent skills demonstrates reliable assessment consistency compared to expert review, supporting governance of specialized AI capabilities in healthcare applications.

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems.Medical research agent skillsrequire safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework formedical research agent skills, with a focus onreliabilityagainstexpert review. Methods: We developedMedSkillAudit([email protected]), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned aquality score(0-100), an ordinalrelease disposition(Production Ready / Limited Release / Beta Only / Reject), and ahigh-risk failure flag. System-expert agreement was quantified usingICC(2andlinearly weighted Cohen’s kappa, benchmarked against the human inter-rater baseline. Results: The meanconsensus quality scorewas 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold.MedSkillAuditachievedICC(2= 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613).Protocol Designshowed the strongest category-level agreement (ICC = 0.551));Academic Writingshowed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governingmedical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.

View arXiv pageView PDFGitHub537Add to collection

Get this paper in your agent:

hf papers read 2604\.20441

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.20441 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.20441 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.20441 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

Skill Inspector

Product Hunt

Skill Inspector is a developer tool that audits AI agent skills to help prevent malware risks.

Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

arXiv cs.AI

This exploratory study evaluates whether augmenting AI agents with a medical research skill package improves the quality of transcriptomic research analysis outputs compared to native AI, using a multi-model human evaluation in an NSCLC biomarker task. Results show a directional but statistically non-significant improvement, highlighting the need for larger, more robust evaluations.

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

arXiv cs.CL

OpenSkillEval is an automatic evaluation framework for auditing open-source skills used by LLM agents across multiple downstream tasks. Using over 600 dynamically generated tasks and 30 skills, the authors find that skill availability does not guarantee effective usage and that benefits depend heavily on the model and framework.