MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

Hugging Face Daily Papers 04/22/26, 12:00 AM Papers

Summary

This paper introduces MedSkillAudit, a domain-specific framework for auditing the safety and quality of medical research AI agent skills before deployment. The study demonstrates that the system achieves reliable assessment consistency comparable to or better than human expert review.

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills, with a focus on reliability against expert review. Methods: We developed MedSkillAudit ([email protected]), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0-100), an ordinal release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high-risk failure flag. System-expert agreement was quantified using ICC(2,1) and linearly weighted Cohen's kappa, benchmarked against the human inter-rater baseline. Results: The mean consensus quality score was 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold. MedSkillAudit achieved ICC(2,1) = 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613). Protocol Design showed the strongest category-level agreement (ICC = 0.551); Academic Writing showed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governing medical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/08/26, 08:13 AM

Paper page - MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

Source: https://huggingface.co/papers/2604.20441 Authors:

Abstract

A domain-specific audit framework for medical research agent skills demonstrates reliable assessment consistency compared to expert review, supporting governance of specialized AI capabilities in healthcare applications.

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems.Medical research agent skillsrequire safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework formedical research agent skills, with a focus onreliabilityagainstexpert review. Methods: We developedMedSkillAudit([email protected]), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned aquality score(0-100), an ordinalrelease disposition(Production Ready / Limited Release / Beta Only / Reject), and ahigh-risk failure flag. System-expert agreement was quantified usingICC(2andlinearly weighted Cohen’s kappa, benchmarked against the human inter-rater baseline. Results: The meanconsensus quality scorewas 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold.MedSkillAuditachievedICC(2= 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613).Protocol Designshowed the strongest category-level agreement (ICC = 0.551));Academic Writingshowed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governingmedical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.

View arXiv page View PDF GitHub537 Add to collection

Get this paper in your agent:

hf papers read 2604\.20441

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.20441 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.20441 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.20441 in a Space README.md to link it from this page.

MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

Paper page - MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

Skill Inspector

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

SkillOS: Learning Skill Curation for Self-Evolving Agents

Submit Feedback

Similar Articles

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

SkillOS: Learning Skill Curation for Self-Evolving Agents