Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)

arXiv cs.LG 05/15/26, 04:00 AM Papers

reinforcement-learning tool-calling healthcare fhir llm-agents code-act multi-step-reasoning

Summary

This paper presents a reinforcement learning post-training pipeline for tool-calling LLM agents operating on FHIR healthcare data, achieving a 77% answer correctness on FHIR-AgentBench using a smaller Qwen3-8B model compared to 50% with o4-mini.

arXiv:2605.14126v1 Announce Type: new Abstract: Fast Healthcare Interoperability Resources (FHIR) is the dominant standard for interoperable exchange of healthcare data. In FHIR, electronic health records form a directed graph of resources. Answering clinically meaningful questions over FHIR requires agents to perform multi-step reasoning, filtering, and aggregation across multiple resource types. Prior work shows that even tool-augmented LLM agents (retrieval, code execution, multi-turn planning) often select the wrong resources or violate traversal constraints. We study this problem in the context of FHIR-AgentBench, a benchmark for realistic question answering over real-world hospital data, and frame reasoning on FHIR as a sequential decision-making problem over a queryable structured graph. We implement a multi-turn CodeAct agent and post-train it with reinforcement learning using a custom harness and tools. A LLM Judge provides execution-grounded rewards. Compared to prompt-based, closed-model baselines, RL post-training improves performance while enforcing data-integrity constraints. Empirically, our approach improves answer correctness from 50% (o4-mini) to 77% on FHIR-AgentBench using a smaller and cheaper Qwen3-8B model. We present an end-to-end post-training pipeline (environment building, harness construction, model training and custom evaluation) that reliably improves multi-turn reasoning over structured clinical graphs.

Original Article

View Cached Full Text

Cached at: 05/15/26, 06:27 AM

# Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)
Source: [https://arxiv.org/abs/2605.14126](https://arxiv.org/abs/2605.14126)
[View PDF](https://arxiv.org/pdf/2605.14126)

> Abstract:Fast Healthcare Interoperability Resources \(FHIR\) is the dominant standard for interoperable exchange of healthcare data\. In FHIR, electronic health records form a directed graph of resources\. Answering clinically meaningful questions over FHIR requires agents to perform multi\-step reasoning, filtering, and aggregation across multiple resource types\. Prior work shows that even tool\-augmented LLM agents \(retrieval, code execution, multi\-turn planning\) often select the wrong resources or violate traversal constraints\. We study this problem in the context of FHIR\-AgentBench, a benchmark for realistic question answering over real\-world hospital data, and frame reasoning on FHIR as a sequential decision\-making problem over a queryable structured graph\. We implement a multi\-turn CodeAct agent and post\-train it with reinforcement learning using a custom harness and tools\. A LLM Judge provides execution\-grounded rewards\. Compared to prompt\-based, closed\-model baselines, RL post\-training improves performance while enforcing data\-integrity constraints\. Empirically, our approach improves answer correctness from 50% \(o4\-mini\) to 77% on FHIR\-AgentBench using a smaller and cheaper Qwen3\-8B model\. We present an end\-to\-end post\-training pipeline \(environment building, harness construction, model training and custom evaluation\) that reliably improves multi\-turn reasoning over structured clinical graphs\.

## Submission history

From: Marius Knorr \[[view email](https://arxiv.org/show-email/3c34a810/2605.14126)\] **\[v1\]**Wed, 13 May 2026 21:27:21 UTC \(1,359 KB\)

Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)

Similar Articles

World Feedback for Clinical Agents: Diagnosing RL in FHIR Environments

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning

From Voting to Agent Collaboration: Answer-Type-Aware LLM Pipelines for BioASQ 14b

Beyond Next-Token Prediction: An RLVR Proof of Concept for Tool-Use Agents on Atlassian Workflows

Submit Feedback

Similar Articles

World Feedback for Clinical Agents: Diagnosing RL in FHIR Environments

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning

From Voting to Agent Collaboration: Answer-Type-Aware LLM Pipelines for BioASQ 14b

Beyond Next-Token Prediction: An RLVR Proof of Concept for Tool-Use Agents on Atlassian Workflows