Improving Medical Communication using Rubric-Guided Counterfactual Recommendations
Summary
This paper introduces an LM-guided counterfactual recommendation pipeline for improving doctor-patient communication in text-based telemedicine. It identifies interpretable features like tone and actionability, and suggests minimal changes that increase positive patient feedback without altering medical content, achieving a mean 6.41% gain in predicted positive feedback.
View Cached Full Text
Cached at: 06/18/26, 05:46 AM
# Improving Medical Communication using Rubric-Guided Counterfactual Recommendations Source: [https://arxiv.org/html/2606.18889](https://arxiv.org/html/2606.18889) Adrian Cosma§,Nicoleta\-Nina Basoc†\\dagger,Andrei Niculae†\\dagger, Cosmin Dumitrache†\\dagger,Emilian Radoi†\\dagger §IDSIA, Dalle Molle Institute for Artificial Intelligence, †\\daggerNational University of Science and Technology POLITEHNICA Bucharest, Correspondence:[emilian\.radoi@upb\.ro](https://arxiv.org/html/2606.18889v1/mailto:[email protected]) ###### Abstract Text\-based telemedicine increasingly relies on lightweight patient feedback, however, such feedback primarily reflects perceived communication quality rather than medical accuracy\. We introduce an LM\-guided counterfactual recommendation pipeline that discovers and refines interpretable communication features such as tone, personalization, actionability and completeness in addressing patient concerns, without interfering with the medical content\. These features are used together with patient\-doctor interaction metadata to estimate positive feedback\. At inference time, the system searches over low\-cost ordinal feature changes and recommends minimal communication changes predicted to increase the probability of positive feedback, while independent auditor models test whether these gains generalize beyond the selection model\. Across interactions, recommendations yield a mean \+6\.41% gain in predicted positive feedback probability under independent auditors, and are non\-negative for 93\.31% of recommendations\. These results suggest that small, interpretable communication changes can capture most predicted gains while preserving the doctor’s control over medical reasoning and final wording\. Improving Medical Communication using Rubric\-Guided Counterfactual Recommendations Adrian Cosma§, Nicoleta\-Nina Basoc†\\dagger, Andrei Niculae†\\dagger,Cosmin Dumitrache†\\dagger,Emilian Radoi†\\dagger§IDSIA, Dalle Molle Institute for Artificial Intelligence,†\\daggerNational University of Science and Technology POLITEHNICA Bucharest,Correspondence:[emilian\.radoi@upb\.ro](https://arxiv.org/html/2606.18889v1/mailto:[email protected]) ## 1Introduction Figure 1:Overview of the proposed pipeline\.Starting from patient–doctor QA pairs, an LM first optimizes a set of interpretable response\-level features, which are then refined into feature\-specific extraction prompts and used to annotate the dataset\. The extracted semantic features, together with structured metadata, are used to train a feedback prediction model\. At inference time, we search over counterfactual changes to the semantic features and recommend the smallest feature modifications that maximize the probability of positive feedback\.Telemedicine platforms increasingly mediate the first point of contact between patients and doctors\. In 2024, there were reportedly over 116 million users of online consultations worldwideStatista \([2026](https://arxiv.org/html/2606.18889#bib.bib107)\)\. Some of these interactions occur through written text, where patients submit questions, doctors respond asynchronously, and the exchange typically concludes with a lightweight feedback signal such as a positive or negative rating\. This feedback is important, as it shapes doctors’ reputations, serves as a proxy for patient satisfaction, and incentivizes doctors to communicate more effectively with patients\. However, patient feedback does not reflect the medical accuracy of responses, as patients are not well positioned to evaluate clinical correctness\. Their judgments are mostly driven by communication qualities such as clarity, empathy, personalization and actionability\. Prior work has shown that doctor communication significantly influences patient experience, treatment adherence and perceptions of care\(Martinet al\.,[2005](https://arxiv.org/html/2606.18889#bib.bib6); Genget al\.,[2024](https://arxiv.org/html/2606.18889#bib.bib59)\)\. This raises an important question for telemedicine platforms:how can doctors be supported in improving the patient\-facing quality of their responses without automating medical advice or rewriting their response? We address this problem as one of*counterfactual recommendation*\. Given a patient question, a doctor response, and question\-response metadata, our goal is to identify the smallest set of communication features of the response that, if improved, would most increase the probability of positive patient feedback\. Unlike previous worksNiculaeet al\.\([2025](https://arxiv.org/html/2606.18889#bib.bib87)\); Liet al\.\([2023](https://arxiv.org/html/2606.18889#bib.bib44)\); Zhaoet al\.\([2025a](https://arxiv.org/html/2606.18889#bib.bib96),[b](https://arxiv.org/html/2606.18889#bib.bib97)\), we do not use a*Language Model*\(LM\) to generate a new medical response\. We recommend interpretable editing targets, such as increasing personalization, providing clearer explanations, or making recommendations more actionable\. The doctor retains full control over the final response and may accept, modify or ignore the recommendations\. This setting imposes threerequirementson any system for improving doctor responses: ℜ1\\mathfrak\{R\}\_\{1\}Explainability:Recommendations should be expressed in terms of interpretable communication features\. ℜ2\\mathfrak\{R\}\_\{2\}Feedback Improvement:Recommended changes should increase the predicted probability of positive patient feedback\. ℜ3\\mathfrak\{R\}\_\{3\}Minimal Editing Effort:Recommendations should achieve feedback improvement with minimal editing effort\. To satisfy these requirements, we propose an LM\-guided counterfactual recommendation pipeline for text\-based telemedicine responses, that leverages*Automatic Prompt Optimization*\(APO\) techniques\. This paper makes the following contributions: 1. 1\.We adapt and validate automated feature discoveryCosmaet al\.\([2026](https://arxiv.org/html/2606.18889#bib.bib100)\)for text\-based telemedicine feedback modeling, using LMs to discover, refine and extract response\-level semantic features predictive of patient satisfaction \(ℜ1\\mathfrak\{R\}\_\{1\},ℜ2\\mathfrak\{R\}\_\{2\}\)\. We evaluate the predictive value of these features by training an interpretable feedback estimator which achieves 71\.5% ROC\-AUC\. 2. 2\.We propose a budget\-constrained counterfactual search procedure over ordinal semantic features, generating low\-cost recommendations by balancing predicted feedback gains against editing effort \(ℜ2\\mathfrak\{R\}\_\{2\},ℜ3\\mathfrak\{R\}\_\{3\}\)\. We perform a quantitative study on the proposed recommendations, and find that they raise the predicted probability of positive feedback for 93\.31% of responses, by \+6\.41% on average\. ## 2Related Work LM\-Driven Feature Discovery\.While LMs have been used in the past for reliable feature*extraction*under a fixed schemaHeet al\.\([2024](https://arxiv.org/html/2606.18889#bib.bib62)\); Gilardiet al\.\([2023](https://arxiv.org/html/2606.18889#bib.bib42)\); Törnberg \([2023](https://arxiv.org/html/2606.18889#bib.bib47)\), recent works have used LMs to*discover*what features to extract in the first place\(Cosmaet al\.,[2026](https://arxiv.org/html/2606.18889#bib.bib100); Baleket al\.,[2025](https://arxiv.org/html/2606.18889#bib.bib77); Zhouet al\.,[2024](https://arxiv.org/html/2606.18889#bib.bib75); Zhanget al\.,[2025](https://arxiv.org/html/2606.18889#bib.bib84)\)\. Beyond optimizing instructions for a single prediction, as in APO\(Zhouet al\.,[2022](https://arxiv.org/html/2606.18889#bib.bib36); Pryzantet al\.,[2023](https://arxiv.org/html/2606.18889#bib.bib48); Yuksekgonulet al\.,[2025](https://arxiv.org/html/2606.18889#bib.bib94)\),Cosmaet al\.\([2026](https://arxiv.org/html/2606.18889#bib.bib100)\)frame feature discovery as a dataset\-level prompt optimization problem, where the optimized prompt induces a shared feature schema scored by downstream classifier performance\. We adapt and improve upon this framework by incorporating the unsupervised prompt refinement step, and apply it to telemedicine feedback\. We treat the discovered features as the action space for counterfactual recommendationsJianget al\.\([2024](https://arxiv.org/html/2606.18889#bib.bib63)\)\. Communication Quality in Medicine\.Clinician communication shapes patient adherence, health outcomes, and how patients judge their doctors\(Martinet al\.,[2005](https://arxiv.org/html/2606.18889#bib.bib6); Street Jret al\.,[2009](https://arxiv.org/html/2606.18889#bib.bib8); Stergiopoulos and Martimianakis,[2023](https://arxiv.org/html/2606.18889#bib.bib50)\)\. On telemedicine platforms, the linguistic characteristics of doctor responses are associated with patient satisfaction\(Genget al\.,[2024](https://arxiv.org/html/2606.18889#bib.bib59)\)\. Although LMs are increasingly used in healthcare\(Anthropic Team,[2026](https://arxiv.org/html/2606.18889#bib.bib1)\), patient\-facing deployment remains difficult: users assisted by LMs perform no better than controls\(Beanet al\.,[2025](https://arxiv.org/html/2606.18889#bib.bib78)\), and advice believed to involve AI is trusted less\(Reiset al\.,[2024](https://arxiv.org/html/2606.18889#bib.bib68); Hohensteinet al\.,[2023](https://arxiv.org/html/2606.18889#bib.bib43)\)\. This motivates doctor\-facing assistance that preserves the doctor as the author\. ## 3Method Our pipeline has five stages, as shown in Fig\.[1](https://arxiv.org/html/2606.18889#S1.F1)\. We first apply dataset\-level automatic feature discovery\(Cosmaet al\.,[2026](https://arxiv.org/html/2606.18889#bib.bib100)\)to optimize a set of response\-level semantic features that are likely to be predictive of patient feedback\. Each feature describes a mutable communication property of the doctor’s response\. We then refine feature\-specific extraction prompts using a grounded prompt refinement procedure inspired by APO methods such as MIPRO\(Opsahl\-Onget al\.,[2024](https://arxiv.org/html/2606.18889#bib.bib67)\)\. These extractors annotate each patient–doctor interaction with ordinal semantic feature values\. The extracted semantic features, together with structured question\-response metadata, are used to train a feedback estimator that predicts the probability of positive patient feedback\. At inference time, we enumerate low\-cost positive changes to the semantic representation of a response and select the counterfactual that best trades off predicted feedback improvement against editing effort\. The output is a small set of feature\-level recommendations describing which communication aspects should be improved\. Consider a patient who asks“I have had headaches almost every day for the past two weeks\. Should I be worried?”, and a doctor who responds“It is probably a tension headache\. I recommend a consultation with a neurologist”\. The response is medically reasonable, but it is short, impersonal, and offers no concrete guidance\. We use this interaction throughout this section to illustrate each stage of the pipeline; all numeric values associated with it are illustrative\. ### 3\.1Problem Definition Let𝒟=\{\(qi,ri,mi,yi\)\}i=1n\\mathcal\{D\}=\\\{\(q\_\{i\},r\_\{i\},m\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}be a dataset of interactions, whereqiq\_\{i\}is the patient question,rir\_\{i\}is the doctor response,mim\_\{i\}denotes non\-textual metadata, andyi∈\{0,1\}y\_\{i\}\\in\\\{0,1\\\}indicates whether the interaction received positive patient feedback\. The metadata may include variables such as response time, doctor activity history, and other information available at the time the response is written\. We assume that the doctor responserir\_\{i\}can be characterized by a set of interpretable semantic featuresℱ=\{f1,…,fk\}\\mathcal\{F\}=\\\{f\_\{1\},\\ldots,f\_\{k\}\\\}where each feature describes a mutable aspect of communication style, such as empathy or actionability\. Each featurefjf\_\{j\}takes values from an ordered finite setRj=\{rj,1,…,rj,mj\}R\_\{j\}=\\\{r\_\{j,1\},\\ldots,r\_\{j,m\_\{j\}\}\\\}, for example, from 5\-point Likert scale, from "low" to "high"\. Given a question–response pair\(qi,ri\)\(q\_\{i\},r\_\{i\}\), anExtractorLM module maps the interaction to an ordinal semantic representationsi=E\(qi,ri;F\)∈R1×⋯×Rks\_\{i\}=E\(q\_\{i\},r\_\{i\};F\)\\in R\_\{1\}\\times\\cdots\\times R\_\{k\}\. We train a policy modelπ\\pithat estimatesπ\(mi,si\)=P\(yi=1∣mi,si\)\\pi\(m\_\{i\},s\_\{i\}\)=P\(y\_\{i\}=1\\mid m\_\{i\},s\_\{i\}\), which is used to select recommendations\. For evaluation, to reduce self\-confirmation, we separately train auditor modelsα1,…,αL\\alpha\_\{1\},\\ldots,\\alpha\_\{L\}that are not used for recommendation selection\. ### 3\.2Dataset\-Level Feature Discovery via APO We instantiate the dataset\-level feature discovery framework ofCosmaet al\.\([2026](https://arxiv.org/html/2606.18889#bib.bib100)\)in the telemedicine feedback setting\. In this formulation, the prompt does not directly produce a prediction for each input\. Instead, it induces a global feature schemaℱϕ=\{f1,…,fk\}\\mathcal\{F\}\_\{\\phi\}=\\\{f\_\{1\},\\ldots,f\_\{k\}\\\}, whereϕ\\phidenotes the instruction and example context given to theFeatureProposer\. Unlike standard prompt optimization, where the prompt directly improves task predictions, the quality ofϕ\\phiis determined only indirectly: a good prompt is one that yields features whose extracted values improve downstream feedback prediction while remaining interpretable and actionable\. Given a subset of the training corpus, we optimize aFeatureProposerLM to generate candidate semantic features of doctor responses\. Each feature consists of a name, an ordered value set, and an extraction instruction\. We run theFeatureProposerunder several prompt contexts \(examples with feedback labels, without labels, and with written patient feedback when available\), manually consolidate recurring features across runs, and remove redundant, clinically unsafe or non\-actionable ones\. The resulting features define the action space for counterfactual recommendations\. We show the final feature set in Appendix[A](https://arxiv.org/html/2606.18889#A1); this differs from previous workNiculaeet al\.\([2025](https://arxiv.org/html/2606.18889#bib.bib87)\)in that the selected features are not defined based on organizational or regulatory needs, which can be suboptimal for feedback prediction, but optimized directly for predicting positive feedback\. ### 3\.3Unsupervised Prompt Refinement The initial feature definitions are often too compact to serve directly as reliable extraction prompts, and since we have no gold labels for the semantic features, we cannot run a supervised prompt optimizer over extraction accuracy\. Unlike the original formulation ofCosmaet al\.\([2026](https://arxiv.org/html/2606.18889#bib.bib100)\), where a singleExtractormodule handles all features, we instantiate a specializedExtractormoduleEjE\_\{j\}for each featurefjf\_\{j\}\. We refine each prompt with a procedure inspired by the grounded instruction proposal stage of MIPROOpsahl\-Onget al\.\([2024](https://arxiv.org/html/2606.18889#bib.bib67)\), omitting its iterative search stage\. For each feature, an LM receives the feature definition, its ordinal value set, representative question and response examples, and a task description\. It produces a refined extraction prompt that specifies the meaning of each ordinal category, gives decision criteria, and summarizes domain\-specific patterns observed in the examples\. We find that refinement improves the reliability of extracted features\. Thus, each feature receives its own extractor:Ej\(qi,ri;ψj\)→s^i,jE\_\{j\}\(q\_\{i\},r\_\{i\};\\psi\_\{j\}\)\\rightarrow\\hat\{s\}\_\{i,j\}, whereψj\\psi\_\{j\}is the refined prompt for featurefjf\_\{j\}\. The full semantic representation is obtained by applying all feature\-specific extractors:s^i=\(E1\(qi,ri;ψ1\),…,Ek\(qi,ri;ψk\)\)\\hat\{s\}\_\{i\}=\(E\_\{1\}\(q\_\{i\},r\_\{i\};\\psi\_\{1\}\),\\ldots,E\_\{k\}\(q\_\{i\},r\_\{i\};\\psi\_\{k\}\)\)\. For the example interaction, the extractors assign, among others,empathy\_level=no\_empathy,actionability\_level=weakly\_actionable, andproblems\_addressed=partially\_addressed: the response is a generic referral without context and does not answer whether the patient should be worried\. ### 3\.4Feedback Estimation To train the policy and auditor models, we split the data temporally, holding out the most recent 20% of interactions as a fixed evaluation set; this choice reflects deployment, where a model trained on past interactions makes recommendations for future ones\. The remaining data is divided into training subsets for the policy model and the auditor models,𝒟=𝒟policy∪𝒟auditor∪𝒟eval\\mathcal\{D\}=\\mathcal\{D\}\_\{\\text\{policy\}\}\\cup\\mathcal\{D\}\_\{\\text\{auditor\}\}\\cup\\mathcal\{D\}\_\{\\text\{eval\}\}\. For the policy modelπ\\pi, we used a CatBoost classifierProkhorenkovaet al\.\([2018](https://arxiv.org/html/2606.18889#bib.bib17)\)trained on question\-response metadata and the extracted semantic features\. A tree\-based model naturally supports mixed categorical, ordinal, and numerical features and provides feature importance estimates that are useful for further analysis\. The auditor modelsα1,…,αL\\alpha\_\{1\},\\ldots,\\alpha\_\{L\}are trained separately from the policy model\. They differ from the policy model in training data, feature subsets and inductive bias\. In our experiments, we use a*LogisticRegression*\(LR\) and an*ExplainableBoostingMachines*\(EBM\)Louet al\.\([2013](https://arxiv.org/html/2606.18889#bib.bib12)\)model, and allow the auditors to use surface\-level text features \(such as LIWCDudău and Sava \([2022](https://arxiv.org/html/2606.18889#bib.bib32)\)\)\. The purpose of these models is to test whether recommendations chosen by the policy model remain beneficial under alternative satisfaction estimators\. Additional details are presented in Appendix[A](https://arxiv.org/html/2606.18889#A1)\. ### 3\.5Searching for Counterfactual Recommendations For each response, we search for small positive changes to the extracted semantic features: a counterfactual representationsi′s^\{\\prime\}\_\{i\}changes one or more ordinal feature values, and features may only move towards values that correspond to a better communication form according to the feature rubric\. For a featurefjf\_\{j\}, letrankj\(si,j\)\\operatorname\{rank\}\_\{j\}\(s\_\{i,j\}\)denote the ordinal rank of its current value\. The cost of changing featurejjis the number of ordinal steps between the original and counterfactual valuecj\(si,si′\)=\|rankj\(si,j′\)−rankj\(si,j\)\|c\_\{j\}\(s\_\{i\},s^\{\\prime\}\_\{i\}\)=\\left\|\\operatorname\{rank\}\_\{j\}\(s^\{\\prime\}\_\{i,j\}\)\-\\operatorname\{rank\}\_\{j\}\(s\_\{i,j\}\)\\right\|\. The total intervention cost isC\(si,si′\)=∑j=1kcj\(si,si′\)C\(s\_\{i\},s^\{\\prime\}\_\{i\}\)=\\sum\_\{j=1\}^\{k\}c\_\{j\}\(s\_\{i\},s^\{\\prime\}\_\{i\}\)\. Given a budgetBB, we consider only counterfactuals whose cost is at mostBB:𝒳B\(si\)=\{si′:C\(si,si′\)≤B\}\\mathcal\{X\}\_\{B\}\(s\_\{i\}\)=\\\{s^\{\\prime\}\_\{i\}:C\(s\_\{i\},s^\{\\prime\}\_\{i\}\)\\leq B\\\}\. Because the semantic features are ordinal and low\-dimensional, we enumerate all valid changes within the budget rather than relying on gradient\-based or heuristic searchVermaet al\.\([2024](https://arxiv.org/html/2606.18889#bib.bib73)\), and compute the policy model’s predicted positive feedback probabilityp\(si′\)=π\(mi,si′\)p\(s^\{\\prime\}\_\{i\}\)=\\pi\(m\_\{i\},s^\{\\prime\}\_\{i\}\)for each candidate\. We retain only candidates that improve over the original prediction:ℐB\(si\)=\{si′∈𝒳B\(si\):p\(si′\)\>p\(si\)\}\\mathcal\{I\}\_\{B\}\(s\_\{i\}\)=\\\{s^\{\\prime\}\_\{i\}\\in\\mathcal\{X\}\_\{B\}\(s\_\{i\}\):p\(s^\{\\prime\}\_\{i\}\)\>p\(s\_\{i\}\)\\\}\. Finally, we select the candidate closest to the ideal point of maximum predicted feedback and zero intervention cost: si⋆=argminsi′∈ℐB\(si\)\(1−p\(si′\)\)2\+\(C\(si,si′\)B\)2s\_\{i\}^\{\\star\}=\\operatorname\*\{arg\\,min\}\_\{s^\{\\prime\}\_\{i\}\\in\\mathcal\{I\}\_\{B\}\(s\_\{i\}\)\}\\sqrt\{\(1\-p\(s^\{\\prime\}\_\{i\}\)\)^\{2\}\+\\left\(\\frac\{C\(s\_\{i\},s^\{\\prime\}\_\{i\}\)\}\{B\}\\right\)^\{2\}\}\(1\) In practice, each feature can also receive a modification weight, since some features are easier to edit than others from doctors’ perspectives\. The output is a set of feature\-level recommendations, such as increasing theactionabilityof the response, presented as an interpretable editing target that the doctor may accept, ignore or adapt\. For the example interaction, the policy model assigns the original response a positive feedback probability ofp\(si\)=0\.41p\(s\_\{i\}\)=0\.41\. With a budgetB=3B=3, the selected counterfactual raisesempathy\_levelfromno\_empathytomoderate\_empathy\(cost 2\) andactionability\_levelfromweakly\_actionabletomoderately\_actionable\(cost 1\), increasing the predicted probability top\(si⋆\)=0\.58p\(s\_\{i\}^\{\\star\}\)=0\.58\. The doctor receives two editing targets: acknowledge the patient’s worry, and add at least one concrete next step, for example, what to monitor or when an in\-person visit becomes necessary\. ### 3\.6Auditing Recommendations After selectingsi⋆s\_\{i\}^\{\\star\}with the policy model, we evaluate the same counterfactual under each auditor model and take the average change in probabilityΔ\(i\)=𝔼l<L\[αℓ\(mi,si⋆\)−αℓ\(mi,si\)\]\\Delta\(i\)=\\mathbb\{E\}\_\{l<L\}\[\\alpha\_\{\\ell\}\(m\_\{i\},s\_\{i\}^\{\\star\}\)\-\\alpha\_\{\\ell\}\(m\_\{i\},s\_\{i\}\)\]\. A recommendation is considered robust when it improves the policy prediction and yields non\-negative changes across the auditor models\. For the running example, the counterfactual selected by the policy model also yields a positive average delta under the auditors, so the recommendation would be considered robust enough to be shown to the doctor\. Our use of independent auditor models is relevant for counterfactual recommendations, since interventions selected by one model can exploit that model’s particular errors\(Kearnset al\.,[2018](https://arxiv.org/html/2606.18889#bib.bib20); Kimet al\.,[2019](https://arxiv.org/html/2606.18889#bib.bib23); Hébert\-Johnsonet al\.,[2018](https://arxiv.org/html/2606.18889#bib.bib19); Upadhyayet al\.,[2021](https://arxiv.org/html/2606.18889#bib.bib29); Duttaet al\.,[2022](https://arxiv.org/html/2606.18889#bib.bib33)\)\. ## 4Experiments and Results #### Evaluating Feature Quality\. In Fig\.[2](https://arxiv.org/html/2606.18889#S4.F2)we show Spearman correlations between extracted features and the positive feedback indicator; the extracted features are generally positively associated with positive feedback\. Further, in Fig\.[3](https://arxiv.org/html/2606.18889#S4.F3)we show agreement among the feature values extracted by theExtractorLMs\. We use\(R\)to denote the refinedExtractorprompt\. There is moderate agreement between models; same\-family extractor pairs agree more strongly than cross\-family pairs, with theGemma\-4\-31BandGemma\-4\-31B\(R\)pair obtaining the highest off\-diagonal agreement\. Table[1](https://arxiv.org/html/2606.18889#S4.T1)shows the performance of the policy model under variousExtractormodels; the discovered semantic features are predictive of patient satisfaction\. TheQwen3\.5\-27B\(R\)extractor obtains the highest validation ROC\-AUC of 0\.717, suggesting that the refined semantic annotations yield more effective feature representation in this setting\. However, the highest rollout stability score is achieved by theGemma\-4family\. Refining the prompts improves both final downstream performance and rollout stability, at practically negligible cost\. We show additional results and implementation details in Appendix[A](https://arxiv.org/html/2606.18889#A1)\. Consequently,Gemma\-4\-31B\(R\)combines near\-best downstream performance with the highest rollout stability in Table[1](https://arxiv.org/html/2606.18889#S4.T1), while remaining highly consistent with its unrefined counterpart\. We therefore useGemma\-4\-31B\(R\)as the preferred extractor in subsequent analyses\. Figure 2:Spearman correlations between extracted features and positive feedback\. All features show positive associations with satisfaction\.Figure 3:Pairwise agreement between extractor models after majority voting over rollout\-level feature values\.ExtractorROC\-AUCAvg\. RolloutStabilityQwen3\-80B0\.7140\.957Qwen3\.5\-27B0\.7150\.718Qwen3\.5\-27B\(R\)0\.7170\.778Gemma\-4\-31B0\.7140\.960Gemma\-4\-31B\(R\)0\.7150\.974 Table 1:Validation ROC\-AUC of the policy model and average rollout stability across the evaluated feature extraction models\. #### Evaluating Counterfactual Recommendations\. In Fig\.[4](https://arxiv.org/html/2606.18889#S4.F4)we show that increasing the modification budget results in greater diversity among the changeable features\. As the budget grows, modifications are spread across a larger set of features, indicating that achieving more substantial interventions requires adjustments to a broader range of attributes\. Additionally, in Fig\.[5](https://arxiv.org/html/2606.18889#S4.F5)we show that increasing the budget yields only marginal improvements in feature probability contribution\. This suggests that relatively minor modifications are already sufficient to achieve most of the performance gains\. Figure 4:Fraction of recommendations that modify each semantic feature as the edit budget increases\.Figure 5:Average probability contribution of each modified feature to the predicted positive\-feedback probability across budgets\.Figure 6:Distribution of average auditor improvement delta across edit budgets and extractor models, after applying the selected counterfactual recommendation\. Recommendations selected by the policy model remain beneficial under independent auditor models\.AuditorExtractorImprovementrate \(%\)Improvementdelta \(%\)EBMQwen3\-80B89\.58\+5\.2Qwen3\.5\-27B97\.18\+6\.3Qwen3\.5\-27B\(R\)97\.54\+6\.3Gemma\-4\-31B94\.90\+6\.3Gemma\-4\-31B\(R\)98\.91\+6\.6LRQwen3\-80B89\.54\+5\.3Qwen3\.5\-27B86\.25\+6\.4Qwen3\.5\-27B\(R\)88\.68\+6\.7Gemma\-4\-31B94\.71\+7\.6Gemma\-4\-31B\(R\)95\.83\+7\.0 Table 2:Counterfactual recommendation performance across auditor models and feature extractors, averaged over edit budgets\. Improvement rate is the percentage of responses for which the selected counterfactual yields a non\-negative change under the auditor, and improvement delta is the mean change in predicted positive\-feedback probability\. The best value in each column within an auditor block is shown inbold\.In Fig\.[6](https://arxiv.org/html/2606.18889#S4.F6)we show the distribution of auditor\-evaluated deltas across budgets andExtractormodules\. As expected, increasing the change budget results in larger increases in auditor probability estimates, providing further evidence that the proposed method reliably identifies counterfactual recommendations expected to improve user feedback\. Additionally, Table[2](https://arxiv.org/html/2606.18889#S4.T2), shows that the recommendation policy obtains strong performance across all auditor configurations: the selected recommendations improved the policy estimate for 95\.65% of responses, with a mean predicted\-feedback increase of \+7\.35%\. Under independent auditor models, the same recommendations produced a positive mean delta of \+6\.37%, and 93\.31% of deltas were non\-negative on average across auditors\. We also compare our method with greedy recommendations \(see Table[4](https://arxiv.org/html/2606.18889#A1.T4), Appendix[A](https://arxiv.org/html/2606.18889#A1)\) and show that our method obtains better improvement rates\. ## 5Conclusions Our results suggest thatinterpretable, low\-effort interventions can achieve substantial improvements in predicted feedback, while preserving clinician ownership of medical communication\. Patient feedback was associated with a small number modifiable communication properties, including response tone, personalization and completeness in addressing patient concerns\. Counterfactual modifications to these properties remained beneficial when evaluated by independent models that were not involved in selecting them\. Most of the predicted improvement was obtained with low editing budgets\. This indicates that the practical value of the system lies in identifying the one or two communication deficiencies most worth the clinician’s attention\. Across responses, the selected recommendations obtained an average improvement of \+6\.41% and assigned a non\-negative average change to 93\.31% of recommendations\. These findings support communication\-level counterfactual guidance as a middle ground between passive feedback analytics and automated medical respose generation, as the model identifies actionable communication improvements, while the doctor retains control over the medical reasoning and the final response\. ## Limitations Although the evaluation is conducted offline, the use of a temporal holdout and independent auditor models provides evidence that the identified recommendations are not artifacts of a single predictive model\. However, the proposed approach requires prospective validation in real\-world deployment, where doctors apply the recommendations and their impact on patient feedback can be measured\. ## References - Anthropic economic index report: Economic primitives\.External Links:[Link](https://www.anthropic.com/research/anthropic-economic-index-january-2026-report)Cited by:[§2](https://arxiv.org/html/2606.18889#S2.p2.1)\. - V\. Balek, L\. Sỳkora, V\. Sklenák,et al\.\(2025\)LLM\-based feature generation from text for interpretable machine learning\.Machine Learning114\(11\),pp\. 1–30\.Cited by:[§2](https://arxiv.org/html/2606.18889#S2.p1.1)\. - A\. Bean, R\. E\. Payne, G\. Parsons, H\. R\. Kirk, J\. Ciro, R\. Mosquera\-Gómez, S\. Hincape, A\. Ekanayaka, L\. Tarassenko, L\. Rocher,et al\.\(2025\)Reliability of LLMs as medical assistants for the general public: a randomized preregistered study\.Nature Medicine\.Cited by:[§2](https://arxiv.org/html/2606.18889#S2.p2.1)\. - A\. Cosma, O\. Szehr, D\. Kletz, A\. Antonucci, and O\. Pelletier \(2026\)Automatic Prompt Optimization for Dataset\-Level Feature Discovery\.External Links:[Link](https://arxiv.org/abs/2601.13922),2601\.13922Cited by:[item 1](https://arxiv.org/html/2606.18889#S1.I2.i1.p1.2),[§2](https://arxiv.org/html/2606.18889#S2.p1.1),[§3\.2](https://arxiv.org/html/2606.18889#S3.SS2.p1.3),[§3\.3](https://arxiv.org/html/2606.18889#S3.SS3.p1.2),[§3](https://arxiv.org/html/2606.18889#S3.p1.1)\. - D\. P\. Dudău and F\. A\. Sava \(2022\)The development and validation of the Romanian version of Linguistic Inquiry and Word Count 2015 \(Ro\-LIWC2015\)\.Current Psychology41\(6\),pp\. 3597–3614\.Cited by:[§3\.4](https://arxiv.org/html/2606.18889#S3.SS4.p3.1)\. - S\. Dutta, J\. Long, S\. Mishra, C\. Tilli, and D\. Magazzeni \(2022\)Robust counterfactual explanations for tree\-based ensembles\.InInternational Conference on Machine Learning,Cited by:[§3\.6](https://arxiv.org/html/2606.18889#S3.SS6.p2.1)\. - S\. Geng, Y\. He, L\. Duan, C\. Yang, X\. Wu, G\. Liang, and B\. Niu \(2024\)The association between linguistic characteristics of physicians’ communication and their economic returns: mixed method study\.Journal of Medical Internet Research26,pp\. e42850\.Cited by:[§1](https://arxiv.org/html/2606.18889#S1.p2.1),[§2](https://arxiv.org/html/2606.18889#S2.p2.1)\. - F\. Gilardi, M\. Alizadeh, and M\. Kubli \(2023\)ChatGPT outperforms crowd workers for text\-annotation tasks\.Proceedings of the National Academy of Sciences120\(30\),pp\. e2305016120\.Cited by:[§2](https://arxiv.org/html/2606.18889#S2.p1.1)\. - X\. He, Z\. Lin, Y\. Gong,et al\.\(2024\)AnnoLLM: Making large language models to be better crowdsourced annotators\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 6: Industry Track\),Cited by:[§2](https://arxiv.org/html/2606.18889#S2.p1.1)\. - Ú\. Hébert\-Johnson, M\. P\. Kim, O\. Reingold, and G\. N\. Rothblum \(2018\)Multicalibration: calibration for the computationally\-identifiable masses\.InInternational Conference on Machine Learning,Cited by:[§3\.6](https://arxiv.org/html/2606.18889#S3.SS6.p2.1)\. - J\. Hohenstein, R\. F\. Kizilcec, D\. DiFranzo, Z\. Aghajari, H\. Mieczkowski, K\. Levy, M\. Naaman, J\. Hancock, and M\. F\. Jung \(2023\)Artificial intelligence in communication impacts language and social relationships\.Scientific reports13\(1\),pp\. 5487\.Cited by:[§2](https://arxiv.org/html/2606.18889#S2.p2.1)\. - J\. Jiang, F\. Leofante, A\. Rago, and F\. Toni \(2024\)Robust counterfactual explanations in machine learning: a survey\.InProceedings of the Thirty\-Third International Joint Conference on Artificial Intelligence,IJCAI ’24\.External Links:ISBN 978\-1\-956792\-04\-1,[Link](https://doi.org/10.24963/ijcai.2024/894),[Document](https://dx.doi.org/10.24963/ijcai.2024/894)Cited by:[§2](https://arxiv.org/html/2606.18889#S2.p1.1)\. - M\. Kearns, S\. Neel, A\. Roth, and Z\. S\. Wu \(2018\)Preventing fairness gerrymandering: auditing and learning for subgroup fairness\.InInternational Conference on Machine Learning,Cited by:[§3\.6](https://arxiv.org/html/2606.18889#S3.SS6.p2.1)\. - M\. P\. Kim, A\. Ghorbani, and J\. Zou \(2019\)Multiaccuracy: black\-box post\-processing for fairness in classification\.InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society,Cited by:[§3\.6](https://arxiv.org/html/2606.18889#S3.SS6.p2.1)\. - W\. Kwon, Z\. Li, S\. Zhuang,et al\.\(2023\)Efficient Memory Management for Large Language Model Serving with PagedAttention\.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,Cited by:[Appendix A](https://arxiv.org/html/2606.18889#A1.SS0.SSS0.Px2.p1.1)\. - Y\. Li, Z\. Li, K\. Zhang, R\. Dan, S\. Jiang,et al\.\(2023\)Chatdoctor: A medical chat model fine\-tuned on a large language model meta\-ai \(llama\) using medical domain knowledge\.Cureus15\(6\)\.Cited by:[§1](https://arxiv.org/html/2606.18889#S1.p3.1)\. - Y\. Lou, R\. Caruana, J\. Gehrke, and G\. Hooker \(2013\)Accurate intelligible models with pairwise interactions\.InProceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining,pp\. 623–631\.Cited by:[§3\.4](https://arxiv.org/html/2606.18889#S3.SS4.p3.1)\. - S\. M\. Lundberg and S\. Lee \(2017\)A unified approach to interpreting model predictions\.CoRRabs/1705\.07874\.External Links:[Link](http://arxiv.org/abs/1705.07874),1705\.07874Cited by:[Figure 8](https://arxiv.org/html/2606.18889#A1.F8),[Appendix A](https://arxiv.org/html/2606.18889#A1.SS0.SSS0.Px4.p1.4)\. - L\. R\. Martin, S\. L\. Williams, K\. B\. Haskard, and M\. R\. DiMatteo \(2005\)The challenge of patient adherence\.Therapeutics and clinical risk management1\(3\),pp\. 189–199\.Cited by:[§1](https://arxiv.org/html/2606.18889#S1.p2.1),[§2](https://arxiv.org/html/2606.18889#S2.p2.1)\. - A\. Niculae, A\. Cosma, C\. Dumitrache, and E\. Radoi \(2025\)Dr\. Copilot: A Multi\-Agent Prompt Optimized Assistant for Improving Patient\-Doctor Communication in Romanian\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track,Suzhou \(China\),pp\. 1780–1792\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-industry.125),ISBN 979\-8\-89176\-333\-3,[Link](https://aclanthology.org/2025.emnlp-industry.125/)Cited by:[§1](https://arxiv.org/html/2606.18889#S1.p3.1),[§3\.2](https://arxiv.org/html/2606.18889#S3.SS2.p2.1)\. - H\. H\. Nigatu, A\. L\. Tonja, B\. Rosman, T\. Solorio, and M\. Choudhury \(2024\)The Zeno’s Paradox of ‘Low\-Resource’ Languages\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 17753–17774\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.983),[Link](https://aclanthology.org/2024.emnlp-main.983/)Cited by:[Appendix A](https://arxiv.org/html/2606.18889#A1.SS0.SSS0.Px1.p1.1)\. - K\. Opsahl\-Ong, M\. J\. Ryan, J\. Purtell, D\. Broman, C\. Potts, M\. Zaharia, and O\. Khattab \(2024\)Optimizing instructions and demonstrations for multi\-stage language model programs\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 9340–9366\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.525),[Link](https://aclanthology.org/2024.emnlp-main.525/)Cited by:[§3\.3](https://arxiv.org/html/2606.18889#S3.SS3.p2.4),[§3](https://arxiv.org/html/2606.18889#S3.p1.1)\. - L\. Prokhorenkova, G\. Gusev, A\. Vorobev, A\. V\. Dorogush, and A\. Gulin \(2018\)CatBoost: unbiased boosting with categorical features\.Advances in neural information processing systems31\.Cited by:[§3\.4](https://arxiv.org/html/2606.18889#S3.SS4.p2.1)\. - R\. Pryzant, D\. Iter, J\. Li, Y\. Lee, C\. Zhu, and M\. Zeng \(2023\)Automatic Prompt Optimization with “Gradient Descent” and Beam Search\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 7957–7968\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.494),[Link](https://aclanthology.org/2023.emnlp-main.494/)Cited by:[§2](https://arxiv.org/html/2606.18889#S2.p1.1)\. - Qwen Team \(2026\)Qwen3\.5: towards native multimodal agents\.External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[Appendix A](https://arxiv.org/html/2606.18889#A1.SS0.SSS0.Px1.p1.1)\. - M\. Reis, F\. Reis, and W\. Kunde \(2024\)Influence of believed AI involvement on the perception of digital medical advice\.Nature Medicine30\(11\),pp\. 3098–3100\.Cited by:[§2](https://arxiv.org/html/2606.18889#S2.p2.1)\. - Statista \(2026\)Online Doctor Consultations Worldwide 2017–2029\.Note:Accessed: 2026\-06\-12External Links:[Link](https://www.statista.com/forecasts/1456718/online-doctor-consultations-worldwide-forecast/)Cited by:[§1](https://arxiv.org/html/2606.18889#S1.p1.1)\. - E\. Stergiopoulos and M\. A\. T\. Martimianakis \(2023\)What makes a ‘good doctor’? A critical discourse analysis of perspectives from medical students with lived experience as patients\.Medical Humanities49\(4\),pp\. 613–622\.Cited by:[§2](https://arxiv.org/html/2606.18889#S2.p2.1)\. - R\. L\. Street Jr, G\. Makoul, N\. K\. Arora, and R\. M\. Epstein \(2009\)How does communication heal? Pathways linking clinician–patient communication to health outcomes\.Patient education and counseling74\(3\),pp\. 295–301\.Cited by:[§2](https://arxiv.org/html/2606.18889#S2.p2.1)\. - Q\. Team \(2025\)Qwen3 technical report\.External Links:[Link](https://arxiv.org/abs/2505.09388),2505\.09388Cited by:[Appendix A](https://arxiv.org/html/2606.18889#A1.SS0.SSS0.Px1.p1.1)\. - P\. Törnberg \(2023\)ChatGPT\-4 Outperforms Experts and Crowd Workers in Annotating Political Twitter Messages with Zero\-Shot Learning\.External Links:[Link](https://arxiv.org/abs/2304.06588),2304\.06588Cited by:[§2](https://arxiv.org/html/2606.18889#S2.p1.1)\. - S\. Upadhyay, S\. Joshi, and H\. Lakkaraju \(2021\)Towards robust and reliable algorithmic recourse\.InAdvances in Neural Information Processing Systems,Cited by:[§3\.6](https://arxiv.org/html/2606.18889#S3.SS6.p2.1)\. - S\. Verma, V\. Boonsanong, M\. Hoang, K\. Hines, J\. Dickerson, and C\. Shah \(2024\)Counterfactual explanations and algorithmic recourses for machine learning: a review\.ACM Computing Surveys56\(12\),pp\. 1–42\.Cited by:[§3\.5](https://arxiv.org/html/2606.18889#S3.SS5.p3.2)\. - M\. Yuksekgonul, F\. Bianchi, J\. Boen,et al\.\(2025\)Optimizing generative AI by backpropagating language model feedback\.Nature639,pp\. 609–616\.Cited by:[§2](https://arxiv.org/html/2606.18889#S2.p1.1)\. - G\. Zhang, L\. Niu, J\. Fang,et al\.\(2025\)Multi\-agent Architecture Search via Agentic Supernet\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=imcyVlzpXh)Cited by:[§2](https://arxiv.org/html/2606.18889#S2.p1.1)\. - X\. Zhao, S\. Liu, S\. Yang, and C\. Miao \(2025a\)A smart multimodal healthcare copilot with powerful llm reasoning\.InProceedings of the Thirty\-Fourth International Joint Conference on Artificial Intelligence,IJCAI ’25\.External Links:ISBN 978\-1\-956792\-06\-5,[Link](https://doi.org/10.24963/ijcai.2025/1278),[Document](https://dx.doi.org/10.24963/ijcai.2025/1278)Cited by:[§1](https://arxiv.org/html/2606.18889#S1.p3.1)\. - X\. Zhao, S\. Liu, S\. Yang, and C\. Miao \(2025b\)MedRAG: Enhancing Retrieval\-augmented Generation with Knowledge Graph\-Elicited Reasoning for Healthcare Copilot\.InProceedings of the ACM on Web Conference 2025,pp\. 4442–4457\.Cited by:[§1](https://arxiv.org/html/2606.18889#S1.p3.1)\. - L\. Zhou, Y\. Farag, and A\. Vlachos \(2024\)An LLM Feature\-based Framework for Dialogue Constructiveness Assessment\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 5389–5409\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.308),[Link](https://aclanthology.org/2024.emnlp-main.308/)Cited by:[§2](https://arxiv.org/html/2606.18889#S2.p1.1)\. - Y\. Zhou, A\. I\. Muresanu, Z\. Han,et al\.\(2022\)Large language models are human\-level prompt engineers\.InThe eleventh international conference on learning representations,Cited by:[§2](https://arxiv.org/html/2606.18889#S2.p1.1)\. ## Appendix AImplementation Details #### Models\. We usedQwen3\-80BTeam \([2025](https://arxiv.org/html/2606.18889#bib.bib89)\),Qwen3\.5\-27BQwen Team \([2026](https://arxiv.org/html/2606.18889#bib.bib105)\)andGemma\-4\-31B111[hf\.co/google/gemma\-4\-31B\-it](https://arxiv.org/html/2606.18889v1/hf.co/google/gemma-4-31B-it), Accessed 16 July, 2026for semantic feature extraction\. These open\-weight models provide strong multilingual performance, which is important given that our data is in Romanian, a low\-resource languageNigatuet al\.\([2024](https://arxiv.org/html/2606.18889#bib.bib66)\)\. Feature discovery usedGPT\-5\-minias theFeatureProposer, and prompt refinement usedGPT\-5\.4, both with temperature 1\.0 and a 16k output token limit\. #### Semantic feature extraction\. Extraction was run offline with vLLM 0\.19\.1Kwonet al\.\([2023](https://arxiv.org/html/2606.18889#bib.bib51)\)on two NVIDIA DGX A100 GPUs, with tensor parallelism across the allocated GPUs, a maximum context length of 16k tokens, and at most 16 concurrent sequences\. Each feature is extracted with three rollouts at temperature 1\.0, with the final value determined by majority vote over the rollouts\. Feature NameValues \(Prevalence\)Actionabilityunclear\_or\_mixed \(0\.6%\)not\_actionable \(1\.7%\)weakly\_actionable \(2\.7%\)moderately\_actionable \(15\.9%\)highly\_actionable \(79\.2%\)Empathyunclear\_or\_mixed \(0\.4%\)no\_empathy \(6\.0%\)low\_empathy \(58\.9%\)moderate\_empathy \(30\.6%\)high\_empathy \(4\.1%\)Explanationunclear\_or\_mixed \(1\.1%\)absent \(5\.9%\)limited \(16\.2%\)moderate \(26\.4%\)clear \(50\.3%\)Personalizationunclear\_or\_mixed \(1\.7%\)generic \(0\.3%\)lightly\_personalized \(12\.3%\)moderately\_personalized \(26\.3%\)highly\_personalized \(59\.4%\)Problems Addressedunclear\_or\_mixed \(1\.1%\)not\_addressed \(0\.3%\)partially\_addressed \(6\.0%\)mostly\_addressed \(44\.6%\)fully\_addressed \(47\.9%\)Response Toneunclear\_or\_mixed \(0\.3%\)dismissive\_judgemental \(0\.4%\)rushed\_curt \(0\.8%\)neutral\_professional \(57\.4%\)supportive\_reassuring \(41\.1%\)Only Recommends Visityes \(2\.4%\)no \(97\.6%\) Table 3:Extracted features and their corresponding values along with their prevalence\. #### Prevalence distribution\. The prevalence distribution, shown in Table[3](https://arxiv.org/html/2606.18889#A1.T3)suggests that most responses already exhibit strong communication qualities, including actionability, personalization and adequate coverage of patient concerns\. As a result, the recommendation task involves identifying incremental improvements rather than correcting poor responses\. #### Feedback estimators\. The policy model is tuned in three stages: SHAP\-basedLundberg and Lee \([2017](https://arxiv.org/html/2606.18889#bib.bib15)\)recursive feature selection keeping 10% of features \(at least 32\), a 3\-fold grid search over tree depth\{2,4,6,8,10\}\\\{2,4,6,8,10\\\}, learning rate\{0\.01,0\.03,0\.05\}\\\{0\.01,0\.03,0\.05\\\}, and L2 leaf regularization\{0,2,5\}\\\{0,2,5\\\}, and final training for 2500 iterations with balanced class weights\. The logistic regression auditor uses imputation, one\-hot encoding, and feature scaling\. Counterfactual search is run with budgetsB∈\{1,…,5\}B\\in\\\{1,\\ldots,5\\\}\. Figure 7:Spearman inter\-feature correlations\.Figure 8:SHAPLundberg and Lee \([2017](https://arxiv.org/html/2606.18889#bib.bib15)\)feature importance acrossExtractormodules for the policy model in predicting positive feedback\. #### Semantic feature correlation\. Fig\.[7](https://arxiv.org/html/2606.18889#A1.F7)shows that the extracted features are largely uncorrelated, with the exception ofempathyandresponse\_tone, which exhibit a moderate positive correlation\. This suggests that each feature captures a distinct aspect of response quality and contributes unique information for predicting positive user feedback\. Counterfactual RecommendationsGreedy RecommendationsAuditorExtractorImprovementrate \(%\)Improvementdelta \(%\)Improvementrate \(%\)Improvementdelta \(%\)EBMQwen3\-80B89\.58\+5\.389\.49\+3\.4Qwen3\.5\-27B97\.19\+6\.394\.78\+4\.1Qwen3\.5\-27B\(R\)97\.54\+6\.398\.18\+4\.3Gemma\-4\-31B94\.91\+6\.499\.38\+3\.6Gemma\-4\-31B\(R\)98\.91\+6\.699\.23\+5\.0LRQwen3\-80B89\.55\+5\.388\.29\+2\.2Qwen3\.5\-27B86\.26\+6\.571\.95\+1\.6Qwen3\.5\-27B\(R\)88\.68\+6\.874\.94\+2\.0Gemma\-4\-31B94\.72\+7\.646\.98\+1\.0Gemma\-4\-31B\(R\)95\.84\+7\.088\.23\+2\.3Average93\.18\+6\.4185\.14\+2\.95 Table 4:Comparison of our policy\-guided counterfactual recommendations with a greedy baseline across auditor models and feature extractors, averaged over edit budgets\. Improvement rate is the percentage of responses with a non\-negative auditor\-evaluated change, while improvement delta is the mean change in predicted positive\-feedback probability\. Our method achieves higher average improvement rate and substantially larger gains than the greedy baseline\. #### Semantic feature importance\. The feature\-importance patterns, depicted in Fig\.[8](https://arxiv.org/html/2606.18889#A1.F8), show that across models, the most influential semantic features areresponse\_tone,personalization, andproblems\_addressed\. This supports the hypothesis that patient satisfaction is sensitive to the way information is communicated\. Notably, these features correspond to aspects that a doctor can revise without delegating medical reasoning to an LM\. #### Comparison with Greedy Recommendations We compare our counterfactual recommendation algorithm withgreedyrecommendations\. For a given edit budget, greedy recommendations select the semantic features with the lowest ordinal score and increase its value by the amount specified in the budget\. In this case, no policy model is used to guide the selection\. In Table[4](https://arxiv.org/html/2606.18889#A1.T4), we show the results after applying the auditors; greedy recommendations have lower improvement delta and lower improvement rate overall\.
Similar Articles
World Feedback for Clinical Agents: Diagnosing RL in FHIR Environments
This paper examines the use of reinforcement learning from world feedback for clinical protocol-execution tasks in FHIR environments, identifies structural barriers like high silent-finish ceilings and zero-gradient tasks, and introduces MedAgentBench-v3 with a lower ceiling. It shows that pure RL underperforms rule-based SFT due to these barriers, and proposes a combined SFT+RL approach.
RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills
RubricsTree proposes a scalable, expert-aligned evaluation framework for personal health agents using over 100 atomic Boolean rubrics, achieving up to 66% relative gains on HealthBench across Gemini, GPT, and Qwen model families.
Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO
This paper proposes a Variance-Aware Reward Framework using GRPO to improve LLM performance on heart-focused medical question answering, achieving significant accuracy and F1 gains on a HealthBench subset.
Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue
This paper presents a method for fine-tuning LLMs to predict PHQ-9 depression severity scores directly from transcripts of conversations with an AI mental health application, achieving strong correlation with clinical thresholds using a augmented dataset of 6,283 users.
MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning
MedGuideX transforms clinical practice guidelines into executable decision logic to generate factual and counterfactual QA data for training medical LLMs, achieving a 10.28% relative improvement in average accuracy across clinical reasoning benchmarks.