CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations

arXiv cs.AI 06/03/26, 04:00 AM Papers
Summary
CP-Agent is a multimodal large language model that interprets cellular morphological changes under chemical perturbations using context-aware alignment (CP-CLIP), enabling interpretable and scalable phenotypic screening for drug discovery.
arXiv:2606.03435v1 Announce Type: new Abstract: Cell Painting combines multiplexed fluorescent staining, high-content imaging, and quantitative analysis to generate high-dimensional phenotypic readouts to support diverse downstream tasks such as mechanism-of-action (MoA) inference, toxicity prediction, and construction of drug-disease atlases. However, existing workflows are slow, costly and difficult to interpret. Approaches for drug screening modeling predominantly focus on molecular representation learning, while neglecting actual experimental context (e.g., cell line, dosing schedule, etc.), limiting generalization and MoA resolution. We introduce CP-Agent, an agentic multimodal large language model (MLLM) capable of generating mechanism-relevant, human-interpretable rationales for cell morphological changes under drug perturbations. At its core, CP-Agent leverages a context-aware alignment module, CP-CLIP, that jointly embeds high-content images and experimental metadata to enable robust treatment and MoA discrimination (achieving a maximum F1-score of 0.896). By integrating CP-CLIP outputs with agentic tool usage and reasoning, CP-Agent compiles rationales into a structured report to guide experimental design and hypothesis refinement. These capabilities highlight CP-Agent's potential to accelerate drug discovery by enabling more interpretable, scalable, and context-aware phenotypic screening -- streamlining iterative cycles of hypothesis generation in drug discovery.
Original Article
View Cached Full Text
Cached at: 06/03/26, 09:43 AM
# CP-Agent: Context‑Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations
Source: [https://arxiv.org/html/2606.03435](https://arxiv.org/html/2606.03435)
Yuxin Zhang1,\*, Yiyao Li2,\*, Ping Shu Ho4, Simon See4, Zhenqin Wu2,†, Kevin Tsia1,3,5,† 1Department of Electrical and Computer Engineering, The University of Hong Kong 2School of Computing and Data Science, The University of Hong Kong 3School of Biomedical Engineering, The University of Hong Kong 4Nvidia AI Technology Center 5Advanced Biomedical Instrumentation Centre

###### Abstract

Cell Painting combines multiplexed fluorescent staining, high‑content imaging, and quantitative analysis to generate high\-dimensional phenotypic readouts to support diverse downstream tasks such as mechanism\-of\-action \(MoA\) inference, toxicity prediction, and construction of drug–disease atlases\. However, existing workflows are slow, costly and difficult to interpret\. Approaches for drug screening modeling predominantly focus on molecular representation learning, while neglecting actual experimental context \(e\.g\., cell line, dosing schedule, etc\.\), limiting generalization and MoA resolution\. We introduce CP\-Agent, an agentic multimodal large language model \(MLLM\) capable of generating mechanism\-relevant, human\-interpretable rationales for cell morphological changes under drug perturbations\. At its core, CP\-Agent leverages a context\-aware alignment module, CP\-CLIP, that jointly embeds high\-content images and experimental metadata to enable robust treatment and MoA discrimination \(achieving a maximum F1\-score of 0\.896\)\. By integrating CP\-CLIP outputs with agentic tool usage and reasoning, CP‑Agent compiles rationales into a structured report to guide experimental design and hypothesis refinement\. These capabilities highlight CP\-Agent’s potential to accelerate drug discovery by enabling more interpretable, scalable, and context\-aware phenotypic screening – streamlining iterative cycles of hypothesis generation in drug discovery\.

††footnotetext:\*Equal contribution\.†Corresponding authors\. Project page: https://github\.com/letitia\-zhang/CP\-Agent## 1Introduction

High‑content imaging with Cell Painting has become a workhorse for scalable phenotypic drug discovery\. This technique, integrating advanced microscopy, multiplexed fluorescent staining and quantitative image analysis, allows us to establish high\-dimensional morphological cell profiles that capture rich multiscale cellular responses to chemical perturbations\. These profiles have been proven valuable in supporting mechanism\-of\-action \(MoA\) inference\(Tianet al\.,[2023](https://arxiv.org/html/2606.03435#bib.bib18)\), toxicity prediction\(Ewaldet al\.,[2025](https://arxiv.org/html/2606.03435#bib.bib19)\), hit triage\(Vincentet al\.,[2020](https://arxiv.org/html/2606.03435#bib.bib20)\), and drug repurposing\(Fredin Haslumet al\.,[2024](https://arxiv.org/html/2606.03435#bib.bib22)\), while also enabling the construction of reference atlases and improved target deconvolution\(Moffatet al\.,[2017](https://arxiv.org/html/2606.03435#bib.bib21)\)\.

In Cell Painting workflows, cells are perturbed under diverse conditions and the experimental context is not a nuisance to control but a signal to model\. For instance, dose and time define trajectories; cellular background modulates pathway readouts \(Appendix[B\.2](https://arxiv.org/html/2606.03435#A2.SS2)\)\. The resulting profiles guide follow‑up experiments and can advance phenotype\-driven drug discovery\. However, Cell Painting\-based drug discovery remains limited by several challenges: \(i\) complex intermediate dependencies: Morphological responses are highly context\-dependent\. For example, concentration\-dependent profiles show low correlations across dose levels \(Pearson r = 0\.21\-0\.26\)\(Trapotsiet al\.,[2022](https://arxiv.org/html/2606.03435#bib.bib51)\), and MoA prediction is sensitive to cell line context\(Sealet al\.,[2024](https://arxiv.org/html/2606.03435#bib.bib52)\)\. Ignoring these structures conflates biology with acquisition artifacts and wastes the valuable metadata; \(ii\) convergent morphologies: Compounds with distinct mechanisms may induce morphological readouts convergence, reducing MoA resolution, thereby complicating the extraction of standardized, interpretable descriptors\. \(iii\) Lack of semantic grounding: Representing image embeddings as unstructured feature vectors restricts their capacity for semantic reasoning and downstream biological inference\.

Recently, various AI methods have been introduced to Cell Painting datasets, such as generative approaches to synthesize images under perturbations\(Navidiet al\.,[2024](https://arxiv.org/html/2606.03435#bib.bib23); Cross\-Zamirskiet al\.,[2023](https://arxiv.org/html/2606.03435#bib.bib25); Palmaet al\.,[2025](https://arxiv.org/html/2606.03435#bib.bib26)\), multimodal frameworks integrating chemical and genetic annotations with cell painting images\(Sanchez\-Fernandezet al\.,[2023](https://arxiv.org/html/2606.03435#bib.bib24)\)\(Fradkinet al\.,[2024](https://arxiv.org/html/2606.03435#bib.bib64); Luet al\.,[2025](https://arxiv.org/html/2606.03435#bib.bib65)\)\. For example, CLOOME firstly introduced a CLIP\-style model to align Cell Painting images with molecular structures\. MolPhenix and CellCLIP further extend this direction by leveraging strong unimodal foundation models to align the molecule\. However, many existing models offer visual embeddings as black\-box features, which lack semantic interpretability\. Moreover, experimental context is often under‑used: metadata is appended via late fusion or treated as unstructured text, yielding less informative representations and hindering iterative, closed‑loop experimental design\. Meanwhile, emerging multimodal large language models \(MLLMs\) offer reasoning capabilities and have been applied in diverse biological domains, such as genomics, biomedical imaging, and omics data analysis\(Zhanget al\.,[2024a](https://arxiv.org/html/2606.03435#bib.bib29); Linet al\.,[2025](https://arxiv.org/html/2606.03435#bib.bib30); Liuet al\.,[2024b](https://arxiv.org/html/2606.03435#bib.bib28); Huet al\.,[2024b](https://arxiv.org/html/2606.03435#bib.bib31); Zhanget al\.,[2024b](https://arxiv.org/html/2606.03435#bib.bib32)\)\. Yet their applications in drug screening remain underexplored\.

In this work, we introduce CP\-Agent, a context\-aware, agentic MLLM framework for Cell Painting drug perturbation screening\. At its core is CP‑CLIP, a contrastive alignment module that jointly embeds Cell Painting images and structured experimental context, including drug compounds and other essential experimental conditions, enhancing the biological relevance of cell morphology\. The model is pretrained on 1\.9 million image\-context pairs, with a customized token injection strategy that embeds key fields for better alignment\. Comprehensive evaluations across curated classification tasks show that CP\-CLIP outperforms general\-purpose baselines\. Built on this perception layer, CP\-Agent integrates tool\-augmented reasoning and task\-adapted MLLMs grounded in phenotype descriptors and MoA ontologies to generate structured, interpretable outputs\. Together, this agentic system supports scalable and interoperable phenotypic analysis, enabling cross\-study generalization and providing actionable insights for assay prioritization and iteration, thereby accelerating hypothesis generation and improving decision\-making in phenotypic drug discovery\.

## 2Method

### 2\.1Dataset

We employed three open\-access Cell Painting datasets, consisting of approximately 1\.9 million pairs: BBBC021\(Caieet al\.,[2010](https://arxiv.org/html/2606.03435#bib.bib15)\), CPJUMP1\(Chandrasekaranet al\.,[2024](https://arxiv.org/html/2606.03435#bib.bib16)\), and RxRx3\(Fayet al\.,[2023](https://arxiv.org/html/2606.03435#bib.bib17)\), encompassing diverse compound\-induced phenotypes\. Each image\-context pair comprises a microscopy image and its associated experimental context \(e\.g\., cell lines, experimental treatment conditions\) We curated compounds to ensure traceable MoA labels across datasets\. For each collection, we matched SMILES representations of the perturbing chemical compounds to ChEMBL, retrieved their targets and MoAs, and retained only compounds with publicly resolvable MoA names\. A summary of the curated multi\-dataset setting is provided in Table[1](https://arxiv.org/html/2606.03435#S2.T1)\. More details about dataset backgrounds are provided in Appendix[C](https://arxiv.org/html/2606.03435#A3)\.

Table 1:Summary of datasets used in this studyDatasetCell lineChannelCompoundConcentrationTimeImage PairBBBC021MCF\-7 \(p53 WT\)334Variable 8\-point half\-log24 h144,411CPJUMP1U2OS, A5495625\.0 µM24 h, 48 h562,687RXRX3HUVEC6380Fixed 8\-point half\-log∼\\sim20 h1,265,984The training set comprises 1,846,436 image–text pairs, while the validation set contains 9,395 pairs\. For zero\-shot evaluation, we curated a held\-out set of compounds spanning all three datasets, selected to assess generalization to unseen perturbations\.

### 2\.2Molecular Drug Encoding

Several established approaches map compound perturbations to vector representations, enabling alignment with image embeddings and facilitating multimodal learning\(Winteret al\.,[2019](https://arxiv.org/html/2606.03435#bib.bib8); Wuet al\.,[2025](https://arxiv.org/html/2606.03435#bib.bib9)\)\. For instance, SMILES\-based \(e\.g\., ChemBERTa\) and graph\-based models learn molecular embeddings from structure, often using RDKit for preprocessing\. Alternatively, one can compute continuous molecular descriptor embeddings \(e\.g\., physicochemical and topological descriptors\), formalized as a parameterized feature extractor:ϕdesc\(x;P\)=\[f1\(x;P1\),f2\(x;P2\),…,fd\(x;Pd\)\]∈ℝd\\phi\_\{\\mathrm\{desc\}\}\(x;P\)=\\left\[f\_\{1\}\\left\(x;P\_\{1\}\\right\),f\_\{2\}\\left\(x;P\_\{2\}\\right\),\\ldots,f\_\{d\}\\left\(x;P\_\{d\}\\right\)\\right\]\\in\\mathbb\{R\}^\{d\}, wherexxis an input molecular representation \(e\.g\., SMILES strings or molecular graphs\), and eachfi\(x;Pi\)f\_\{i\}\(x;P\_\{i\}\)extracts a specific property, forming add\-dimensional real\-valued feature vector\. In contrast, binary fingerprint embeddings that encode the presence/absence of substructures \(e\.g\., Morgan/circular, MACCS, or path\-based fingerprints\)\(Bentoet al\.,[2020](https://arxiv.org/html/2606.03435#bib.bib10)\)ϕfp:ℳ→\{0,1\}dorℕ0d\\phi\_\{\\mathrm\{fp\}\}:\\mathcal\{M\}\\rightarrow\\\{0,1\\\}^\{d\}\\text\{ or \}\\mathbb\{N\}\_\{0\}^\{d\}, yield binary or count\-based encoding over the molecular spaceℳ\\mathcal\{M\}\.

### 2\.3CP\-CLIP: Reprocessing

To harmonizeCell Painting imagesacross datasets with varying resolution and signal quality, we defined a channel\-wise preprocessing step:𝒫:ℝH0×W0→ℝH×W\\mathcal\{P\}:\\mathbb\{R\}^\{H\_\{0\}\\times W\_\{0\}\}\\rightarrow\\mathbb\{R\}^\{H\\times W\}, applied independently to each fluorescence channel\. This includes Contrast Limited Adaptive Histogram Equalization \(CLAHE\), random Laplacian sharpening, and gamma correction, yielding enhanced imagesI~=𝒫\(I\)\\tilde\{I\}=\\mathcal\{P\}\(I\)\. Enhanced single\-channel images are then cropped into512×512512\\times 512patches and stacked, yielding input tilesxp∈ℝ512×512×Cx\_\{p\}\\in\\mathbb\{R\}^\{512\\times 512\\times C\}\. For each perturbation tilexpx\_\{p\}, a corresponding control tilexc∈ℝ512×512×Cx\_\{c\}\\in\\mathbb\{R\}^\{512\\times 512\\times C\}is independently sampled from a matching control setΩ\(xp\)\\Omega\\left\(x\_\{p\}\\right\), which share all experimental contexts \(e\.g\., plate, cell line, channel\) withxpx\_\{p\}, except for the perturbation compound\. That isxc∼𝒰\(Ω\(xp\)\)x\_\{c\}\\sim\\mathcal\{U\}\\left\(\\Omega\\left\(x\_\{p\}\\right\)\\right\)\. The final image branch input is formed by concatenating the grayscale perturbation and control tiles along the channel dimension,x^=concat⁡\(xp,xc\)∈ℝ512×512×2\\hat\{x\}=\\operatorname\{concat\}\\left\(x\_\{p\},x\_\{c\}\\right\)\\in\\mathbb\{R\}^\{512\\times 512\\times 2\}\. This paired design encourages the model to learn the contrasts between treated and untreated states\.

Molecular descriptorsare projected via a fixed dimensional mappingfdesc:𝒳→ℝdf\_\{\\text\{desc \}\}:\\mathcal\{X\}\\rightarrow\\mathbb\{R\}^\{d\}, where each feature dimension corresponds to a predefined physicochemical or topological property \(See Appendix[D](https://arxiv.org/html/2606.03435#A4)\)\. Letv=fdesc\(x\)∈ℝdv=f\_\{\\operatorname\{desc\}\}\(x\)\\in\\mathbb\{R\}^\{d\}denote the raw descriptor vector for compoundx∈𝒳x\\in\\mathcal\{X\}\. To ensure numerical stability and comparability across compounds, dimensions containing undefined values \(e\.g\., NaNs or Infs\) are removed, and z\-score normalization is applied independently to each feature dimensionv~i=vi−μiσi\\tilde\{v\}\_\{i\}=\\frac\{v\_\{i\}\-\\mu\_\{i\}\}\{\\sigma\_\{i\}\}\.

To account for the compound\-specific dosing scheme, each molecule is represented by a normalized dosing pair\[ρmax,s\(C\)\]\\left\[\\rho\_\{\\max\},s\(C\)\\right\], whereρmax\\rho\_\{\\max\}denotes the molecular mass\-normalized maximum concentration \(inmg/mLmg/mL\), ands\(C\)s\(C\)is the log\-scaled dose step index corresponding to a given concentration\. LetM∈ℝ\>0M\\in\\mathbb\{R\}\_\{\>0\}denote the molecular weight \(inDaDaorg/molg/mol\), andCmax∈ℝ\>0C\_\{\\max\}\\in\\mathbb\{R\}\_\{\>0\}the nominal maximum concentration \(inμ\\muM\)\. So, the molecular maximum mass concentration is given by:

ρmax\[mg/mL\]:=M\[Da\]⋅Cmax\[μM\]106\\rho\_\{\\max\}\[\\mathrm\{mg\}/\\mathrm\{mL\}\]:=\\frac\{M\[\\mathrm\{Da\}\]\\cdot C\_\{\\max\}\[\\mu\\mathrm\{M\}\]\}\{10^\{6\}\}\(1\)where the denominator10610^\{6\}reflects the conversion fromμ\\muM andDaDatomg/mL\\mathrm\{mg\}/\\mathrm\{mL\}\. While for each titration pointC∈\{C1,…,C8\}C\\in\\left\\\{C\_\{1\},\\ldots,C\_\{8\}\\right\\\}, a pseudo\-step index is computed on a log scale to reflect dilution ratios:

s\(C\):=log10⁡\(Cmax\)−log10⁡\(C\)Δlog,Δlog=0\.5s\(C\):=\\frac\{\\log\_\{10\}\\left\(C\_\{\\max\}\\right\)\-\\log\_\{10\}\(C\)\}\{\\Delta\\log\},\\quad\\Delta\\log=0\.5\(2\)where the denominator 0\.5 corresponds to the log\-fold change between adjacent titration levels in a 2\-fold serial dilution protocol\. A detailed derivation is provided in Appendix[E](https://arxiv.org/html/2606.03435#A5)\.

Forobservation time, lett∈ℝ≥0t\\in\\mathbb\{R\}\_\{\\geq 0\}denote time in days\. Temporal normalization rescalesttinto the unit interval via:t~=tTmax,withTmax=112\\tilde\{t\}=\\frac\{t\}\{T\_\{\\max\}\},\\quad\\text\{ with \}T\_\{\\max\}=112\. The 112\-day \(16\-week\) window reflects the FDA’s stopping rule, adopted byWatkinset al\.\([2022](https://arxiv.org/html/2606.03435#bib.bib50)\)in their pharmacoeconomic analysis\. These representations ensure that the input space remains consistent across compounds with varying dosing schemes and time\-points\.

### 2\.4CP\-CLIP: Context\-Aware Token Projection

![Refer to caption](https://arxiv.org/html/2606.03435v1/x1.png)Figure 1:Illustration of the CP‑agent \(top\) and CP‑CLIP \(bottom\)\. CP\-Agent connects perception, memory retrieval, and modular analysis into a unified pipeline for generating reports for Cell Painting experiments\. CP\-CLIP forms the backbone of the CP\-Agent’s perception module, providing joint embeddings of Cell Painting images and structured experimental context\.Our contrastive framework uses a structured text encoder tailored to the metadata obtained from drug screening experiments \(Figure[1](https://arxiv.org/html/2606.03435#S2.F1), bottom\)\. Each experiment is represented as a prompt\-like sequence composed of cell culture, imaging, and drug compound perturbation conditions\. So the “raw text” refers to structured experimental metadata such as cell line, culture medium, imaging parameters, compound identity, dosage, time and other cultural information if have\. These contextual descriptions are first composed into a natural language\-style sentence and tokenized into input IDs using the standard GPT\-2\. To accommodate structured context and consistent representations of perturbing compounds, we introduced field\-specific placeholder tokens \(i\.e\.<CMPD\>,<CONC\>,<TIME\>\) for compound descriptorszcmpd=ϕdesc\(x;P\)∈ℝdz\_\{\\text\{cmpd \}\}=\\phi\_\{\\text\{desc \}\}\(x;P\)\\in\\mathbb\{R\}^\{d\}, normalized concentrationzconc=\[ρmax,s\(C\)\]∈ℝ2z\_\{\\text\{conc \}\}=\\left\[\\rho\_\{\\max\},s\(C\)\\right\]\\in\\mathbb\{R\}^\{2\}, and normalized timeztime=t~∈ℝz\_\{\\text\{time \}\}=\\tilde\{t\}\\in\\mathbb\{R\}\. The special placeholder tokens are directly inserted into the text sequence and registered into the tokenizer’s vocabulary\. During tokenization, they are automatically recognized as atomic units and their positions are preserved without being split or altered\. Their embeddings are then dynamically computed via field\-specific Multilayer Perceptron \(MLP\) trunksf∗:ℝd′→ℝDf\_\{\*\}:\\mathbb\{R\}^\{d^\{\\prime\}\}\\rightarrow\\mathbb\{R\}^\{D\}:

ecmpd\\displaystyle e\_\{\\text\{cmpd \}\}=fcmpd\(zcmpd\)∈ℝD\\displaystyle=f\_\{\\text\{cmpd \}\}\\left\(z\_\{\\text\{cmpd\}\}\\right\)\\in\\mathbb\{R\}^\{D\}\(3\)econc\\displaystyle e\_\{\\text\{conc \}\}=fconc\(zconc\)∈ℝD\\displaystyle=f\_\{\\text\{conc \}\}\\left\(z\_\{\\text\{conc\}\}\\right\)\\in\\mathbb\{R\}^\{D\}etime\\displaystyle e\_\{\\text\{time \}\}=ftime\(ztime\)∈ℝD\\displaystyle=f\_\{\\text\{time \}\}\\left\(z\_\{\\text\{time\}\}\\right\)\\in\\mathbb\{R\}^\{D\}wherefcmpd,fconc,andftimef\_\{\\text\{cmpd\}\},f\_\{\\text\{conc\}\},\\text\{and \}f\_\{\\text\{time\}\}are lightweight MLP trunks encoding compound identity, concentration, and time\-point used in place of the placeholders\. The resulting text input is a hybrid sequence:

X=\[CLS,t1,t2,…,ecmpd⏟<CMPD\>,…,econc⏟<CONC\>,…,etime⏟<TIME\>,…\]X=\[\\mathrm\{CLS\},t\_\{1\},t\_\{2\},\\ldots,\\underbrace\{e\_\{\\text\{cmpd \}\}\}\_\{<\\mathrm\{CMPD\}\>\},\\ldots,\\underbrace\{e\_\{\\text\{conc \}\}\}\_\{<\\text\{CONC\}\>\},\\ldots,\\underbrace\{e\_\{\\text\{time \}\}\}\_\{<\\text\{TIME\}\>\},\\ldots\]\(4\)This hybrid sequence, combining standard subword embeddingsti∈ℝDt\_\{i\}\\in\\mathbb\{R\}^\{D\}with structured embeddings𝐞∗∈ℝD\\mathbf\{e\}\_\{\*\}\\in\\mathbb\{R\}^\{D\}from field\-specific MLPs, is fed into the text Transformer to produce final text representation\. Implementation details are in Appendix[F](https://arxiv.org/html/2606.03435#A6)\. By replacing placeholder tokens with learned embeddings, the model fuses continuous metadata with discrete language tokens in a shared embedding space\. The text encoder thus captures both experimental signals and linguistic coherence, enabling better semantic alignment\.

![Refer to caption](https://arxiv.org/html/2606.03435v1/x2.png)Figure 2:Automated cell\-phenotype assessment pipeline of CP\-Agent\. Upon user query, CP\-CLIP retrieves the relevant experimental context to guide cell segmentation and feature extraction\. Downstream agents then rank morphological changes and generate interpretable, end\-to\-end phenotype reports\.
### 2\.5CP\-Agent workflow

CP\-Agent adopts a modular, memory\-augmented architecture that connects perception, tooling, and analysis into a single\-pass pipeline \(Figure[1](https://arxiv.org/html/2606.03435#S2.F1), top\)\. Given user\-provided Cell Painting images, a lightweight memory retriever powered by CP\-CLIP fetches the most probable experimental context \(i\.e\., cell line, fluorescence channels, imaging settings, chemical perturbations\)\. Once the experimental context is retrieved, the pipeline proceeds to visual analysis\. Rather than relying on vision backbones that produce holistic, biologically opaque embeddings, we extract handcrafted single\-cell morphological features\. These interpretable representations are processed by a modular, MLLM\-driven agent architecture, where the MLLM serves as a policy layer that dynamically routes tasks to interchangeable tools and integrates their outputs\. We frame this system as “agentic” in the sense ofprocedural autonomy\(Xuet al\.,[2025](https://arxiv.org/html/2606.03435#bib.bib68)\): unlike reinforcement learning\-based planners, CP\-Agent employs the MLLM as a cognitive controller within a structured workflow\. It relies on the model’s learned reasoning capabilities—rather than fixed logical scripts—to dynamically prioritize morphological features, interpret statistical distribution shifts, and synthesize mechanism\-level hypotheses based on retrieved experimental context\.

We instantiate this concept on fluorescence Cell Painting data via a specialized CP\-Agent workflow \(Figure[2](https://arxiv.org/html/2606.03435#S2.F2)\), which comprises the following steps:

- •CPContext AgentGiven paired Cell Painting images \(control vs\. perturbation\) acquired under matched conditions, theCPContext Agentemploys a pre\-trained CP\-CLIP retriever to obtain experimental context from a curated knowledge base\. Simultaneously, it harmonizes metadata via controlled\-vocabulary tagging and channel labeling to generate standardized descriptors\. Retrieved context is routed both \(A\) as a context bundle toFeatRank Agent,ReportGen Agent, and \(B\) as metadata keywords to theCellFeat Agent\.
- •ChannelSeg AgentGiven Cell Painting images, theChannelSeg Agentperforms nuclei instance segmentation on DNA\-stained channels and whole\-cell segmentation on non\-DNA channels \(e\.g\., RNA, Actin, ER, etc\.\)\. It outputs channel\-specific instance masks, which are passed to theCellFeat Agent\.
- •CellFeat AgentGiven Cell Painting images, corresponding masks, and harmonized metadata, theCellFeat Agentextracts per\-cell morphological, intensity, texture, granularity, neighborhood, and occupancy features using a configured CellProfiler pipeline \(Appendix[H](https://arxiv.org/html/2606.03435#A8)\)\. Output is routed both \(A\) as extracted feature items to theFeatRank Agentfor mechanism\-aware selection, and \(B\) as channel\-wise single\-cell feature matrices to theStatSynth Agentfor statistical evidence synthesis\.
- •FeatRank AgentGiven extracted feature items and experimental context, theFeatRank Agentscores and ranks features by their likelihood of being influenced by the perturbation\. It generates confidence\-weighted rationales to support prioritization\. Output is routed as a prioritized feature list with explanations to theStatSynth Agent\.
- •StatSynth AgentGiven the prioritized feature list, full feature matrices, and experiment\-level context, theStatSynth Agentcomputes per\-feature statistical evidence between control and perturbation conditions based on the prioritized features\. It summarizes distribution shifts, effect sizes, confidence intervals, and statistical significance\. Outputs are routed as statistical summaries and interpretations to theReportGen Agentfor final report composition\.
- •ReportGen AgentGiven statistical summaries, prioritized features, visual exemplars, and experimental context, theReportGen Agentcomposes an integrated interpretation of the perturbation’s biological impact\. It identifies key morphological shifts and evaluates their consistency with expected cellular responses to infer plausible mechanisms\. The resulting report summarizes these findings, provides follow\-up recommendations and visualizations, and is delivered to the users for downstream access\.

The agent tool stack integrates both classical and learning\-based components\. For segmentation, we fine\-tuned VISTA\-2D[Heet al\.](https://arxiv.org/html/2606.03435#bib.bib36)for 20 epochs using diverse augmentation strategies to mitigate optics\-induced batch effects\. The model generates channel\-specific masks that enable biologically consistent segmentation across diverse imaging conditions\. More details regarding dataset preparation and training of the segmentation model are provided in Appendix[J](https://arxiv.org/html/2606.03435#A10)\. TheStatSynth Agentis tasked with reasoning over high\-dimensional single\-cell morphological data \(typically 30–300 cells per image\), which is impractical for direct LLM application due to length constraints and noise\(Fanget al\.,[2024](https://arxiv.org/html/2606.03435#bib.bib37)\)\. Instead, we curate agentic tools that \(i\) aggregate summary statistics for key features, and \(ii\) quantify distribution shifts between control and perturbed samples\. These compact, interpretable summaries support reliable LLM\-based reasoning\. Detailed procedures for this step are provided in Appendix[L](https://arxiv.org/html/2606.03435#A12)\.

## 3Experiments and Results

Table 2:Model performance on classification tasksModelCell lineChannelPerturbation CompoundFlindokalnerRacecadotrilAZM\-475271MisoprostolTrazodoneOrantinibRufinamideLumiracoxibBIRB\-796MethoxsalenMacro\-avg\\rowcolorgray\!20 Random Guessing0\.250\.1430\.100\.100\.100\.100\.100\.100\.100\.100\.100\.100\.10Grok\-40\.4480\.2280\.2150\.1740\.00\.00\.4100\.1900\.0340\.00\.00\.00\.102GPT\-50\.3770\.4390\.0590\.1680\.00\.00\.3530\.00\.00\.00\.00\.00\.074Claude\-4\-Sonnet0\.4500\.1980\.00\.000\.00\.0570\.00\.00\.00\.00\.2110\.00\.027Gemini\-2\.5\-Pro0\.5260\.6280\.00\.00\.00\.00\.00\.0230\.00\.00\.0450\.00\.007CLOOME ViT\-B/16\-\-0\.7840\.7840\.7290\.8540\.6230\.8490\.6530\.6190\.8540\.8000\.755CLIP ViT\-B/161\.0000\.9550\.7760\.6800\.6610\.2160\.6290\.4470\.5000\.6000\.5750\.6420\.657SigLIP\-ViT\-B/161\.0000\.9250\.7340\.4710\.5150\.8260\.2910\.6380\.3950\.2720\.6040\.4000\.514CP\-CLIP SigLIP\-ViT\-B/16\(descriptor\)1\.0000\.9340\.6850\.4420\.4850\.7760\.3510\.8600\.2550\.1860\.6600\.6200\.532CP\-CLIP ViT\-B/16\(fingerprint\)1\.0000\.9910\.8390\.8620\.8910\.8750\.9130\.9140\.8940\.8400\.9710\.8750\.887CP\-CLIP ViT\-B/16\(descriptor\)1\.0000\.8820\.9070\.8690\.8570\.9420\.8480\.9400\.8840\.8540\.9320\.9220\.896CP\-CLIP ViT\-L/16\(descriptor\)1\.0000\.8490\.9280\.8800\.8960\.8460\.8430\.9290\.9110\.8190\.9150\.9410\.891

Table 3:Unseen drugs similarity scoreModelRegorafenibSacubitrilBuparlisibDexamethasoneNimodipineAZ258NilotinibMG\-132AverageCLIP ViT\-B/160\.207±0\.0820\.207\{\\scriptstyle\\,\\pm\\,0\.082\}0\.2058±0\.1040\.2058\{\\scriptstyle\\,\\pm\\,0\.104\}0\.289±0\.0460\.289\{\\scriptstyle\\,\\pm\\,0\.046\}0\.3601±0\.0490\.3601\{\\scriptstyle\\,\\pm\\,0\.049\}0\.377±0\.0390\.377\{\\scriptstyle\\,\\pm\\,0\.039\}0\.328±0\.0690\.328\{\\scriptstyle\\,\\pm\\,0\.069\}0\.174±0\.0800\.174\{\\scriptstyle\\,\\pm\\,0\.080\}0\.346±0\.0720\.346\{\\scriptstyle\\,\\pm\\,0\.072\}0\.2860\.286SigLIP ViT\-B/160\.038±0\.0820\.038\{\\scriptstyle\\,\\pm\\,0\.082\}0\.095±0\.0990\.095\{\\scriptstyle\\,\\pm\\,0\.099\}0\.129±0\.0730\.129\{\\scriptstyle\\,\\pm\\,0\.073\}0\.146±0\.0910\.146\{\\scriptstyle\\,\\pm\\,0\.091\}0\.183±0\.0670\.183\{\\scriptstyle\\,\\pm\\,0\.067\}0\.090±0\.1860\.090\{\\scriptstyle\\,\\pm\\,0\.186\}−0\.055±0\.103\-0\.055\{\\scriptstyle\\,\\pm\\,0\.103\}0\.143±0\.1010\.143\{\\scriptstyle\\,\\pm\\,0\.101\}0\.0960\.096CP\-CLIP SigLIP\-ViT\-B/16\(descriptor\)0\.378±0\.0770\.378\{\\scriptstyle\\,\\pm\\,0\.077\}0\.420±0\.1930\.420\{\\scriptstyle\\,\\pm\\,0\.193\}0\.323±0\.1020\.323\{\\scriptstyle\\,\\pm\\,0\.102\}0\.503±0\.1300\.503\{\\scriptstyle\\,\\pm\\,0\.130\}0\.515±0\.0750\.515\{\\scriptstyle\\,\\pm\\,0\.075\}0\.488±0\.115\\bm\{0\.488\{\\scriptstyle\\,\\pm\\,0\.115\}\}0\.303±0\.0900\.303\{\\scriptstyle\\,\\pm\\,0\.090\}0\.380±0\.1140\.380\{\\scriptstyle\\,\\pm\\,0\.114\}0\.4140\.414CP\-CLIP ViT\-B/16\(fingerprint\)0\.297±0\.0930\.297\{\\scriptstyle\\,\\pm\\,0\.093\}0\.222±0\.0720\.222\{\\scriptstyle\\,\\pm\\,0\.072\}0\.375±0\.0530\.375\{\\scriptstyle\\,\\pm\\,0\.053\}0\.468±0\.0520\.468\{\\scriptstyle\\,\\pm\\,0\.052\}0\.461±0\.0460\.461\{\\scriptstyle\\,\\pm\\,0\.046\}0\.429±0\.1200\.429\{\\scriptstyle\\,\\pm\\,0\.120\}0\.210±0\.1090\.210\{\\scriptstyle\\,\\pm\\,0\.109\}0\.420±0\.0810\.420\{\\scriptstyle\\,\\pm\\,0\.081\}0\.3600\.360CP\-CLIP ViT\-B/16\(descriptor\)0\.432±0\.0980\.432\{\\scriptstyle\\,\\pm\\,0\.098\}0\.412±0\.0940\.412\{\\scriptstyle\\,\\pm\\,0\.094\}0\.396±0\.0430\.396\{\\scriptstyle\\,\\pm\\,0\.043\}0\.503±0\.0730\.503\{\\scriptstyle\\,\\pm\\,0\.073\}0\.469±0\.0320\.469\{\\scriptstyle\\,\\pm\\,0\.032\}0\.468±0\.1040\.468\{\\scriptstyle\\,\\pm\\,0\.104\}0\.324±0\.085\\bm\{0\.324\{\\scriptstyle\\,\\pm\\,0\.085\}\}0\.448±0\.0810\.448\{\\scriptstyle\\,\\pm\\,0\.081\}0\.4320\.432CP\-CLIP ViT\-L/16\(descriptor\)0\.455±0\.115\\bm\{0\.455\{\\scriptstyle\\,\\pm\\,0\.115\}\}0\.445±0\.135\\bm\{0\.445\{\\scriptstyle\\,\\pm\\,0\.135\}\}0\.408±0\.053\\bm\{0\.408\{\\scriptstyle\\,\\pm\\,0\.053\}\}0\.530±0\.072\\bm\{0\.530\{\\scriptstyle\\,\\pm\\,0\.072\}\}0\.523±0\.032\\bm\{0\.523\{\\scriptstyle\\,\\pm\\,0\.032\}\}0\.448±0\.1060\.448\{\\scriptstyle\\,\\pm\\,0\.106\}0\.295±0\.0890\.295\{\\scriptstyle\\,\\pm\\,0\.089\}0\.448±0\.077\\bm\{0\.448\{\\scriptstyle\\,\\pm\\,0\.077\}\}0\.444\\bm\{0\.444\}

To assess the effectiveness of CP\-Agent, we isolated and evaluated its core components before measuring end\-to\-end reporting quality: \(a\) CP\-CLIP \(context\-aware retrieval and alignment\): we evaluate its accuracy on in\-distribution classification \(seen\-drug\) and generalization \(unseen\-drug matching\), ablations against MLLM baselines and CLIP variants; \(b\) Vision embedding structure: we evaluate whether CP\-CLIP embeddings encode chemically grounded, dose\- and MoA\-dependent morphology; \(c\) Statistical synthesis and reporting: whether compact summaries enable robust comparisons between control and perturbation in the generated report\. Finally, we assessed the effectiveness of full CP\-Agent reports via expert review\.

### 3\.1Model Variants and MLLM Baselines

To contextualize the performance of our proposed model, we compared it against several leading MLLMs\. Specifically, we included Grok\-4\(xAI,[2025](https://arxiv.org/html/2606.03435#bib.bib49)\), GPT\-5\(OpenAI,[2025](https://arxiv.org/html/2606.03435#bib.bib46)\), Claude\-4\-Sonnet\(Anthropic,[2025](https://arxiv.org/html/2606.03435#bib.bib47)\), and Gemini\-2\.5\-Pro\(Google DeepMind,[2025](https://arxiv.org/html/2606.03435#bib.bib48)\), which have demonstrated strong performance across a range of general\-purpose multimodal benchmarks\. Following recent benchmarking protocols for MLLMs in biomedical and healthcare settings\(Lozanoet al\.,[2024](https://arxiv.org/html/2606.03435#bib.bib67); Burgesset al\.,[2025](https://arxiv.org/html/2606.03435#bib.bib66)\), we adopt a zero\-shot, two\-stage prompting pipeline: first, the models were prompted to curate background knowledge relevant to Cell Painting experiments; then, they were asked to answer multiple\-choice questions about experimental conditions given the curated background knowledge, paired control and perturbation images, and masked textual prompts\. To further narrow the adaptation gap and make the comparison more conservative, we additionally evaluated a few\-shot variant in which the MLLMs were provided with a small visual memory bank \(two labeled exemplars per class\) before answering the same tasks\. Detailed prompt templates and the corresponding zero\-shot and few\-shot results are reported in the Appendix[M](https://arxiv.org/html/2606.03435#A13)\.

Alongside these MLLMs, we benchmarked multiple variants of our contrastive learning framework, CP\-CLIP, which extends the CLIP architecture by integrating structured experimental context into training\. As a baseline, we used the original CLIP model based on the ViT\-B/16 vision backbone, retrained on natural language text aligned with Cell Painting images\. All CP\-CLIP variants enhance this setup by injecting serialized numerical metadata, as detailed in Section[2\.4](https://arxiv.org/html/2606.03435#S2.SS4)\. We evaluated CP\-CLIP variants that differ in compound encoding and loss function \(See Appendix[G](https://arxiv.org/html/2606.03435#A7)\), including: \(i\) a descriptor\-based model used continuous molecular descriptors, and \(ii\) a fingerprint\-based model used binary fingerprints\. We also tested a SigLIP variant that uses a sigmoid\-based pairwise contrastive objective\(Zhaiet al\.,[2023](https://arxiv.org/html/2606.03435#bib.bib57)\)\. To assess the impact of vision model capacity on performance, we tested a CP\-CLIP variant with ViT\-L/16 vision backbone\.

![Refer to caption](https://arxiv.org/html/2606.03435v1/x3.png)Figure 3:CP\-CLIP captures pharmacologically meaningful morphology\. UMAP projections of CP\-CLIP image embeddings, colored by \(a\) compound identity and \(b\) mechanism of action \(MoA\)\. The clear clustering indicates that the learned representation encodes biologically relevant morphology\. \(c\) Concentration\-dependent morphological changes are captured using image embeddings extracted from samples treated with varying compound doses\.
### 3\.2Task I: seen\-drug classification

To benchmark in\-distribution performance, we designed classification tasks across three categories: cell line, fluorescence channel and compound\. The classification is performed retrieval\-based inference by ranking cosine similarity scores between image embeddings and a set of candidate textual prompts, following the standard CLIP paradigm\. Those contextual metadata includes both textual and numerical variables, which are encoded jointly as a natural language sequence, enabling prompt\-based querying without the needs for task\-specific heads\. For example, for compound classification, 10 compounds were randomly sampled to form a balanced 10\-class setting\. Table[2](https://arxiv.org/html/2606.03435#S3.T2)summarizes the results\. Among all general\-purpose MLLMs, Gemini\-2\.5\-Pro achieved the best performance on the cell line and channel prediction tasks \(F1: 0\.526 and 0\.628\)\. However, on compound classification, performance dropped sharply: All models fell below random baseline, except for Grok\-4, which slightly exceeded it\. Confusion matrices \(Appendix[M\.3](https://arxiv.org/html/2606.03435#A13.SS3)\) revealed near\-zero F1 scores, indicating systematic failure in identifying perturbing chemical compounds and limited generalization of current MLLMs\. In contrast, CP\-CLIP consistently outperformed both the baseline CLIP and all MLLMs across tasks\. Descriptor\-based models slightly outperformed fingerprint\-based ones on compound classification \(F1: 0\.891 vs\. 0\.887\), indicating that continuous encodings provide richer chemical contexts\. Scaling the vision encoder from ViT\-B/16 to ViT\-L/16 yielded no significant gain \(F1: 0\.896 vs\. 0\.891\), indicating that a lightweight backbone suffices when paired with strong chemical priors\. Taken together, these MLLMs results also constitute a “no\-CPContext” baseline, reinforcing the conclusion that without explicit perturbation\-aware grounding, current MLLMs fail to extract meaningful biological signals from Cell Painting image\. This emphasizes the essential role of CP\-CLIP as the perception in CP\-agent\.

### 3\.3Task II: unseen\-drug matching

To evaluate generalization, we performed zero\-shot prompt–image matching on held\-out compounds by computing cosine similarity between image and prompt embeddings \(Table[3](https://arxiv.org/html/2606.03435#S3.T3)\)\. The baseline CLIP model \(ViT\-B/16\) yielded low alignment on unseen drugs \(avg\. similarity = 0\.286\), while CP\-CLIP \(descriptor, ViT\-B/16\) achieved 0\.432, a 14\.6% absolute increase\. Descriptor\-based models also outperformed fingerprint\-based ones \(0\.432 vs\. 0\.360\), indicating that continuous encodings capture more relevant chemical contexts\. Scaling the vision encoder from ViT\-B/16 to ViT\-L/16 further improved performance to 0\.444, suggesting enhanced robustness to morphological variation\. To provide a comparative reference, we also evaluated similarity on seen drugs \(Appendix[I](https://arxiv.org/html/2606.03435#A9)\)\. Notably, performance on unseen drugs remained close, indicating strong generalization\. Specifically, descriptor\-based ViT\-B/16 and ViT\-L/16 models achieved 0\.549/0\.432 and 0\.561/0\.444 on seen/unseen drugs, suggesting that CP\-CLIP captures mechanism\-relevant biology, rather than memorizing labels\. This zero\-shot capability supports practical applications such as MoA hypothesis generation, hit prioritization, and generalization to novel perturbation contexts\.

### 3\.4vision embedding analyses

Figure[3](https://arxiv.org/html/2606.03435#S3.F3)a\-b shows UMAP projections of embeddings from CP\-CLIP ViT\-B/16 \(descriptor\)\. The UMAP projection reveals clustering by MoA, indicating the learned representation encodes pharmacologically meaningful morphology beyond compound identity\. Figure[3](https://arxiv.org/html/2606.03435#S3.F3)c shows concentration\-related patterns for four drugs selected from the BBBC021 and RxRx3 datasets\. CP\-CLIP embeddings exhibited clear dose–response trajectories, reflecting concentration\-dependent morphological change\. In particular, the sharp dose\-responses observed for Anisomycin and Bryostatin are consistent with previous reportsCranstonet al\.\([1982](https://arxiv.org/html/2606.03435#bib.bib40)\); Marshallet al\.\([2002](https://arxiv.org/html/2606.03435#bib.bib41)\)\. In contrast, drugs with minimal morphological impacts show flatter trends across dosage\. More examples and a detailed explanation of this schematic are provided in the Appendix[K](https://arxiv.org/html/2606.03435#A11)\.

### 3\.5CP\-Agent reports

![Refer to caption](https://arxiv.org/html/2606.03435v1/x4.png)Figure 4:Summary reports generated from CP\-Agent\. The examples show CP\-Agent’s ability to recognize clear \(Taxol\), subtle \(Sorbinil\), and complex \(BGT226\) morphological responses, linking them to plausible biological mechanisms\.We present three case studies from different datasets to demonstrate CP\-Agent generated reports \(Figure[4](https://arxiv.org/html/2606.03435#S3.F4)\): \(i\)Example 1 \(BBBC021, MCF7 \+ Taxol\): Taxol induces a clearcytoskeletal phenotypeby stabilizing microtubules and arresting mitosis\(Kiwanukaet al\.,[2022](https://arxiv.org/html/2606.03435#bib.bib44)\)\. CP\-Agent detected the localized changes in tubulin texture and correctly linked them to microtubule stabilization and mitotic arrest, demonstrating its ability to recognize canonical, visually prominent phenotypes\. \(ii\)Example 2 \(CPJUMP, A549 \+ Sorbinil\): Sorbinil is an aldose reductase inhibitor that produces asubtle and uncertain phenotype\(Zieteket al\.,[2025](https://arxiv.org/html/2606.03435#bib.bib42)\)\. CP\-Agent detected modest shifts \(e\.g\., smoother RNA texture, reduced granularity\), and suggested potential stress granule suppression\. Meanwhile, it also flagged ambiguity and suggested further validation, illustrating its ability to reason under uncertainty\. \(iii\)Example 3 \(RxRx3, HUVEC \+ BGT226\): BGT226 is a PI3K/mTOR inhibitor, leading to amulti\-compartment phenotypeaffecting organelles, cell shape, and density\(Kampa\-Schittenhelmet al\.,[2013](https://arxiv.org/html/2606.03435#bib.bib43)\)\. By integrating mitochondrial texture, cell area, and density changes, CP\-Agent inferred PI3K/mTOR inhibition, showcasing its capacity to synthesize complex morphological cues into mechanistic insights\. Together, these cases show that CP\-Agent adapts to diverse biological contexts, ranging from clear to ambiguous phenotypes, and generates biologically grounded summaries\. Additional examples and reasoning details are provided in Appendix[O](https://arxiv.org/html/2606.03435#A15)\.

We conducted an expert survey to assess whether LLM\-based CP\-Agent produces accurate and well\-reasoned screening reports\. Four LLMs \(mentioned in Section[3\.1](https://arxiv.org/html/2606.03435#S3.SS1)\) each generated reports for ten control–perturbation image pairs\. Experts \(N = 11\), ranging from PhD students to professors in pharmacology or related fields, rated 40 reports \(10 pairs × 4 models\) on a 1–7 scale across ten criteria fromWaqaset al\.\([2025](https://arxiv.org/html/2606.03435#bib.bib45)\), covering language quality and reasoning quality\. Full criteria definitions and examples are provided in Appendix[P](https://arxiv.org/html/2606.03435#A16)\. As shown in Figure[16](https://arxiv.org/html/2606.03435#A17.F16), most metrics received high scores across models\. CP\-Agent powered by GPT\-5 showed the strongest overall reasoning performance, followed closely by Gemini\-2\.5\-Pro\. To further assess the interpretability and consistency of the CP\-Agent framework, we conducted a systematic evaluation of the two LLM\-powered modules: FeatRank Agent and ReportGen Agent\. As shown in Table[15](https://arxiv.org/html/2606.03435#A18.T15), the selected morphological features remained highly consistent across runs, indicating robust and stable feature prioritization\. Table[16](https://arxiv.org/html/2606.03435#A18.T16)also revealed a stable corpus\-level consistency of the report generated from ReportGen Agent\.

## 4Discussion and Conclusion

We present CP\-Agent, a context\-aware multimodal reasoning framework for interpretable analysis of Cell Painting drug responses\. Its core, CP\-CLIP, aligns imaging data with experimental context, enhanced by numerically grounded token injection\. This yields strong generalization and outperforms baselines on multiple classification tasks\. CP\-Agent separates and coordinates perception, retrieval, analysis, and reporting into specialized agents \(i\.e\.,CPContext, ChannelSeg, CellFeat, FeatRank, StatSynth, ReportGen\)\. This enables an evidence\-first workflow where CP\-Agent converts high\-dimensional morphological features, together with the experimental context, into compact, calibrated summaries that an MLLM synthesizes into interpretable narratives\. Hence, CP\-Agent allows end\-to\-end biological interpretability\. Users can trace predicted mechanisms back to corresponding morphological features—from images to masks, features, statistics, and final explanations\. Unlike histology tasks, where many agent\-based pipelines can perform well without training by using a well\-designed chain\-of\-thought with off\-the\-shelf MLLMs, our results show that zero\-shot prompting for Cell Painting datasets consistently underperforms, and biologically grounded supervision is essential for meaningful reasoning\. CP\-Agent also generalizes to various imaging modalities such as quantitative phase imaging \(QPI\), digital holographic microscopy, and brightfield time\-lapse imaging\(Loet al\.,[2024](https://arxiv.org/html/2606.03435#bib.bib53); Siuet al\.,[2023](https://arxiv.org/html/2606.03435#bib.bib54); Zhanget al\.,[2023](https://arxiv.org/html/2606.03435#bib.bib55); Leeet al\.,[2025](https://arxiv.org/html/2606.03435#bib.bib56)\)and integrates flexibly with tools like ilastik, Fiji, and Icy\. Overall, it establishes a new paradigm for combining MLLMs with mechanistically grounded analysis, offering a foundation for next\-generation AI systems in phenotypic drug discovery\. Looking forward, the modular agentic architecture of CP\-Agent could flexibly be extended for experimental planning \(e\.g\., dose strategy refinement\), multi\-omics fusion, as well as causal priors for counterfactual reasoning\.

#### Ethical Statement

This work does not involve human subjects, animal experiments, or personally identifiable data\. All experiments are conducted on publicly available Cell Painting datasets\.

#### Reproducibility Statement

All code, training scripts, and instructions necessary to reproduce our results are available at the anonymized repository: https://github\.com/letitia\-zhang/CP\-Agent

#### Acknowledgements

The work is supported by Advanced Biomedical Instrumentation Center, the Research Grants Council \(grant no\. 17125121, 14125924, 17128225, RFS2021\-7S06\), the Innovation and Technology Commission of the Hong Kong Special Administrative Region of China \(grant no\. ITS/318/22FP, ITS/408/23FP\)\.

## References

- Claude 4 sonnet\.Anthropic Model Card\.Note:Available at:[https://www\.anthropic\.com](https://www.anthropic.com/)\[Accessed: September 2025\]Cited by:[§3\.1](https://arxiv.org/html/2606.03435#S3.SS1.p1.1)\.
- R\. Averly, F\. N\. Baker, I\. A\. Watson, and X\. Ning \(2025\)Liddia: language\-based intelligent drug discovery agent\.arXiv preprint arXiv:2502\.13959\.Cited by:[§B\.3](https://arxiv.org/html/2606.03435#A2.SS3.p1.1)\.
- N\. S\. Beesabathuni, N\. A\. B\. Adia, E\. Thilakaratne, R\. Gangaraju, and P\. S\. Shah \(2025\)Image\-based temporal profiling of autophagy\-related phenotypes\.Autophagy reports4\(1\),pp\. 2484835\.Cited by:[§B\.2](https://arxiv.org/html/2606.03435#A2.SS2.p1.1)\.
- A\. P\. Bento, A\. Hersey, E\. Félix, G\. Landrum, A\. Gaulton, F\. Atkinson, L\. J\. Bellis, M\. De Veij, and A\. R\. Leach \(2020\)An open source chemical structure curation pipeline using rdkit\.Journal of Cheminformatics12\(1\),pp\. 51\.Cited by:[§2\.2](https://arxiv.org/html/2606.03435#S2.SS2.p1.6)\.
- M\. Bray, S\. Singh, H\. Han, C\. T\. Davis, B\. Borgeson, C\. Hartland, M\. Kost\-Alimova, S\. M\. Gustafsdottir, C\. C\. Gibson, and A\. E\. Carpenter \(2016\)Cell painting, a high\-content image\-based assay for morphological profiling using multiplexed fluorescent dyes\.Nature protocols11\(9\),pp\. 1757–1774\.Cited by:[§B\.1](https://arxiv.org/html/2606.03435#A2.SS1.p1.1)\.
- J\. Burgess, J\. J\. Nirschl, L\. Bravo\-Sánchez, A\. Lozano, S\. R\. Gupte, J\. G\. Galaz\-Montoya, Y\. Zhang, Y\. Su, D\. Bhowmik, Z\. Coman,et al\.\(2025\)Microvqa: a multimodal reasoning benchmark for microscopy\-based scientific research\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 19552–19564\.Cited by:[§3\.1](https://arxiv.org/html/2606.03435#S3.SS1.p1.1)\.
- P\. D\. Caie, R\. E\. Walls, A\. Ingleston\-Orme, S\. Daya, T\. Houslay, R\. Eagle, M\. E\. Roberts, and N\. O\. Carragher \(2010\)High\-content phenotypic profiling of drug response signatures across distinct cancer cells\.Molecular cancer therapeutics9\(6\),pp\. 1913–1926\.Cited by:[Appendix C](https://arxiv.org/html/2606.03435#A3.p1.1),[§2\.1](https://arxiv.org/html/2606.03435#S2.SS1.p1.1)\.
- S\. N\. Chandrasekaran, B\. A\. Cimini, A\. Goodale, L\. Miller, M\. Kost\-Alimova, N\. Jamali, J\. G\. Doench, B\. Fritchman, A\. Skepner, M\. Melanson,et al\.\(2024\)Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations\.Nature Methods21\(6\),pp\. 1114–1121\.Cited by:[Appendix C](https://arxiv.org/html/2606.03435#A3.p1.1),[§2\.1](https://arxiv.org/html/2606.03435#S2.SS1.p1.1)\.
- J\. Choy, Y\. Kan, S\. Cifelli, J\. Johnson, M\. Chen, L\. Shiao, H\. Zhou, S\. Previs, Y\. Lei, R\. Johnstone,et al\.\(2021\)High\-throughput screening to identify small molecules that selectively inhibit apol1 protein level in podocytes\.SLAS DISCOVERY: Advancing the Science of Drug Discovery26\(9\),pp\. 1225–1237\.Cited by:[§B\.2](https://arxiv.org/html/2606.03435#A2.SS2.p1.1)\.
- W\. Cranston, R\. Hellon, and Y\. Townsend \(1982\)Further observations on the suppression of fever in rabbits by intracerebral action of anisomycin\.\.The Journal of Physiology322\(1\),pp\. 441–445\.Cited by:[§3\.4](https://arxiv.org/html/2606.03435#S3.SS4.p1.1)\.
- J\. O\. Cross\-Zamirski, P\. Anand, G\. Williams, E\. Mouchet, Y\. Wang, and C\. Schönlieb \(2023\)Class\-guided image\-to\-image diffusion: cell painting from brightfield images with class labels\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 3800–3809\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p3.1)\.
- J\. D\. Ewald, K\. L\. Titterton, A\. Bäuerle, A\. Beatson, D\. A\. Boiko, Á\. A\. Cabrera, J\. Cheah, B\. A\. Cimini, B\. Gorissen, T\. Jones,et al\.\(2025\)Cell painting for cytotoxicity and mode\-of\-action analysis in primary human hepatocytes\.bioRxiv\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p1.1)\.
- X\. Fang, W\. Xu, F\. A\. Tan, J\. Zhang, Z\. Hu, Y\. Qi, S\. Nickleach, D\. Socolinsky, S\. Sengamedu, and C\. Faloutsos \(2024\)Large language models \(llms\) on tabular data: prediction, generation, and understanding–a survey\.arXiv preprint arXiv:2402\.17944\.Cited by:[§2\.5](https://arxiv.org/html/2606.03435#S2.SS5.p3.1)\.
- M\. M\. Fay, O\. Kraus, M\. Victors, L\. Arumugam, K\. Vuggumudi, J\. Urbanik, K\. Hansen, S\. Celik, N\. Cernek, G\. Jagannathan,et al\.\(2023\)Rxrx3: phenomics map of biology\.Biorxiv,pp\. 2023–02\.Cited by:[Appendix C](https://arxiv.org/html/2606.03435#A3.p1.1),[§2\.1](https://arxiv.org/html/2606.03435#S2.SS1.p1.1)\.
- P\. Fradkin, P\. Azadi Moghadam, K\. Suri, F\. Wenkel, A\. Bashashati, M\. Sypetkowski, and D\. Beaini \(2024\)How molecules impact cells: unlocking contrastive phenomolecular retrieval\.Advances in Neural Information Processing Systems37,pp\. 110667–110701\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p3.1)\.
- J\. Fredin Haslum, C\. Lardeau, J\. Karlsson, R\. Turkki, K\. Leuchowius, K\. Smith, and E\. Müllers \(2024\)Cell painting\-based bioactivity prediction boosts high\-throughput screening hit\-rates and compound diversity\.Nature Communications15\(1\),pp\. 3470\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p1.1)\.
- Google DeepMind \(2025\)Gemini 2\.5 pro\.Google DeepMind\.Note:Available at:[https://deepmind\.google](https://deepmind.google/)\[Accessed: September 2025\]Cited by:[§3\.1](https://arxiv.org/html/2606.03435#S3.SS1.p1.1)\.
- L\. Harkness, X\. Chen, M\. Gillard, P\. P\. Gray, and A\. M\. Davies \(2019\)Media composition modulates human embryonic stem cell morphology and may influence preferential lineage differentiation potential\.PLoS One14\(3\),pp\. e0213678\.Cited by:[§B\.2](https://arxiv.org/html/2606.03435#A2.SS2.p1.1)\.
- \[19\]Y\. He, P\. Guo, Y\. Tang, A\. Myronenko, V\. Nath, Z\. Xu,et al\.Vista3d: a unified segmentation foundation model for 3d medical imaging \(2024\)\.Jun\.Cited by:[§2\.5](https://arxiv.org/html/2606.03435#S2.SS5.p3.1)\.
- H\. Hu, X\. Wang, Y\. Zhang, Q\. Chen, and Q\. Guan \(2024a\)A comprehensive survey on contrastive learning\.Neurocomputing610,pp\. 128645\.Cited by:[§B\.4](https://arxiv.org/html/2606.03435#A2.SS4.p1.1)\.
- M\. Hu, J\. Qian, S\. Pan, Y\. Li, R\. L\. Qiu, and X\. Yang \(2024b\)Advancing medical imaging with language models: featuring a spotlight on chatgpt\.Physics in Medicine & Biology69\(10\),pp\. 10TR01\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p3.1)\.
- D\. J\. Huggins, A\. R\. Venkitaraman, and D\. R\. Spring \(2011\)Rational methods for the selection of diverse screening compounds\.ACS chemical biology6\(3\),pp\. 208–217\.Cited by:[§B\.2](https://arxiv.org/html/2606.03435#A2.SS2.p1.1)\.
- K\. M\. Kampa\-Schittenhelm, M\. C\. Heinrich, F\. Akmut, K\. H\. Rasp, B\. Illing, H\. Döhner, K\. Döhner, and M\. M\. Schittenhelm \(2013\)Cell cycle\-dependent activity of the novel dual pi3k\-mtorc1/2 inhibitor nvp\-bgt226 in acute leukemia\.Molecular cancer12\(1\),pp\. 46\.Cited by:[§3\.5](https://arxiv.org/html/2606.03435#S3.SS5.p1.1)\.
- M\. Kiwanuka, G\. Higgins, S\. Ngcobo, J\. Nagawa, D\. M\. Lang, M\. H\. Zaman, N\. H\. Davies, and T\. Franz \(2022\)Effect of paclitaxel treatment on cellular mechanics and morphology of human oesophageal squamous cell carcinoma in 2d and 3d environments\.Integrative Biology14\(6\),pp\. 137–149\.Cited by:[§3\.5](https://arxiv.org/html/2606.03435#S3.SS5.p1.1)\.
- T\. Lee, E\. H\. Cheung, K\. C\. Lee, D\. M\. Siu, M\. C\. Lo, E\. Y\. Lam, R\. Goswami, S\. Girardo, K\. Kim, F\. Reichel,et al\.\(2025\)High\-throughput multimodal optofluidic biophysical imaging cytometry\.Lab on a Chip\.Cited by:[§4](https://arxiv.org/html/2606.03435#S4.p1.1)\.
- V\. Lejal, D\. Rouquié, and O\. Taboureau \(2025\)Cell morphology and gene expression: tracking changes and complementarity across time and cell lines\.Toxicology and Applied Pharmacology504,pp\. 117530\.External Links:ISSN 0041\-008X,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.taap.2025.117530),[Link](https://www.sciencedirect.com/science/article/pii/S0041008X25003060)Cited by:[§B\.2](https://arxiv.org/html/2606.03435#A2.SS2.p1.1)\.
- A\. Lin, J\. Ye, C\. Qi, L\. Zhu, W\. Mou, W\. Gan, D\. Zeng, B\. Tang, M\. Xiao, G\. Chu,et al\.\(2025\)Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics\.Briefings in Bioinformatics26\(4\),pp\. bbaf357\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p3.1)\.
- F\. Liu, O\. Mailhot, I\. S\. Glenn, S\. F\. Vigneron, V\. Bassim, X\. Xu, K\. Fonseca\-Valencia, M\. S\. Smith, D\. S\. Radchenko, J\. S\. Fraser,et al\.\(2025a\)The impact of library size and scale of testing on virtual screening\.Nature chemical biology,pp\. 1–7\.Cited by:[§B\.2](https://arxiv.org/html/2606.03435#A2.SS2.p1.1)\.
- H\. Liu, S\. Chen, Y\. Zhang, and H\. Wang \(2024a\)GenoTEX: an llm agent benchmark for automated gene expression data analysis\.arXiv preprint arXiv:2406\.15341\.Cited by:[§B\.3](https://arxiv.org/html/2606.03435#A2.SS3.p1.1)\.
- T\. Liu, Y\. Xiao, X\. Luo, H\. Xu, W\. J\. Zheng, and H\. Zhao \(2024b\)Geneverse: a collection of open\-source multimodal large language models for genomic and proteomic research\.arXiv preprint arXiv:2406\.15534\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p3.1)\.
- W\. Liu, H\. Peng, W\. Li, Y\. Zhang, J\. Guan, and S\. Zhou \(2025b\)ScI2CL: effectively integrating single\-cell multi\-omics by intra\-and inter\-omics contrastive learning\.arXiv preprint arXiv:2508\.18304\.Cited by:[§B\.4](https://arxiv.org/html/2606.03435#A2.SS4.p1.1)\.
- M\. C\. Lo, D\. M\. Siu, K\. C\. Lee, J\. S\. Wong, M\. C\. Yeung, M\. K\. Hsin, J\. C\. Ho, and K\. K\. Tsia \(2024\)Information\-distilled generative label\-free morphological profiling encodes cellular heterogeneity\.Advanced science11\(29\),pp\. 2307591\.Cited by:[§4](https://arxiv.org/html/2606.03435#S4.p1.1)\.
- A\. Lozano, J\. Nirschl, J\. Burgess, S\. R\. Gupte, Y\. Zhang, A\. Unell, and S\. Yeung\-Levy \(2024\)\{\\\{\\\\backslashmu\}\\\}\-Bench: a vision\-language benchmark for microscopy understanding\.arXiv preprint arXiv:2407\.01791\.Cited by:[§3\.1](https://arxiv.org/html/2606.03435#S3.SS1.p1.1)\.
- M\. Y\. Lu, B\. Chen, D\. F\. Williamson, R\. J\. Chen, M\. Zhao, A\. K\. Chow, K\. Ikemura, A\. Kim, D\. Pouli, A\. Patel,et al\.\(2024\)A multimodal generative ai copilot for human pathology\.Nature634\(8033\),pp\. 466–473\.Cited by:[§B\.3](https://arxiv.org/html/2606.03435#A2.SS3.p1.1)\.
- M\. Lu, E\. Weinberger, C\. Kim, and S\. Lee \(2025\)CellCLIP–learning perturbation effects in cell painting via text\-guided contrastive learning\.arXiv preprint arXiv:2506\.06290\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p3.1)\.
- J\. L\. Marshall, N\. Bangalore, D\. El\-Ashry, Y\. Fuxman, M\. Johnson, B\. Norris, M\. Oberst, E\. Ness, S\. Wojtowicz\-Praga, P\. Bhargava,et al\.\(2002\)Phase i study of prolonged infusion bryostatin\-1 in patients\.Cancer biology & therapy1\(4\),pp\. 409–416\.Cited by:[§3\.4](https://arxiv.org/html/2606.03435#S3.SS4.p1.1)\.
- A\. Miyajima, F\. Nishimura, D\. Natsuhara, Y\. Kiba, S\. Okamoto, M\. Nagai, T\. Yamamuro, M\. Kitamura, and T\. Shibata \(2025\)Parallel dilution microfluidic device for enabling logarithmic concentration generation in molecular diagnostics\.Lab on a Chip25\(13\),pp\. 3242–3253\.Cited by:[§B\.2](https://arxiv.org/html/2606.03435#A2.SS2.p1.1)\.
- J\. G\. Moffat, F\. Vincent, J\. A\. Lee, J\. Eder, and M\. Prunotto \(2017\)Opportunities and challenges in phenotypic drug discovery: an industry perspective\.Nature reviews Drug discovery16\(8\),pp\. 531–543\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p1.1)\.
- Z\. Navidi, J\. Ma, E\. A\. Miglietta, L\. Liu, A\. E\. Carpenter, B\. A\. Cimini, B\. Haibe\-Kains, and B\. Wang \(2024\)Morphodiff: cellular morphology painting with diffusion models\.bioRxiv\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p3.1)\.
- F\. Odje, D\. Meijer, E\. Von Coburg, J\. J\. van der Hooft, S\. Dunst, M\. H\. Medema, and A\. Volkamer \(2024\)Unleashing the potential of cell painting assays for compound activities and hazards prediction\.Frontiers in toxicology6,pp\. 1401036\.Cited by:[§B\.1](https://arxiv.org/html/2606.03435#A2.SS1.p1.1)\.
- OpenAI \(2025\)GPT\-5\.OpenAI Blog\.Note:Available at:[https://openai\.com](https://openai.com/)\[Accessed: September 2025\]Cited by:[§3\.1](https://arxiv.org/html/2606.03435#S3.SS1.p1.1)\.
- A\. Palma, F\. J\. Theis, and M\. Lotfollahi \(2025\)Predicting cell morphological responses to perturbations using generative modeling\.Nature Communications16\(1\),pp\. 505\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p3.1)\.
- A\. Sanchez\-Fernandez, E\. Rumetshofer, S\. Hochreiter, and G\. Klambauer \(2023\)CLOOME: contrastive learning unlocks bioimaging databases for queries with chemical structures\.Nature Communications14\(1\),pp\. 7339\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p3.1)\.
- S\. Seal, M\. Trapotsi, O\. Spjuth, S\. Singh, J\. Carreras\-Puigvert, N\. Greene, A\. Bender, and A\. E\. Carpenter \(2024\)A decade in a systematic review: the evolution and impact of cell painting\.bioRxiv\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p2.1)\.
- R\. Singh, S\. Sledzieski, B\. Bryson, L\. Cowen, and B\. Berger \(2023\)Contrastive learning in protein language space predicts interactions between drugs and protein targets\.Proceedings of the National Academy of Sciences120\(24\),pp\. e2220778120\.Cited by:[§B\.4](https://arxiv.org/html/2606.03435#A2.SS4.p1.1)\.
- D\. M\. Siu, K\. C\. Lee, B\. M\. Chung, J\. S\. Wong, G\. Zheng, and K\. K\. Tsia \(2023\)Optofluidic imaging meets deep learning: from merging to emerging\.Lab on a Chip23\(5\),pp\. 1011–1033\.Cited by:[§4](https://arxiv.org/html/2606.03435#S4.p1.1)\.
- H\. Su, W\. Long, and Y\. Zhang \(2025\)BioMaster: multi\-agent system for automated bioinformatics analysis workflow\.bioRxiv,pp\. 2025–01\.Cited by:[§B\.3](https://arxiv.org/html/2606.03435#A2.SS3.p2.1)\.
- G\. Tian, P\. J\. Harrison, A\. P\. Sreenivasan, J\. Carreras\-Puigvert, and O\. Spjuth \(2023\)Combining molecular and cell painting image data for mechanism of action prediction\.Artificial Intelligence in the Life Sciences3,pp\. 100060\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p1.1)\.
- M\. Trapotsi, E\. Mouchet, G\. Williams, T\. Monteverde, K\. Juhani, R\. Turkki, F\. Miljkovic, A\. Martinsson, L\. Mervin, K\. R\. Pryde,et al\.\(2022\)Cell morphological profiling enables high\-throughput screening for proteolysis targeting chimera \(protac\) phenotypic signature\.ACS Chemical Biology17\(7\),pp\. 1733–1744\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p2.1)\.
- F\. Vincent, P\. M\. Loria, A\. D\. Weston, C\. M\. Steppan, R\. Doyonnas, Y\. Wang, K\. L\. Rockwell, and M\. Peakman \(2020\)Hit triage and validation in phenotypic screening: considerations and strategies\.Cell Chemical Biology27\(11\),pp\. 1332–1346\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p1.1)\.
- H\. Wang, Y\. He, P\. P\. Coelho, M\. Bucci, A\. Nazir, B\. Chen, L\. Trinh, S\. Zhang, K\. Huang, V\. Chandrasekar,et al\.\(2024\)Jure leskovec, and aviv regev\.Metric mirages in cell embeddings\. bioRxiv,pp\. 2024–04\.Cited by:[§B\.3](https://arxiv.org/html/2606.03435#A2.SS3.p1.1)\.
- A\. Waqas, A\. Khan, Z\. G\. Ozturk, D\. Saeed\-Vafa, W\. Chen, J\. Dhillon, A\. Bychkov, M\. M\. Bui, E\. Ullah, F\. Khalil,et al\.\(2025\)Reasoning beyond accuracy: expert evaluation of large language models in diagnostic pathology\.medRxiv\.Cited by:[§3\.5](https://arxiv.org/html/2606.03435#S3.SS5.p2.1)\.
- S\. Watkins, J\. C\. Toliver, N\. Kim, S\. Whitmire, and W\. T\. Garvey \(2022\)Economic outcomes of antiobesity medication use among adults in the united states: a retrospective cohort study\.Journal of managed care & specialty pharmacy28\(10\),pp\. 1066–1079\.Cited by:[§2\.3](https://arxiv.org/html/2606.03435#S2.SS3.p4.3)\.
- R\. Winter, F\. Montanari, F\. Noé, and D\. Clevert \(2019\)Learning continuous and data\-driven molecular descriptors by translating equivalent chemical representations\.Chemical science10\(6\),pp\. 1692–1701\.Cited by:[§2\.2](https://arxiv.org/html/2606.03435#S2.SS2.p1.6)\.
- T\. Wu, P\. Zhan, W\. Chen, M\. Lin, Q\. Qiu, Y\. Hu, J\. Song, and X\. Lin \(2025\)ChemBERTa embeddings and ensemble learning for prediction of density and melting point of deep eutectic solvents with hybrid features\.Computers & Chemical Engineering196,pp\. 109065\.Cited by:[§2\.2](https://arxiv.org/html/2606.03435#S2.SS2.p1.6)\.
- xAI \(2025\)Grok 4\.xAI Documentation\.Note:Available at:[https://x\.ai](https://x.ai/)\[Accessed: September 2025\]Cited by:[§3\.1](https://arxiv.org/html/2606.03435#S3.SS1.p1.1)\.
- G\. Xu, X\. Li, Y\. Chen, Y\. Duan, S\. Wu, A\. Yu, C\. Chiu, J\. Ni, N\. Tang, T\. J\. Li,et al\.\(2025\)A comprehensive survey of agentic ai in healthcare\.Cited by:[§2\.5](https://arxiv.org/html/2606.03435#S2.SS5.p1.1)\.
- Y\. Yang, A\. Jerger, S\. Feng, Z\. Wang, C\. Brasfield, M\. S\. Cheung, J\. Zucker, and Q\. Guan \(2024\)Improved enzyme functional annotation prediction using contrastive learning with structural inference\.Communications Biology7\(1\),pp\. 1690\.Cited by:[§B\.4](https://arxiv.org/html/2606.03435#A2.SS4.p1.1)\.
- L\. Yiyao, N\. Vakharia, W\. Liang, A\. T\. Mayer, R\. Luo, A\. E\. Trevino, and Z\. Wu \(2025\)OmicsNavigator: an llm\-driven multi\-agent system for autonomous zero\-shot biological analysis in spatial omics\.bioRxiv,pp\. 2025–07\.Cited by:[§B\.3](https://arxiv.org/html/2606.03435#A2.SS3.p2.1)\.
- X\. Zhai, B\. Mustafa, A\. Kolesnikov, and L\. Beyer \(2023\)Sigmoid loss for language image pre\-training\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 11975–11986\.Cited by:[§3\.1](https://arxiv.org/html/2606.03435#S3.SS1.p2.1)\.
- D\. Zhang, Y\. Yu, J\. Dong, C\. Li, D\. Su, C\. Chu, and D\. Yu \(2024a\)Mm\-llms: recent advances in multimodal large language models\.arXiv preprint arXiv:2401\.13601\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p3.1)\.
- S\. Zhang, G\. Dai, T\. Huang, and J\. Chen \(2024b\)Multimodal large language models for bioimage analysis\.nature methods21\(8\),pp\. 1390–1393\.Cited by:[§1](https://arxiv.org/html/2606.03435#S1.p3.1)\.
- Z\. Zhang, K\. C\. Lee, D\. M\. Siu, M\. C\. Lo, Q\. T\. Lai, E\. Y\. Lam, and K\. K\. Tsia \(2023\)Morphological profiling by high\-throughput single\-cell biophysical fractometry\.Communications biology6\(1\),pp\. 449\.Cited by:[§4](https://arxiv.org/html/2606.03435#S4.p1.1)\.
- M\. A\. Zietek, A\. Lohith, D\. Terciano, B\. M\. Rabbitts, A\. Khadilkar, J\. B\. MacMillan, and R\. S\. Lokey \(2025\)Cell painting in activated cells illuminates phenotypic dark space and uncovers novel drug mechanisms of action\.bioRxiv\.Cited by:[§3\.5](https://arxiv.org/html/2606.03435#S3.SS5.p1.1)\.

## Appendix AUse of Large Language Models \(LLMs\)

We used large language models \(e\.g\., GPT\-4\) for non\-substantive assistance during manuscript preparation\. Specifically, LLMs were used to improve writing clarity, grammar, and phrasing, but not for generating scientific content or experimental design\. All technical contributions, experiments, and interpretations were conceived and conducted by the authors\.

The authors take full responsibility for the content of the manuscript, including any text generatedor polished by the LLM\. We have ensured that the \[LM\-generated text adheres to ethical guidelinesand does not contribute to plagiarism or scientific misconduct\.

## Appendix BPreliminaries and Background

### B\.1High\-content imaging

High\-content imaging \(HCI\) leverages automated microscopy and quantitative morphology to profile compound effects\. Cell Painting stains multiple cellular components and extracts hundreds of single\-cell features, producing high\-dimensional representations that enable cross\-perturbation comparisons, including compound clustering, target and pathway inference, and prediction of unannotated mechanisms\(Brayet al\.,[2016](https://arxiv.org/html/2606.03435#bib.bib34); Odjeet al\.,[2024](https://arxiv.org/html/2606.03435#bib.bib35)\)\.

### B\.2Multidimensional Experimental Design in Cell Painting Assays

Drug screening with Cell Painting involves diverse experimental factors that strongly shape cell morphology\. \(Overview of high\-content imaging \(HCI\) can be referred to Appendix[B\.1](https://arxiv.org/html/2606.03435#A2.SS1)\)\. Key sources of variability include the cell line\(Lejalet al\.,[2025](https://arxiv.org/html/2606.03435#bib.bib1)\), culture mediumHarknesset al\.\([2019](https://arxiv.org/html/2606.03435#bib.bib2)\), incubation environment, and drug administration, each capable of inducing substantial morphological shifts\. Drug libraries typically contain hundreds of thousands of molecules\(Hugginset al\.,[2011](https://arxiv.org/html/2606.03435#bib.bib3); Liuet al\.,[2025a](https://arxiv.org/html/2606.03435#bib.bib4)\), with concentrations sampled using half\-logarithmic dilution series to capture dose–response characteristics across orders of magnitude\(Choyet al\.,[2021](https://arxiv.org/html/2606.03435#bib.bib6); Miyajimaet al\.,[2025](https://arxiv.org/html/2606.03435#bib.bib5)\)\. Meanwhile, temporal variables staging further increase complexity, as different observation time points can capture different call phases of treatment response, revealing both immediate and progressive morphological changes\(Beesabathuniet al\.,[2025](https://arxiv.org/html/2606.03435#bib.bib7); Lejalet al\.,[2025](https://arxiv.org/html/2606.03435#bib.bib1)\)\. The interplay of experimental variables defines a high\-dimensional space which condition combinations yield diverse morphological phenotypes\.

### B\.3MLLM Agents for Bioinformatics

Large language models \(LLMs\) are demonstrating growing potential across diverse domains of bioinformatics, with applications ranging from gene expression analysis\(Liuet al\.,[2024a](https://arxiv.org/html/2606.03435#bib.bib58)\)and drug discovery\(Averlyet al\.,[2025](https://arxiv.org/html/2606.03435#bib.bib59)\)to pathology image interpretation\(Luet al\.,[2024](https://arxiv.org/html/2606.03435#bib.bib60)\), spatial transcriptomics\(Wanget al\.,[2024](https://arxiv.org/html/2606.03435#bib.bib61)\), and gene perturbation studies\. Because datasets in these fields are often high\-dimensional, recent efforts have increasingly turned to multimodal large language models \(MLLMs\), which integrate visual features from images with prior textual knowledge\. Leveraging logical inference strategies such as deduction, induction, abduction, and analogy, MLLMs can support existing pipelines and facilitate novel scientific insights\.

More recently, an emerging paradigm has focused on deploying MLLMs as autonomous or semi\-autonomous agents to execute complex bioinformatics workflows\(Yiyaoet al\.,[2025](https://arxiv.org/html/2606.03435#bib.bib62); Suet al\.,[2025](https://arxiv.org/html/2606.03435#bib.bib63)\)\. Such agents integrate heterogeneous tools and interact through natural language, enabling biological data analysis guided by human instructions\. While early studies highlight the promise of MLLM\-driven agents in augmenting traditional pipelines, their scope has largely been limited to direct perception and recognition tasks\. They remain insufficient for deeper understanding of complex biological processes and for generating novel hypotheses\. Addressing this gap, we introduce CL\-CLIP, a multi\-agent system that extends beyond the visual capacities of current state\-of\-the\-art MLLMs to capture subtle pharmacological features, provide interpretable analysis, and facilitate hypothesis generation in pharmacological research\.

### B\.4Contrastive Learning

Contrastive learning is a self\-supervised paradigm that learns representations by pulling semantically related pairs closer and pushing unrelated pairs apart in a shared embedding space\(Huet al\.,[2024a](https://arxiv.org/html/2606.03435#bib.bib11)\)\. In biology, contrastive learning has underpinned several applications, such as single\-cell multi\-omics integration \(scRNA\-seq and scATAC\-seq\)\(Liuet al\.,[2025b](https://arxiv.org/html/2606.03435#bib.bib12)\), protein function prediction for classify enzyme activities\(Yanget al\.,[2024](https://arxiv.org/html/2606.03435#bib.bib13)\), drug\-target interaction prediction through protein\-compound embedding\(Singhet al\.,[2023](https://arxiv.org/html/2606.03435#bib.bib14)\)\. CLIP exemplifies the dual\-encoder contrastive paradigm for multi\-modal learning, it trains an image encoder and a text encoder so that matched image–text pairs have high cosine similarity while mismatched pairs are pushed apart\. By scaling to large, CLIP can produce transferable embeddings that generalize across tasks\.

## Appendix CDataset backgrounds

BBBC021 profiles MCF‑7 cells treated with 38 reference drugs covering 12 mechanisms of action, imaged across up to eight half‑log doses and three channels \(DNA,β\\beta‑tubulin, actin\)\(Caieet al\.,[2010](https://arxiv.org/html/2606.03435#bib.bib15)\)\. CPJUMP1 includes 301 small molecules \(46 controls\) perturbed in U2OS and A549 cells, imaged in five channels \(DNA; mitochondria; actin/Golgi/plasma membrane; nucleoli and cytoplasmic RNA; endoplasmic reticulum\)\(Chandrasekaranet al\.,[2024](https://arxiv.org/html/2606.03435#bib.bib16)\)\. RxRx3 assays HUVECs with 1,674 bioactive compounds across eight concentrations and six fluorescence channels to capture dose–response phenotypes\(Fayet al\.,[2023](https://arxiv.org/html/2606.03435#bib.bib17)\)\.

## Appendix DDetailed RDKit2D Feature Overview

Table 4:Categorized RDKit2D Descriptors Used in This Study \(174 descriptors\)Feature CategoryDescriptorsTopological and Complexity DescriptorsBalabanJ, BertzCT, Chi0, Chi0n, Chi0v, Chi1, Chi1n, Chi1v, Chi2n, Chi2v, Chi3n, Chi3v, Chi4n, Chi4v, Ipc, Kappa1, Kappa2, Kappa3Basic Physicochemical PropertiesMolWt, ExactMolWt, HeavyAtomMolWt, MolLogP, MolMR, LabuteASA, TPSAAtom and Bond CountsHeavyAtomCount, NumValenceElectrons, NumRotatableBonds, NumHAcceptors, NumHDonors, NHOHCount, NOCount, NumHeteroatoms, FractionCSP3Ring Structure DescriptorsRingCount, NumAromaticRings, NumSaturatedRings, NumAliphaticRings, NumAromaticCarbocycles, NumAromaticHeterocycles, NumSaturatedCarbocycles, NumSaturatedHeterocycles, NumAliphaticCarbocycles, NumAliphaticHeterocyclesElectrotopological State \(EState\) DescriptorsMaxEStateIndex, MinEStateIndex, MaxAbsEStateIndex, MinAbsEStateIndexVSA \(Van der Waals Surface Area\) DescriptorsEState\_VSA1–11, PEOE\_VSA1–14, SMR\_VSA1–10, SlogP\_VSA1–12, VSA\_EState1–10Fingerprint Density DescriptorsFpDensityMorgan1, FpDensityMorgan2, FpDensityMorgan3Fragment\-Based Functional Group Descriptorsfr\_Al\_COO, fr\_Al\_OH, fr\_Al\_OH\_noTert, fr\_ArN, fr\_Ar\_COO, fr\_Ar\_N, fr\_Ar\_NH, fr\_Ar\_OH, fr\_COO, fr\_COO2, fr\_C\_O, fr\_C\_O\_noCOO, fr\_HOCCN, fr\_Imine, fr\_NH0, fr\_NH1, fr\_NH2, fr\_Ndealkylation1, fr\_Ndealkylation2, fr\_Nhpyrrole, fr\_SH, fr\_aldehyde, fr\_alkyl\_carbamate, fr\_alkyl\_halide, fr\_allylic\_oxid, fr\_amide, fr\_amidine, fr\_aniline, fr\_aryl\_methyl, fr\_azo, fr\_benzene, fr\_bicyclic, fr\_dihydropyridine, fr\_epoxide, fr\_ester, fr\_ether, fr\_furan, fr\_halogen, fr\_hdrzine, fr\_imidazole, fr\_imide, fr\_ketone, fr\_ketone\_Topliss, fr\_lactone, fr\_methoxy, fr\_morpholine, fr\_nitrile, fr\_nitro, fr\_nitro\_arom, fr\_nitro\_arom\_nonortho, fr\_para\_hydroxylation, fr\_phenol, fr\_phenol\_noOrthoHbond, fr\_phos\_acid, fr\_phos\_ester, fr\_piperdine, fr\_piperzine, fr\_priamide, fr\_pyridine, fr\_sulfide, fr\_sulfonamd, fr\_sulfone, fr\_thiazole, fr\_thiophene, fr\_unbrch\_alkane, fr\_ureaDrug\-Likeness Scoreqed

## Appendix ELog\-Dose Indexing for Serial Dilution

To represent compound concentrations on a consistent and model\-friendly scale, we transform raw concentrations into log\-scaled step values\. This transformation is based on the assumption that concentrations follow a serial dilution protocol in logarithmic space\.

LetCmax∈ℝ\>0C\_\{\\max\}\\in\\mathbb\{R\}\_\{\>0\}denote the nominal maximum concentration for a compound, and letC∈ℝ\>0C\\in\\mathbb\{R\}\_\{\>0\}be any intermediate concentration point\. In a standard protocol with logarithmic dilution spacing, each dose is reduced by a fixed factor per step\. This can be expressed as:

Ck=Cmax⋅10−k⋅Δlog,k=0,1,2,…C\_\{k\}=C\_\{\\max\}\\cdot 10^\{\-k\\cdot\\Delta\\log\},\\quad k=0,1,2,\\ldots\(5\)whereΔlog\>0\\Delta\\log\>0is the logarithmic step size \(in base 10\. For example,Δlog=0\.5\\Delta\\log=0\.5corresponds to a 3\.16\-fold dilution between adjacent doses, since10−0\.5≈0\.316210^\{\-0\.5\}\\approx 0\.3162\.

To recover the step indexs\(C\)s\(C\)corresponding to any concentrationCC, we invert the above relation:

C\\displaystyle C=Cmax⋅10−s\(C\)⋅Δlog\\displaystyle=C\_\{\\max\}\\cdot 0^\{\-s\(C\)\\cdot\\Delta\\log\}\(6\)⇒log10⁡\(C\)\\displaystyle\\Rightarrow\\log\_\{10\}\(C\)=log10⁡\(Cmax\)−s\(C\)⋅Δlog\\displaystyle=\\log\_\{10\}\\left\(C\_\{\\max\}\\right\)\-s\(C\)\\cdot\\Delta\\log⇒s\(C\)\\displaystyle\\Rightarrow s\(C\)=log10⁡\(Cmax\)−log10⁡\(C\)Δlog\\displaystyle=\\frac\{\\log\_\{10\}\\left\(C\_\{\\max\}\\right\)\-\\log\_\{10\}\(C\)\}\{\\Delta\\log\}Thus, the log\-scaled step transformation is defined as:

s\(C\):=log10⁡\(Cmax\)−log10⁡\(C\)Δlog,Δlog=0\.5s\(C\):=\\frac\{\\log\_\{10\}\\left\(C\_\{\\max\}\\right\)\-\\log\_\{10\}\(C\)\}\{\\Delta\\log\},\\quad\\Delta\\log=0\.5\(7\)This representation maps concentrations to a normalized step index in log space, which is more suitable for modeling, especially in contexts where concentration\-response relationships are approximately log\-linear\.

## Appendix FContext\-Aware Token Projection Modules

Algorithm 1CP\-CLIP: Context\-Aware Token Projection Modules1:functionEncodeImage\(

ximgx\_\{\\text\{img\}\}\)

2:

fimg←V\(ximg\)f\_\{\\text\{img\}\}\\leftarrow V\(x\_\{\\text\{img\}\}\)
3:return

normalize\(fimg\)\\text\{normalize\}\(f\_\{\\text\{img\}\}\)
4:endfunction

5:functionEncodeText\(

xtxt,c,t,ex\_\{\\text\{txt\}\},c,t,e\)

6:

X←TokenEmbedding\(xtxt\)X\\leftarrow\\text\{TokenEmbedding\}\(x\_\{\\text\{txt\}\}\)
7:if<CONC\>in

xtxtx\_\{\\text\{txt\}\}then

8:

X\[<CONC\>\]←conc\_mlp\(c\)X\[\\texttt\{<CONC\>\}\]\\leftarrow\\text\{conc\\\_mlp\}\(c\)⊳\\trianglerightc∈ℝ2c\\in\\mathbb\{R\}^\{2\},conc\_mlp:ℝ2→ℝdh→ℝd\\text\{conc\\\_mlp\}:\\mathbb\{R\}^\{2\}\\rightarrow\\mathbb\{R\}^\{d\_\{h\}\}\\rightarrow\\mathbb\{R\}^\{d\}

9:endif

10:if<TIME\>in

xtxtx\_\{\\text\{txt\}\}then

11:

X\[<TIME\>\]←time\_mlp\(t\)X\[\\texttt\{<TIME\>\}\]\\leftarrow\\text\{time\\\_mlp\}\(t\)⊳\\trianglerightt∈ℝ1t\\in\\mathbb\{R\}^\{1\},time\_mlp:ℝ1→ℝdh→ℝd\\text\{time\\\_mlp\}:\\mathbb\{R\}^\{1\}\\rightarrow\\mathbb\{R\}^\{d\_\{h\}\}\\rightarrow\\mathbb\{R\}^\{d\}

12:endif

13:if<CMPD\>in

xtxtx\_\{\\text\{txt\}\}then

14:

X\[<CMPD\>\]←compound\_mlp\(e\)X\[\\texttt\{<CMPD\>\}\]\\leftarrow\\text\{compound\\\_mlp\}\(e\)⊳\\trianglerighte∈ℝdcmpe\\in\\mathbb\{R\}^\{d\_\{\\text\{cmp\}\}\},compound\_mlp:ℝdcmp→ℝdh→ℝd\\text\{compound\\\_mlp\}:\\mathbb\{R\}^\{d\_\{\\text\{cmp\}\}\}\\rightarrow\\mathbb\{R\}^\{d\_\{h\}\}\\rightarrow\\mathbb\{R\}^\{d\}

15:endif

16:

X←X\+PosEmb\(X\)X\\leftarrow X\+\\text\{PosEmb\}\(X\)
17:

ftxt←T\(X\)f\_\{\\text\{txt\}\}\\leftarrow T\(X\)
18:return

normalize\(ftxt\)\\text\{normalize\}\(f\_\{\\text\{txt\}\}\)
19:endfunction

## Appendix GTraining losses

We train the alignment with a symmetric CLIP\-style contrastive objective\. Specifically, we employ the InfoNCE loss, which encourages matched image\-text pairs to have high similarity while contrasting them against all other mismatched pairs in the batch:

ℒInfoNCE=12N∑k=1N\[ℓCE\(Si→t\(k,;\),yk\)\+ℓCE\(St→i\(k,;\),yk\)\]\\mathcal\{L\}\_\{\\text\{InfoNCE\}\}=\\frac\{1\}\{2N\}\\sum\_\{k=1\}^\{N\}\\left\[\\ell\_\{\\mathrm\{CE\}\}\\left\(S\_\{i\\rightarrow t\}^\{\(k,;\)\},y\_\{k\}\\right\)\+\\ell\_\{\\mathrm\{CE\}\}\\left\(S\_\{t\\rightarrow i\}^\{\(k,;\)\},y\_\{k\}\\right\)\\right\]\(8\)
Here,Fi=\[fi\(1\),…,fi\(N\)\]⊤∈ℝN×dF\_\{i\}=\[f\_\{i\}^\{\(1\)\},\.\.\.,f\_\{i\}^\{\(N\)\}\]^\{\\top\}\\in\\mathbb\{R\}^\{N\\times d\}andFt=\[ft\(1\),…,ft\(N\)\]⊤∈ℝN×dF\_\{t\}=\[f\_\{t\}^\{\(1\)\},\.\.\.,f\_\{t\}^\{\(N\)\}\]^\{\\top\}\\in\\mathbb\{R\}^\{N\\times d\}are the batch of normalized image and text embeddings\. The similarity matrices are computed asSi→t=s⋅FiFt⊤∈ℝN×NS\_\{i\\rightarrow t\}=s\\cdot F\_\{i\}F\_\{t\}^\{\\top\}\\in\\mathbb\{R\}^\{N\\times N\}\. The ground\-truth labelsyk∈\{0,1,…,N−1\}y\_\{k\}\\in\\\{0,1,\\ldots,N\-1\\\}indicate the correct matching pair for each sample in the batch\.ℓCE\(⋅,⋅\)\\ell\_\{\\text\{CE\}\}\(\\cdot,\\cdot\)denotes the standard cross\-entropy between the similarity scores and the target labels\.

In our experiments, we additionally compare InfoNCE loss with an alternative loss recently proposed in SigLIP, which simplifies the contrastive objective by directly operate joint embeddings in a shared representation space\.

ℒSigLIP=1N∑k=1N∑j=1N−log⁡σ\(ykj⋅s⋅⟨fi\(k\),ft\(j\)⟩\)\\mathcal\{L\}\_\{\\text\{SigLIP \}\}=\\frac\{1\}\{N\}\\sum\_\{k=1\}^\{N\}\\sum\_\{j=1\}^\{N\}\-\\log\\sigma\\left\(y\_\{kj\}\\cdot s\\cdot\\left\\langle f\_\{i\}^\{\(k\)\},f\_\{t\}^\{\(j\)\}\\right\\rangle\\right\)\(9\)
Here,ssis a learnable temperature parameter\. To isolate the effect of the loss function from the model architecture, we apply both loss types within our CP\-CLIP framework for a fair comparison\.

## Appendix HCellProfiler pipeline

For all DNA channels, we extracted per\-cell features using the workflow described in Table[5](https://arxiv.org/html/2606.03435#A8.T5)\. This pipeline is specifically optimized for nuclear segmentation and feature extraction, using modules that measure grayscale features like shape, texture, and granularity\. These features are particularly suitable for DNA stains\. For all non\-DNA channels \(such as Actin, Tubulin, etc\.\), we applied a consistent pipeline template described in[6](https://arxiv.org/html/2606.03435#A8.T6)\. This workflow is tailored to cytoplasmic or filamentous structures, which differ in spatial organization and image characteristics compared to nuclei\.

Some feature modules differ between the two workflows, particularly in how certain parameters are configured\. For example, texture features were computed at different spatial scales: for DNA, we used smaller scales \(e\.g\., 3, 5, 7\) to capture fine\-grained nuclear texture, while for non\-DNA channels, larger scales \(e\.g\., 5, 10, 15\) were used to capture broader cytoskeletal patterns\. Similarly, granularity features and shape descriptors such as Zernike moments were customized to reflect the typical size and morphology of structures in each channel\. These differences in pipeline configuration ensure that the measurements are biologically meaningful and adapted to the unique characteristics of each fluorescence channel\.

Table 5:CellProfiler pipeline modules and measured features for DNA channel\.ModuleKey Settings / NotesMeasured Features1\. ImagesLoad images; filter by:isimage, exclude folders with regex—2\. MetadataExtract metadata from filename and folder using regex patternsPlate,Well,Site,ChannelNumber,Date3\. NamesAndTypesAssign names:DNA\(grayscale\),nuclei\_mask\(objects\); match rules:file contains "DNA",file contains "nuclei"Image names:DNA,mask; Object names:nuclei,Nucleus4\. GroupsGrouping disabled—5\. MeasureImageAreaOccupiedMeasure area ofnucleiobjectsAreaOccupied\_nuclei6\. MeasureObjectNeighborsMeasure neighbors ofnucleiwithin 10 pixelsNeighbors\_10px\_Count,Neighbors\_10px\_PercentTouching7\. MeasureObjectNeighborsMeasure neighbors ofnucleiwithin 50 pixelsNeighbors\_50px\_Count,Neighbors\_50px\_PercentTouching8\. MeasureObjectSizeShapeMeasurenuclei; include Zernike moments and advanced featuresShape:Area,Perimeter,Solidity,FormFactor, etc\.; Zernike:Zernike\_0\_0toZernike\_9\_99\. MeasureTextureTexture ofDNAinnuclei; scales: 3, 5, 7; levels: 256; mode: both image and objectTexture features per scale:Contrast,Entropy,Correlation, etc\.10\. MeasureGranularityGranularity ofDNAinnuclei; radius = 8, spectrum range = 4Granularity\_1\-\-4\_DNA\_in\_nuclei11\. ExportToSpreadsheetExport all features with metadata; output file:DATA\.csvwith prefixExpt\_All per\-object and per\-image features above, including per\-image mean/median/stdTable 6:CellProfiler pipeline modules and measured features for Actin channel\.ModuleKey Settings / NotesMeasured Features1\. ImagesLoad images; filter by:isimage, exclude folders with regex—2\. MetadataExtract metadata from filename and folder using regex patternsPlate,Well,Site,ChannelNumber,Date3\. NamesAndTypesAssign names:Actin\(grayscale\),cell\_mask\(objects\);
Match rules:file contains "Actin",file contains "cell"Image names:DNA,mask
Object names:nuclei,Nucleus4\. GroupsGrouping disabled—5\. MeasureImageAreaOccupiedMeasure area ofcellobjectsAreaOccupied\_Cell6\. MeasureObjectNeighborsMeasure neighbors ofcellwithin 10 pixelsNeighbors\_10px\_Count,Neighbors\_10px\_PercentTouching7\. MeasureObjectNeighborsMeasure neighbors ofcellwithin 50 pixelsNeighbors\_50px\_Count,Neighbors\_50px\_PercentTouching8\. MeasureObjectSizeShapeMeasurecell; include Zernike moments and advanced featuresShape:Area,Perimeter,Solidity,FormFactor,MaxFeretDiameter,EquivalentDiameter, etc\.
Zernike:Zernike\_0\_0toZernike\_9\_99\. MeasureTextureTexture ofActinincell; scales: 3, 5, 7; levels: 256Texture features per scale:Contrast,Correlation,Entropy,SumEntropy,DifferenceEntropy,InfoMeas1,InfoMeas210\. MeasureGranularityGranularity ofActinincell; radius = 8, spectrum range = 4Granularity\_1\-\-4\_Actin\_in\_cell11\. ExportToSpreadsheetExport all features with metadata; output file:DATA\.csvAll per\-object and per\-image features above, including per\-image mean/median/std
## Appendix ISimilarity Performance on Seen Drug Compounds

Table 7:Similarity Performance on Seen Drug CompoundsModelFlindokalnerRacecadotrilAZM475271MisoprostolTrazodoneOrantinibRufinamidelumiracoxibBIRB\-796MethoxsalenCLIP ViT\-B/160\.486±0\.0490\.486\{\\scriptstyle\\,\\pm\\,0\.049\}0\.528±0\.0090\.528\{\\scriptstyle\\,\\pm\\,0\.009\}0\.496±0\.0320\.496\{\\scriptstyle\\,\\pm\\,0\.032\}0\.437±0\.0510\.437\{\\scriptstyle\\,\\pm\\,0\.051\}0\.499±0\.0360\.499\{\\scriptstyle\\,\\pm\\,0\.036\}0\.427±0\.0440\.427\{\\scriptstyle\\,\\pm\\,0\.044\}0\.500±0\.0300\.500\{\\scriptstyle\\,\\pm\\,0\.030\}0\.433±0\.0420\.433\{\\scriptstyle\\,\\pm\\,0\.042\}0\.422±0\.0410\.422\{\\scriptstyle\\,\\pm\\,0\.041\}0\.440±0\.0360\.440\{\\scriptstyle\\,\\pm\\,0\.036\}SigLIP ViT\-B/160\.308±0\.0880\.308\{\\scriptstyle\\,\\pm\\,0\.088\}0\.323±0\.0750\.323\{\\scriptstyle\\,\\pm\\,0\.075\}0\.209±0\.0800\.209\{\\scriptstyle\\,\\pm\\,0\.080\}0\.329±0\.0770\.329\{\\scriptstyle\\,\\pm\\,0\.077\}0\.214±0\.0740\.214\{\\scriptstyle\\,\\pm\\,0\.074\}0\.322±0\.0830\.322\{\\scriptstyle\\,\\pm\\,0\.083\}0\.222±0\.0680\.222\{\\scriptstyle\\,\\pm\\,0\.068\}0\.211±0\.0630\.211\{\\scriptstyle\\,\\pm\\,0\.063\}0\.2407±0\.0860\.2407\{\\scriptstyle\\,\\pm\\,0\.086\}0\.314±0\.0730\.314\{\\scriptstyle\\,\\pm\\,0\.073\}CP\-CLIP SigLIP\-ViT\-B/16\(descriptor\)0\.538±0\.0660\.538\{\\scriptstyle\\,\\pm\\,0\.066\}0\.539±0\.0570\.539\{\\scriptstyle\\,\\pm\\,0\.057\}0\.456±0\.0400\.456\{\\scriptstyle\\,\\pm\\,0\.040\}0\.531±0\.0520\.531\{\\scriptstyle\\,\\pm\\,0\.052\}0\.448±0\.0390\.448\{\\scriptstyle\\,\\pm\\,0\.039\}0\.545±0\.0460\.545\{\\scriptstyle\\,\\pm\\,0\.046\}0\.452±0\.0420\.452\{\\scriptstyle\\,\\pm\\,0\.042\}0\.448±0\.0400\.448\{\\scriptstyle\\,\\pm\\,0\.040\}0\.479±0\.0590\.479\{\\scriptstyle\\,\\pm\\,0\.059\}0\.525±0\.0510\.525\{\\scriptstyle\\,\\pm\\,0\.051\}CP\-CLIP ViT\-B/16\(fingerprint\)0\.592±0\.0500\.592\{\\scriptstyle\\,\\pm\\,0\.050\}0\.598±0\.0360\.598\{\\scriptstyle\\,\\pm\\,0\.036\}0\.510±0\.0450\.510\{\\scriptstyle\\,\\pm\\,0\.045\}0\.599±0\.0430\.599\{\\scriptstyle\\,\\pm\\,0\.043\}0\.510±0\.0420\.510\{\\scriptstyle\\,\\pm\\,0\.042\}0\.602±0\.0400\.602\{\\scriptstyle\\,\\pm\\,0\.040\}0\.510±0\.0360\.510\{\\scriptstyle\\,\\pm\\,0\.036\}0\.499±0\.049\\bm\{0\.499\{\\scriptstyle\\,\\pm\\,0\.049\}\}0\.516±0\.0360\.516\{\\scriptstyle\\,\\pm\\,0\.036\}0\.581±0\.0510\.581\{\\scriptstyle\\,\\pm\\,0\.051\}CP\-CLIP ViT\-B/16\(descriptor\)0\.590±0\.0520\.590\{\\scriptstyle\\,\\pm\\,0\.052\}0\.594±0\.0370\.594\{\\scriptstyle\\,\\pm\\,0\.037\}0\.510±0\.0470\.510\{\\scriptstyle\\,\\pm\\,0\.047\}0\.595±0\.0470\.595\{\\scriptstyle\\,\\pm\\,0\.047\}0\.504±0\.0460\.504\{\\scriptstyle\\,\\pm\\,0\.046\}0\.596±0\.0420\.596\{\\scriptstyle\\,\\pm\\,0\.042\}0\.511±0\.044\\bm\{0\.511\{\\scriptstyle\\,\\pm\\,0\.044\}\}0\.497±0\.0490\.497\{\\scriptstyle\\,\\pm\\,0\.049\}0\.525±0\.031\\bm\{0\.525\{\\scriptstyle\\,\\pm\\,0\.031\}\}0\.573±0\.0570\.573\{\\scriptstyle\\,\\pm\\,0\.057\}CP\-CLIP ViT\-L/16\(descriptor\)0\.608±0\.057\\bm\{0\.608\{\\scriptstyle\\,\\pm\\,0\.057\}\}0\.620±0\.043\\bm\{0\.620\{\\scriptstyle\\,\\pm\\,0\.043\}\}0\.511±0\.060\\bm\{0\.511\{\\scriptstyle\\,\\pm\\,0\.060\}\}0\.626±0\.039\\bm\{0\.626\{\\scriptstyle\\,\\pm\\,0\.039\}\}0\.503±0\.053\\bm\{0\.503\{\\scriptstyle\\,\\pm\\,0\.053\}\}0\.626±0\.043\\bm\{0\.626\{\\scriptstyle\\,\\pm\\,0\.043\}\}0\.509±0\.0570\.509\{\\scriptstyle\\,\\pm\\,0\.057\}0\.496±0\.0600\.496\{\\scriptstyle\\,\\pm\\,0\.060\}0\.513±0\.0500\.513\{\\scriptstyle\\,\\pm\\,0\.050\}0\.599±0\.064\\bm\{0\.599\{\\scriptstyle\\,\\pm\\,0\.064\}\}

Table 8:Seen drugs similarity averaged scoreCLIP ViT\-B/16SigLIP ViT\-B/16CP\-CLIPSigLIP\-ViT\-B/16\(descriptor\)CP\-CLIPViT\-B/16\(fingerprint\)CP\-CLIPViT\-B/16\(descriptor\)CP\-CLIPViT\-L/16\(descriptor\)0\.4670\.2690\.4960\.5520\.5490\.561
## Appendix JVISTA\-2D fine\-tune

The original VISTA2D model does not consistently achieve accurate segmentation across all fluorescent channels, especially when applied to diverse cell painting datasets\. To address this limitation, we fine\-tuned the segmentation model using the Cell Painting dataset\. Figures below illustrate representative instance segmentation results across different channels and datasets \(BBBC021, RxRx1, and CPJUMP, respectively\), demonstrating improved mask quality and channel\-specific accuracy\. Three standard instance segmentation metrics are used to evaluate the fine\-tuned model’s instance mask quality on 500 test data, with improvements shown in Table[9](https://arxiv.org/html/2606.03435#A10.T9):

∙\\bulletIntersection over Union \(IoU\)The IoU evaluates the overlap between a predicted instancePPand ground truth instance labelTT:

IoU⁡\(P,T\)=\|P∩T\|\|P∪T\|\\operatorname\{IoU\}\(P,T\)=\\frac\{\|P\\cap T\|\}\{\|P\\cup T\|\}\(10\)Where\|P∩T\|\|P\\cap T\|is number of pixels in the intersection ofPPandTT\.

∙\\bulletAggregated Jaccard Index \(AJI\): The AJI generalizes IoU to an entire image containing multiple instances\. It is the ratio of the total number of overlapping pixels between matched ground truth and prediction pairs, to the total number of pixels in their union plus the pixels in all unmatched predicted instances, and can be formulated as:

AJI\(P,T\)=∑i=1n\|Ti∩Pπ\(i\)\|∑i=1n\|Ti∪Pπ\(i\)\|\+∑j∈U\|Pj\|\\mathrm\{AJI\}\(P,T\)=\\frac\{\\sum\_\{i=1\}^\{n\}\\left\|T\_\{i\}\\cap P\_\{\\pi\(i\)\}\\right\|\}\{\\sum\_\{i=1\}^\{n\}\\left\|T\_\{i\}\\cup P\_\{\\pi\(i\)\}\\right\|\+\\sum\_\{j\\in U\}\\left\|P\_\{j\}\\right\|\}\(11\)Whereπ\(i\)\\pi\(i\)the index mapping that assigns predicted instances align with ground truth instances\.UUis the set of unmatched predicted instances\.

∙\\bulletPanoptic Quality \(PQ\): PQ is a metric that jointly evaluates segmentation quality and recognition quality in instance segmentation\. It reflects both how accurately the matched segments overlap \(IoU\) and how well all instances are detected \(accounting for false positives and false negatives\)\. PQ rewards correct segmentations while penalizing missing or spurious predictions\. PQ can be formulated as:

PQ\(P,T\)=1\|ℳ\|∑\(p,t\)∈ℳIoU⁡\(p,t\)⏟Segmentation Quality \(SQ\)×\|ℳ\|\|ℳ\|\+12\|𝒫unmatched\|\+12\|𝒯unmatched\|⏟Detection Quality \(DQ\)\\mathrm\{PQ\}\(P,T\)=\\underbrace\{\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\sum\_\{\(p,t\)\\in\\mathcal\{M\}\}\\operatorname\{IoU\}\(p,t\)\}\_\{\\text\{Segmentation Quality \(SQ\) \}\}\\times\\underbrace\{\\frac\{\|\\mathcal\{M\}\|\}\{\|\\mathcal\{M\}\|\+\\frac\{1\}\{2\}\\left\|\\mathcal\{P\}\_\{\\text\{unmatched \}\}\\right\|\+\\frac\{1\}\{2\}\\left\|\\mathcal\{T\}\_\{\\text\{unmatched \}\}\\right\|\}\}\_\{\\text\{Detection Quality \(DQ\) \}\}\(12\)Whereℳ\\mathcal\{M\}is the number of ground truth pairs,𝒫unmatched\\mathcal\{P\}\_\{\\text\{unmatched \}\}is unmatched predicted instances \(False Positives\),𝒯unmatched\\mathcal\{T\}\_\{\\text\{unmatched \}\}is unmatched ground truth instances \(False Negatives\)\.

Table 9:Instance Mask Evaluation MetricsVISTA\-2dIoUAJIPQbefore fine tune0\.2720\.2900\.151after fine tune0\.8240\.7910\.682![Refer to caption](https://arxiv.org/html/2606.03435v1/x5.png)Figure 5:Segmentation performance comparison on CP\-JUMP dataset across different imaging channels\.![Refer to caption](https://arxiv.org/html/2606.03435v1/x6.png)Figure 6:Segmentation performance comparison on RXRX3 dataset across different imaging channels\.![Refer to caption](https://arxiv.org/html/2606.03435v1/x7.png)Figure 7:Segmentation performance comparison on BBBC021 dataset across different imaging channels\.
## Appendix KDose response examples

To further illustrate the diversity of dose–response behaviors captured by CP\-CLIP embeddings, Figure[8](https://arxiv.org/html/2606.03435#A11.F8)shows additional examples from two datasets: BBBC021 and RxRx3 since only the two datasets designed dose scheme based experiments\. For each compound, we compute the cosine distance between image embeddings at different concentration levels, focusing on perturbation effects within individual imaging channels\.

The x\-axis denotes concentration step pairs relative to the first experimental dose\. Because different datasets use either fixed or variable half\-log concentration series, we normalize the comparisons by indexing each dose level \(e\.g\., 1 for the lowest concentration, 8 for the highest\)\. A label such as ”1–2” indicates the cosine distance between embeddings at concentration step 1 and step 2\. For example, if the lowest concentration is 0\.0001 µM and a half\-log step is used, then: step 1 is 0\.0001 µM, step 2 is 0\.000316 µM, step8 is 0 µM\. The cosine distance is computed between embeddingsziz\_\{i\}andzjz\_\{j\}at two different dosesiiandjj, where

dij=1−𝐳i⋅𝐳j‖𝐳i‖‖𝐳j‖d\_\{ij\}=1\-\\frac\{\\mathbf\{z\}\_\{i\}\\cdot\\mathbf\{z\}\_\{j\}\}\{\\left\\\|\\mathbf\{z\}\_\{i\}\\right\\\|\\left\\\|\\mathbf\{z\}\_\{j\}\\right\\\|\}\(13\)The y\-axis reflects this cosine distance, providing a quantitative measure of morphological difference between two concentrations\. A rising trend along the x\-axis indicates increasing morphological divergence from the baseline as concentration increases, which indicating a hallmark of a dose\-dependent phenotype\. Sharp trajectories are observed for drugs such as Alsterpaullone, Camptothecin, Cisplatin, Emetine, Mitoxantrone, Acetophenazine, Buclizine, and Thiothixene, which are also consistent with their known mechanisms\. In contrast, compounds such as Eszopiclone and Methsuximide produce more stable embeddings across doses, suggesting limited morphological response\. These visualizations provide additional support for the claim that CP\-CLIP embeddings can sensitively capture dose\-dependent morphological variation across diverse chemical perturbations\.

![Refer to caption](https://arxiv.org/html/2606.03435v1/x8.png)Figure 8:Dose–response consistency across compounds in BBBC021 and RxRx3 datasets, measured by cosine distance between CP\-CLIP embeddings at different concentration step pairs\.
## Appendix LStatistical Evidence Synthesizer Equations

Table 10:Summary of statistical parameters for imageParameter NameExpressionVariable Descriptionn\_control\|a\|\|a\|aa: Number of cells from the control groupn\_perturb\|b\|\|b\|bb: Number of cells from the perturbation groupTable 11:Summary of statistical parameters for each feature metric and their definitionsParameter NameExpressionVariable Descriptionmedian\_controlmedian\(a\)\\mathrm\{median\}\(a\)median\\mathrm\{median\}: Median ofaamedian\_perturbmedian\(b\)\\mathrm\{median\}\(b\)median\\mathrm\{median\}: Median ofbbmad\_controlmedian\(\|a−median\(a\)\|\)\\mathrm\{median\}\(\|a\-\\mathrm\{median\}\(a\)\|\)MAD: Median absolute deviation ofaamad\_perturbmedian\(\|b−median\(b\)\|\)\\mathrm\{median\}\(\|b\-\\mathrm\{median\}\(b\)\|\)MAD: Median absolute deviation ofbbp10\_controlQa\(0\.10\)Q\_\{a\}\(0\.10\)Qa\(p\)Q\_\{a\}\(p\):pp\-th quantile of control groupaap25\_controlQa\(0\.25\)Q\_\{a\}\(0\.25\)Same as abovep50\_controlQa\(0\.50\)Q\_\{a\}\(0\.50\)Same as abovep75\_controlQa\(0\.75\)Q\_\{a\}\(0\.75\)Same as abovep90\_controlQa\(0\.90\)Q\_\{a\}\(0\.90\)Same as abovep10\_perturbQb\(0\.10\)Q\_\{b\}\(0\.10\)Qb\(p\)Q\_\{b\}\(p\):pp\-th quantile of perturbation groupbbp25\_perturbQb\(0\.25\)Q\_\{b\}\(0\.25\)Same as abovep50\_perturbQb\(0\.50\)Q\_\{b\}\(0\.50\)Same as abovep75\_perturbQb\(0\.75\)Q\_\{b\}\(0\.75\)Same as abovep90\_perturbQb\(0\.90\)Q\_\{b\}\(0\.90\)Same as abovedelta\_medianmedian\(b\)−median\(a\)\\mathrm\{median\}\(b\)\-\\mathrm\{median\}\(a\)Difference in medians between groupsbootstrap\_ci\_lowerCIlow\\mathrm\{CI\}\_\{\\mathrm\{low\}\}Lower bound of bootstrap confidence intervalbootstrap\_ci\_upperCIup\\mathrm\{CI\}\_\{\\mathrm\{up\}\}Upper bound of bootstrap confidence intervalcliffs\_deltadddd: Cliff’s delta effect sizep\_valuepppp: Statistical significance from hypothesis test

The lower and upper bounds of the bootstrap confidence interval, denoted asCIlow\\mathrm\{CI\}\_\{\\mathrm\{low\}\}andCIup\\mathrm\{CI\}\_\{\\mathrm\{up\}\}, estimate the confidence interval of the median difference between control and perturbed sample using the bootstrap resampling method\. Specifically, 1000 rounds of bootstrap sampling are performed\. It can be computed as:

CIlow=Percentile2\.5\(\{δi∗\}\)\\mathrm\{CI\}\_\{\\text\{low \}\}=\\text\{ Percentile \}\_\{2\.5\}\\left\(\\left\\\{\\delta\_\{i\}^\{\*\}\\right\\\}\\right\)\(14\)CIup=Percentile97\.5\(\{δi∗\}\)\\mathrm\{CI\}\_\{\\mathrm\{up\}\}=\\text\{ Percentile \}\_\{97\.5\}\\left\(\\left\\\{\\delta\_\{i\}^\{\*\}\\right\\\}\\right\)\(15\)
Letδi∗\\delta\_\{i\}^\{\*\}denote the median difference obtained in theii\-th round of bootstrap resampling, the collection\{δi∗\}\\left\\\{\\delta\_\{i\}^\{\*\}\\right\\\}represents the set of median differences obtained fromNNrounds of bootstrap resampling\.

Cliff’s delta is a nonparametric effect size that quantifies the magnitude of difference between two distributions\. It is computed as:

d=1\|a\|\|b\|∑i=1nx∑j=1ny\[𝕀\(xi\>yj\)−𝕀\(xi<yj\)\]d=\\frac\{1\}\{\|a\|\|b\|\}\\sum\_\{i=1\}^\{n\_\{x\}\}\\sum\_\{j=1\}^\{n\_\{y\}\}\\left\[\\mathbb\{I\}\\left\(x\_\{i\}\>y\_\{j\}\\right\)\-\\mathbb\{I\}\\left\(x\_\{i\}<y\_\{j\}\\right\)\\right\]\(16\)
Wherexix\_\{i\}denotes theii\-th sample from the control group, andyjy\_\{j\}denotes thejj\-th sample perturbation \(or treatment\) group\. The indicator function𝕀\(⋅\)\\mathbb\{I\}\(\\cdot\)returns 1 if the condition inside the brackets is true, and 0 otherwise\. Cliff’s delta, which quantifies the degree of difference between the two groups\. Its value ranges from−1\-1to11, whered=0d=0indicates no difference,d=1d=1indicates the control group has a much bigger value\.

Thepp\-value corresponds to the result of a two\-sided Mann–Whitney U test\. It helps assess whether the observed difference could be explained by random variation, under the assumption that the null hypothesis is true\. Thepp\-value is computed as:

p=2⋅\(1−Φ\(\|U−nanb2nanb\(na\+nb\+1\)12\|\)\)p=2\\cdot\\left\(1\-\\Phi\\left\(\\left\|\\frac\{U\-\\frac\{n\_\{a\}n\_\{b\}\}\{2\}\}\{\\sqrt\{\\frac\{n\_\{a\}n\_\{b\}\\left\(n\_\{a\}\+n\_\{b\}\+1\\right\)\}\{12\}\}\}\\right\|\\right\)\\right\)\(17\)WhereUUis the Mann–Whitney U statistic, andnan\_\{a\},nbn\_\{b\}are the sample sizes of the two groups being compared\. The termΦ\(⋅\)\\Phi\(\\cdot\)denotes the cumulative distribution function \(CDF\) of the standard normal distribution\. The numerator measures the deviation of the observed U value from its expected value under the null hypothesis\. This standardization transforms theUUstatistic into a z\-score, which is then used to compute the two\-tailed p\-value\. A small p\-value indicates that the observed difference in distributions is unlikely to have occurred by chance\.

## Appendix MMLLMs Baseline Details

### M\.1Methods

To evaluate the reasoning capability of current mainstream MLLMs on the Cell Painting dataset, we test four API\-accessible models: Grok\-4, GPT\-5, Claude\-4\-Sonnet, and Gemini\-2\.5\-Pro\. The experimental workflow consists of two stages\. First, each MLLM performs background knowledge curation as a single preliminary task\. The curated information is then used as context for zero\-shot VQA across three tasks: the cell line task, the channel task, and the perturbation compound task\. During background knowledge curation, the decoding parameters are set to temperature = 0\.7 and top\-p = 0\.95, whereas for VQA they are set to temperature = 1 and top\-p = 1 to ensure response stability\. All MLLMs are prompted with the same structured instructions specifying the evaluation criteria\. In the VQA stage, the models receive both control and perturbation images together with masked textual descriptions\. Their task is to select the correct answer from multiple\-choice options that include the ground\-truth label and to provide both a confidence estimate and a concise rationale\. An example prompt is shown below\.

In addition to the zero\-shot setting described above, we further evaluate a few\-shot variant of the same protocol to make the comparison with CP\-Agent more conservative\. For each of the three tasks \(cell line, channel, and perturbation compound\), we construct a small visual memory bank consisting of two labeled exemplar image pairs per class \(control \+ perturbation\)\. These exemplars are selected from the training split and are fixed across all MLLMs to ensure comparability\. In the few\-shot condition, the VQA prompt is augmented with these exemplars: before answering a query, the model is shown the memory bank with the corresponding class labels and is instructed to use these examples as visual references when reasoning about the new control–perturbation pair\.

The overall prompting structure, background knowledge curation stage, and decoding parameters remain identical to the zero\-shot setup\. The only difference is the inclusion of the exemplar memory bank in the VQA stage\. As reported in Table[12](https://arxiv.org/html/2606.03435#A13.T12), few\-shot prompting yields modest improvements on the cell line and channel tasks, indicating that current MLLMs can benefit from limited visual grounding\. However, performance on the perturbation compound task remains very low, and the models do not exhibit reliable compound\-level discrimination despite changes in the prediction distribution\. This suggests that the subtle morphological signatures induced by chemical perturbations are difficult for general\-purpose MLLMs to acquire from a small number of Cell Painting exemplars\.

### M\.2Prompts

![[Uncaptioned image]](https://arxiv.org/html/2606.03435v1/x9.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03435v1/x10.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03435v1/x11.png)
### M\.3Detailed Results

![Refer to caption](https://arxiv.org/html/2606.03435v1/x12.png)Figure 9:Confusion matrix on cell line task\.![Refer to caption](https://arxiv.org/html/2606.03435v1/x13.png)Figure 10:Confusion matrix on image channel task\.![Refer to caption](https://arxiv.org/html/2606.03435v1/x14.png)Figure 11:Confusion matrix on perturbation compound task\.
### M\.4Few\-shot Results

Table 12:Few\-shot Model performance on classification tasksModelCell lineChannelPerturbation CompoundFlindokalnerRacecadotrilAZM\-475271MisoprostolTrazodoneOrantinibRufinamideLumiracoxibBIRB\-796MethoxsalenMacro\-avgGrok\-40\.515 \(\+0\.067\)0\.260 \(\+0\.032\)0\.2240\.1840\.00\.00\.3900\.1760\.0340\.0000\.00\.0000\.101GPT\-50\.440 \(\+0\.063\)0\.510 \(\+0\.071\)0\.00\.00\.0660\.00\.1150\.0000\.0790\.0000\.0880\.0000\.035Claude\-4\-Sonnet0\.520 \(\+0\.070\)0\.225 \(\+0\.027\)0\.0000\.0000\.0000\.0550\.0000\.0000\.0000\.0000\.2100\.0000\.026Gemini\-2\.5\-Pro0\.600 \(\+0\.074\)0\.730 \(\+0\.102\)0\.00\.0000\.0000\.0000\.0000\.00\.0000\.1600\.0740\.0000\.023

## Appendix NCP\-Agent Prompts

The prompts guide the CP\-Agent through a multi\-step reasoning process to interpret morphological effects of perturbations in Cell Painting data\. Figure[12](https://arxiv.org/html/2606.03435#A14.F12)introduces two tasks: \(1\) a background curation step, where the agent synthesizes prior biological knowledge about a compound’s mechanism of action \(MoA\) and predicts which CellProfiler feature classes are likely to be affected in a specific imaging channel, and \(2\) a feature ranking task, where individual features are prioritized based on their relevance to the predicted morphological response\. Figure[13](https://arxiv.org/html/2606.03435#A14.F13)guides the CP\-Agent to evaluate whether observed morphological changes under a perturbation are consistent with the proposed mechanism of action \(MoA\)\. Using prior biological knowledge and quantitative feature summaries, the agent assesses each feature’s directional change, links it to the expected mechanism, and assigns confidence scores\. The agent then provides an overall judgment of mechanism plausibility, highlighting supporting or conflicting evidence\. All prompts enforce structured JSON outputs to ensure compatibility with automated downstream analysis and promote reproducibility\.

![Refer to caption](https://arxiv.org/html/2606.03435v1/x15.png)Figure 12:Prompt templates for background curation and feature ranking\.![Refer to caption](https://arxiv.org/html/2606.03435v1/x16.png)Figure 13:Prompt template for evaluating mechanism\-feature consistency in Cell Painting data\.
## Appendix OAdditional case studies

### O\.1Additional case 1: Taxol in MCF7

![[Uncaptioned image]](https://arxiv.org/html/2606.03435v1/x17.png)
### O\.2Additional case 2: Vincristine in MCF7

![[Uncaptioned image]](https://arxiv.org/html/2606.03435v1/x18.png)
### O\.3Additional case 3: Sorbinil in A549

![[Uncaptioned image]](https://arxiv.org/html/2606.03435v1/x19.png)
### O\.4Additional case 4: BGT226 in HUVEC

![[Uncaptioned image]](https://arxiv.org/html/2606.03435v1/x20.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03435v1/x21.png)
### O\.5Additional case 5: AZ841 in MCF7

![[Uncaptioned image]](https://arxiv.org/html/2606.03435v1/x22.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03435v1/x23.png)

## Appendix PReasoning evaluation criteria

To facilitate consistent and high\-quality responses, we shared the following rubric and example list with participated experts as initial guidance\. This framework outlines key criteria for evaluatingLanguage QualityandReasoning Qualityof model\-generated explanations in biological tasks\. The rubric emphasizes five core aspects of language quality—including accuracy, relevance, coherence, depth, and conciseness, as well as five reasoning quality metrics such as pattern recognition, stepwise reasoning, biological deduction, hypothesis formation, and mechanistic insight\. Each criterion is paired with both positive and negative examples to help clarify expectations and common pitfalls\.

### P\.1Language Quality Criteria

![Refer to caption](https://arxiv.org/html/2606.03435v1/x24.png)Figure 14:Language quality criteria for evaluating CP\-Agent generated Cell Painting reports\.
### P\.2Reasoning Quality Criteria

![Refer to caption](https://arxiv.org/html/2606.03435v1/x25.png)Figure 15:Reasoning quality criteria for evaluating CP\-Agent generated Cell Painting reports\.

## Appendix QExpert Ratings of CP\-Agent generated Reports Across Language and Reasoning Criteria

### Q\.1Human Expert Assessment of Perturbation Report Quality Across LLMs

Figure[16](https://arxiv.org/html/2606.03435#A17.F16)summarizes expert evaluations across ten rubric criteria, split into five language quality dimensions \(Figure[16](https://arxiv.org/html/2606.03435#A17.F16)a\) and five reasoning quality dimensions \(Figure[16](https://arxiv.org/html/2606.03435#A17.F16)b\)\. On average, all four LLMs received high ratings \(mostly above 5\.0 on a 7\-point scale\), indicating strong performance in generating biologically grounded screening reports\. Among the models, GPT\-5 consistently achieved the highest scores across most reasoning metrics—including pattern recognition, algorithmic reasoning, and mechanistic insight—while also maintaining strong language quality\. Gemini\-2\.5\-Pro closely followed, particularly excelling in relevance and coherence\. Claude\-Sonnet\-4 underperformed slightly in mechanistic insight and inductive reasoning, indicating slightly weaker performance in higher\-order biological inference\. Grok\-4 showed relatively balanced language quality but lagged slightly in depth and coherence compared to top\-performing models\. The bar chart \(Figure[16](https://arxiv.org/html/2606.03435#A17.F16)c\) further illustrates per\-metric mean scores, reinforcing the finding that reasoning dimensions pose a greater challenge than surface\-level language quality, especially in tasks requiring mechanistic interpretation and hypothesis generation\.

![Refer to caption](https://arxiv.org/html/2606.03435v1/x26.png)Figure 16:Expert evaluation of LLM\-generated screening reports across language and reasoning dimensions\.\(a–b\) Mean expert ratings \(on a 7\-point scale\) for language and reasoning quality, based on ten rubric\-based evaluation criteria\. \(c\) Bar chart summarizing per\-metric mean scores across the four evaluated models: Claude\-Sonnet\-4 \(blue\), Gemini\-2\.5\-Pro \(orange\), GPT\-5 \(green\), and Grok\-4 \(red\)\.
### Q\.2Expert Evaluation Consistency Analysis

To assess the reliability and consistency of expert evaluations used in our study, a comprehensive inter\-rater agreement analysis is conducted\. The results are detailed in Table[13](https://arxiv.org/html/2606.03435#A17.T13)and summarized in Table[14](https://arxiv.org/html/2606.03435#A17.T14)\. We employed several standard metrics to quantify agreement among expert annotators, including Kendall’s W, standard, deviation, exact agreement percentage, ±1 rating agreement, Mean absolute, difference \(MAD\)\. Kendall’s W values ranged from 0\.33 to 0\.37 across different models, indicating fair agreement among annotators\. At the pairwise level, 73\.3% of all rating pairs fell within ±1 point on the 0–7 scale, and the mean absolute difference \(MAD\) was 1\.04, suggesting a high degree of rating consistency\. Given the limited number of annotators, bootstrap resampling is employed to compute 95% confidence intervals for all key reliability metrics, ensuring robust statistical estimation\.

Although the expert ratings did not reveal statistically significant differences between models, we observed consistent trends across raters—particularly that GPT\-5 achieved marginally higher average scores across most evaluation dimensions\. However, these differences were relatively small, implying that all evaluated LLMs were able to generate informative, relevant, and biologically meaningful reports when operating under our CP\-Agent generation pipeline\.

These findings collectively support the internal consistency and reliability of the expert evaluation process, reinforcing confidence in the qualitative assessments reported in the main text\.

Table 13:Inter\-rater Evaluation for Expert EvaluationModelMetricKendall\_WKendall\_W\_LKendall\_W\_USDSD\_LSD\_UExact%Exact\_LExact\_UNear%Near\_LNear\_UMADMAD\_LMAD\_UClaude\-Sonnet\-4Accuracy0\.3030\.1480\.4951\.010\.891\.1424\.120\.228\.468\.762\.674\.71\.161\.021\.31Relevance0\.3490\.1940\.5480\.950\.851\.0426\.022\.430\.469\.462\.177\.11\.110\.961\.24Coherence0\.3300\.1510\.5230\.870\.780\.9430\.425\.836\.575\.969\.182\.90\.970\.831\.09Depth0\.2910\.1220\.4840\.910\.850\.9727\.223\.531\.374\.570\.877\.61\.030\.951\.13Conciseness0\.2440\.1060\.4341\.050\.901\.2324\.822\.026\.970\.465\.075\.61\.171\.031\.34Pattern Recognition0\.3120\.1380\.5200\.890\.771\.0429\.926\.233\.277\.171\.283\.30\.990\.861\.14Algorithmic Reasoning0\.3680\.1530\.6500\.880\.820\.9428\.124\.932\.075\.970\.981\.11\.000\.901\.09Deductive0\.4020\.2000\.6280\.900\.830\.9929\.826\.433\.574\.068\.478\.71\.010\.911\.12Inductive0\.3670\.1810\.6190\.920\.811\.0227\.622\.234\.473\.766\.681\.51\.040\.881\.20Mechanistic Insights0\.3420\.1520\.5631\.030\.931\.1323\.721\.126\.066\.960\.772\.71\.181\.071\.30Gemini\-2\.5\-ProAccuracy0\.3700\.1710\.5650\.940\.861\.0125\.322\.228\.472\.266\.377\.51\.080\.971\.18Relevance0\.4090\.1910\.6640\.940\.871\.0025\.022\.427\.772\.368\.276\.51\.081\.001\.16Coherence0\.3730\.1810\.5480\.840\.760\.9029\.824\.935\.179\.274\.084\.50\.940\.831\.05Depth0\.3000\.1190\.5180\.850\.800\.9030\.427\.133\.877\.073\.480\.90\.950\.881\.03Conciseness0\.3020\.1510\.4720\.890\.830\.9427\.024\.729\.575\.272\.078\.21\.020\.951\.08Pattern Recognition0\.3590\.2050\.5260\.840\.740\.9130\.526\.135\.678\.373\.084\.80\.940\.821\.05Algorithmic Reasoning0\.3810\.1790\.5970\.810\.730\.9031\.927\.935\.680\.574\.486\.80\.900\.791\.02Deductive0\.4120\.2070\.6820\.890\.791\.0031\.626\.636\.475\.468\.881\.10\.990\.861\.13Inductive0\.3950\.1740\.6510\.930\.870\.9929\.624\.935\.268\.763\.674\.11\.060\.961\.14Mechanistic Insights0\.3040\.1680\.4990\.930\.841\.0328\.323\.333\.471\.965\.278\.71\.060\.931\.19GPT\-5Accuracy0\.3370\.1750\.5040\.950\.861\.0427\.222\.431\.872\.166\.577\.31\.080\.961\.21Relevance0\.3540\.1980\.5330\.900\.840\.9827\.724\.731\.170\.465\.375\.41\.050\.961\.15Coherence0\.3570\.1650\.6140\.890\.830\.9530\.825\.637\.573\.769\.677\.30\.980\.891\.07Depth0\.3070\.1390\.5260\.920\.831\.0029\.625\.334\.071\.363\.878\.01\.040\.921\.15Conciseness0\.3030\.1400\.4690\.970\.831\.1628\.224\.032\.071\.563\.577\.61\.080\.941\.28Pattern Recognition0\.3600\.2070\.5400\.810\.720\.8932\.728\.038\.579\.674\.484\.70\.910\.781\.02Algorithmic Reasoning0\.4220\.2450\.6090\.860\.730\.9930\.624\.537\.278\.369\.386\.70\.960\.791\.15Deductive0\.4940\.3060\.6690\.920\.870\.9825\.823\.528\.272\.168\.275\.71\.071\.001\.15Inductive0\.4240\.2540\.6090\.950\.910\.9924\.522\.227\.168\.766\.570\.71\.121\.071\.17Mechanistic Insights0\.3580\.2040\.5210\.960\.881\.0428\.223\.134\.267\.162\.472\.51\.110\.981\.22Grok\-4Accuracy0\.2780\.1200\.4930\.930\.811\.0627\.823\.732\.273\.366\.480\.21\.060\.911\.21Relevance0\.4030\.2140\.6330\.920\.840\.9825\.722\.929\.173\.468\.579\.01\.050\.941\.15Coherence0\.3670\.1990\.5860\.810\.680\.9435\.631\.239\.781\.173\.888\.70\.870\.731\.03Depth0\.3000\.1360\.5030\.970\.871\.0725\.823\.528\.170\.765\.075\.61\.101\.001\.23Conciseness0\.2650\.1040\.4541\.040\.911\.1622\.919\.326\.767\.160\.973\.81\.201\.041\.34Pattern Recognition0\.3590\.1710\.5820\.900\.811\.0029\.525\.333\.373\.767\.178\.81\.010\.901\.16Algorithmic Reasoning0\.3990\.2360\.5930\.850\.690\.9933\.726\.241\.579\.771\.488\.60\.920\.731\.11Deductive0\.4980\.2790\.7000\.890\.810\.9828\.824\.432\.576\.371\.679\.81\.000\.911\.11Inductive0\.4830\.2700\.6770\.980\.911\.0525\.422\.528\.469\.064\.473\.51\.131\.041\.23Mechanistic Insights0\.3230\.1490\.5190\.990\.861\.1129\.126\.032\.466\.658\.974\.71\.120\.971\.27

Table 14:Summary of Inter\-rater Evaluations Across LLMsModelKendall\_WScore\_AvgScore\_StdNear\_Agreement %Claude\-Sonnet\-40\.335\.450\.9472\.65Gemini\-2\.5\-Pro0\.365\.530\.8975\.07GPT\-50\.375\.590\.9172\.48Grok\-40\.375\.510\.9373\.09

## Appendix RRobustness Evaluation: FeatRank Agent and ReportGen Agent

The reproducibility of CPAgent’s reasoning outputs is primarily influenced by the temperature parameter in the large language models \(LLMs\) since the perturbation codition is infered from CP\-CLIP, which makes the conclusion of the report deterministic already\. In our pipeline, following the initial pretrained CLIP model, there are two LLM modules, the ’FeatRank Agent’, which ranks CellProfiler extracted morphology features, and ’ReportGen Agent’, which generates natural language reports based on the ranked features and contextual information\. To ensure that the feature ranking step is as deterministic as possible, we set the temperature of the FeatRank Agent to 0\. To systematically evaluate the reproducibility, we designed five experiments with varying temperature settings, the ReportGen Agent’s temperature was set ranging from 0\.0 to 1\.0, while keeping the FeatRank Agent temperature fixed at 0\.0\. In each setting, we repeated the pipeline 30 times on the same input samples and analyzed the consistency of the selected features and generated reports\.

### System Prompt

You are a scientific evaluator specialized in assessing corpora of mechanism assessment reports derived from Cell Painting assays\.

Your task is to evaluate the internal consistency of a set of 30 mechanism assessment reports\. Each report analyzes how observed morphological features align with a hypothesized mechanism of action for a specific chemical perturbation\.

You are not evaluating the scientific accuracy or biological correctness of any individual report\. Your focus is on how mutually consistent the reports are with each other in terms of:

- •Scientific focus and biological reasoning style
- •Use of technical terminology and feature names
- •Presentation structure and rhetorical flow

You will be provided with the full set of reports or representative excerpts\. Based on this, you will assign scores for each consistency dimension and briefly justify your evaluation\.

Return only the JSON structure specified in the user prompt\. Do not include any commentary or additional explanation\.

### User Prompt

INPUTS:

- •A corpus of 30 Cell Painting\-based mechanism assessment reports
- •Each report typically includes: Mechanism verdict, feature\-based evidence summary, a mechanistic linkage explanation and caveats or alternative hypotheses\.

TASK: Evaluate the corpus for internal consistency across the following three dimensions\. You are not judging scientific accuracy\. Focus on mutual alignment in scientific reasoning, terminology, and structure\.

#### 1\. Thematic Consistency

Definition: Do all reports demonstrate same scientific purpose and consistent style of biological reasoning?

What to look for:

- •Reports clearly assess whether observed phenotypes support a hypothesized MoA
- •Use of biological logic at cellular or subcellular level
- •Use of structured reasoning patterns: “If MoA X is true, then we expect Y phenotype, which we observe\.”;“This phenotype is consistent with known outcomes of X \(e\.g\., ER stress\)”
- •Avoidance of vague or off\-topic content
- •Consistency in depth of biological interpretation

Scoring Guide:

- •10: Reports are clear, coherent, and mechanistic
- •8–9: Mostly consistent with minor variation in depth
- •6–7: Some reports are descriptive only
- •4–5: Inconsistent styles or lack MoA focus
- •1–3: Highly divergent in purpose or reasoning

#### 2\. Terminological Consistency

Definition: Are technical terms, feature names, and mechanistic labels used consistently?

What to look for:

- •Uniform use of Cell Painting features \(e\.g\.,Texture\_Entropy\)
- •Consistent naming of biological phenomena
- •Standardized use of MoA terms \(e\.g\., “oxidative stress”\)
- •Avoidance of ambiguous or informal language
- •Quantitative descriptors \(e\.g\., “\+0\.3 increase”\) preferred over vague phrases

Scoring Guide:

- •10: Terminology precise and consistent
- •8–9: Minor synonym use
- •6–7: Mixed naming for same terms
- •4–5: Frequent inconsistencies
- •1–3: Terms are vague or informal

#### 3\. Structural Consistency

Definition: Do reports follow a similar structure and rhetorical flow?

What to look for:

- •Inclusion of key components: Mechanism verdict, evidence summary, mechanistic linkage, caveats or alternatives\.
- •Similar order, depth, and sentence structure
- •Avoidance of missing or reordered sections

Scoring Guide:

- •10: Fully consistent structure
- •8–9: Minor variation in order or depth
- •6–7: Incomplete or reordered sections
- •4–5: Frequent structural differences
- •1–3: Highly inconsistent organization

#### Scoring Instructions

- •Assign a score between1–10for each dimension
- •Provide a concise justification \(<50 words\)
- •Return only the JSON in the following format

JSON Output Format``` { "corpus_evaluation": { "Thematic Consistency": { "score": <1-10>, "justification": "<brief explanation>" }, "Terminological Consistency": { "score": <1-10>, "justification": "<brief explanation>" }, "Structural Consistency": { "score": <1-10>, "justification": "<brief explanation>" } } } ```

### Corpus Evaluation

Table 15:FeatRank Agent’s RepeatabilityTemp 1Features Number AvgFeatures Number stdTop 5 Stable Features0\.018\.372\.37AreaShape\_Area, AreaShape\_Eccentricity, Texture\_Contrast\_5\_02\_256, Granularity\_2, Granularity\_30\.018\.202\.26AreaShape\_Area, AreaShape\_Eccentricity, Texture\_Contrast\_5\_02\_256, Granularity\_2, Granularity\_30\.018\.272\.31AreaShape\_Area, AreaShape\_Eccentricity, Texture\_Contrast\_5\_02\_256, Granularity\_2, Granularity\_30\.018\.871\.96AreaShape\_Area, AreaShape\_Eccentricity, Texture\_Contrast\_5\_02\_256, Granularity\_2, Granularity\_30\.017\.172\.28AreaShape\_Area, AreaShape\_Eccentricity, Texture\_Contrast\_5\_02\_256, Granularity\_2, Granularity\_3

Table 16:Corpus score comparison across different temperature setting\.Temp1/Temp2Thematic ConsistencyTerminological ConsistencyStructural ConsistencyAveraged Corpus Score0\.0/0\.08\.007\.009\.008\.000\.0/0\.18\.007\.008\.007\.670\.0/0\.28\.007\.008\.007\.670\.0/0\.58\.007\.008\.007\.670\.0/1\.08\.007\.008\.007\.67

## Appendix SCounterfactual Prompt Experiments: Dosage and Time

To address the concern that the high retrieval performance may be due to metadata correlations \(e\.g\., compound identity being indirectly inferred from time or dosage\), rather than genuine multimodal alignment, we conducted controlled ablation experiments to isolate and evaluate the extent to which different textual components contribute to model performance\. Specifically, we masked individual fields in the text prompts: compound name \+ MOA, concentration, or time—and measured the calculating the changes in retrieval accuracy\.

As shown in Table[17](https://arxiv.org/html/2606.03435#A19.T17), masking the compound name and MOA \(Experiment 1\) results in a catastrophic performance drop \(e\.g\., text\-to\-image R@1 drops from 98\.70 to 3\.50; MRR drops by nearly 90%\), indicating that the model relies heavily on compound\-specific textual information\. We mask the compound name and its mechanism of action \(MOA\) jointly, rather than separately, because these two fields are semantically correlated and often co\-informative\. This design choice is consistent with our setup in the drug classification experiment \(Table[2](https://arxiv.org/html/2606.03435#S3.T2)\), and ensures a fair and aligned evaluation across experiments\. In contrast, masking concentration or time results in only moderate to negligible performance degradation \(e\.g\., R@1 drops to 3\.50 when masking compound, and only to 93\.00 when masking time\)\. This pattern suggests that while the model encodes and leverages compound identity meaningfully\. To further test CP\-CLIP’s robustness to misleading contextual cues, we designed counterfactual prompt experiments in two settings:

- •Clean Prompt:The target classification field was masked, but all other contextual metadata \(e\.g\., time, channel\) remained correct\.
- •Disturbed Prompt:The target field was masked, and unrelated metadata fields were shuffled randomly, keeping only the compound identity correct\. Since we have demonstrated that compound identity has a strong influence on retrieval performance\.

Table 17:Retrieval performance before and after masking different textual components\.ExperimentMetricOriginalMaskedδ\\deltaAbsoluteδ\\deltaRelative \(%\)Experiment 1: Mask Compound Name \+ MOAText\-to\-ImageR@198\.703\.50\-95\.20\-96\.45R@5100\.0016\.80\-83\.20\-83\.20R@10100\.0029\.60\-70\.40\-70\.40MRR0\.99350\.1178\-0\.8757\-88\.15Image\-to\-TextR@198\.703\.30\-95\.40\-96\.66R@5100\.0016\.10\-83\.90\-83\.90R@10100\.0028\.70\-71\.30\-71\.30MRR0\.99350\.1102\-0\.8833\-88\.90Experiment 2: Mask Concentration OnlyText\-to\-ImageR@198\.7057\.00\-41\.70\-42\.25R@5100\.0084\.00\-16\.00\-16\.00R@10100\.0091\.40\-8\.60\-8\.60MRR0\.99350\.6829\-0\.3106\-31\.26Image\-to\-TextR@198\.7055\.20\-43\.50\-44\.07R@5100\.0079\.20\-20\.80\-20\.80R@10100\.0087\.40\-12\.60\-12\.60MRR0\.99350\.6617\-0\.3318\-33\.39Experiment 3: Mask Time OnlyText\-to\-ImageR@198\.7093\.00\-5\.70\-5\.78R@5100\.0099\.50\-0\.50\-0\.50R@10100\.00100\.000\.000\.00MRR0\.99350\.9590\-0\.0345\-3\.47Image\-to\-TextR@198\.7094\.70\-4\.00\-4\.05R@5100\.00100\.000\.000\.00R@10100\.00100\.000\.000\.00MRR0\.99350\.9726\-0\.0209\-2\.11Table 18:Concentration Classification under Masked and Counterfactual PromptsConcentrationFirocoxibOpicaponeCinoxacinNeratinibHydroflumethiazideAcetaminophenPrimidone0\.00316µM CLIP \(disturb\)0\.48660\.33410\.38890\.29650\.16340\.22980\.2174\\rowcolorgray\!10 0\.00316µM CP\-CLIP \(disturb\)0\.46630\.50520\.48900\.48870\.23500\.31550\.38660\.00316µM CLIP0\.73710\.52030\.64910\.52090\.47800\.42370\.4167\\rowcolorgray\!20 0\.00316µM CP\-CLIP0\.68660\.73070\.62110\.62330\.28190\.39090\.53490\.01µM CLIP \(disturb\)0\.55880\.27080\.17570\.37310\.21470\.21610\.2612\\rowcolorgray\!10 0\.01µM CP\-CLIP \(disturb\)0\.32120\.44090\.36940\.46770\.49060\.45980\.23210\.01µM CLIP0\.78870\.53910\.37580\.62670\.43830\.34730\.5000\\rowcolorgray\!20 0\.01µM CP\-CLIP0\.43560\.63430\.49130\.55210\.57830\.55610\.26350\.0316µM CLIP \(disturb\)0\.57140\.19210\.31780\.28900\.20170\.28730\.1670\\rowcolorgray\!10 0\.0316µM CP\-CLIP \(disturb\)0\.43350\.47140\.39280\.79490\.37250\.51510\.40150\.0316µM CLIP0\.74370\.47490\.58390\.53300\.50330\.53870\.4628\\rowcolorgray\!20 0\.0316µM CP\-CLIP0\.57540\.60480\.63740\.92050\.47710\.63640\.47730\.1µM CLIP \(disturb\)0\.29920\.26520\.33690\.28660\.24860\.22320\.1938\\rowcolorgray\!10 0\.1µM CP\-CLIP \(disturb\)0\.46600\.29350\.40150\.30350\.40510\.30000\.41480\.1µM CLIP0\.47750\.35540\.36900\.52030\.28190\.48230\.2661\\rowcolorgray\!20 0\.1µM CP\-CLIP0\.64020\.44590\.54420\.49740\.50140\.38320\.49710\.316µM CLIP \(disturb\)0\.33430\.23010\.27760\.21310\.25050\.19940\.2961\\rowcolorgray\!10 0\.316µM CP\-CLIP \(disturb\)0\.50790\.33330\.40270\.45950\.28480\.49620\.26420\.316µM CLIP0\.54010\.41790\.62880\.45910\.60560\.40770\.5475\\rowcolorgray\!20 0\.316µM CP\-CLIP0\.62720\.46860\.50680\.64380\.43990\.54770\.32171\.0µM CLIP \(disturb\)0\.44300\.27480\.25050\.54550\.18030\.30820\.3073\\rowcolorgray\!10 1\.0µM CP\-CLIP \(disturb\)0\.63640\.44110\.41030\.47920\.44590\.39740\.37021\.0µM CLIP0\.66110\.57730\.62110\.88560\.49090\.57660\.4473\\rowcolorgray\!20 1\.0µM CP\-CLIP0\.81370\.64690\.65000\.62170\.51330\.44690\.56733\.162µM CLIP \(disturb\)0\.32190\.29330\.28150\.21190\.20160\.26350\.2114\\rowcolorgray\!10 3\.162µM CP\-CLIP \(disturb\)0\.54830\.33240\.29000\.44790\.26000\.15770\.39913\.162µM CLIP0\.55280\.59020\.54590\.64420\.58260\.54590\.3255\\rowcolorgray\!20 3\.162µM CP\-CLIP0\.75990\.52430\.45770\.58120\.34330\.24150\.445410\.0µM CLIP \(disturb\)0\.38710\.41910\.29970\.15020\.16960\.18130\.2152\\rowcolorgray\!10 10\.0µM CP\-CLIP \(disturb\)0\.63400\.39240\.60290\.50100\.34010\.35620\.349910\.0µM CLIP0\.70790\.71250\.57290\.66160\.33000\.36850\.5673\\rowcolorgray\!20 10\.0µM CP\-CLIP0\.78950\.64690\.65000\.58740\.51330\.44690\.3886

Table 19:Time Comparison \(clean vs disturb prompt\): CLIP vs CP\-CLIPTimeIxabepiloneMethoxsalenSulfinpyrazoneTriamtereneMiconazoleCeritinibAcetohexamide24h CLIP \(disturb\)0\.75130\.59030\.64730\.72120\.68820\.63190\.7650\\rowcolorgray\!20 24h CP\-CLIP \(disturb\)0\.96000\.95560\.91010\.88440\.93090\.93230\.930924h CLIP0\.99501\.00001\.00001\.00001\.00000\.99501\.0000\\rowcolorgray\!20 24h CP\-CLIP0\.99801\.00001\.00001\.00001\.00001\.00001\.000048h CLIP \(disturb\)0\.75860\.68290\.62180\.69790\.63220\.69270\.7650\\rowcolorgray\!20 48h CP\-CLIP \(disturb\)0\.96000\.95920\.92380\.88560\.93400\.93750\.938748h CLIP0\.99501\.00001\.00001\.00001\.00000\.99501\.0000\\rowcolorgray\!20 48h CP\-CLIP0\.99801\.00001\.00001\.00001\.00001\.00001\.0000

Table 20:Summary of Concentration Classification Performance: CP\-CLIP vs CLIP under clean vs\. disturbed prompts\.CompoundCLIPCP\-CLIPCleanDisturbedΔ\\DeltaCleanDisturbedΔ\\Deltafirocoxib0\.56370\.4163\-0\.14740\.60250\.5038\-0\.0987opicapone0\.42440\.2831\-0\.14130\.46310\.3956\-0\.0675cinoxacin0\.39620\.2875\-0\.10870\.45000\.4231\-0\.0269neratinib0\.51310\.2875\-0\.22560\.52690\.4950\-0\.0319hydroflumethiazide0\.36690\.2025\-0\.16440\.33500\.3650\+0\.0300acetaminophen0\.35310\.2381\-0\.11500\.37620\.3844\+0\.0082primidone0\.35500\.2350\-0\.12000\.35880\.3556\-0\.0032CP\-CLIP↑\\uparrowover CLIP \(Clean\)\+7\.83% Accuracy\+7\.72% F1\-ScoreCP\-CLIP↑\\uparrowover CLIP \(Disturbed\)\+54\.05% Accuracy\+50\.55% F1\-Score

Table 21:Summary of Time Classification Performance: CP\-CLIP vs CLIP under clean vs\. disturbed prompts\.CompoundCLIPCP\-CLIPCleanDisturbedΔ\\DeltaCleanDisturbedΔ\\Deltaixabepilone0\.99500\.7550\-0\.24000\.99750\.9600\-0\.0375methoxsalen1\.00000\.6425\-0\.35751\.00000\.9575\-0\.0425sulfinpyrazone1\.00000\.6350\-0\.36501\.00000\.9175\-0\.0825triamterene1\.00000\.7100\-0\.29001\.00000\.8850\-0\.1150miconazole1\.00000\.6625\-0\.33751\.00000\.9325\-0\.0675ceritinib0\.99500\.6650\-0\.33001\.00000\.9350\-0\.0650acetohexamide1\.00000\.7650\-0\.23501\.00000\.9350\-0\.0650CP\-CLIP↑\\uparrowover CLIP \(Clean\)\+0\.08% Accuracy\+0\.08% F1\-ScoreCP\-CLIP↑\\uparrowover CLIP \(Disturbed\)\+34\.91% Accuracy\+35\.23% F1\-Score

This setup isolates the model’s reliance on different types of metadata and tests whether it can still make accurate predictions under misleading or noisy context\. For the concentration classification task \(Table[20](https://arxiv.org/html/2606.03435#A19.T20)\), CP\-CLIP achieves a \+7\.72% F1\-score improvement over CLIP under clean prompts, and a \+50\.55% F1\-score gain under disturbed prompts\. Similarly, for the time classification task \(Table[21](https://arxiv.org/html/2606.03435#A19.T21)\), CP\-CLIP maintains stable performance with only marginal degradation under disturbed prompts, achieving a \+35\.23% F1\-score improvement over CLIP\. The detailed scores for each category are available in Table[18](https://arxiv.org/html/2606.03435#A19.T18)and Table[19](https://arxiv.org/html/2606.03435#A19.T19)\. In contrast, CLIP exhibits substantial performance drops in disturbed scenarios, suggesting that it is more susceptible to spurious correlations in the metadata\.

This results confirm that CP\-CLIP is not learning shortcuts based on metadata correlations but is instead capturing robust multimodal associations between visual morphology and semantic input prompts\. The counterfactual experiment and contextual masking experiments effectively disentangles various sources of information, demonstrating that CP\-CLIP can generalize to various classification tasks even under adversarial or misleading contextual conditions\.

## Appendix TRetrieval Performance Benchmarks

In addition to the classification accuracy metrics presented in Table[2](https://arxiv.org/html/2606.03435#S3.T2)and Table[3](https://arxiv.org/html/2606.03435#S3.T3), which are used for benchmarking in\-distribution performance and out\-of\-distribution performance across various conditions \(e\.g\., cell line, channel, and compound\), we further report retrieval\-based metrics to evaluate the effectiveness of contrastive learning models\.

Specifically, we use Recall@K and Mean Reciprocal Rank \(MRR\) on a held\-out validation set to assess both text\-to\-image \(T→I\) and image\-to\-text \(I→T\) retrieval performance\. These metrics provide a complementary perspective on model alignment quality between visual and textual modalities, particularly in settings where ranking\-based retrieval is desirable\.

Table summarizes the retrieval performance of several contrastive models, including CLIP, SigLIP, and variants of our proposed CP\-CLIP, under both text\-to\-image and image\-to\-text settings \(Table[23](https://arxiv.org/html/2606.03435#A20.T23)\)\. To provide a more complete picture, we also include the retrieval performance of CLOOME, shown in Table[22](https://arxiv.org/html/2606.03435#A20.T22)\. While CLOOME is not a general\-purpose contrastive model, we report its performance on molecule–image and image–molecule retrieval tasks for completeness\. Notably, CLOOME’s retrieval is limited to molecular inputs only, and is not applicable for cell\-line or other biological metadata queries\.

Table 22:Performance of CLOOME on Molecule–Image and Image–Molecule Retrieval Tasks\.ModelMolecule\-ImageImage\-MoleculeR@1R@5R@10R@20R@50MRRR@1R@5R@10R@20R@50MRRCLOOME95\.5899\.9499\.9499\.9499\.940\.975459\.8659\.9860\.5265\.5971\.890\.6063

Table 23:Context\-to\-Image \(T\-I\) and Image\-to\-Context \(I\-T\) retrieval performance using Recall@K\.ModelContext\-ImageImage\-ContextR@1R@5R@10R@20R@50MRRR@1R@5R@10R@20R@50MRRCLIP ViT\-B/1666\.890\.8095\.4097\.8999\.380\.771958\.5580\.6786\.5491\.7196\.850\.6820SigLIP\-ViT\-B/1643\.8567\.3877\.4485\.7593\.190\.545738\.6556\.1563\.8671\.5681\.920\.4700SigLIP\-ViT\-B/16 \(D\)25\.9352\.2161\.3771\.4585\.380\.384220\.7139\.8748\.8359\.6574\.330\.3015CP\-CLIP ViT\-B/16 \(fp\)72\.9793\.9697\.4799\.1099\.860\.821364\.2086\.5591\.9995\.7398\.750\.7385CP\-CLIP ViT\-B/16 \(D\)77\.0994\.6997\.8799\.2199\.740\.847968\.9287\.7792\.1495\.5698\.550\.7716CP\-CLIP ViT\-L/16 \(D\)73\.8392\.9396\.4498\.4999\.610\.821564\.7785\.0290\.1393\.8397\.520\.7351
CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations

Similar Articles

Towards Autonomous Mechanistic Reasoning in Virtual Cells

CP-Agent: A Calibrated Risk-Controlled Agent for Feedback-Driven Competitive Programming

ChemAmp: Amplified Chemistry Tools via Composable Agents

COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion

kept facing with coding agents was hallucinations context loss outdated framework knowledge and models confidently guessing wrong implementations

Submit Feedback

Similar Articles

Towards Autonomous Mechanistic Reasoning in Virtual Cells
CP-Agent: A Calibrated Risk-Controlled Agent for Feedback-Driven Competitive Programming
ChemAmp: Amplified Chemistry Tools via Composable Agents
COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion
kept facing with coding agents was hallucinations context loss outdated framework knowledge and models confidently guessing wrong implementations