3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy

arXiv cs.LG Papers

Summary

This paper presents 3D masked autoencoders for volumetric microscopy data, demonstrating that 3D modeling outperforms 2D max-projection and slice-based variants on downstream single-cell tasks, with cross-modal alignment to a protein language model further improving performance.

arXiv:2606.23964v1 Announce Type: new Abstract: Self-supervised learning in fluorescence microscopy often relies on 2D projections, despite the inherently three-dimensional nature of cells. We present a systematic comparison of 2D and 3D masked autoencoders (MAE-2D vs. MAE-3D) on volumetric microscopy data. Under matched architectures and training protocols, MAE-3D consistently outperforms 2D max-projection and slice-based variants on downstream single-cell tasks. We further align visual representations with a pretrained protein language model (ESM2) and show that cross-modal supervision yields larger gains for volumetric models. Channel cross-attention and frequency-domain regularization are critical for leveraging 3D spatial context. On a protein--protein interaction task, MAE-3D achieves a ROC--AUC of 0.865, outperforming prior methods by up to +0.025. For protein localization, our best 3D model attains state-of-the-art AUC$_{\text{micro}}$ (0.952) and F1$_{\text{micro}}$ (0.742), improving over previous approaches by +0.003 and +0.010 absolute, respectively. Overall, these results demonstrate the advantages of native 3D modeling and multimodal alignment for representation learning in single-cell microscopy.
Original Article
View Cached Full Text

Cached at: 06/24/26, 07:49 AM

# 3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy
Source: [https://arxiv.org/html/2606.23964](https://arxiv.org/html/2606.23964)
11institutetext:Institute of AI for Health & Helmholtz AI, Computational Health Center, Helmholtz Munich – German Research Center for Environmental Health, Neuherberg, Germany22institutetext:Department of Medicine III, Ludwig\-Maximilian\-University Hospital, Munich, Germany33institutetext:Department of Physics, Ludwig\-Maximilian\-University, Munich, Germany44institutetext:German Cancer Consortium \(DKTK\), partner site Munich, Germany55institutetext:Munich Center for Machine Learning \(MCML\), Munich, Germany###### Abstract

Self\-supervised learning in fluorescence microscopy often relies on 2D projections, despite the inherently three\-dimensional nature of cells\. We present a systematic comparison of 2D and 3D masked autoencoders \(MAE\-2D vs\. MAE\-3D\) on volumetric microscopy data\. Under matched architectures and training protocols, MAE\-3D consistently outperforms 2D max\-projection and slice\-based variants on downstream single\-cell tasks\. We further align visual representations with a pretrained protein language model \(ESM2\) and show that cross\-modal supervision yields larger gains for volumetric models\. Channel cross\-attention and frequency\-domain regularization are critical for leveraging 3D spatial context\. On a protein–protein interaction task, MAE\-3D achieves a ROC–AUC of 0\.865, outperforming prior methods by up to \+0\.025\. For protein localization, our best 3D model attains state\-of\-the\-art AUCmicro\{\}\_\{\\text\{micro\}\}\(0\.952\) and F1micro\{\}\_\{\\text\{micro\}\}\(0\.742\), improving over previous approaches by \+0\.003 and \+0\.010 absolute, respectively\. Overall, these results demonstrate the advantages of native 3D modeling and multimodal alignment for representation learning in single\-cell microscopy\.

††footnotetext:Corresponding authors:
\{amirhossein\.kardoost, tingying\.peng, carsten\.marr\}@helmholtz\-munich\.de## 1Introduction

Cells constitute the fundamental building blocks of tissues and organs\. Their function is tightly linked to subcellular structure and spatial organization\. Understanding cellular organization remains a central challenge in biology\. Despite extensive studies\[[12](https://arxiv.org/html/2606.23964#bib.bib9),[14](https://arxiv.org/html/2606.23964#bib.bib17),[6](https://arxiv.org/html/2606.23964#bib.bib6),[8](https://arxiv.org/html/2606.23964#bib.bib5),[1](https://arxiv.org/html/2606.23964#bib.bib19)\], deciphering subcellular architecture and protein localization remains complex, particularly in high\-dimensional imaging data\. Fluorescence microscopy\[[5](https://arxiv.org/html/2606.23964#bib.bib2),[2](https://arxiv.org/html/2606.23964#bib.bib1),[20](https://arxiv.org/html/2606.23964#bib.bib3)\]enables visualization of intracellular structures by tagging proteins and organelles with fluorescent markers\. Large\-scale resources such as JUMP\[[2](https://arxiv.org/html/2606.23964#bib.bib1)\], OpenCell\[[5](https://arxiv.org/html/2606.23964#bib.bib2)\], WTC\-11\[[20](https://arxiv.org/html/2606.23964#bib.bib3)\], and the Human Protein Atlas \(HPA\)\[[16](https://arxiv.org/html/2606.23964#bib.bib4)\]provide multi\-channel imaging data capturing rich subcellular organization\. Notably, OpenCell and WTC\-11 consist of volumetriczz\-stacks\. However, many representation learning approaches such as Subcell\[[8](https://arxiv.org/html/2606.23964#bib.bib5)\]and DINO4Cell\[[6](https://arxiv.org/html/2606.23964#bib.bib6)\]operate on 2D projections of these volumes, discarding depth\-resolved structural information\. We investigate the role of volumetric modeling for the learning of cellular representations\. On OpenCell\[[5](https://arxiv.org/html/2606.23964#bib.bib2)\], we systematically compare 2D and 3D masked autoencoder \(MAE\)\[[22](https://arxiv.org/html/2606.23964#bib.bib7),[12](https://arxiv.org/html/2606.23964#bib.bib9)\]models\. We demonstrate that preserving full 3D structure yields more informative representations and consistently improves downstream performance compared to 2D max\-projection and even slice\-based inputs\. Beyond purely visual modeling, we explore multimodal integration by incorporating protein sequence information via a pretrained protein language model \(PLM\) such as ESM2\[[13](https://arxiv.org/html/2606.23964#bib.bib8)\]\. By aligning image features with protein embeddings, we infuse the representation space with biologically grounded structural priors\. We demonstrate that sequence\-level supervision enhances representation quality, particularly when coupled with volumetric modeling\.

Our contributions are threefold: \(1\) We demonstrate that 3D MAE models outperform 2D counterparts across two downstream tasks\. \(2\) We show that channel cross\-attention and frequency\-domain \(FFT\) regularization further enhance volumetric representation learning\. \(3\) We establish that integrating protein language models into the visual framework improves representation quality and downstream performance, highlighting the benefit of multimodal alignment for cellular imaging\. Code is available at[https://github\.com/marrlab/mae3d\-opencell](https://github.com/marrlab/mae3d-opencell)\.

## 2Related Work

Fluorescence imaging of proteins, combined with DNA or membrane reference markers, enables single\-cell analysis of protein localization and function\[[5](https://arxiv.org/html/2606.23964#bib.bib2),[16](https://arxiv.org/html/2606.23964#bib.bib4),[11](https://arxiv.org/html/2606.23964#bib.bib10)\]\. OpenCell\[[5](https://arxiv.org/html/2606.23964#bib.bib2)\]contains 1,310 endogenously tagged human proteins and 29,922 experimentally measured protein–protein interactions, acquired as high\-resolution 3Dzz\-stacks\. Its protein diversity and volumetric imaging make it well suited for studying 3D representation learning and integration of sequence\-level embeddings that encode structural information\. The WTC\-11 dataset\[[20](https://arxiv.org/html/2606.23964#bib.bib3)\]contains 3D fluorescence images of 25 endogenously tagged proteins \(cellular structures\), captured with DNA and membrane reference channels and supports tasks such as protein localization and cell cycle stage classification\. Compared to OpenCell, WTC\-11 exhibits substantially lower protein diversity \(25 vs\. 1,310 proteins\), focusing instead on detailed structural characterization across single cells\. Cytoself\[[11](https://arxiv.org/html/2606.23964#bib.bib10)\]trains a vector\-quantized variational autoencoder\[[18](https://arxiv.org/html/2606.23964#bib.bib23)\]on 2D max\-projection images with protein identity conditioning, demonstrating that the learned representations cluster according to subcellular localization\. DINO4Cell\[[6](https://arxiv.org/html/2606.23964#bib.bib6)\]applies self\-supervised DINO training\[[3](https://arxiv.org/html/2606.23964#bib.bib12)\]to 2D max\-projection images from WTC\-11 and evaluates the learned representations on downstream tasks such as protein localization and cell\-cycle prediction\. Subcell\[[8](https://arxiv.org/html/2606.23964#bib.bib5)\]learns representations from HPA\[[16](https://arxiv.org/html/2606.23964#bib.bib4)\]images using masked reconstruction and similarity\-based objectives, and shows that image\-derived features complement protein sequence and structure embeddings \(e\.g\., ESM2\[[13](https://arxiv.org/html/2606.23964#bib.bib8)\]\), which are integrated via a second\-stage multimodal model\. Existing image\-based representation learning methods largely operate on 2D images or 2D projections of volumetric data\. While 2D architectures can be extended to 3D via slice\-wise processing\[[21](https://arxiv.org/html/2606.23964#bib.bib11)\], we find that native 3D representation learning consistently outperforms both max\-projection and slice\-based approaches\. Our model builds on SelfMedMAE\[[22](https://arxiv.org/html/2606.23964#bib.bib7)\], a 3D masked autoencoder originally developed for medical imaging, which we adapt to multi\-channel fluorescence microscopy and further enhance with channel cross\-attention and 3D frequency\-domain regularization\. Inspired by Subcell\[[8](https://arxiv.org/html/2606.23964#bib.bib5)\], we further improve representation learning through alignment with protein sequence embeddings derived from ESM2\[[13](https://arxiv.org/html/2606.23964#bib.bib8)\]\. In contrast to Subcell, which learns a joint image–sequence representation in a second stage, we retain a purely image\-based encoder and incorporate ESM2 during training via two mechanisms: \(i\) conditioning the decoder with sequence tokens for masked reconstruction, and \(ii\) aligning image and sequence embeddings with a symmetric InfoNCE\[[4](https://arxiv.org/html/2606.23964#bib.bib18)\]objective\.

![Refer to caption](https://arxiv.org/html/2606.23964v1/images/PLM.png)Figure 1:The MAE\-3D⋆model integrates protein representation by applying a contrastiveLCLIPL\_\{\\text\{CLIP\}\}loss between projected image tokens from masked 3D OpenCell volumes and the corresponding ESM2 embedding\[[13](https://arxiv.org/html/2606.23964#bib.bib8)\]\. The ESM2 token is additionally fed to the decoder to guide reconstruction\.
## 3Methodology

The proposed model is based on the Masked Autoencoder \(MAE\) framework in both 2D and 3D\[[9](https://arxiv.org/html/2606.23964#bib.bib13),[22](https://arxiv.org/html/2606.23964#bib.bib7)\]\. The input is azz\-stack volume𝐈∈ℝC​Z​X​Y\\mathbf\{I\}\\in\\mathbb\{R\}^\{CZXY\}, whereCC,ZZ,XX, andYYdenote the number of channels, depth, width, and height, respectively\. For the 2D variant \(MAE\-2Dbase\), the volume is collapsed via a projection of maximum intensity along the z\-axis, giving𝐈2​D∈ℝC​X​Y\\mathbf\{I\}\_\{2D\}\\\!\\in\\\!\\mathbb\{R\}^\{CXY\}\. The𝐈2​D\\mathbf\{I\}\_\{2D\}image is patched into non\-overlappingpx×pyp\_\{x\}\\\!\\times\\\!p\_\{y\}patches\. Patch embedding is performed using a convolutional layer followed by 2D sinusoidal positional embeddings\. A fractionmmof patches is randomly masked, and an encoder processes the visible patches\. The latent representations are combined with mask tokens and passed to a decoder for reconstruction\. For the 3D variant \(MAE\-3Dbase\), the depth dimension is retained, and the full𝐈3​D\\mathbf\{I\}\_\{3D\}volume is divided into non\-overlapping patches of sizepz×px×pyp\_\{z\}\\times p\_\{x\}\\times p\_\{y\}\. Patch embedding is implemented via a 3D convolutional layer followed by 3D sinusoidal positional embeddings\. Masking and encoding follow the same procedure as in the 2D setting\. The model is trained via mean squared error \(MSE\) loss over the masked patches\. In both models, all channels are processed jointly\.

### 3\.1Channel Cross\-Attention \(CCA\)

The base models are extended to a dual\-stream encoder–decoder to explicitly model inter\-channel interactions\[[1](https://arxiv.org/html/2606.23964#bib.bib19)\]\. Each channel is processed as a separate token stream with cross\-attention between channels\. A shared random masking pattern is applied identically across channels to prevent trivial reconstruction from unmasked positions in the complementary channel\. Accordingly, patch embedding is performed independently for each channel\. Within each channel, standard multi\-head self\-attention is applied over all visible tokens, capturing global spatial context\. In addition, position\-wise cross\-attention\[[19](https://arxiv.org/html/2606.23964#bib.bib20)\]is introduced: a token at spatial indexiiin channelcic\_\{i\}queries only the token at the same spatial index in channelcjc\_\{j\}​ \(i≠ji\\\!\\neq\\\!j\)\. Since there is exactly one key per query position, the softmax operation degenerates to 1\. Therefore, softmax is replaced with a sigmoid gating mechanism in attention computation\[[19](https://arxiv.org/html/2606.23964#bib.bib20)\]\. The decoder mirrors the encoder architecture\. Models with channel cross\-attention are denoted as CCA\.

### 3\.2FFT Loss

An FFT\-based\[[10](https://arxiv.org/html/2606.23964#bib.bib14)\]frequency loss is applied to the fully reconstructed output\. The loss is computed per channelccon the reconstructed 2D \(MAE\-2D\) or 3D \(MAE\-3D\) image to preserve fine subcellular structures\. While this frequency\-domain loss was introduced for 2D reconstruction in\[[12](https://arxiv.org/html/2606.23964#bib.bib9)\], we extend it to the 3D setting\. The frequency\-domain loss is defined as

ℒFFT=12​∑cL1​\(log⁡\(1\+\|ℱ​\(I^c\)\|\),log⁡\(1\+\|ℱ​\(Ic\)\|\)\)\\mathcal\{L\}\_\{\\mathrm\{FFT\}\}=\\frac\{1\}\{2\}\\sum\_\{c\}L\_\{1\}\\\!\\left\(\\log\\\!\\left\(1\+\\bigl\\lvert\\mathcal\{F\}\(\\hat\{I\}^\{c\}\)\\bigr\\rvert\\right\),\\log\\\!\\left\(1\+\\bigl\\lvert\\mathcal\{F\}\(I^\{c\}\)\\bigr\\rvert\\right\)\\right\)\(1\)whereIcI^\{c\}is the original andI^c\\hat\{I\}^\{c\}is the reconstructed image at channelcc,ℱ\\mathcal\{F\}denotes theNN\-dimensional discrete Fourier transform with orthonormal normalization,\|⋅\|\|\\cdot\|is the magnitude spectrum, andlog\(1\+⋅\)\\log\(1\+\\cdot\)compresses the dynamic range\. TheL1L\_\{1\}distance is preferred overL2L\_\{2\}for robustness to frequency\-domain outliers and only the magnitude spectrum is used\. For MAE\-2D, the transform is applied over the two spatial axes, whereas for MAE\-3D it is computed over all three volumetric axes\. The FFT loss is combined with the MSE loss using a weighting factorwFFTw\_\{\\mathrm\{FFT\}\}which is set to zero during warm\-up and linearly increased to its target value during a ramp\-up phase, stabilizing early training and introducing the frequency constraint once reconstructions become structurally meaningful\.

### 3\.3Multimodal Alignment with ESM2

To model alignment between protein sequence information and cellular morphology, protein embeddings are injected into the MAE decoder via a contrastive image–sequence alignment objective\[[4](https://arxiv.org/html/2606.23964#bib.bib18)\]\. The encoder architecture remains unchanged\. The pretrained ESM2\[[13](https://arxiv.org/html/2606.23964#bib.bib8)\]protein language model is kept frozen during MAE training and introduced only after stabilization of the reconstruction loss\. The ESM2 embedding is projected to the decoder dimension through a learned linear layer, producing a single protein token\. This token is inserted into each channel\-specific decoder sequence alongside the encoder’s visible and mask tokens, using zero positional embedding \(as it carries no spatial information\) \(see Figure[1](https://arxiv.org/html/2606.23964#S2.F1)\)\. After decoding, the protein token is discarded prior to image reconstruction\. This design allows masked patches to attend to protein context through the decoder’s self\-attention mechanism\. For cross\-modal alignment, image embeddings from the encoder are projected to the ESM2 embedding dimension and a symmetric InfoNCE\[[4](https://arxiv.org/html/2606.23964#bib.bib18)\]objective with cosine similarity is applied:

ℒCLIP=12​\(CE​\(τ​𝐄I​𝐄P⊤,diag\)\+CE​\(τ​𝐄P​𝐄I⊤,diag\)\)\\mathcal\{L\}\_\{\\mathrm\{CLIP\}\}=\\frac\{1\}\{2\}\\Big\(\\mathrm\{CE\}\(\\tau\\mathbf\{E\}\_\{I\}\\mathbf\{E\}\_\{P\}^\{\\top\},\\mathrm\{diag\}\)\+\\mathrm\{CE\}\(\\tau\\mathbf\{E\}\_\{P\}\\mathbf\{E\}\_\{I\}^\{\\top\},\\mathrm\{diag\}\)\\Big\)\(2\)where𝐄I,𝐄P\\mathbf\{E\}\_\{I\},\\mathbf\{E\}\_\{P\}denote the normalized image and protein projections,τ\\tauis a learnable temperature parameter, where the target corresponds to matching diagonal pairs within the batch\. The cross\-entropy \(CE\) terms enforce bidirectional alignment between image and protein modalities\. The final loss becomes

ℒ=ℒMSE\+wFFT​ℒFFT\+wCLIP​ℒCLIP\.\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{MSE\}\}\+w\_\{\\mathrm\{FFT\}\}\\,\\mathcal\{L\}\_\{\\mathrm\{FFT\}\}\+w\_\{\\mathrm\{CLIP\}\}\\mathcal\{L\}\_\{\\mathrm\{CLIP\}\}\.\(3\)Protein–image alignment is computed only on visible tokens, making the contrastive objective challenging and promoting robust representations\. Models incorporating CCA, FFT, and ESM2 are denoted as MAE\-2D⋆and MAE\-3D⋆\.

Implementation DetailsMAE\-2D and MAE\-3D use a ViT\[[7](https://arxiv.org/html/2606.23964#bib.bib21)\]backbone with encoder/decoder dimensions 384/192, comprising 6 encoder and 4 decoder layers with 6 attention heads each\. Patch sizes arepx=py=8p\_\{x\}=p\_\{y\}=8in\-plane andpz=10p\_\{z\}=10along the z\-axis for MAE\-3D\. Models are trained for 10 epochs with 4\-epoch linear warm\-up; the FFT loss weight is set towFFT=0\.1w\_\{\\mathrm\{FFT\}\}=0\.1and ramped up over 2 epochs, with gradient clipping at 0\.5\. For protein alignment, a learning rate of1×10−51\\times 10^\{\-5\}and a projection dimension of 1280 \(matching ESM2\[[13](https://arxiv.org/html/2606.23964#bib.bib8)\]\) are used\. Training runs for 5 epochs withwCLIP=1\.0w\_\{\\mathrm\{CLIP\}\}=1\.0and temperatureτ=0\.07\\tau=0\.07, ramped up over 1 epoch; the decoder is frozen during the first warm\-up epoch\. All experiments are conducted on a single NVIDIA A100 80GB GPU\.

## 4Experiments and Results

The proposed method is evaluated on the OpenCell dataset\[[5](https://arxiv.org/html/2606.23964#bib.bib2)\]\. It is a large fluorescence microscopy dataset of human cells with endogenously tagged proteins\. Each sample is a two\-channel 3Dzz\-stack \(protein and nucleus\)\. The dataset comprises 6,301 volumes from 1,310 proteins across 17 subcellular localization categories, with multiple cells per field of view\. Single\-cell crops are obtained using Cellpose\[[15](https://arxiv.org/html/2606.23964#bib.bib24)\]on 2D maximum\-intensity projections and extracted from the corresponding 3D volumes, yielding 70,313 cells\. All crops are standardized to100×176×176100\\times 176\\times 176pixels at0\.2​μ​m0\.2\\,\\mu\\text\{m\}isotropic resolution, where 100 denotes the number of z\-slices\. The learned representations are evaluated on two downstream tasks\.Protein localization:Linear probing is performed over 17 compartments using frozen representations\.Protein–protein interaction:Binary classification of interacting protein pairs\. A two\-layer MLP head \(512→\\rightarrow128\) is trained on frozen embeddings\. All downstream models are trained for 100 epochs with a learning rate of1×10−41\\times 10^\{\-4\}\. Protein sequences are retrieved from UniProt\[[17](https://arxiv.org/html/2606.23964#bib.bib22)\]using Ensembl IDs provided on OpenCell\. Sequence embeddings are obtained from the pretrained ESM2\-650M\[[13](https://arxiv.org/html/2606.23964#bib.bib8)\]model by mean pooling the final\-layer token representations\. All experiments use five\-fold cross\-validation with protein\-level splits to prevent data leakage\.

### 4\.1MAE\-2Dbase\{\}\_\{\\text\{base\}\}and MAE\-3Dbase\{\}\_\{\\text\{base\}\}Comparison

![Refer to caption](https://arxiv.org/html/2606.23964v1/images/maskratio_2d_vs_3d.png)Figure 2:On the protein localization task, MAE\-3Dbase\{\}\_\{\\text\{base\}\}outperforms MAE\-2Dbase\{\}\_\{\\text\{base\}\}across all mask ratios and evaluation metrics on the OpenCell dataset\.To compare base models and select an appropriate mask ratio for MAE pretraining, experiments were conducted on a single fold \(Fold\-1\)\. Mask ratios of 70%–90% are evaluated for protein localization\. As shown in Fig\.[2](https://arxiv.org/html/2606.23964#S4.F2), MAE\-3Dbase\{\}\_\{\\text\{base\}\}consistently outperforms MAE\-2Dbase\{\}\_\{\\text\{base\}\}across all mask ratios, likely due to its access to full volumetric context\. A mask ratio of 75% achieves the best performance for MAE\-3Dbase\{\}\_\{\\text\{base\}\}and is adopted for all subsequent experiments\. For a stricter comparison, both models are restricted to slices 45–55 per volume with 75% masking\. The 2D model is trained on individual slices per volume, requiring separate processing of each slice, which increases training time by approximately 10×\\timesfor slices 45–55 and makes slice\-based training substantially less computationally efficient than MAE\-3D\. Slice embeddings are averaged at inference to obtain a volume\-level representation\. Despite this, MAE\-3Dbasestill outperforms MAE\-2Dbase\(Table[1](https://arxiv.org/html/2606.23964#S4.T1)\)\. Furthermore, both models are evaluated on the protein–protein interaction task\. MAE\-3Dbaseachieves substantially higher ROC\-AUC compared to MAE\-2Dbase\. Across proteins, the mean ROC\-AUC improves from approximately 0\.78 \(2D\) to 0\.84 \(3D\)\. These results further confirm the benefit of modeling full volumetric context\.

Table 1:MAE\-3Dbase\{\}\_\{\\text\{base\}\}still outperforms MAE\-2Dbase\{\}\_\{\\text\{base\}\}in protein localization when both models are restricted to slices 45–55\.Table 2:On the protein localization task, MAE\-3D⋆outperforms MAE\-2D⋆\.
### 4\.2Ablation Study

Table[2](https://arxiv.org/html/2606.23964#S4.T2)assesses the impact of channel cross\-attention \(CCA\), FFT loss, and ESM2 alignment on MAE\-2D and MAE\-3D for protein localization\. Overall, MAE\-3D⋆achieves the best performance, with substantially larger gains from ESM2 than the 2D variant, indicating that sequence information better complements preserved 3D morphological context\. On the protein–protein interaction \(PPI\) task, MAE\-3D⋆outperforms MAE\-2D⋆, achieving a ROC–AUC of 0\.86±\\pm0\.03 compared to 0\.84±\\pm0\.04 for the 2D variant\. The same trend holds without ESM2 \(0\.86±\\pm0\.05 vs\. 0\.83±\\pm0\.04\), indicating that volumetric modeling provides consistent gains for interaction prediction\.

![Refer to caption](https://arxiv.org/html/2606.23964v1/images/attention.png)Figure 3:Integrating ESM2 leads to more precise attention over protein\-specific regions, particularly across relevantzz\-stack slices in 3D\. Attention is visualized for MAE\-2D⋆, MAE\-3D⋆wo/ESM2, and MAE\-3D⋆\. Per\-slice Pearson correlation \(rr\) between attention and protein intensity is shown for each slice, where higher values indicate stronger spatial co\-localization\.Figure[3](https://arxiv.org/html/2606.23964#S4.F3)shows attention maps for the exemplarily chosen protein ACTB, localized to the actin cytoskeleton and cytoplasm\. With ESM2\[[13](https://arxiv.org/html/2606.23964#bib.bib8)\], attention concentrates on protein\-relevant structures, whereas without it, focus spreads to non\-specific regions, including the nucleus\. The 3D model distributes attention consistently across relevant z\-slices, reflecting volumetric awareness, while the 2D model concentrates on limited regions in the maximum\-intensity projection, lacking depth specificity\.

### 4\.3Comparison with State\-of\-the\-Art Methods

On the OpenCell protein\-protein interaction task, MAE\-3D⋆achieves the highest ROC\-AUC \(0\.86 ± 0\.03\), outperforming Subcell\[[8](https://arxiv.org/html/2606.23964#bib.bib5)\]\(0\.85 ± 0\.04\) and DINO4Cell\[[6](https://arxiv.org/html/2606.23964#bib.bib6)\]\(0\.84 ± 0\.05\)\. Performance without ESM2 \(0\.86 ± 0\.05\) is comparable with MAE\-3D⋆, indicating that sequence\-level alignment does not provide additional benefit for protein–protein interaction prediction\. For a fair comparison, Subcell and DINO4Cell embeddings are extracted for the same five\-fold splits, followed by linear probing, identical to our evaluation protocol\. Our best\-performing model \(MAE\-3D⋆\) achieves the strongest results despite being trained on the comparatively small OpenCell dataset \(∼\\sim6k volumes\), while Subcell and DINO4Cell are pretrained on large\-scale HPA\[[16](https://arxiv.org/html/2606.23964#bib.bib4)\]microscopy datasets containing an order of magnitude more images\. These findings underscore the effectiveness of volumetric modeling and cross\-modal alignment, particularly in data\-limited settings\. As shown in Table[3](https://arxiv.org/html/2606.23964#S4.T3), MAE\-3D⋆achieves performance competitive with state\-of\-the\-art methods on the protein localization task\.

Table 3:Integrating ESM2 yields consistent gains for MAE\-3D⋆, resulting in performance competitive with state\-of\-the\-art methods and achieving the best micro\-level scores\.

## 5Conclusion

We systematically compared 2D and 3D MAE\-based models for representation learning in volumetric fluorescence microscopy\. Across two tasks, native 3D models consistently outperformed 2D variants, demonstrating that preserving full volumetric context yields more discriminative representations\. Integrating protein language embeddings \(ESM2\) further improved performance, with substantially larger gains for 3D models, highlighting a strong synergy between volumetric morphology and protein\-level semantics\. Overall, our findings emphasize the importance of native 3D modeling and multimodal alignment for robust foundation models in cellular imaging\. Future work will explore learning a shared representation space across 2D and 3D datasets to better leverage protein sequence information, as well as extending the framework to subcellular segmentation with a dedicated decoder\.

\{credits\}

Preprint notice\.This is the submitted/pre\-peer\-review version of the manuscript, with minor editorial corrections\. The final authenticated version will be published in the MICCAI 2026 proceedings by Springer Nature\. A link to the version of record will be added once it becomes available\.

#### 5\.0\.1Acknowledgements

C\.M\. acknowledges support from the European Research Council \(ERC; Grant Nos\. 866411, 101113551, and 101213822\), the High\-tech Agenda Bayern, and the Deutsche Forschungsgemeinschaft \(DFG, TRR359, Project No\. 491676693\)\. A\.K\. acknowledges computing time on the JUWELS Booster supercomputer operated by the Jülich Supercomputing Centre\. L\.G\. acknowledges support from the Munich School for Data Science \(MUDS\)\.

## References

- \[1\]N\. Bourriez, I\. Bendidi, E\. Cohen, G\. Watkinson, M\. Sanchez, G\. Bollot, and A\. Genovesio\(2024\)ChAda\-vit: channel adaptive attention for joint representation learning of heterogeneous microscopy images\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 11556–11565\.External Links:[Document](https://dx.doi.org/10.1109/CVPR52733.2024.01098)Cited by:[§1](https://arxiv.org/html/2606.23964#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.23964#S3.SS1.p1.4)\.
- \[2\]M\. Bray, S\. Singh, H\. Han, C\. Davis, B\. Borgeson, C\. Hartland, M\. Kost\-Alimova, S\. M\. Gustafsdottir, C\. C\. Gibson, A\. E\. Carpenter,et al\.\(2023\)The jump cell painting dataset: morphological impact of 136,000 chemical and genetic perturbations\.Nature619,pp\. 151–158\.External Links:[Document](https://dx.doi.org/10.1038/s41586-023-06119-4)Cited by:[§1](https://arxiv.org/html/2606.23964#S1.p1.1)\.
- \[3\]M\. Caron, H\. Touvron, I\. Misra, H\. Jégou, J\. Mairal, P\. Bojanowski, and A\. Joulin\(2021\)Emerging properties in self\-supervised vision transformers\.InProceedings of the 38th International Conference on Machine Learning \(ICML\),pp\. 2186–2205\.External Links:[Link](http://proceedings.mlr.press/v139/caron21a.html)Cited by:[§2](https://arxiv.org/html/2606.23964#S2.p1.1)\.
- \[4\]T\. Chen, S\. Kornblith, M\. Norouzi, and G\. Hinton\(2020\)A simple framework for contrastive learning of visual representations\.InICML,pp\. 1597–1607\.Cited by:[§2](https://arxiv.org/html/2606.23964#S2.p1.1),[§3\.3](https://arxiv.org/html/2606.23964#S3.SS3.p1.5)\.
- \[5\]N\. H\. Cho, K\. C\. Cheveralls, A\. Brunner, K\. Kim, A\. C\. Michaelis, P\. Raghavan, H\. Kobayashi, L\. Savy, J\. Y\. Li, H\. Canaj,et al\.\(2021\)OpenCell: proteome\-scale endogenous tagging enables the cartography of human cellular organization\.Nature595,pp\. 285–290\.External Links:[Document](https://dx.doi.org/10.1038/s41586-021-03969-3)Cited by:[§1](https://arxiv.org/html/2606.23964#S1.p1.1),[§2](https://arxiv.org/html/2606.23964#S2.p1.1),[§4](https://arxiv.org/html/2606.23964#S4.p1.5)\.
- \[6\]M\. Doron, T\. Moutakanni, Z\. S\. Chen, N\. Moshkov, M\. Caron, H\. Touvron, W\. Pernice, and J\. C\. Caicedo\(2023\)Unbiased single\-cell morphology with self\-supervised vision transformers\.bioRxiv\.External Links:[Document](https://dx.doi.org/10.1101/2023.06.16.545359),[Link](https://www.biorxiv.org/content/10.1101/2023.06.16.545359v1)Cited by:[§1](https://arxiv.org/html/2606.23964#S1.p1.1),[§2](https://arxiv.org/html/2606.23964#S2.p1.1),[§4\.3](https://arxiv.org/html/2606.23964#S4.SS3.p1.5)\.
- \[7\]A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby\(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2010.11929)Cited by:[§3\.3](https://arxiv.org/html/2606.23964#S3.SS3.p2.6)\.
- \[8\]A\. Gupta, Z\. Wefers, K\. Kahnert, J\. N\. Hansen, M\. K\. Misra, W\. Leineweber, A\. Cesnik, D\. Lu, U\. Axelsson, F\. Ballllosera Navarro, R\. B\. Altman, T\. Karaletsos, and E\. Lundberg\(2024\)SubCell: vision foundation models for microscopy capture single\-cell biology\.bioRxiv\.External Links:[Document](https://dx.doi.org/10.1101/2024.12.06.627299),[Link](https://www.biorxiv.org/content/10.1101/2024.12.06.627299v1)Cited by:[§1](https://arxiv.org/html/2606.23964#S1.p1.1),[§2](https://arxiv.org/html/2606.23964#S2.p1.1),[§4\.3](https://arxiv.org/html/2606.23964#S4.SS3.p1.5)\.
- \[9\]K\. He, X\. Chen, S\. Xie, Y\. Li, P\. Dollár, and R\. Girshick\(2022\)Masked autoencoders are scalable vision learners\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 16000–16009\.External Links:[Document](https://dx.doi.org/10.1109/CVPR52688.2022.01553)Cited by:[§3](https://arxiv.org/html/2606.23964#S3.p1.14)\.
- \[10\]L\. Jiang, B\. Dai, W\. Wu, and C\. C\. Loy\(2021\)Focal frequency loss for image reconstruction and synthesis\.InICCV,Cited by:[§3\.2](https://arxiv.org/html/2606.23964#S3.SS2.p1.1)\.
- \[11\]H\. Kobayashi, K\. C\. Cheveralls, M\. D\. Leonetti, L\. A\. Royer,et al\.\(2022\)Self\-supervised deep learning encodes high\-resolution features of protein subcellular localization\.Nature Methods19\(8\),pp\. 995–1003\.External Links:[Document](https://dx.doi.org/10.1038/s41592-022-01541-z)Cited by:[§2](https://arxiv.org/html/2606.23964#S2.p1.1)\.
- \[12\]O\. Kraus, K\. Kenyon\-Dean, S\. Saberian, M\. Fallah, P\. McLean, J\. Leung, V\. Sharma, A\. Khan, J\. Balakrishnan, S\. Celik, D\. Beaini, M\. Sypetkowski, C\. V\. Cheng, K\. Morse, M\. Makes, B\. Mabey, and B\. Earnshaw\(2024\)Masked autoencoders for microscopy are scalable learners of cellular biology\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 11757–11768\.External Links:[Document](https://dx.doi.org/10.1109/CVPR57552.2024.01157)Cited by:[§1](https://arxiv.org/html/2606.23964#S1.p1.1),[§3\.2](https://arxiv.org/html/2606.23964#S3.SS2.p1.1)\.
- \[13\]Z\. Lin, H\. Akin, R\. Rao, B\. Hie, Z\. Zhu, W\. Lu, N\. Smetanin, R\. Verkuil, O\. Kabeli, Y\. Shmueli, A\. dos Santos Costa, M\. Fazel\-Zarandi, T\. Sercu, S\. Candido, and A\. Rives\(2023\)Evolutionary\-scale prediction of atomic\-level protein structure with a language model\.Science379\(6637\),pp\. 1123–1130\.Note:Includes ESM\-2 protein language model and associated ESMFold predictionsExternal Links:[Document](https://dx.doi.org/10.1126/science.ade2574)Cited by:[§1](https://arxiv.org/html/2606.23964#S1.p1.1),[Figure 1](https://arxiv.org/html/2606.23964#S2.F1),[§2](https://arxiv.org/html/2606.23964#S2.p1.1),[§3\.3](https://arxiv.org/html/2606.23964#S3.SS3.p1.5),[§3\.3](https://arxiv.org/html/2606.23964#S3.SS3.p2.6),[§4\.2](https://arxiv.org/html/2606.23964#S4.SS2.p2.1),[§4](https://arxiv.org/html/2606.23964#S4.p1.5)\.
- \[14\]T\. Moutakanni, C\. Couprie, S\. Yi, M\. Doron, Z\. S\. Chen, N\. Moshkov,et al\.\(2025\)Cell\-dino: self\-supervised image\-based embeddings for cell fluorescent microscopy\.PLOS Computational Biology21\(12\),pp\. e1013828\.External Links:[Document](https://dx.doi.org/10.1371/journal.pcbi.1013828)Cited by:[§1](https://arxiv.org/html/2606.23964#S1.p1.1)\.
- \[15\]C\. Stringer, T\. Wang, M\. Michaelos, and M\. Pachitariu\(2021\)Cellpose: a generalist algorithm for cellular segmentation\.Nature Methods18\(1\),pp\. 100–106\.External Links:[Document](https://dx.doi.org/10.1038/s41592-020-01018-x)Cited by:[§4](https://arxiv.org/html/2606.23964#S4.p1.5)\.
- \[16\]M\. Uhlén, L\. Fagerberg, B\. M\. Hallström, C\. Lindskog, P\. Oksvold, A\. Mardinoglu, Å\. Sivertsson, C\. Kampf, E\. Sjöstedt, A\. Asplund,et al\.\(2015\)Tissue\-based map of the human proteome\.Science347\(6220\),pp\. 1260419\.External Links:[Document](https://dx.doi.org/10.1126/science.1260419)Cited by:[§1](https://arxiv.org/html/2606.23964#S1.p1.1),[§2](https://arxiv.org/html/2606.23964#S2.p1.1),[§4\.3](https://arxiv.org/html/2606.23964#S4.SS3.p1.5)\.
- \[17\]UniProt Consortium\(2023\)UniProt: the universal protein knowledgebase in 2023\.Nucleic Acids Research51\(D1\),pp\. D523–D531\.External Links:[Document](https://dx.doi.org/10.1093/nar/gkac1052)Cited by:[§4](https://arxiv.org/html/2606.23964#S4.p1.5)\.
- \[18\]A\. van den Oord, O\. Vinyals, and K\. Kavukcuoglu\(2017\)Neural discrete representation learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.30\.Cited by:[§2](https://arxiv.org/html/2606.23964#S2.p1.1)\.
- \[19\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.30\.Cited by:[§3\.1](https://arxiv.org/html/2606.23964#S3.SS1.p1.4)\.
- \[20\]M\. P\. Viana, J\. Chen, T\. A\. Knijnenburg, R\. Vasan, C\. Yan, J\. E\. Arakaki, M\. Bailey, B\. Berry, A\. Borensztejn, E\. M\. Brown,et al\.\(2023\)Integrated intracellular organization and its variations in human ips cells\.Nature613\(7943\),pp\. 345–354\.External Links:[Document](https://dx.doi.org/10.1038/s41586-022-05563-7)Cited by:[§1](https://arxiv.org/html/2606.23964#S1.p1.1),[§2](https://arxiv.org/html/2606.23964#S2.p1.1)\.
- \[21\]F\. Y\. Zhou, Z\. Marin, C\. Yapp, Q\. Zhou, B\. A\. Nanes, S\. Daetwyler, A\. R\. Jamieson, M\. T\. Islam, E\. Jenkins, G\. M\. Gihana, J\. Lin, H\. M\. Borges, B\. Chang, A\. Weems, S\. J\. Morrison, P\. K\. Sorger, R\. Fiolka, and K\. M\. Dean\(2025\)Universal consensus 3d segmentation of cells from 2d segmented stacks\.Nature Methods22\(11\),pp\. 2386–2399\.External Links:[Document](https://dx.doi.org/10.1038/s41592-025-02887-w)Cited by:[§2](https://arxiv.org/html/2606.23964#S2.p1.1)\.
- \[22\]L\. Zhou, H\. Liu, J\. Bae, J\. He, D\. Samaras, and P\. Prasanna\(2023\)Self pre\-training with masked autoencoders for medical image classification and segmentation\.InProceedings of the 2023 IEEE 20th International Symposium on Biomedical Imaging \(ISBI\),pp\. 1–5\.External Links:[Document](https://dx.doi.org/10.1109/ISBI53787.2023.10230477),[Link](https://ieeexplore.ieee.org/document/10230477)Cited by:[§1](https://arxiv.org/html/2606.23964#S1.p1.1),[§2](https://arxiv.org/html/2606.23964#S2.p1.1),[§3](https://arxiv.org/html/2606.23964#S3.p1.14)\.

Similar Articles

Tadpole: Autoencoders as Foundation Models for 3D PDEs with Online Learning

arXiv cs.LG

Tadpole introduces a foundation model for 3D PDEs, pre-trained as an autoencoder via efficient online data generation, enabling large-scale diverse training without storage overhead. It demonstrates strong fine-tuning performance for dynamics learning and generative modeling across heterogeneous physical systems.

Variational lossy autoencoder

OpenAI Blog

OpenAI researchers present a Variational Lossy Autoencoder (VLAE) that combines VAEs with neural autoregressive models (RNN, MADE, PixelRNN/CNN) to learn controllable global representations, achieving state-of-the-art results on MNIST, OMNIGLOT, and Caltech-101 Silhouettes density estimation tasks.