The Galaxy's Guide to the Tokenizer: A Benchmark for Scientific Foundation Models
Summary
This paper compares four tokenization methods (Affine, AIM, JetFormer, VQ-VAE) for astronomical images within a unified transformer framework, using 640,000 galaxy images to evaluate reconstruction quality, physical property prediction, and morphological preservation. It finds that no single method excels across all tasks, highlighting trade-offs in representation learning.
View Cached Full Text
Cached at: 06/29/26, 06:04 PM
Paper page - The Galaxy’s Guide to the Tokenizer: A Benchmark for Scientific Foundation Models
Source: https://huggingface.co/papers/2606.25610
Abstract
Four tokenization methods for astronomical images show distinct strengths in reconstruction quality, physical property prediction, and morphological preservation, with no single approach excelling across all tasks.
Tokenizationis central to adapting scientific data fortransformer-based foundation models, yet its impact on learned representations remains poorly understood. We compare fourtokenizationstrategies, Affine, AIM, JetFormer, andVQ-VAE, within a unified transformer framework forastronomical imaging. Using 640,000 galaxy images from theDESI Legacy Surveyand a sharedAstroPT backbone, we evaluate each method onreconstruction fidelityand prediction ofphysical properties. Our results reveal trade-offs across approaches. Theflow-basedJetFormer achieves higher reconstruction quality, whileVQ-VAEyields strong probe performance for galaxyphysical properties. Affine and AIM better preserve localizedmorphological information. We find that reconstruction and representation quality are decoupled, and no single method consistently performs best across the tasks considered here. By grounding our evaluation in independently measured physical quantities, we hope this study serves to highlight the potential of scientific data as a basis for constructing interpretable benchmarks for foundation models.
View arXiv pageView PDFGitHub48Add to collection
Get this paper in your agent:
hf papers read 2606\.25610
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.25610 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.25610 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.25610 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers
This paper introduces a two-stage token selection framework for visual geometry transformers that reduces computational costs by restricting key/value tokens during global attention, achieving over 85% acceleration on scenes with 500 images while maintaining baseline performance.
From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion
This paper introduces a multimodal image fusion method that uses a 1D token interface from a pretrained image tokenizer to enhance global appearance coherence while preserving local details through selective token editing (STE). Experiments on four benchmarks show state-of-the-art performance in both global coherence and local fidelity.
Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study
Benchmarks seven foundation models on Ukrainian legal text, finding tokenizer fertility varies 1.6×, few-shot prompting degrades performance, and cost-performance analysis shows NVIDIA Nemotron Super 3 outperforms larger models.
Identifiable Token Correspondence for World Models
This paper introduces Identifiable Token Correspondence, a method that models token correspondence across time frames to improve temporal consistency in transformer-based world models for visual reinforcement learning, achieving state-of-the-art results on multiple benchmarks.
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
HYDRA-X presents a unified multimodal model that integrates image and video tokenization within a single Vision Transformer, achieving strong performance across understanding and generation tasks.