Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
Summary
Stargazer introduces a scalable benchmark environment with 120 astrophysics tasks to evaluate AI agents on physics-grounded model-fitting of radial-velocity data, revealing gaps between statistical optimization and physical constraint adherence.
View Cached Full Text
Cached at: 04/23/26, 03:35 AM
Paper page - Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
Source: https://huggingface.co/papers/2604.15664
Abstract
Stargazer is a scalable environment for evaluating AI agents on dynamic physics-grounded model-fitting tasks using radial-velocity time series data, revealing gaps between statistical fitting and physical constraint adherence.
The rise of autonomousAI agentssuggests thatdynamic benchmark environmentswith built-in feedback on scientifically grounded tasks are needed to evaluate the capabilities of these agents in research work. We introduce Stargazer, a scalable environment for evaluatingAI agentson dynamic, iterativephysics-groundedmodel-fitting tasksusing inference on radial-velocity (RV) time series data. Stargazer comprises 120 tasks across three difficulty tiers, including 20 real archival cases, covering diverse scenarios ranging from high-SNR single-planet systems to complex multi-planetary configurations requiring involved low-SNR analysis. Our evaluation of eight frontier agents reveals a gap between numerical optimization and adherence to physical constraints: although agents often achieve a good statistical fit, they frequently fail to recover correct physical system parameters, a limitation that persists even when agents are equipped with vanilla skills. Furthermore, increasing test-time compute yields only marginal gains, with excessive token usage often reflecting recursive failure loops rather than meaningful exploration. Stargazer presents an opportunity to train, evaluate, scaffold, and scale strategies on a model-fitting problem of practical research relevance today. Our methodology to design asimulation-driven environmentforAI agentspresumably generalizes to many other model-fitting problems across scientific domains. Source code and the project website are available at https://github.com/Gudmorning2025/Stargazer and https://gudmorning2025.github.io/Stargazer, respectively.
View arXiv pageView PDFProject pageGitHub4Add to collection
Get this paper in your agent:
hf papers read 2604\.15664
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.15664 in a model README.md to link it from this page.
Datasets citing this paper1
#### liuxinge/Stargazer Preview• Updated2 days ago • 62 • 2
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.15664 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Benchmarking AI Agents for Addressing Scientific Challenges Across Scales
Introduces SciAgentArena, a benchmark of ~200 tasks for evaluating AI agents in real scientific research. Finds agents effective for well-specified data-analysis workflows but struggle with novel insights and open-ended exploration.
VESTA: Visual Exploration with Statistical Tool Agents
This paper introduces VESTA, a framework that equips vision-language models with dynamically growing toolkits for data exploration and statistical model refinement, outperforming prior agent-based methods on complex scientific modeling tasks. The authors also present Dawn, a benchmark for distribution fitting and time series modeling, including real-world astronomy challenges.
AstroMind: A High-Fidelity Benchmark for Spacecraft Behavior Reasoning Based on Large Language Models
AstroMind is a physics-grounded benchmark for evaluating large language models on spacecraft behavior reasoning tasks, including intent inference, maneuver parameter estimation, and threat assessment, using high-fidelity astrodynamics simulations and realistic sensor noise.
Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark
This paper introduces PhySciBench, a benchmark of 200 expert-curated questions for physical sciences, and DelveAgent, a multi-agent framework that improves accuracy and reduces inference costs compared to baselines like Gemini Deep Research.
GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models
The paper introduces GeoNatureAgent Benchmark, the first benchmark for evaluating LLM agents on environmental geospatial analysis tasks via structured tool calls. It evaluates seven models on 93 tasks across 18 categories and finds Claude Sonnet 4 achieves highest accuracy at 60.8%, while open-weight models like DeepSeek V3.2 offer strong cost-performance tradeoffs.