Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

Hugging Face Daily Papers 05/10/26, 12:00 AM Papers

Summary

Metal-Sci introduces a 10-task benchmark for optimizing scientific computing kernels on Apple Silicon, paired with an evolutionary search framework driven by large language models. The study evaluates models like Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5, demonstrating significant speedups while using out-of-distribution testing to catch silent performance regressions.

We present Metal-Sci, a 10-task benchmark of scientific Apple Silicon Metal compute kernels spanning six optimization regimes (stencils, all-pairs in n-body problems, multi-field Boltzmann, neighbor-list molecular dynamics, multi-kernel PDE, FFT). Each task ships a CPU reference, a roofline-anchored fitness function, and a held-out generalization size. We pair the benchmark with a lightweight harness for automatic kernel search that runtime-compiles each candidate, scores it against the roofline across multiple sizes, and feeds structured compile and per-size correctness diagnostics back to a frozen LLM driving a (1{+}1) evolutionary loop. We report matched single-model sweeps of Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5 on M1 Pro: in-distribution self-speedups span 1.00times to 10.7times. Beyond raw speedup, our central methodological claim is structural: the held-out gate scoring function Φ_T (evaluated once at end-of-run on a configuration the agent never sees during search) functions as a cheap mechanical oversight primitive on this automatic search loop, catching e.g. an Opus template <uint D> HMC win that returns wrong samples at unseen dimensions, and a GPT FFT3D best that wins in-distribution at 2.95times speedup but collapses to 0.23times on a 256^3 held-out cube, a silent regression that the in-distribution score alone cannot see. Code at https://github.com/vicgalle/metal-sci-kernels

Original Article

View Cached Full Text

Cached at: 05/12/26, 10:53 AM

Paper page - Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

Source: https://huggingface.co/papers/2605.09708

Abstract

A benchmark for optimizing scientific computing kernels on Apple Silicon is paired with an automated search framework that uses a large language model in a evolutionary loop to optimize performance while identifying hidden regressions through out-of-distribution evaluation.

We present Metal-Sci, a 10-task benchmark of scientific Apple SiliconMetal compute kernelsspanning sixoptimization regimes(stencils, all-pairs in n-body problems, multi-field Boltzmann, neighbor-list molecular dynamics, multi-kernel PDE, FFT). Each task ships a CPU reference, aroofline-anchored fitness function, and aheld-out generalization size. We pair the benchmark with a lightweight harness forautomatic kernel searchthat runtime-compiles each candidate, scores it against the roofline across multiple sizes, and feeds structured compile and per-size correctness diagnostics back to afrozen LLMdriving a (1{+}1)evolutionary loop. We report matched single-model sweeps of Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5 on M1 Pro: in-distribution self-speedups span 1.00times to 10.7times. Beyond raw speedup, our central methodological claim is structural: the held-out gate scoring function Φ_T (evaluated once at end-of-run on a configuration the agent never sees during search) functions as a cheapmechanical oversight primitiveon this automatic search loop, catching e.g. an Opus template <uint D> HMC win that returns wrong samples at unseen dimensions, and a GPT FFT3D best that wins in-distribution at 2.95times speedup but collapses to 0.23times on a 256^3 held-out cube, asilent regressionthat the in-distribution score alone cannot see. Code at https://github.com/vicgalle/metal-sci-kernels

View arXiv page View PDF GitHub1 Add to collection

Get this paper in your agent:

hf papers read 2605\.09708

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.09708 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.09708 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.09708 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

Paper page - Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

I ported EXL3 to run well on Apple Silicon - PonyExl3

I built mlx-Chronos — a community benchmark leaderboard for local LLM engines on Apple Silicon (oMLX, Rapid-MLX, mlx-lm, Ollama)

I fitted the new δ-mem research for apple silicon using mlx and openclaw integration! My findings

I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed

Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

Submit Feedback

Similar Articles

I ported EXL3 to run well on Apple Silicon - PonyExl3

I built mlx-Chronos — a community benchmark leaderboard for local LLM engines on Apple Silicon (oMLX, Rapid-MLX, mlx-lm, Ollama)

I fitted the new δ-mem research for apple silicon using mlx and openclaw integration! My findings

I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed

Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s