Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon
Summary
Metal-Sci introduces a 10-task benchmark for optimizing scientific computing kernels on Apple Silicon, paired with an evolutionary search framework driven by large language models. The study evaluates models like Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5, demonstrating significant speedups while using out-of-distribution testing to catch silent performance regressions.
View Cached Full Text
Cached at: 05/12/26, 10:53 AM
Paper page - Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon
Source: https://huggingface.co/papers/2605.09708
Abstract
A benchmark for optimizing scientific computing kernels on Apple Silicon is paired with an automated search framework that uses a large language model in a evolutionary loop to optimize performance while identifying hidden regressions through out-of-distribution evaluation.
We present Metal-Sci, a 10-task benchmark of scientific Apple SiliconMetal compute kernelsspanning sixoptimization regimes(stencils, all-pairs in n-body problems, multi-field Boltzmann, neighbor-list molecular dynamics, multi-kernel PDE, FFT). Each task ships a CPU reference, aroofline-anchored fitness function, and aheld-out generalization size. We pair the benchmark with a lightweight harness forautomatic kernel searchthat runtime-compiles each candidate, scores it against the roofline across multiple sizes, and feeds structured compile and per-size correctness diagnostics back to afrozen LLMdriving a (1{+}1)evolutionary loop. We report matched single-model sweeps of Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5 on M1 Pro: in-distribution self-speedups span 1.00times to 10.7times. Beyond raw speedup, our central methodological claim is structural: the held-out gate scoring function Φ_T (evaluated once at end-of-run on a configuration the agent never sees during search) functions as a cheapmechanical oversight primitiveon this automatic search loop, catching e.g. an Opus template <uint D> HMC win that returns wrong samples at unseen dimensions, and a GPT FFT3D best that wins in-distribution at 2.95times speedup but collapses to 0.23times on a 256^3 held-out cube, asilent regressionthat the in-distribution score alone cannot see. Code at https://github.com/vicgalle/metal-sci-kernels
View arXiv pageView PDFGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.09708
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.09708 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.09708 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.09708 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
I ported EXL3 to run well on Apple Silicon - PonyExl3
Ported the EXL3 LLM codec to run on Apple Silicon via Metal, achieving high prefill and generation speeds on M5 Max (e.g., ~600 tok/s prefill, 17-80 tok/s gen on various models).
I built mlx-Chronos — a community benchmark leaderboard for local LLM engines on Apple Silicon (oMLX, Rapid-MLX, mlx-lm, Ollama)
A CS student built mlx-Chronos, an open-source CLI tool that standardizes benchmarking of MLX inference engines on Apple Silicon by measuring TTFT, throughput, memory usage, and thermal state, with a community leaderboard for sharing results.
I fitted the new δ-mem research for apple silicon using mlx and openclaw integration! My findings
The author implements the δ-mem research paper on Apple Silicon using MLX and OpenClaw, showing memory and attention improvements in local AI agent tests, though with mixed results compared to CUDA benchmarks.
I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed
A developer benchmarked 21 local LLMs on MacBook Air M5 using HumanEval+ and found Qwen 3.6 35B-A3B (MoE) leads at 89.6% with 16.9 tok/s, while Qwen 2.5 Coder 7B offers the best RAM-to-performance ratio at 84.2% in 4.5 GB. Notably, Gemma 4 models significantly underperformed expectations (31.1% for 31B), possibly due to Q4_K_M quantization effects.
Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s
The author details the process of optimizing custom matrix multiplication kernels in Swift to train a Large Language Model on Apple Silicon, aiming to outperform C implementations by leveraging CPU, SIMD, AMX, and GPU capabilities.