Rational Sparse Autoencoder

arXiv cs.LG Papers

Summary

Introduces Rational Sparse Autoencoder (RSAE), which replaces fixed encoder activations with trainable rational functions, improving reconstruction and sparsity trade-offs on residual-stream activations of open-weight language models across multiple baseline families.

arXiv:2606.14990v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) are standard tools for mechanistic interpretability, but current SAE families are constrained by fixed encoder nonlinearities such as ReLU, JumpReLU, and TopK. This hard-codes a particular sparsity mechanism into the model and can distort the reconstruction-versus-sparsity trade-off. We introduce the Rational Sparse Autoencoder (RSAE), which replaces the fixed encoder activation with a trainable rational function. Rational activations are flexible enough to uniformly approximate the activation primitives used by existing SAE families on compact domains (for TopK, the thresholded gate obtained after a separating top-k threshold is supplied), while also providing a richer function class for adapting to the observed pre-activation geometry. We realise this idea through a two-stage pipeline: an initialisation procedure that copies the pre-trained baseline SAE weights, plugs in rational coefficients obtained by the relaxed Remez exchange on synthetic data, and calibrates the scale parameters along with the rational coefficients; followed by a fine-tuning step under the standard sparsity-regularised reconstruction objective. Empirically, on residual-stream activations of three open-weight language models and across all three baseline activation families, the RSAE strictly improves on it after the fine-tuning step, both on reconstruction-side metrics and on downstream-behaviour metrics, without sacrificing feature-level interpretability under sparse probing. These gains are consistent across host language models, across baseline activation families, and across the full range of baseline sparsity we tested, while the upgrade itself adds only a handful of scalar parameters per autoencoder and runs in minutes on a single consumer GPU.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:36 AM

# Rational Sparse Autoencoder
Source: [https://arxiv.org/html/2606.14990](https://arxiv.org/html/2606.14990)
Naiyu Yin Department of Mathematics Lehigh University Bethlehem, PA 18015 nay224@lehigh\.edu &Yue Yu Department of Mathematics Lehigh University Bethlehem, PA 18015 yuy214@lehigh\.edu

###### Abstract

Sparse autoencoders \(SAEs\) are standard tools for mechanistic interpretability, but current SAE families are constrained by fixed encoder nonlinearities such as ReLU, JumpReLU, and TopK\. This hard\-codes a particular sparsity mechanism into the model and can distort the reconstruction\-versus\-sparsity trade\-off\. We introduce the*Rational Sparse Autoencoder*\(RSAE\), which replaces the fixed encoder activation with a trainable rational function\. Rational activations are flexible enough to uniformly approximate the activation primitives used by existing SAE families on compact domains \(for TopK, the thresholded gate obtained aftera separating top\-kkthreshold is supplied\), while also providing a richer function class for adapting to the observed pre\-activation geometry\. We realise this idea through a two\-stage pipeline: an initialisation procedure that copies the pre\-trained baseline SAE weights, plugs in rational coefficients obtained by the relaxed Remez exchange on synthetic data, and calibrates the scale parameters along with the rational coefficients; followed by a fine\-tuning step under the standard sparsity\-regularised reconstruction objective\. Empirically, on residual\-stream activations of three open\-weight language models and across all three baseline activation families, the RSAE*strictly improves*on it after the fine\-tuning step, both on reconstruction\-side metrics \(MSE,ℓ0\\ell\_\{0\}, alive\-feature fraction\) and on downstream\-behaviour metrics \(cross\-entropy degradation, loss recovered\), without sacrificing feature\-level interpretability under sparse probing\. These gains are consistent across host language models, across baseline activation families, and across the full range of baseline sparsity we tested, while the upgrade itself adds only a handful of scalar parameters per autoencoder and runs in minutes on a single consumer GPU\.

## 1Introduction

Sparse autoencoders \(SAEs\) have become a central tool for mechanistic interpretability of large language models, decomposing the internal activations of a transformer into a sparse linear combination of an overcomplete dictionary of monosemantic feature directions\(Brickenet al\.,[2023](https://arxiv.org/html/2606.14990#bib.bib21); Hubenet al\.,[2024](https://arxiv.org/html/2606.14990#bib.bib38); Gaoet al\.,[2025](https://arxiv.org/html/2606.14990#bib.bib10); Rajamanoharanet al\.,[2024a](https://arxiv.org/html/2606.14990#bib.bib3),[b](https://arxiv.org/html/2606.14990#bib.bib17)\)\. Despite rapid progress on training procedures and evaluation suites, the widely used released baselines considered here pair the same affine encoder with one of three non\-smooth activation primitives:ReLU\\mathrm\{ReLU\}regularised by anℓ1\\ell\_\{1\}penalty\(Brickenet al\.,[2023](https://arxiv.org/html/2606.14990#bib.bib21); Bloom,[2024](https://arxiv.org/html/2606.14990#bib.bib20)\),TopK\\mathrm\{TopK\}\(Gaoet al\.,[2025](https://arxiv.org/html/2606.14990#bib.bib10)\), orJumpReLU\\mathrm\{JumpReLU\}\(Rajamanoharanet al\.,[2024b](https://arxiv.org/html/2606.14990#bib.bib17)\)\. Each of these primitives carries well\-documented pathologies\. Theℓ1\\ell\_\{1\}\-regularisedReLU\\mathrm\{ReLU\}SAE suffers from magnitude shrinkage of active features and a persistent population of dead latents\(Taggart,[2024](https://arxiv.org/html/2606.14990#bib.bib19); Rajamanoharanet al\.,[2024a](https://arxiv.org/html/2606.14990#bib.bib3); Gaoet al\.,[2025](https://arxiv.org/html/2606.14990#bib.bib10)\)\.TopK\\mathrm\{TopK\}replaces the soft penalty with a hard cardinality constraint that breaks gradient flow through inactive features and relies on auxiliary revival losses to mitigate dead features\.JumpReLU\\mathrm\{JumpReLU\}inserts a learnable per\-feature threshold but requires a continuous\-relaxation surrogate for back\-propagation through its indicator gate\.

In this work, we consider the shallow encoder architecture used by SAEs: one affine pre\-activation layer followed by a sparse activation block\. We show that trainable rational activations can efficiently represent the ReLU, JumpReLU, and supplied\-threshold TopK gates used by current SAE families\. For discontinuous gates, the approximation holds on compact domains separated by a marginδ\\deltafrom the jump; a direct rational gate of polylogarithmic size in1/ε1/\\varepsilonand the inverse margin suffices, and the same scalar gate also has a constant\-width deep rational realization\. Conversely, there are𝒪​\(1\)\\mathcal\{O\}\(1\)\-parameter rational target maps for which any scalar\-output single\-layer SAE encoder with piecewise\-affine ReLU/JumpReLU/supplied\-threshold TopK gates needsΩ​\(ε−1/2\)\\Omega\(\\varepsilon^\{\-1/2\}\)activated coordinates to reach accuracyε\\varepsilon\. While our theoretical analysis in this work focuses on the shallow SAE setting, we point out that a similar efficiency advantage holds for deep networks as well\. Specifically, in the deep setting, constant\-width rational networks achieve a depth upper bound of𝒪​\(log⁡log⁡\(1/ε\)\+log⁡log⁡\(1/δ\)\)\\mathcal\{O\}\(\\log\\log\(1/\\varepsilon\)\+\\log\\log\(1/\\delta\)\), whereas piecewise\-affine networks obey a parameter lower bound ofΩ​\(log⁡\(1/ε\)\)\\Omega\(\\log\(1/\\varepsilon\)\)\. This separation suggests that replacing fixed SAE gates by trainable rational activations can improve reconstruction fidelity at matched sparsity; the deep\-layer results are included as a complementary extension beyond the SAE encoder architecture\.

We therefore propose the*Rational Sparse Autoencoder*\(RSAE\), an SAE whose encoder activation is a learnable rational function applied element\-wise to the affine pre\-activation, with learnable input/output scales\(Cin,Cout\)\(C\_\{\\mathrm\{in\}\},C\_\{\\mathrm\{out\}\}\)that map the per\-feature pre\-activation distribution into a bounded interval\. We then propose a two\-step RSAE training algorithm\. During the initialization procedure, we copy the pre\-trained baseline SAE weights verbatim, plug in rational coefficients obtained by the relaxed Remez exchange\(Chenet al\.,[2018](https://arxiv.org/html/2606.14990#bib.bib30)\)on synthetic data, and calibrate the scale parameters and the coefficients to the baseline’s pre\-activation distribution\. During the fine\-tuning procedure, we unfreeze all parameters and minimise the standardℓ1\\ell\_\{1\}\-regularised reconstruction objective\. Empirically, the rational function is expressive enough to approximate every baseline activation at low degree on synthetic data\. At the SAE level, we evaluate the RSAE on residual\-stream activations of three open\-weight language models spanning a range of model sizes and against all three baseline activation families, supporting our central claim: the RSAE achieves better fidelity at comparable sparsity and strictly improves the baseline across reconstruction\-side metrics \(MSE,ℓ0\\ell\_\{0\}, alive\-feature fraction\) and downstream\-behaviour metrics \(cross\-entropy degradation, loss recovered\), uniformly across host language models and baseline activation families\. These gains are consistent across the full range of baseline sparsity we tested and do not come at the cost of feature\-level interpretability under sparse probing\. All of this is achieved by adding only a handful of scalar parameters per autoencoder and running for minutes on a single consumer GPU\.

Contributions\.We introduce*RSAE*, a new sparse autoencoder built on a trainable activation\. Our model is grounded in approximation theory tailored to the SAE encoder: trainable scalar rational activations can emulate the fixed ReLU, JumpReLU, and supplied\-threshold TopK gates used in shallow SAE encoders with polylogarithmic size, while the converse lower bound shows that scalar\-output single\-layer piecewise\-affine encoders may requireΩ​\(ε−1/2\)\\Omega\(\\varepsilon^\{\-1/2\}\)activated coordinates for some rational targets\. To implement this upgrading strategy, we propose a two\-step RSAE training algorithm: an initialisation procedure that copies the pre\-trained baseline SAE weights, followed by a fine\-tuning procedure that unfreezes all parameters under the standardℓ1\\ell\_\{1\}\-regularised reconstruction objective\. We empirically verify that the RSAE achieves better fidelity at comparable sparsity and improves the baseline across both reconstruction\-side metrics \(MSE,ℓ0\\ell\_\{0\}, alive\-feature fraction\) and downstream\-behaviour metrics \(cross\-entropy degradation, loss recovered\), uniformly across host language models, baseline activation families, and baseline sparsity levels, while preserving feature\-level interpretability under sparse probing and adding only negligible parameter and runtime overhead\.

## 2Preliminaries and Related Work

Sparse Autoencoders \(SAEs\)decompose a language model’s internal activations𝒙∈ℝdin\\bm\{x\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{in\}\}\}into a sparse linear combination of an overcomplete dictionary ofdsae≫dind\_\{\\mathrm\{sae\}\}\\gg d\_\{\\mathrm\{in\}\}feature directions𝒛∈ℝdsae\\bm\{z\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{sae\}\}\}\. They follow a skeleton with a pair of encoder and decoder functions\(f,g\)\(f,g\)defined by:

Encoder:​𝒛=f​\(𝒙\):=ϕ​\(𝑾enc​\(𝒙−𝒃dec\)\+𝒃enc\),Decoder:​𝒙^=g​\(𝒛\):=𝑾dec​𝒛\+𝒃dec\.\\text\{Encoder: \}\\bm\{z\}=f\(\\bm\{x\}\):=\\phi\(\\bm\{W\}\_\{\\text\{enc\}\}\\,\(\\bm\{x\}\-\\bm\{b\}\_\{\\text\{dec\}\}\)\+\\bm\{b\}\_\{\\text\{enc\}\}\),\\quad\\text\{Decoder: \}\\hat\{\\bm\{x\}\}=g\(\\bm\{z\}\):=\\bm\{W\}\_\{\\text\{dec\}\}\\,\\bm\{z\}\+\\bm\{b\}\_\{\\text\{dec\}\}\.\(1\)We write𝒉≔𝑾enc​\(𝒙−𝒃dec\)\+𝒃enc\\bm\{h\}\\coloneqq\\bm\{W\}\_\{\\text\{enc\}\}\(\\bm\{x\}\-\\bm\{b\}\_\{\\text\{dec\}\}\)\+\\bm\{b\}\_\{\\text\{enc\}\}for the pre\-activation, so that𝒛=ϕ​\(𝒉\)\\bm\{z\}=\\phi\(\\bm\{h\}\)\. Here, columns of𝑾dec\\bm\{W\}\_\{\\text\{dec\}\}represent decoder dictionary directions used to reconstruct𝒙\\bm\{x\}from the sparse code𝒛\\bm\{z\}with unitℓ2\\ell\_\{2\}\-norm\. The weights in encoder/decoder functions are optimized using a loss function of the form:

ℒ​\(𝑾\)=𝔼𝒙∼𝒟​\[‖𝒙−𝒙^​\(𝒙;𝑾\)‖22\+λ​S​\(𝒛​\(𝒙;𝑾\)\)\],𝑾:=\{𝑾enc,𝑾dec,𝒃enc,𝒃dec\},\\mathcal\{L\}\(\\bm\{W\}\)=\\mathbb\{E\}\_\{\\bm\{x\}\\sim\\mathcal\{D\}\}\\Bigl\[\\,\\bigl\\\|\\bm\{x\}\-\\hat\{\\bm\{x\}\}\(\\bm\{x\};\\,\\bm\{W\}\)\\bigr\\\|\_\{2\}^\{2\}\\;\+\\;\\lambda\\,S\\bigl\(\\bm\{z\}\(\\bm\{x\};\\,\\bm\{W\}\)\\bigr\)\\,\\Bigr\],\\quad\\bm\{W\}:=\\\{\\bm\{W\}\_\{\\text\{enc\}\},\\bm\{W\}\_\{\\text\{dec\}\},\\bm\{b\}\_\{\\text\{enc\}\},\\bm\{b\}\_\{\\text\{dec\}\}\\\},\(2\)whereSSis a function that penalizes non\-sparse decompositions with a tunable sparsity coefficientλ\\lambda\.

There are two objectives in the SAE encoder: sparsity, meaning that only a few elements of the dictionary are necessary, and faithfulness, meaning that the reconstructed𝒙^\\hat\{\\bm\{x\}\}is close to the original𝒙\\bm\{x\}\. To achieve a good balance between these two objectives, three major SAE activations were proposed, which differ in the encoder activationϕ\\phiand the sparsity mechanismSSimposed on𝒛\\bm\{z\}\. TheReLU SAE\(Brickenet al\.,[2023](https://arxiv.org/html/2606.14990#bib.bib21); Bloom,[2024](https://arxiv.org/html/2606.14990#bib.bib20)\)setsϕ=ReLU\\phi=\\mathrm\{ReLU\}and imposes sparsity through an explicitℓ1\\ell\_\{1\}penaltyS​\(𝒛\):=‖𝒛‖1S\(\\bm\{z\}\):=\{\\left\|\\left\|\\bm\{z\}\\right\|\\right\|\}\_\{1\}\. In the originalReLU SAE, the softℓ1\\ell\_\{1\}penalty leads to a magnitude shrinkage of active features and causes loss of reconstruction fidelity\(Taggart,[2024](https://arxiv.org/html/2606.14990#bib.bib19); Rajamanoharanet al\.,[2024a](https://arxiv.org/html/2606.14990#bib.bib3); Gaoet al\.,[2025](https://arxiv.org/html/2606.14990#bib.bib10)\)\.TopK SAE\(Gaoet al\.,[2025](https://arxiv.org/html/2606.14990#bib.bib10)\)then proposes to replace the soft penalty with a hard top\-kkselection𝒛=TopKk​\(𝒉\)\\bm\{z\}=\\mathrm\{TopK\}\_\{k\}\(\\bm\{h\}\)that yields exactℓ0=k\\ell\_\{0\}=k, and theJumpReLU SAE\(Rajamanoharanet al\.,[2024b](https://arxiv.org/html/2606.14990#bib.bib17)\)keeps theℓ1\\ell\_\{1\}\-style soft sparsity but inserts a learnable per\-feature thresholdθj\>0\\theta\_\{j\}\>0in the activation functionϕ\\phiby settingϕ​\(𝒉\)=𝒉⊙H​\(𝒉−𝜽\)\\phi\(\\bm\{h\}\)=\\bm\{h\}\\odot H\(\\bm\{h\}\-\\bm\{\\theta\}\), whereHHis the Heaviside function satisfyingH​\(z\)=0H\(z\)=0ifz≤0z\\leq 0andH​\(z\)=1H\(z\)=1elsewhere\. Orthogonal to the development in encoder activations,*Matryoshka SAEs*\(Bussmannet al\.,[2025](https://arxiv.org/html/2606.14990#bib.bib40)\)reorganise the decoder into nested\-prefix dictionaries, and*data\-free SAEs*\(Laptevet al\.,[2025](https://arxiv.org/html/2606.14990#bib.bib46)\)fit dictionaries directly from model weights without streaming activations\.

While related variants such as the*Gated SAE*\(Rajamanoharanet al\.,[2024a](https://arxiv.org/html/2606.14990#bib.bib3)\), ProLU\(Taggart,[2024](https://arxiv.org/html/2606.14990#bib.bib19)\),*BatchTopK SAE*\(Bussmannet al\.,[2024](https://arxiv.org/html/2606.14990#bib.bib4)\), and end\-to\-end SAE training\(Braunet al\.,[2024](https://arxiv.org/html/2606.14990#bib.bib42)\)modify thresholds, gates, batch\-level sparsity, or the training objective, our theoretical and empirical comparisons focus on the widely used released baselines considered here—ReLU, JumpReLU, and TopK SAEs—whose encoder nonlinearities are fixed functional forms with sparsity controlled by a penalty coefficient, learned threshold, or cardinality budget rather than by a trainable rational activation\. Through a trainable activation architecture supported by approximation theory, our RSAE provides a drop\-in modification for pre\-trained SAEs \(teacher model\): while sustaining a similar level of sparsity, it strictly improves model fidelity\.

Evaluation benchmarks and pretrained baselines\.*SAEBench*\(Karvonenet al\.,[2025](https://arxiv.org/html/2606.14990#bib.bib22)\)provides matched pretrained ReLU, JumpReLU, and TopK SAEs across model sizes and a unified evaluation suite \(covering reconstruction, sparsity, downstream performance, and interpretability metrics\); we use the pre\-trained ReLU, JumpReLU, and TopK SAEs for our Pythia\-160m and Gemma\-2\-2B baselines, and reuse its sparse\-probing harness in §[5](https://arxiv.org/html/2606.14990#S5)\. For GPT\-2 small we additionally use Bloom’s*gpt2\-small\-res\-jb*release\(Bloom,[2024](https://arxiv.org/html/2606.14990#bib.bib20)\)as the ReLU baseline and OpenAI’s v5 release\(Gaoet al\.,[2025](https://arxiv.org/html/2606.14990#bib.bib10)\)as the TopK baseline\.

Rational Neural Networkswere built on the key theoretical advantages of rational functions in approximating non\-smooth functions\(Newman,[1979](https://arxiv.org/html/2606.14990#bib.bib27); Telgarsky,[2017](https://arxiv.org/html/2606.14990#bib.bib29); Beckermann and Townsend,[2017](https://arxiv.org/html/2606.14990#bib.bib35); Chenet al\.,[2018](https://arxiv.org/html/2606.14990#bib.bib30)\)\. InBoulléet al\.\([2020](https://arxiv.org/html/2606.14990#bib.bib1)\), a rational function is employed as a learnable replacement forReLU\\mathrm\{ReLU\}ortanh\\tanhin feed\-forward networks for image classification tasks\. A superior performance was also demonstrated in operator learning and PDE surrogates, where the spectral density of rational approximation accelerates convergence on smooth target operators\(Trimmelet al\.,[2022](https://arxiv.org/html/2606.14990#bib.bib33)\)\. In rational neural networks, the standard activation function in a feed\-forward layer is replaced with a trainable rational functionP​\(t\)Q​\(t\)\\frac\{P\(t\)\}\{Q\(t\)\}\. A naive learnable denominatorQ​\(t\)Q\(t\)can develop divergent poles during training whenQQapproaches zero; to prevent this,Molinaet al\.\([2020](https://arxiv.org/html/2606.14990#bib.bib9)\)introduced the Padé Activation Unit \(PAU\) by settingQ​\(t\):=1\+∑j=1qbj​tjQ\(t\):=1\+\\sum\_\{j=1\}^\{q\}b\_\{j\}\\,t^\{j\}\.Dunefskyet al\.\([2024](https://arxiv.org/html/2606.14990#bib.bib6)\)subsequently proposed the*safe\-Padé*parameterisation by parameterizingQ​\(t\)Q\(t\)as1\+\|∑j=1qbj​tj\|1\+\|\\sum\_\{j=1\}^\{q\}b\_\{j\}\\,t^\{j\}\|, which guarantees pole\-free, Lipschitz rational activations\.

While prior works have demonstrated the theoretical advantages of rational activation functions against continuous activation functions such asReLU\\mathrm\{ReLU\},GeLU\\mathrm\{GeLU\}, ortanh\\tanh\(Boulléet al\.,[2020](https://arxiv.org/html/2606.14990#bib.bib1); Molinaet al\.,[2020](https://arxiv.org/html/2606.14990#bib.bib9); Delfosseet al\.,[2021](https://arxiv.org/html/2606.14990#bib.bib23); Trimmelet al\.,[2022](https://arxiv.org/html/2606.14990#bib.bib33); Tang and Townsend,[2026](https://arxiv.org/html/2606.14990#bib.bib34)\), little discussion has focused on discontinuous SAE gates such asJumpReLU\\mathrm\{JumpReLU\}andTopK\\mathrm\{TopK\}\. Moreover, prior rational\-network theory is usually stated for deeper feed\-forward architectures, whereas the SAE encoder in \([1](https://arxiv.org/html/2606.14990#S2.E1)\) is shallow: a single affine pre\-activation layer followed by coordinatewise gates and a linear decoder\. This mismatch motivates a theory whose main separation is stated for the single\-layer encoder, with deep rational realizations kept as an additional comparison\. In this work we show that trainable rational activations give a more efficient approximation class for the fixed gates used in shallow SAE encoders, and then leverage their approximation power and smooth gradients to obtain SAEs with lower reconstruction error and fewer dead features at matched sparsity\.

## 3Rational Sparse Autoencoder

Herein, we propose the*Rational Sparse Autoencoder*\(RSAE\), an overcomplete sparse autoencoder whose encoder activation is atrainablerational function\. We retain the standard SAE skeleton of \([1](https://arxiv.org/html/2606.14990#S2.E1)\) and modify*only*the encoder activationϕ​\(⋅\)\\phi\(\\cdot\)\. Let𝒙∈ℝdin\\bm\{x\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{in\}\}\}and write𝒉=𝑾enc​\(𝒙−𝒃dec\)\+𝒃enc∈ℝdsae\\bm\{h\}=\\bm\{W\}\_\{\\text\{enc\}\}\(\\bm\{x\}\-\\bm\{b\}\_\{\\text\{dec\}\}\)\+\\bm\{b\}\_\{\\text\{enc\}\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{sae\}\}\}for the pre\-activation\. The RSAE activation is applied element\-wise to the pre\-activation𝒉\\bm\{h\}as

ϕ​\(𝒉\)=Cout⋅r\(𝒂,𝒃\)​\(𝒉Cin\),r\(𝒂,𝒃\)​\(t\)=P​\(t\)Q​\(t\)=∑i=0pai​ti∑j=0qbj​tj,t∈\[−1,1\]\.\\phi\(\\bm\{h\}\)\\;=\\;C\_\{\\mathrm\{out\}\}\\cdot r\_\{\(\\bm\{a\},\\bm\{b\}\)\}\\\!\\Bigl\(\\frac\{\\bm\{h\}\}\{C\_\{\\mathrm\{in\}\}\}\\Bigr\),\\qquad r\_\{\(\\bm\{a\},\\bm\{b\}\)\}\(t\)\\;=\\;\\frac\{P\(t\)\}\{Q\(t\)\}\\;=\\;\\frac\{\\sum\_\{i=0\}^\{p\}a\_\{i\}\\,t^\{i\}\}\{\\sum\_\{j=0\}^\{q\}b\_\{j\}\\,t^\{j\}\},\\;t\\in\[\-1,1\]\.\(3\)Becauser\(𝒂,𝒃\)​\(⋅\)r\_\{\(\\bm\{a\},\\bm\{b\}\)\}\(\\cdot\)admit arbitrarily small uniform error of discontinuous activation functions inside a bounded compact interval\[−1,1\]\[\-1,1\], we introduce learnable scaling parametersCin,Cout\>0C\_\{\\mathrm\{in\}\},C\_\{\\mathrm\{out\}\}\>0with the purpose of mapping pre\-activation𝒉\\bm\{h\}into the rational’s design interval and the rational’s output back to the feature magnitude expected by the decoder\.

We now analyze why rational activations are a natural upgrade for the SAE encoder in \([1](https://arxiv.org/html/2606.14990#S2.E1)\)\. The main setting is shallow: an affine pre\-activation layer followed by a sparse activation block\. We prove that ReLU, JumpReLU, and the supplied\-threshold TopK gate can each be replaced by trainable rational gates of polylogarithmic size in the target accuracy, with a margin parameter for discontinuous gates\. The TopK statement concerns the thresholded gate equivalent to TopK conditional on a supplied threshold, not the full order\-statistic operator that computes thekk\-th threshold\. We also state the corresponding constant\-width deep rational realizations, connecting the SAE result to the rational\-network literatureBoulléet al\.\([2020](https://arxiv.org/html/2606.14990#bib.bib1)\); Tang and Townsend \([2026](https://arxiv.org/html/2606.14990#bib.bib34)\)\. For the converse direction, we exhibit an𝒪​\(1\)\\mathcal\{O\}\(1\)\-parameter rational target map that cannot be approximated efficiently by scalar\-output piecewise\-affine encoders in the same shallow form: any such single\-layer ReLU/JumpReLU/supplied\-threshold TopK encoder needsΩ​\(ε−1/2\)\\Omega\(\\varepsilon^\{\-1/2\}\)activated coordinates\. Together, these results explain why replacing the fixed activation in a pretrained SAE by a trainable rational activation can increase the encoder’s approximation power without changing the linear backbone\.

Our theoretical result is based on the family of Zolotarev sign functions, which are geometrically convergent rational approximants ofsign​\(x\)\\mathrm\{sign\}\(x\)on the gap\-separated setEδ:=\[−1,−δ\]∪\[δ,1\]E\_\{\\delta\}:=\[\-1,\-\\delta\]\\cup\[\\delta,1\]for anyδ∈\(0,1\)\\delta\\in\(0,1\)\. We state the quantitative bounds below, and entail all proofs in Appendix:

###### Lemma 1\(Rational approximation ofsign\\mathrm\{sign\}\)\.

For everyδ∈\(0,1\)\\delta\\in\(0,1\)andn≥1n\\geq 1there is a type\-\(2​n\+1,2​n\)\(2n\+1,2n\)rationalsn,δs\_\{n,\\delta\}such thatsupx∈Eδ\|sign​\(x\)−sn,δ​\(x\)\|≤4​exp⁡\(−π2​n/log⁡\(4/δ\)\)\\sup\_\{x\\in E\_\{\\delta\}\}\\big\|\\mathrm\{sign\}\(x\)\-s\_\{n,\\delta\}\(x\)\\big\|\\leq 4\\exp\\\!\\big\(\-\\pi^\{2\}n/\\log\(4/\\delta\)\\big\)\.

Consequently, for every0<ε<10<\\varepsilon<1, there is a rational function of size𝒪​\(log⁡\(1/ε\)​log⁡\(1/δ\)\)\\mathcal\{O\}\\big\(\\log\(1/\\varepsilon\)\\log\(1/\\delta\)\\big\)that approximatessign\\mathrm\{sign\}onEδE\_\{\\delta\}to uniform errorε\\varepsilon\. For deep\-layer networks, there is a constant\-width rational network of depth𝒪​\(log⁡log⁡\(1/ε\)\+log⁡log⁡\(1/δ\)\)\\mathcal\{O\}\\big\(\\log\\log\(1/\\varepsilon\)\+\\log\\log\(1/\\delta\)\\big\)that approximatessign\\mathrm\{sign\}onEδE\_\{\\delta\}to uniform errorε\\varepsilon\.

This lemma enables two complementary rational implementations of the activation gates below: a direct rational gate of the stated size for the shallow SAE encoder, and a constant\-width deep rational realization as considered in the previous rational network approximation results for continuous activation functions such as ReLUBoulléet al\.\([2020](https://arxiv.org/html/2606.14990#bib.bib1)\)and GeLUTang and Townsend \([2026](https://arxiv.org/html/2606.14990#bib.bib34)\)\. To provide analysis for JumpReLU and TopK, we denote:

\(ReLU\)𝒛R​\(𝒉\)=ReLU​\(𝒉\)=𝒉⊙H​\(𝒉\)=𝒉⊙sign​\(𝒉\)\+12,\\displaystyle\\bm\{z\}\_\{\\mathrm\{R\}\}\(\\bm\{h\}\)=\\text\{ReLU\}\(\\bm\{h\}\)=\\bm\{h\}\\odot H\(\\bm\{h\}\)=\\bm\{h\}\\odot\\frac\{\\text\{sign\}\(\\bm\{h\}\)\+1\}\{2\},\(4\)\(JumpReLU\)𝒛J​\(𝒉\)=JumpReLU​\(𝒉\)=𝒉⊙H​\(𝒉−𝜽\)=𝒉⊙sign​\(𝒉−𝜽\)\+12,\\displaystyle\\bm\{z\}\_\{\\mathrm\{J\}\}\(\\bm\{h\}\)=\\text\{JumpReLU\}\(\\bm\{h\}\)=\\bm\{h\}\\odot H\(\\bm\{h\}\-\\bm\{\\theta\}\)=\\bm\{h\}\\odot\\frac\{\\text\{sign\}\(\\bm\{h\}\-\\bm\{\\theta\}\)\+1\}\{2\},\(5\)\(supplied\-threshold TopK gate\)𝒛T​\(𝒉;τk\)=TopK​\(𝒉;τk\)=𝒉⊙sign​\(𝒉−τk\)\+12,\\displaystyle\\bm\{z\}\_\{\\mathrm\{T\}\}\(\\bm\{h\};\\tau\_\{k\}\)=\\text\{TopK\}\(\\bm\{h\};\\tau\_\{k\}\)=\\bm\{h\}\\odot\\frac\{\\text\{sign\}\(\\bm\{h\}\-\\tau\_\{k\}\)\+1\}\{2\},\(6\)whereHHis the Heaviside function,𝜽∈\(ℝ\+\)dsae\\bm\{\\theta\}\\in\(\\mathbb\{R\}^\{\+\}\)^\{d\_\{\\mathrm\{sae\}\}\}is the per\-feature threshold\.For TopK, leth\(1\)≥⋯≥h\(dsae\)h\_\{\(1\)\}\\geq\\cdots\\geq h\_\{\(d\_\{\\mathrm\{sae\}\}\)\}denote the sorted pre\-activations, with1≤k<dsae1\\leq k<d\_\{\\mathrm\{sae\}\}\. We useτk\\tau\_\{k\}for a supplied separating threshold satisfyingh\(k\+1\)<τk<h\(k\)h\_\{\(k\+1\)\}<\\tau\_\{k\}<h\_\{\(k\)\}; under marginδ\\delta, this meansh\(k\+1\)\+δ≤τk≤h\(k\)−δh\_\{\(k\+1\)\}\+\\delta\\leq\\tau\_\{k\}\\leq h\_\{\(k\)\}\-\\delta\. Thusτk\\tau\_\{k\}is not the literalkk\-th largest entry, which would lie on the discontinuity\.Thus𝒛T​\(𝒉;τk\)\\bm\{z\}\_\{\\mathrm\{T\}\}\(\\bm\{h\};\\tau\_\{k\}\)is the thresholded gate equivalent to TopK conditional on the supplied threshold; our rational approximation result does not include the order\-statistic computation that obtainsτk\\tau\_\{k\}from𝒉\\bm\{h\}\.

For the ReLU activation, we use the approximation theorem ofBoulléet al\.\([2020](https://arxiv.org/html/2606.14990#bib.bib1)\):

###### Theorem 2\(Rational approximation of ReLUBoulléet al\.\([2020](https://arxiv.org/html/2606.14990#bib.bib1)\)\)\.

For every0<ε<10<\\varepsilon<1, there exists a scalar rational functionRε:\[−1,1\]→\[−1,1\]R\_\{\\varepsilon\}:\[\-1,1\]\\to\[\-1,1\]of size

𝒪​\(log2⁡\(1/ε\)\),\\mathcal\{O\}\\\!\\Big\(\\log^\{2\}\(1/\\varepsilon\)\\Big\),such that

supx∈\[−1,1\]\|Rε​\(x\)−ReLU​\(x\)\|≤ε\.\\sup\_\{x\\in\[\-1,1\]\}\\big\|R\_\{\\varepsilon\}\(x\)\-\\mathrm\{ReLU\}\(x\)\\big\|\\;\\leq\\;\\varepsilon\.Consequently, the ReLU activation block can be replaced in either of two implementations\. First,RεR\_\{\\varepsilon\}can be applied coordinatewise as a trainable rational activation, with scalar size𝒪​\(log2⁡\(1/ε\)\)\\mathcal\{O\}\\\!\\big\(\\log^\{2\}\(1/\\varepsilon\)\\big\)\. Second, the same scalar map can be realized by a constant\-width deep rational network of internal depth

MR=𝒪​\(log⁡log⁡\(1/ε\)\)\.M\_\{R\}\\;=\\;\\mathcal\{O\}\\\!\\Big\(\\log\\log\(1/\\varepsilon\)\\Big\)\.Under either implementation, the resulting activation blockℛR:\[−1,1\]dsae→ℝdsae\\mathcal\{R\}\_\{R\}:\[\-1,1\]^\{d\_\{\\mathrm\{sae\}\}\}\\to\\mathbb\{R\}^\{d\_\{\\mathrm\{sae\}\}\}satisfies

sup𝒉∈\[−1,1\]dsae∥ℛR​\(𝒉\)−𝒛R​\(𝒉\)∥∞≤ε\.\\sup\_\{\\bm\{h\}\\in\[\-1,1\]^\{d\_\{\\mathrm\{sae\}\}\}\}\\big\\lVert\\mathcal\{R\}\_\{R\}\(\\bm\{h\}\)\-\\bm\{z\}\_\{\\mathrm\{R\}\}\(\\bm\{h\}\)\\big\\rVert\_\{\\infty\}\\;\\leq\\;\\varepsilon\.

JumpReLU is discontinuous athi=θih\_\{i\}=\\theta\_\{i\}, so uniform approximation is only meaningful on a domain bounded away from the jump\. We therefore fix a*margin*δ\>0\\delta\>0and defineΩδ:=\{𝒉∈\[−1,1\]dsae:\|hi−θi\|≥δ,∀i\}\\Omega\_\{\\delta\}:=\\\{\\bm\{h\}\\in\[\-1,1\]^\{d\_\{\\mathrm\{sae\}\}\}:\|h\_\{i\}\-\\theta\_\{i\}\|\\geq\\delta,\\;\\forall i\\\}111This is the standard domain restriction needed for uniform approximation of a discontinuous threshold map: without excluding aδ\\delta\-neighbourhood of the jump, no continuous or rational approximant can achieve arbitrarily small uniform error\. In applications,δ\\deltashould therefore be interpreted as a lower bound on the threshold margin of the pre\-activations under consideration\.\. We then have the approximation results onΩδ\\Omega\_\{\\delta\}:

###### Theorem 3\(Rational approximation of JumpReLU\)\.

For every0<ε<10<\\varepsilon<1, the JumpReLU activation block onΩδ\\Omega\_\{\\delta\}can be replaced in either of two implementations\. First, each coordinate map can be implemented directly as a trainable scalar rational activation of size

𝒪​\(log⁡\(1/ε\)​log⁡\(1/δ\)\),\\mathcal\{O\}\\\!\\Big\(\\log\(1/\\varepsilon\)\\log\(1/\\delta\)\\Big\),with constants depending only on the fixed threshold scale\. Second, each coordinate map can be realized by a constant\-width deep rational network of internal depth

MJ=𝒪​\(log⁡log⁡\(1/ε\)\+log⁡log⁡\(1/δ\)\)M\_\{J\}\\;=\\;\\mathcal\{O\}\\\!\\Big\(\\log\\log\(1/\\varepsilon\)\+\\log\\log\(1/\\delta\)\\Big\)and per\-coordinate size𝒪​\(MJ\)\\mathcal\{O\}\(M\_\{J\}\)\. Under either implementation, the resulting activation blockℛJ:\[−1,1\]dsae→ℝdsae\\mathcal\{R\}\_\{J\}:\[\-1,1\]^\{d\_\{\\mathrm\{sae\}\}\}\\to\\mathbb\{R\}^\{d\_\{\\mathrm\{sae\}\}\}satisfies

sup𝒉∈Ωδ∥ℛJ​\(𝒉\)−𝒛J​\(𝒉\)∥∞≤ε\.\\sup\_\{\\bm\{h\}\\in\\Omega\_\{\\delta\}\}\\big\\lVert\\mathcal\{R\}\_\{J\}\(\\bm\{h\}\)\-\\bm\{z\}\_\{\\mathrm\{J\}\}\(\\bm\{h\}\)\\big\\rVert\_\{\\infty\}\\;\\leq\\;\\varepsilon\.

For TopK,fix1≤k<dsae1\\leq k<d\_\{\\mathrm\{sae\}\}andconsider a simplified setting in which the pretrained teacher supplies the scalar thresholdτk\\tau\_\{k\}that determines the active support\.Hereτk\\tau\_\{k\}is a separating threshold between thekk\-th and\(k\+1\)\(k\+1\)\-st order statistics, not thekk\-th activation itself; for example, one may takeτk=\(h\(k\)\+h\(k\+1\)\)/2\\tau\_\{k\}=\(h\_\{\(k\)\}\+h\_\{\(k\+1\)\}\)/2whenh\(k\)−h\(k\+1\)≥2​δh\_\{\(k\)\}\-h\_\{\(k\+1\)\}\\geq 2\\delta\.The result below therefore approximates the thresholded gate equivalent to TopK conditional on a supplied threshold, not the full TopK operator itself\. Similar to the JumpReLU case, we require a margin\-separated domainwith sorted coordinatesh\(1\)≥⋯≥h\(dsae\)h\_\{\(1\)\}\\geq\\cdots\\geq h\_\{\(d\_\{\\mathrm\{sae\}\}\)\},

ΩδT:=\{\(𝒉,τk\)∈\[−1,1\]dsae×\[−1,1\]:h\(k\)−τk≥δ,τk−h\(k\+1\)≥δ\},\{\\color\[rgb\]\{0,0,0\}\\Omega^\{\\mathrm\{T\}\}\_\{\\delta\}:=\\big\\\{\(\\bm\{h\},\\tau\_\{k\}\)\\in\[\-1,1\]^\{d\_\{\\mathrm\{sae\}\}\}\\times\[\-1,1\]:h\_\{\(k\)\}\-\\tau\_\{k\}\\geq\\delta,\\;\\tau\_\{k\}\-h\_\{\(k\+1\)\}\\geq\\delta\\big\\\},\}and obtain:

###### Theorem 4\(Rational approximation of supplied\-threshold TopK gate\)\.

Suppose the scalar thresholdτk∈\[−1,1\]\\tau\_\{k\}\\in\[\-1,1\]is supplied together with each pre\-activation vectorand satisfies the separating margin condition above\. For every0<ε<10<\\varepsilon<1, the supplied\-threshold TopK gate onΩδT\\Omega^\{\\mathrm\{T\}\}\_\{\\delta\}can be replaced in either of two implementations\. First, each coordinate map can be implemented directly as a trainable scalar rational activation of size

𝒪​\(log⁡\(1/ε\)​log⁡\(1/δ\)\)\.\\mathcal\{O\}\\\!\\Big\(\\log\(1/\\varepsilon\)\\log\(1/\\delta\)\\Big\)\.Second, each coordinate map can be realized by a constant\-width deep rational network of internal depth

MT=𝒪​\(log⁡log⁡\(1/ε\)\+log⁡log⁡\(1/δ\)\)\.M\_\{T\}\\;=\\;\\mathcal\{O\}\\\!\\Big\(\\log\\log\(1/\\varepsilon\)\+\\log\\log\(1/\\delta\)\\Big\)\.Under either implementation, the resulting networkℛT:\[−1,1\]dsae\+1→ℝdsae\\mathcal\{R\}\_\{T\}:\[\-1,1\]^\{d\_\{\\mathrm\{sae\}\}\+1\}\\to\\mathbb\{R\}^\{d\_\{\\mathrm\{sae\}\}\}satisfies

sup\(𝒉,τk\)∈ΩδT∥ℛT​\(𝒉,τk\)−𝒛T​\(𝒉;τk\)∥∞≤ε\.\\sup\_\{\(\\bm\{h\},\\tau\_\{k\}\)\\in\\Omega^\{\\mathrm\{T\}\}\_\{\\delta\}\}\\big\\lVert\\mathcal\{R\}\_\{T\}\(\\bm\{h\},\\tau\_\{k\}\)\-\\bm\{z\}\_\{\\mathrm\{T\}\}\(\\bm\{h\};\\tau\_\{k\}\)\\big\\rVert\_\{\\infty\}\\;\\leq\\;\\varepsilon\.

We now turn to the converse question of approximating rational functions:

###### Theorem 5\(Lower bound for ReLU/JumpReLU/TopK networks\)\.

Fixη∈\(0,1/2\)\\eta\\in\(0,1/2\)and define the rational target

ℛη⋆​\(x\):=η2x2\+η2,x∈\[−1,1\]\.\{\\mathcal\{R\}^\{\\star\}\_\{\\eta\}\(x\):=\\frac\{\\eta^\{2\}\}\{x^\{2\}\+\\eta^\{2\}\},\\qquad x\\in\[\-1,1\]\.\}This target satisfiesℛη⋆:\[−1,1\]→\[0,1\]\\mathcal\{R\}^\{\\star\}\_\{\\eta\}:\[\-1,1\]\\to\[0,1\]and can be realized with𝒪​\(1\)\\mathcal\{O\}\(1\)rational parameters\. Then any scalar map𝒮:\[−1,1\]→\[0,1\]\\mathcal\{S\}:\[\-1,1\]\\to\[0,1\]realized by a ReLU/JumpReLU/supplied\-threshold TopK network and satisfying

∥𝒮−ℛη⋆∥L∞​\(\[−1,1\]\)≤ε\\lVert\\mathcal\{S\}\-\\mathcal\{R\}^\{\\star\}\_\{\\eta\}\\rVert\_\{L^\{\\infty\}\(\[\-1,1\]\)\}\\leq\\varepsilonmust satisfyP=Ω​\(log⁡\(1/ε\)\)P=\\Omega\(\\log\(1/\\varepsilon\)\), wherePPis the number of trainable parameters\. If𝒮\\mathcal\{S\}is realized by the scalar\-output version of the single\-layer encoder architecture in \([1](https://arxiv.org/html/2606.14990#S2.E1)\) withNNactivated coordinates, then it must satisfyN=Ω​\(ε−1/2\)N=\\Omega\(\\varepsilon^\{\-1/2\}\)\.

Together, Lemma[1](https://arxiv.org/html/2606.14990#Thmtheorem1)and Theorems[2](https://arxiv.org/html/2606.14990#Thmtheorem2)–[5](https://arxiv.org/html/2606.14990#Thmtheorem5)show an expressive asymmetry in the SAE encoder setting: trainable rational activations give compact approximations to the fixed gates used by current SAEs, whereas scalar\-output single\-layer piecewise\-affine encoders can require polynomially many activated coordinates for simple rational targets\. This supports the expectation that RSAEs can provide better reconstruction fidelity at matched sparsity\.

## 4Practical Algorithm

As indicated by our analysis, the rational activation function is guaranteed to provide a better approximation power given a teacher model and supplied threshold\. The RSAE is then constructed in two steps: \(i\) an*initialization procedure*that produces high\-quality rational coefficients\(𝒂,𝒃\)\(\\bm\{a\},\\bm\{b\}\)and learnable scales\(Cin,Cout\)\(C\_\{\\mathrm\{in\}\},C\_\{\\mathrm\{out\}\}\)by first fitting a rational function on a bounded interval of synthetic data and then adapting the rational activation to the teacher SAE’s pre\-activation distribution; and \(ii\) a*fine\-tuning procedure*that jointly optimises all parameters, including the encoder and decoder weights, under the standardℓ1\\ell\_\{1\}\-regularised reconstruction objective\.

Step 1: RSAE Initialization Procedure\.Letϕteacher∈\{ReLU,JumpReLUθ,TopKk\}\\phi^\{\\text\{teacher\}\}\\in\\\{\\mathrm\{ReLU\},\\,\\mathrm\{JumpReLU\}\_\{\\theta\},\\,\\mathrm\{TopK\}\_\{k\}\\\}denote any of the activation primitives used by the baseline SAE families considered here, and let\{\(tℓ,yℓ\)\}ℓ=1N\\\{\(t\_\{\\ell\},y\_\{\\ell\}\)\\\}\_\{\\ell=1\}^\{N\}be a uniform dense grid on\[−1,1\]\[\-1,1\]withyℓ=ϕteacher​\(tℓ\)y\_\{\\ell\}=\\phi^\{\\text\{teacher\}\}\(t\_\{\\ell\}\)\. We first fit a rational function on this bounded interval to obtain coefficients that approximate the teacher activation to high accuracy\. To this end, we employ the*relaxed Remez exchange*ofChenet al\.\([2018](https://arxiv.org/html/2606.14990#bib.bib30)\), an iterative procedure that alternates a linearised coefficient solve with a node\-exchange step until the residual equioscillates, to solve the min–max objective

\(𝒂∗,𝒃∗\)=arg⁡min𝒂,𝒃⁡maxt∈\[−1,1\]⁡\|r\(𝒂,𝒃\)​\(t\)−ϕteacher​\(t\)\|\.\(\\bm\{a\}^\{\\ast\},\\bm\{b\}^\{\\ast\}\)\\;=\\;\\arg\\min\_\{\\bm\{a\},\\bm\{b\}\}\\;\\max\_\{t\\in\[\-1,1\]\}\\;\\bigl\|\\,r\_\{\(\\bm\{a\},\\bm\{b\}\)\}\(t\)\\,\-\\,\\phi^\{\\text\{teacher\}\}\(t\)\\,\\bigr\|\.\(7\)The full algorithmic details, including the linearised system \([10](https://arxiv.org/html/2606.14990#A2.E10)\) solved at each outer iteration, are deferred to Appendix[B\.2](https://arxiv.org/html/2606.14990#A2.SS2)\. Remez returns the standard\-Padé coefficients used directly in \([3](https://arxiv.org/html/2606.14990#S3.E3)\); this synthetic fitting is performed once per teacher activation and the resulting coefficients\(𝒂∗,𝒃∗\)\(\\bm\{a\}^\{\\ast\},\\bm\{b\}^\{\\ast\}\)can be tabulated for reuse\.

Given a pre\-trained baseline SAE with weights\{𝑾enc~,𝒃enc~,𝑾dec~,𝒃dec~\}\\\{\\widetilde\{\\bm\{W\}\_\{\\text\{enc\}\}\},\\widetilde\{\\bm\{b\}\_\{\\text\{enc\}\}\},\\widetilde\{\\bm\{W\}\_\{\\text\{dec\}\}\},\\widetilde\{\\bm\{b\}\_\{\\text\{dec\}\}\}\\\}and the rational coefficients\(𝒂∗,𝒃∗\)\(\\bm\{a\}^\{\\ast\},\\bm\{b\}^\{\\ast\}\), we then adapt the rational activation to the teacher’s pre\-activation distribution by minimising

\(𝒂~,𝒃~,C~in,C~out\)=arg⁡min𝒂,𝒃,Cin,Cout⁡‖ϕ​\(𝒉;𝒂,𝒃,Cin,Cout\)−ϕteacher​\(𝒉\)‖22,\(\\widetilde\{\\bm\{a\}\},\\widetilde\{\\bm\{b\}\},\\widetilde\{C\}\_\{\\mathrm\{in\}\},\\widetilde\{C\}\_\{\\mathrm\{out\}\}\)\\;=\\;\\arg\\min\_\{\\bm\{a\},\\bm\{b\},C\_\{\\mathrm\{in\}\},C\_\{\\mathrm\{out\}\}\}\\;\\bigl\\\|\\phi\(\\bm\{h\};\\,\\bm\{a\},\\bm\{b\},C\_\{\\mathrm\{in\}\},C\_\{\\mathrm\{out\}\}\)\\,\-\\,\\phi^\{\\mathrm\{teacher\}\}\(\\bm\{h\}\)\\bigr\\\|\_\{2\}^\{2\},\(8\)whereϕ​\(𝒉;𝒂,𝒃,Cin,Cout\)\\phi\(\\bm\{h\};\\,\\bm\{a\},\\bm\{b\},C\_\{\\mathrm\{in\}\},C\_\{\\mathrm\{out\}\}\)is computed via \([3](https://arxiv.org/html/2606.14990#S3.E3)\) and𝒉=𝑾enc~​\(𝒙−𝒃dec~\)\+𝒃enc~\\bm\{h\}=\\widetilde\{\\bm\{W\}\_\{\\text\{enc\}\}\}\\,\(\\bm\{x\}\-\\widetilde\{\\bm\{b\}\_\{\\text\{dec\}\}\}\)\+\\widetilde\{\\bm\{b\}\_\{\\text\{enc\}\}\}is the teacher’s pre\-activation\. Combined with the inherited encoder and decoder weights, the learned coefficients and scales allow the RSAE to approximately reproduce the teacher’s output, up to the Step 1 approximation error\.

Step 2: RSAE Fine\-Tuning Procedure\.We initialise the RSAE encoder and decoder with the teacher weights\{𝑾enc~,𝒃enc~,𝑾dec~,𝒃dec~\}\\\{\\widetilde\{\\bm\{W\}\_\{\\text\{enc\}\}\},\\widetilde\{\\bm\{b\}\_\{\\text\{enc\}\}\},\\widetilde\{\\bm\{W\}\_\{\\text\{dec\}\}\},\\widetilde\{\\bm\{b\}\_\{\\text\{dec\}\}\}\\\}, and the rational activation with the learned coefficients\(𝒂~,𝒃~\)\(\\widetilde\{\\bm\{a\}\},\\widetilde\{\\bm\{b\}\}\)and scales\(C~in,C~out\)\(\\widetilde\{C\}\_\{\\mathrm\{in\}\},\\widetilde\{C\}\_\{\\mathrm\{out\}\}\)\. We then unfreeze all parametersΘ≔\{𝑾enc,𝒃enc,𝑾dec,𝒃dec,𝒂,𝒃,Cin,Cout\}\\Theta\\coloneqq\\\{\\bm\{W\}\_\{\\text\{enc\}\},\\,\\bm\{b\}\_\{\\text\{enc\}\},\\,\\bm\{W\}\_\{\\text\{dec\}\},\\,\\bm\{b\}\_\{\\text\{dec\}\},\\,\\bm\{a\},\\,\\bm\{b\},\\,C\_\{\\mathrm\{in\}\},\\,C\_\{\\mathrm\{out\}\}\\\}and minimise theℓ1\\ell\_\{1\}\-regularised objective

minΘ⁡𝔼𝒙∼𝒟​\[‖𝒙−𝒙^​\(𝒙;Θ\)‖22\+λ​‖𝒛​\(𝒙;Θ\)‖1\]\.\\min\_\{\\Theta\}\\;\\;\\mathbb\{E\}\_\{\\bm\{x\}\\sim\\mathcal\{D\}\}\\Bigl\[\\,\\bigl\\\|\\bm\{x\}\-\\hat\{\\bm\{x\}\}\(\\bm\{x\};\\,\\Theta\)\\bigr\\\|\_\{2\}^\{2\}\\;\+\\;\\lambda\\,\\bigl\\\|\\bm\{z\}\(\\bm\{x\};\\,\\Theta\)\\bigr\\\|\_\{1\}\\,\\Bigr\]\.\(9\)

## 5Empirical Results

### 5\.1Rational Function Approximation Performance on Synthetic Data

Setup\.We evaluate the three rational coefficient fitting procedures of §[4](https://arxiv.org/html/2606.14990#S4)on the activation primitives used by the baseline SAE families considered here\. Each fitting procedure is run on a uniform, dense grid ofN=4001N=4001points over\[−1,1\]\[\-1,1\]with target activation functionsReLU\\mathrm\{ReLU\}andJumpReLU\\mathrm\{JumpReLU\}\. We report the mean\-squared error \(MSE\) of the fitted rational functions\. In particular, we run the relaxed Remez exchange in both the standard\-Padé form \(identical to the original formulation\) and the safe\-Padé form\. We additionally fit the safe\-Padé coefficients directly under theL2L^\{2\}and smoothed\-L∞L^\{\\infty\}surrogates as baselines\. We deliberately omitTopK\\mathrm\{TopK\}from the synthetic study because, conditioned on a given input batch,TopK\\mathrm\{TopK\}is pointwise equivalent to a supplied\-threshold JumpReLU\-type gate with asample\-dependent separating thresholdτk\\tau\_\{k\}between thekk\-th and\(k\+1\)\(k\+1\)\-st largest pre\-activations, e\.g\. their midpoint when the TopK gap is positive\.Therefore any rational that approximates theJumpReLU\\mathrm\{JumpReLU\}family uniformly over the threshold also approximatesTopK\\mathrm\{TopK\}on the corresponding batch\.

![Refer to caption](https://arxiv.org/html/2606.14990v1/x1.png)Figure 1:Rational approximation of SAE activation primitives \(ReLU and JumpReLU\) on\[−1,1\]\[\-1,1\]\.Best\-MSE rational fits ofReLU\\mathrm\{ReLU\}\(figure \(a\) and figure \(b\)\) andJumpReLU\\mathrm\{JumpReLU\}withθ=0\.1\\theta=0\.1\(figure \(c\) and figure \(d\)\) under three procedures: the relaxed Remez exchange \(redfor standard\-Padé andpurplefor safe\-Padé\), theL2L^\{2\}fit \(blue\), and the smoothedL∞L^\{\\infty\}fit \(green\)\. Figure \(b\) and figure \(d\) zoom into the kink and the jump, respectively\. Each curve uses the optimal\(p,q\)\(p,q\)for its procedure\. Across figure \(a\) \- \(d\), the Remez fit is visually indistinguishable from the teacher activations, validating the universal\-approximation claim\.Approximation precision\.Figure[1](https://arxiv.org/html/2606.14990#S5.F1)shows that the Remez procedure fits bothReLU\\mathrm\{ReLU\}andJumpReLU\\mathrm\{JumpReLU\}to high precision on\[−1,1\]\[\-1,1\]: even in the immediate neighbourhood of the kink \(Figure[1](https://arxiv.org/html/2606.14990#S5.F1)\(b\)\) and the jump \(Figure[1](https://arxiv.org/html/2606.14990#S5.F1)\(d\)\), the fits are visually indistinguishable from the teacher\. Remez reaches a low MSE of3\.8×10−73\.8\\times 10^\{\-7\}onReLU\\mathrm\{ReLU\}at type\(15,14\)\(15,14\)and2\.4×10−62\.4\\times 10^\{\-6\}onJumpReLU\\mathrm\{JumpReLU\}withθ=0\.1\\theta=0\.1at type\(19,18\)\(19,18\), outperforming theL2L^\{2\}andL∞L^\{\\infty\}approaches\. While Figure[1](https://arxiv.org/html/2606.14990#S5.F1)\(c\) shows the fit for JumpReLU withθ=0\.1\\theta=0\.1, we additionally perform ablation studies on JumpReLU with larger discontinuitiesθ∈\{0\.2,…,0\.5\}\\theta\\in\\\{0\.2,\\ldots,0\.5\\\}\(Table[7](https://arxiv.org/html/2606.14990#A2.T7), Appendix[B\.3](https://arxiv.org/html/2606.14990#A2.SS3)\); fitting performance remains consistent across discontinuities\.

Choice of degrees\.To choose a suitable degree for our algorithm, we perform an ablation over\(p,q\)\(p,q\)for all four procedures; the results are reported in Figure[3](https://arxiv.org/html/2606.14990#A2.F3)\(Appendix[B\.3](https://arxiv.org/html/2606.14990#A2.SS3)\), which plots MSE against the numerator degreepp\. In general, Remez first exhibits the expected near\-exponential decay; numerical conditioning of the linearised system \([10](https://arxiv.org/html/2606.14990#A2.E10)\) then dominates, and the curve flattens or oscillates\. Empirically, a single low\-degree rational \(\(3,2\)\(3,2\)for ReLU,\(9,8\)\(9,8\)for JumpReLU and TopK\) is expressive enough to reproduce every activation primitive used by current SAE baselines to within numerical precision\.

### 5\.2Rational SAE Performance

Models, SAEs, and Evaluation Metrics\.We evaluate the RSAE on residual\-stream activations from three open\-weight language models of various sizes:GPT\-2 small,Pythia\-160m\-deduped, andGemma\-2\-2B\. Teacher SAEs are taken from publicly released checkpoints: GPT\-2 small from Bloom\(Bloom,[2024](https://arxiv.org/html/2606.14990#bib.bib20)\)’sgpt2\-small\-res\-jb\(ReLU\) and the OpenAI\-v5\(Gaoet al\.,[2025](https://arxiv.org/html/2606.14990#bib.bib10)\)gpt2\-small\-resid\-post\-v5\-32k\(TopK\); Pythia\-160m and Gemma\-2\-2B from SAEBench\(Karvonenet al\.,[2025](https://arxiv.org/html/2606.14990#bib.bib22)\)for ReLU, JumpReLU, and TopK\. Following the standard protocol, we evaluate our RSAE against SAE baselines in terms of five metrics: \(1\) reconstruction MSE‖𝒙−𝒙^‖F2\\\|\\bm\{x\}\-\\hat\{\\bm\{x\}\}\\\|\_\{F\}^\{2\}, \(2\)ℓ0\\ell\_\{0\}at the\|z\|\>10−6\|z\|\>10^\{\-6\}threshold, \(3\) the fraction of alive latents, \(4\) the cross\-entropy degradation when the SAE intercepts the residual streamΔ​CE=CE𝒙^−CE𝒙\\Delta\\mathrm\{CE\}=\\mathrm\{CE\}\_\{\\hat\{\\bm\{x\}\}\}\-\\mathrm\{CE\}\_\{\\bm\{x\}\}\(lower is better\), \(5\) the loss\-recovered fractionLR=\(CEzero−CE𝒙^\)/\(CEzero−CE𝒙\)\\mathrm\{LR\}=\(\\mathrm\{CE\}\_\{\\text\{zero\}\}\-\\mathrm\{CE\}\_\{\\hat\{\\bm\{x\}\}\}\)/\(\\mathrm\{CE\}\_\{\\text\{zero\}\}\-\\mathrm\{CE\}\_\{\\bm\{x\}\}\)\(higher is better\)\. Please refer to Appendix[B\.1](https://arxiv.org/html/2606.14990#A2.SS1)for more implementation details\.

The experiments are organised around two claims:\(C1\)the rational activation, under the proposed initialization procedure in Algorithm[1](https://arxiv.org/html/2606.14990#alg1), approximately reproduces the baseline SAE teachers’ behaviour at initialisation;\(C2\)after a joint fine\-tune, the RSAE improves on the teacher across the great majority of reconstruction\- and downstream\-behaviour metrics, uniformly across host language models and teacher activation families\.

Table 1:Main results:teacher SAE, RSAE after initialization \(RSAE init\), and RSAE after fine\-tuning on residual\-stream activations of GPT\-2 small, Pythia\-160m, and Gemma\-2\-2B\.Table 2:Downstream behavior:cross\-entropy degradation and loss recovered when the SAE intercepts the residual stream of GPT\-2 small, Pythia\-160m, and Gemma\-2\-2B\.\(C1\) Approximate reproduction at initialisation\.Table[1](https://arxiv.org/html/2606.14990#S5.T1)establishes\(C1\): for every \(model, teacher\) pair, the*RSAE init*row closely tracks the teacher row, especially on reconstruction MSE, while smallℓ0\\ell\_\{0\}and alive\-feature differences remain in some cases\. This is the expected outcome of the RSAE initialization procedure: the rational activation approximately reproduces the teacher’s pre\-activation𝒉\\bm\{h\}to activation𝒛\\bm\{z\}map, yielding similar SAE evaluations before fine\-tuning\.

\(C2\) Strict improvement after fine\-tuning\.After22K Adam steps, the*RSAE*row beats the teacher across the great majority of cells in Tables[1](https://arxiv.org/html/2606.14990#S5.T1)and[2](https://arxiv.org/html/2606.14990#S5.T2):22/2422/24reconstruction\-axis cells in Table[1](https://arxiv.org/html/2606.14990#S5.T1)strictly improve over the teacher \(the only exceptions a1\.61\.6\-token regression inℓ0\\ell\_\{0\}on ReLU/GPT\-2 small and an alive tie at99\.9%99\.9\\%on TopK/Gemma\-2\-2B\), and13/1613/16downstream\-axis cells in Table[2](https://arxiv.org/html/2606.14990#S5.T2)strictly improve \(the only exceptions a marginal regression on ReLU/Pythia\-160m and a tiedΔ​CE\\Delta\\mathrm\{CE\}of0\.0930\.093on TopK/Gemma\-2\-2B\)\. The wins hold uniformly across the three host language models, all three teacher activation families, and both reconstruction\- and downstream\-axis metrics, so the improvement is not an artefact of any single architecture, host model, or evaluation axis\. Together, the \(C1\) approximate match at initialisation and the \(C2\) wins after fine\-tuning are consistent with the shallow\-encoder rational\-vs\-piecewise expressivity asymmetry of §[3](https://arxiv.org/html/2606.14990#S3): the rational activation contains every fixed\-form teacher within a single low\-degree family, and is then free to deviate from any of them in whatever direction lowers the regularised reconstruction loss on the host model’s actual activation distribution\.

Table 3:Wall\-clock runtimeof the RSAE pipeline per model, averaged across baseline SAEs, measured on a single NVIDIA RTX 5090 \(32 GB\)\. The Init Procedure depends only on the host language model and is therefore identical across baselines \(std=0\\,=\\,0\); for the Finetune Procedure and Total we report mean±\\,\\pm\\,std across baselines\. The detailed per\-\(model, baseline\) breakdown is given in Table[6](https://arxiv.org/html/2606.14990#A2.T6)\.Runtime and scalability\.Table[3](https://arxiv.org/html/2606.14990#S5.T3)reports the wall\-clock cost of the RSAE training procedures and shows two properties of the method:\(i\) Cheap approximate teacher reproduction\.The Remez fit is a one\-shot off\-line procedure depending only on the target activation, so it can be tabulated once and reused across \(model, teacher\) pairs\. The remaining model\-specific500500\-step adaptation to the teacher pre\-activation spaces completes in2525to6060s, including Gemma\-2\-2B\. Combined with verbatim weight inheritance, this lets RSAE approximately reproduceReLU\\mathrm\{ReLU\},JumpReLU\\mathrm\{JumpReLU\}, andTopK\\mathrm\{TopK\}teachers under one framework\.\(ii\) Lightweight fine\-tune overhead\.Fine\-tuning adds at most\(p\+1\)\+q\+2\(p\+1\)\+q\+2scalar parameters per RSAE, negligible relative to the𝒪​\(din​dsae\)\\mathcal\{O\}\(d\_\{\\mathrm\{in\}\}d\_\{\\mathrm\{sae\}\}\)encoder/decoder parameters\. The corresponding22K\-step fine\-tune costs8080to295295s on the small models and∼\\sim5 min on Gemma\-2\-2B, so upgrading a released teacher costs minutes, not hours, on a single RTX 5090\.

Ablation studies on sparsity\.We perform ablations along two complementary sparsity axes: the RSAE’s ownℓ1\\ell\_\{1\}coefficientλ\\lambda\(Figure[2](https://arxiv.org/html/2606.14990#S5.F2)\) and the teacher SAE’s training sparsity \(Table[4](https://arxiv.org/html/2606.14990#S5.T4)\)\. On the algorithm\-side axis, sweepingλ\\lambdatraces an RSAE Pareto curve that enters the strict\-domination “sweet zone” against every teacher in Figure[2](https://arxiv.org/html/2606.14990#S5.F2), indicating that a small per\-\(model, teacher\) tuning ofλ\\lambdais sufficient to obtain better MSE and a higher alive\-feature fraction at lowerℓ0\\ell\_\{0\}than the teacher\. On the teacher\-side axis, we re\-run the pipeline against*all six*SAEBench ReLU trainers on Pythia\-160m, which span a wide teacher sparsity rangeℓ0∈\[51,683\]\\ell\_\{0\}\\in\[51,683\]; the RSAE wins on reconstruction, alive, andLR\\mathrm\{LR\}on every trainer \(6/66/6\) and onℓ0\\ell\_\{0\}on half of the trainers \(3/63/6\)\. The performance gain is therefore not specific to a particular teacher sparsity but is consistent across the full range we tested\.

Table 4:Consistency of RSAE Pareto\-domination across teacher sparsity\.We run our pipeline against*all six*SAEBench ReLU trainers on Pythia\-160m, which differ only in their trainingℓ1\\ell\_\{1\}penalty and span teacherℓ0∈\[51,683\]\\ell\_\{0\}\\in\[51,683\]\. “T” denotes the teacher SAE and “R” the RSAE\. recon, alive, andLR\\mathrm\{LR\}are won on*every*trainer \(6/66/6\);ℓ0\\ell\_\{0\}on3/63/6\.Table 5:Sparse\-probing interpretabilityon Pythia\-160m ReLU teacher \(trainer 3\) and our RSAE distilled from it, on the full SAEBench panel of88binary\-classification tasks\. We report full\-dictionary probe accuracy per dataset \(all SAE features, nokk\-sparsity constraint\); higher is better\. Bold marks the better SAE per row;greenmarks rows where the RSAE improves over the teacher\.
### 5\.3Interpretability via Sparse Probing

Table[5](https://arxiv.org/html/2606.14990#S5.T5)reports full\-dictionary linear\-probe accuracy on the eight\-task SAEBench panel for the Pythia\-160m ReLU teacher \(trainer 3\) and the RSAE initialized from it\. The RSAE strictly improves on the teacher in4/84/8tasks \(bias\_in\_bios\_set1,bias\_in\_bios\_set3,amazon\_reviews,ag\_news\) and ties on2/82/8\(amazon\_sentiment,europarl\), losing on the remaining two by margins of at most0\.140\.14percentage points \(bias\_in\_bios\_set2:−0\.08\-0\.08;github\-code:−0\.14\-0\.14\)\. Across the panel, the largest gain in either direction is0\.280\.28percentage points \(in the RSAE’s favour, onbias\_in\_bios\_set3\); all eight task\-level deltas lie within a±0\.3\\pm 0\.3percentage points band that is comparable to the seed\-to\-seed noise of the probe\. The panel mean shifts slightly in the RSAE’s favour \(93\.7193\.71vs\.93\.6993\.69,Δ=\+0\.02\\Delta=\+0\.02\)\. Taken together, the table shows that initializing an RSAE from its teacher SAEdoes*not*degrade feature\-level interpretability in the sparse\-probing sense: the RSAE preserves the teacher’s full\-dictionary probe accuracy on every task, and the substantial reconstruction\-side and downstream\-CE gains documented in Tables[1](https://arxiv.org/html/2606.14990#S5.T1)and[2](https://arxiv.org/html/2606.14990#S5.T2)are therefore obtained without trading off the dictionary’s ability to expose human\-aligned, task\-relevant features\.

![Refer to caption](https://arxiv.org/html/2606.14990v1/x2.png)Figure 2:Pareto fronts on Pythia\-160m for all three baseline activation families\.Subfigures \(a\), \(b\), \(c\) plot MSE vs\.ℓ0\\ell\_\{0\}, and subfigures \(d\), \(e\), \(f\) plot MSE vs\. alive\. Subfigures \(a\) and \(d\) use ReLU as the teacher, \(b\) and \(e\) use JumpReLU, and \(c\) and \(f\) use TopK\. The black star is the teacher SAE; the red curve traces the RSAE Pareto front under aλ\\lambdasweep\. The green sweet\-zone marks the strict\-Pareto\-domination region\. The RSAE curve enters the sweet zone for every architecture at every sparsity level\.

## 6Conclusion

We introduced the*Rational Sparse Autoencoder*\(RSAE\), an SAE whose encoder activation is a learnable rational function supported by approximation theory tailored to the shallow SAE encoder: trainable rational activations compactly approximate the fixed gates used by current SAE families, while scalar\-output single\-layer piecewise\-affine encoders can require many more activated coordinates for some rational targets\. The corresponding deep\-network statements provide a complementary extension beyond the SAE architecture\. When implemented as an upgrading strategy from existing pretrained SAEs across three host language models and three baseline activation families \(ReLU, JumpReLU, TopK\), our RSAE achieves better fidelity at comparable sparsity and strictly improves the baseline across both reconstruction\-side metrics \(MSE,ℓ0\\ell\_\{0\}, alive\-feature fraction\) and downstream\-behaviour metrics \(cross\-entropy degradation, loss recovered\), and these gains hold uniformly across the full range of baseline sparsity we tested without sacrificing feature\-level interpretability under sparse probing\. Because the upgrade adds only a handful of scalar parameters per autoencoder and runs in minutes on a single consumer GPU, any released ReLU, JumpReLU, or TopK SAE can in principle be replaced by its RSAE counterpart at negligible cost\.

Limitation\.Our evaluation covers three open\-weight LLMs and three activations, so the gains at other frontier models remain unexplored\. For instance, combining the rational activation with orthogonal architectural variants such as Gated, BatchTopK, and Matryoshka SAEs is an interesting direction\.

Broader Impact\.As a drop\-in upgrade to any released SAE, the RSAE can strengthen interpretability\-based safety auditing or lower the cost of misuse\-relevant feature steering, depending on how it is deployed\.

## Acknowledgements

N\.Y\. and Y\.Y\. acknowledge support by the Defense Advanced Research Projects Agency \(DARPA\) under award HR00112590032\. This research was, in part, funded by the U\.S\. Government by an agreement with Cornell University\.

## References

- On the singular values of matrices with displacement structure\.SIAM Journal on Matrix Analysis and Applications38\(4\),pp\. 1227–1248\.Cited by:[Appendix A](https://arxiv.org/html/2606.14990#A1.1.p1.1),[§2](https://arxiv.org/html/2606.14990#S2.p5.8)\.
- J\. Bloom \(2024\)Open source sparse autoencoders for all residual stream layers of GPT2\-small\.Note:[https://www\.lesswrong\.com/posts/f9EgfLSurAiqRJySD/open\-source\-sparse\-autoencoders\-for\-all\-residual\-stream](https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream)LessWrong / HuggingFace release:jbloom/GPT2\-Small\-SAEs\-ReformattedCited by:[§B\.1](https://arxiv.org/html/2606.14990#A2.SS1.SSS0.Px2.p1.4),[§1](https://arxiv.org/html/2606.14990#S1.p1.8),[§2](https://arxiv.org/html/2606.14990#S2.p2.20),[§2](https://arxiv.org/html/2606.14990#S2.p4.1),[§5\.2](https://arxiv.org/html/2606.14990#S5.SS2.p1.5)\.
- N\. Boullé, Y\. Nakatsukasa, and A\. Townsend \(2020\)Rational neural networks\.Advances in neural information processing systems33,pp\. 14243–14253\.Cited by:[Appendix A](https://arxiv.org/html/2606.14990#A1.3.p3.3),[Appendix A](https://arxiv.org/html/2606.14990#A1.4.p1.8),[Appendix A](https://arxiv.org/html/2606.14990#A1.5.p2.4),[§2](https://arxiv.org/html/2606.14990#S2.p5.8),[§2](https://arxiv.org/html/2606.14990#S2.p6.5),[§3](https://arxiv.org/html/2606.14990#S3.p2.3),[§3](https://arxiv.org/html/2606.14990#S3.p4.14),[§3](https://arxiv.org/html/2606.14990#S3.p5.1),[Theorem 2](https://arxiv.org/html/2606.14990#Thmtheorem2)\.
- D\. Braun, J\. Taylor, N\. Goldowsky\-Dill, and L\. Sharkey \(2024\)Identifying functionally important features with end\-to\-end sparse dictionary learning\.Advances in Neural Information Processing Systems37,pp\. 107286–107325\.Cited by:[§2](https://arxiv.org/html/2606.14990#S2.p3.1)\.
- T\. Bricken, A\. Templeton, J\. Batson, B\. Chen, A\. Jermyn, T\. Conerly, N\. L\. Turner, C\. Anil, C\. Denison, A\. Askell, R\. Lasenby, Y\. Wu, S\. Kravec, N\. Schiefer, T\. Maxwell, N\. Joseph, A\. Tamkin, K\. Nguyen, B\. McLean, J\. E\. Burke, T\. Hume, S\. Carter, T\. Henighan, and C\. Olah \(2023\)Towards monosemanticity: decomposing language models with dictionary learning\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2023/monosemantic-features)Cited by:[§1](https://arxiv.org/html/2606.14990#S1.p1.8),[§2](https://arxiv.org/html/2606.14990#S2.p2.20)\.
- B\. Bussmann, P\. Leask, and N\. Nanda \(2024\)Batchtopk sparse autoencoders\.arXiv preprint arXiv:2412\.06410\.Cited by:[§2](https://arxiv.org/html/2606.14990#S2.p3.1)\.
- B\. Bussmann, N\. Nabeshima, A\. Karvonen, and N\. Nanda \(2025\)Learning multi\-level features with matryoshka sparse autoencoders\.arXiv preprint arXiv:2503\.17547\.Cited by:[§2](https://arxiv.org/html/2606.14990#S2.p2.20)\.
- Z\. Chen, F\. Chen, R\. Lai, X\. Zhang, and C\. Lu \(2018\)Rational neural networks for approximating jump discontinuities of graph convolution operator\.arXiv preprint arXiv:1808\.10073\.Cited by:[§B\.3](https://arxiv.org/html/2606.14990#A2.SS3.p1.14),[§1](https://arxiv.org/html/2606.14990#S1.p3.3),[§2](https://arxiv.org/html/2606.14990#S2.p5.8),[§4](https://arxiv.org/html/2606.14990#S4.p2.4)\.
- Q\. Delfosse, P\. Schramowski, M\. Mundt, A\. Molina, and K\. Kersting \(2021\)Adaptive rational activations to boost deep reinforcement learning\.arXiv preprint arXiv:2102\.09407\.Note:Introduces the safe\-Padé parameterisation\.External Links:[Link](https://arxiv.org/abs/2102.09407)Cited by:[§2](https://arxiv.org/html/2606.14990#S2.p6.5)\.
- J\. Dunefsky, P\. Chlenski, and N\. Nanda \(2024\)Transcoders find interpretable llm feature circuits\.Advances in Neural Information Processing Systems37,pp\. 24375–24410\.Cited by:[§2](https://arxiv.org/html/2606.14990#S2.p5.8)\.
- L\. Gao, T\. Dupre la Tour, H\. Tillman, G\. Goh, R\. Troll, A\. Radford, I\. Sutskever, J\. Leike, and J\. Wu \(2025\)Scaling and evaluating sparse autoencoders\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 26721–26754\.Cited by:[§B\.1](https://arxiv.org/html/2606.14990#A2.SS1.SSS0.Px2.p1.4),[§1](https://arxiv.org/html/2606.14990#S1.p1.8),[§2](https://arxiv.org/html/2606.14990#S2.p2.20),[§2](https://arxiv.org/html/2606.14990#S2.p4.1),[§5\.2](https://arxiv.org/html/2606.14990#S5.SS2.p1.5)\.
- R\. Huben, H\. Cunningham, L\. Smith, A\. Ewart, and L\. Sharkey \(2024\)Sparse autoencoders find highly interpretable features in language models\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 7827–7845\.Cited by:[§1](https://arxiv.org/html/2606.14990#S1.p1.8)\.
- A\. Karvonen, C\. Rager, J\. Lin, C\. Tigges, J\. I\. Bloom, D\. Chanin, Y\. Lau, E\. Farrell, C\. S\. Mcdougall, K\. Ayonrinde,et al\.\(2025\)SAEBench: a comprehensive benchmark for sparse autoencoders in language model interpretability\.InInternational Conference on Machine Learning,pp\. 29223–29264\.Cited by:[§B\.1](https://arxiv.org/html/2606.14990#A2.SS1.SSS0.Px2.p1.4),[§2](https://arxiv.org/html/2606.14990#S2.p4.1),[§5\.2](https://arxiv.org/html/2606.14990#S5.SS2.p1.5)\.
- D\. Laptev, N\. Balagansky, Y\. Aksenov, and D\. Gavrilov \(2025\)Analyze feature flow to enhance interpretation and steering in language models\.InInternational Conference on Machine Learning,pp\. 32593–32616\.Cited by:[§2](https://arxiv.org/html/2606.14990#S2.p2.20)\.
- A\. Molina, P\. Schramowski, and K\. Kersting \(2020\)Padé activation units: end\-to\-end learning of flexible activation functions in deep networks\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:1907\.06732Cited by:[§2](https://arxiv.org/html/2606.14990#S2.p5.8),[§2](https://arxiv.org/html/2606.14990#S2.p6.5)\.
- D\. J\. Newman \(1979\)Approximation with rational functions\.American Mathematical Soc\.\.Cited by:[§2](https://arxiv.org/html/2606.14990#S2.p5.8)\.
- S\. Rajamanoharan, A\. Conmy, L\. Smith, T\. Lieberum, V\. Varma, J\. Kramar, R\. Shah, and N\. Nanda \(2024a\)Improving sparse decomposition of language model activations with gated sparse autoencoders\.Advances in Neural Information Processing Systems37,pp\. 775–818\.Cited by:[§1](https://arxiv.org/html/2606.14990#S1.p1.8),[§2](https://arxiv.org/html/2606.14990#S2.p2.20),[§2](https://arxiv.org/html/2606.14990#S2.p3.1)\.
- S\. Rajamanoharan, T\. Lieberum, N\. Sonnerat, A\. Conmy, V\. Varma, J\. Kramár, and N\. Nanda \(2024b\)Jumping ahead: improving reconstruction fidelity with jumprelu sparse autoencoders\.arXiv preprint arXiv:2407\.14435\.Cited by:[§1](https://arxiv.org/html/2606.14990#S1.p1.8),[§2](https://arxiv.org/html/2606.14990#S2.p2.20)\.
- G\. M\. Taggart \(2024\)ProLU: a nonlinearity for sparse autoencoders\.Note:[https://www\.lesswrong\.com/posts/HEpufTdakGTTKgoYF/prolu\-a\-pareto\-improvement\-for\-all\-sparse\-autoencoders](https://www.lesswrong.com/posts/HEpufTdakGTTKgoYF/prolu-a-pareto-improvement-for-all-sparse-autoencoders)LessWrong postCited by:[§1](https://arxiv.org/html/2606.14990#S1.p1.8),[§2](https://arxiv.org/html/2606.14990#S2.p2.20),[§2](https://arxiv.org/html/2606.14990#S2.p3.1)\.
- M\. Tang and A\. Townsend \(2026\)Rational neural networks have expressivity advantages\.arXiv preprint arXiv:2602\.12390\.Cited by:[§2](https://arxiv.org/html/2606.14990#S2.p6.5),[§3](https://arxiv.org/html/2606.14990#S3.p2.3),[§3](https://arxiv.org/html/2606.14990#S3.p4.14)\.
- M\. Telgarsky \(2016\)Benefits of depth in neural networks\.InConference on learning theory,pp\. 1517–1539\.Cited by:[Appendix A](https://arxiv.org/html/2606.14990#A1.22.p4.4)\.
- M\. Telgarsky \(2017\)Neural networks and rational functions\.InInternational Conference on Machine Learning,pp\. 3387–3393\.Cited by:[§2](https://arxiv.org/html/2606.14990#S2.p5.8)\.
- L\. N\. Trefethen \(2019\)Approximation theory and approximation practice, extended edition\.SIAM\.Cited by:[§B\.2](https://arxiv.org/html/2606.14990#A2.SS2.p1.25),[§B\.3](https://arxiv.org/html/2606.14990#A2.SS3.SSS0.Px1.p1.15)\.
- M\. Trimmel, M\. Zanfir, R\. Hartley, and C\. Sminchisescu \(2022\)ERA: enhanced rational activations\.InEuropean Conference on Computer Vision,pp\. 722–738\.Cited by:[§2](https://arxiv.org/html/2606.14990#S2.p5.8),[§2](https://arxiv.org/html/2606.14990#S2.p6.5)\.

## Appendix ADetailed Proofs

###### Lemma 1\(Zolotarev; rational approximation ofsign\\mathrm\{sign\}\)\.

For everyδ∈\(0,1\)\\delta\\in\(0,1\)andn≥1n\\geq 1there is a type\-\(2​n\+1,2​n\)\(2n\+1,2n\)rationalsn,δs\_\{n,\\delta\}such thatsupx∈Eδ\|sign​\(x\)−sn,δ​\(x\)\|≤4​exp⁡\(−π2​n/log⁡\(4/δ\)\)\\sup\_\{x\\in E\_\{\\delta\}\}\\big\|\\mathrm\{sign\}\(x\)\-s\_\{n,\\delta\}\(x\)\\big\|\\leq 4\\exp\\\!\\big\(\-\\pi^\{2\}n/\\log\(4/\\delta\)\\big\)\.

Consequently, for every0<ε<10<\\varepsilon<1, there is a rational function of size𝒪​\(log⁡\(1/ε\)​log⁡\(1/δ\)\)\\mathcal\{O\}\\big\(\\log\(1/\\varepsilon\)\\log\(1/\\delta\)\\big\)that approximatessign\\mathrm\{sign\}onEδE\_\{\\delta\}to uniform errorε\\varepsilon\. For deep\-layer networks, there is a constant\-width rational network of depth𝒪​\(log⁡log⁡\(1/ε\)\+log⁡log⁡\(1/δ\)\)\\mathcal\{O\}\\big\(\\log\\log\(1/\\varepsilon\)\+\\log\\log\(1/\\delta\)\\big\)that approximatessign\\mathrm\{sign\}onEδE\_\{\\delta\}to uniform errorε\\varepsilon\.

###### Proof\.

The first part is an immediate result from the classical Zolotarev bound in\[Beckermann and Townsend,[2017](https://arxiv.org/html/2606.14990#bib.bib35)\]\. Namely, for anyn≥1n\\geq 1,

maxsn,δ​supx∈Eδ\|sign​\(x\)−sn,δ​\(x\)\|≤4​exp⁡\(−π2​n/log⁡\(4/δ\)\)\.\\max\_\{s\_\{n,\\delta\}\}\\sup\_\{x\\in E\_\{\\delta\}\}\\big\|\\mathrm\{sign\}\(x\)\-s\_\{n,\\delta\}\(x\)\\big\|\\leq 4\\exp\\\!\\big\(\-\\pi^\{2\}n/\\log\(4/\\delta\)\\big\)\.To guarantee uniform error at mostε∈\(0,1\)\\varepsilon\\in\(0,1\)onEδ=\[−1,−δ\]∪\[δ,1\]E\_\{\\delta\}=\[\-1,\-\\delta\]\\cup\[\\delta,1\], it suffices to choosen=nε,δn=n\_\{\\varepsilon,\\delta\}so that

4​exp⁡\(−π2​nε,δ/log⁡\(4/δ\)\)≤ε\.4\\exp\\\!\\big\(\-\\pi^\{2\}n\_\{\\varepsilon,\\delta\}/\\log\(4/\\delta\)\\big\)\\leq\\varepsilon\.Taking logarithms gives

π2​nε,δlog⁡\(4/δ\)≥log⁡\(4/ε\),\\frac\{\\pi^\{2\}n\_\{\\varepsilon,\\delta\}\}\{\\log\(4/\\delta\)\}\\geq\\log\(4/\\varepsilon\),or equivalently

nε,δ≥log⁡\(4/δ\)​log⁡\(4/ε\)π2\.n\_\{\\varepsilon,\\delta\}\\geq\\frac\{\\log\(4/\\delta\)\\,\\log\(4/\\varepsilon\)\}\{\\pi^\{2\}\}\.Thus one admissible choice is

nε,δ:=⌈log⁡\(4/δ\)​log⁡\(4/ε\)π2⌉\.n\_\{\\varepsilon,\\delta\}:=\\Big\\lceil\\frac\{\\log\(4/\\delta\)\\,\\log\(4/\\varepsilon\)\}\{\\pi^\{2\}\}\\Big\\rceil\.
The rationalsnε,δ,δs\_\{n\_\{\\varepsilon,\\delta\},\\delta\}has numerator and denominator degree𝒪​\(nε,δ\)\\mathcal\{O\}\(n\_\{\\varepsilon,\\delta\}\), and hence size

𝒪​\(nε,δ\)=𝒪​\(log⁡\(1/ε\)​log⁡\(1/δ\)\)\.\\mathcal\{O\}\(n\_\{\\varepsilon,\\delta\}\)=\\mathcal\{O\}\\\!\\big\(\\log\(1/\\varepsilon\)\\log\(1/\\delta\)\\big\)\.This gives the direct scalar rational approximation\.

For the deep\-layer implementation, standard realizability results for rational networks show that a scalar rational with numerator and denominator degree at mostNNcan be implemented by a constant\-width rational network with depth𝒪​\(log⁡N\)\\mathcal\{O\}\(\\log N\); see, for instance, the construction used byBoulléet al\.\[[2020](https://arxiv.org/html/2606.14990#bib.bib1)\]\. Applying this tosnε,δ,δs\_\{n\_\{\\varepsilon,\\delta\},\\delta\}gives depth

𝒪​\(log⁡nε,δ\)=𝒪​\(log⁡log⁡\(1/ε\)\+log⁡log⁡\(1/δ\)\)\.\\mathcal\{O\}\\\!\\big\(\\log n\_\{\\varepsilon,\\delta\}\\big\)=\\mathcal\{O\}\\\!\\big\(\\log\\log\(1/\\varepsilon\)\+\\log\\log\(1/\\delta\)\\big\)\.∎

###### Theorem 2\(Rational approximation of ReLU\)\.

For every0<ε<10<\\varepsilon<1, there exists a scalar rational functionRε:\[−1,1\]→\[−1,1\]R\_\{\\varepsilon\}:\[\-1,1\]\\to\[\-1,1\]of size

𝒪​\(log2⁡\(1/ε\)\),\\mathcal\{O\}\\\!\\Big\(\\log^\{2\}\(1/\\varepsilon\)\\Big\),such that

supx∈\[−1,1\]\|Rε​\(x\)−ReLU​\(x\)\|≤ε\.\\sup\_\{x\\in\[\-1,1\]\}\\big\|R\_\{\\varepsilon\}\(x\)\-\\mathrm\{ReLU\}\(x\)\\big\|\\;\\leq\\;\\varepsilon\.Consequently, the ReLU activation block can be replaced in either of two implementations\. First,RεR\_\{\\varepsilon\}can be applied coordinatewise as a trainable rational activation, with scalar size𝒪​\(log2⁡\(1/ε\)\)\\mathcal\{O\}\\\!\\big\(\\log^\{2\}\(1/\\varepsilon\)\\big\)\. Second, the same scalar map can be realized by a constant\-width deep rational network of internal depth

MR=𝒪​\(log⁡log⁡\(1/ε\)\)\.M\_\{R\}\\;=\\;\\mathcal\{O\}\\\!\\Big\(\\log\\log\(1/\\varepsilon\)\\Big\)\.Under either implementation, the resulting activation blockℛR:\[−1,1\]dsae→ℝdsae\\mathcal\{R\}\_\{R\}:\[\-1,1\]^\{d\_\{\\mathrm\{sae\}\}\}\\to\\mathbb\{R\}^\{d\_\{\\mathrm\{sae\}\}\}satisfies

sup𝒉∈\[−1,1\]dsae∥ℛR​\(𝒉\)−𝒛R​\(𝒉\)∥∞≤ε\.\\sup\_\{\\bm\{h\}\\in\[\-1,1\]^\{d\_\{\\mathrm\{sae\}\}\}\}\\big\\lVert\\mathcal\{R\}\_\{R\}\(\\bm\{h\}\)\-\\bm\{z\}\_\{\\mathrm\{R\}\}\(\\bm\{h\}\)\\big\\rVert\_\{\\infty\}\\;\\leq\\;\\varepsilon\.

###### Proof\.

Letρ​\(t\):=ReLU​\(t\)=max⁡\{t,0\}\\rho\(t\):=\\mathrm\{ReLU\}\(t\)=\\max\\\{t,0\\\}\. The scalar construction is the one used in Lemma 1 ofBoulléet al\.\[[2020](https://arxiv.org/html/2606.14990#bib.bib1)\]; we recall the argument to make clear how it follows from the Zolotarev sign approximation in Lemma[1](https://arxiv.org/html/2606.14990#Thmtheorem1)\. For an integerm≥1m\\geq 1, take the Zolotarev sign functionsms\_\{m\}of type\(3m,3m−1\)\(3^\{m\},3^\{m\}\-1\)\. By the composition property of Zolotarev sign functions,sms\_\{m\}can be written as a composition ofmmrational maps of type\(3,2\)\(3,2\), so it is represented by a constant\-width rational network with internal depthmm\.

As in the proof of Lemma 1 inBoulléet al\.\[[2020](https://arxiv.org/html/2606.14990#bib.bib1)\], choose the gap parameter in the Zolotarev construction optimally\. Then the productt​sm​\(t\)t\\,s\_\{m\}\(t\)approximates\|t\|\|t\|on\[−1,1\]\[\-1,1\]with root\-exponential accuracy: there is a universal constantc\>0c\>0such that

supt∈\[−1,1\]\|\|t\|−t​sm​\(t\)\|≤4​exp⁡\(−c​3m/2\)\.\\sup\_\{t\\in\[\-1,1\]\}\\big\|\|t\|\-t\\,s\_\{m\}\(t\)\\big\|\\leq 4\\exp\\\!\\big\(\-c\\,3^\{m/2\}\\big\)\.Using the identity

ρ​\(t\)=\|t\|\+t2,\\rho\(t\)=\\frac\{\|t\|\+t\}\{2\},define

rm​\(t\):=t​sm​\(t\)\+t2\.r\_\{m\}\(t\):=\\frac\{t\\,s\_\{m\}\(t\)\+t\}\{2\}\.It follows that

supt∈\[−1,1\]\|rm​\(t\)−ρ​\(t\)\|≤2​exp⁡\(−c​3m/2\)\.\\sup\_\{t\\in\[\-1,1\]\}\|r\_\{m\}\(t\)\-\\rho\(t\)\|\\leq 2\\exp\\\!\\big\(\-c\\,3^\{m/2\}\\big\)\.Choosemmso that this right\-hand side is at mostη:=ε/2\\eta:=\\varepsilon/2; equivalently,

m=𝒪​\(log⁡log⁡\(1/ε\)\)\.m=\\mathcal\{O\}\\\!\\big\(\\log\\log\(1/\\varepsilon\)\\big\)\.To keep the scalar approximant inside\[−1,1\]\[\-1,1\], set

Rε​\(t\):=rm​\(t\)1\+η\.R\_\{\\varepsilon\}\(t\):=\\frac\{r\_\{m\}\(t\)\}\{1\+\\eta\}\.Since0≤ρ​\(t\)≤10\\leq\\rho\(t\)\\leq 1on\[−1,1\]\[\-1,1\]and\|rm​\(t\)−ρ​\(t\)\|≤η\|r\_\{m\}\(t\)\-\\rho\(t\)\|\\leq\\eta, we have−η≤rm​\(t\)≤1\+η\-\\eta\\leq r\_\{m\}\(t\)\\leq 1\+\\eta, henceRε​\(\[−1,1\]\)⊂\[−1,1\]R\_\{\\varepsilon\}\(\[\-1,1\]\)\\subset\[\-1,1\]\. Moreover,

\|Rε​\(t\)−ρ​\(t\)\|≤\|rm​\(t\)−ρ​\(t\)\|\+η​\|ρ​\(t\)\|1\+η≤2​η=ε\.\|R\_\{\\varepsilon\}\(t\)\-\\rho\(t\)\|\\leq\\frac\{\|r\_\{m\}\(t\)\-\\rho\(t\)\|\+\\eta\|\\rho\(t\)\|\}\{1\+\\eta\}\\leq 2\\eta=\\varepsilon\.
This proves the scalar approximation guarantee\. For the direct trainable rational\-activation implementation, note thatsms\_\{m\}has degree𝒪​\(3m\)\\mathcal\{O\}\(3^\{m\}\), and thereforeRεR\_\{\\varepsilon\}has numerator and denominator degree𝒪​\(3m\)\\mathcal\{O\}\(3^\{m\}\)\. With the above choice ofmm,

3m=𝒪​\(log2⁡\(1/ε\)\)\.3^\{m\}=\\mathcal\{O\}\\\!\\big\(\\log^\{2\}\(1/\\varepsilon\)\\big\)\.ThusRεR\_\{\\varepsilon\}is representable by𝒪​\(log2⁡\(1/ε\)\)\\mathcal\{O\}\(\\log^\{2\}\(1/\\varepsilon\)\)coefficients, which is the stated size bound for the scalar rational function\.

The same scalar map also admits a constant\-width deep rational\-network implementation of internal depthm=𝒪​\(log⁡log⁡\(1/ε\)\)m=\\mathcal\{O\}\(\\log\\log\(1/\\varepsilon\)\), sincesms\_\{m\}is a composition ofmmtype\-\(3,2\)\(3,2\)rational maps and the final affine rescaling does not change the asymptotic depth\.

Now define the vector\-valued activation block coordinatewise by

ℛR​\(𝒉\)i:=Rε​\(hi\),i=1,…,dsae\.\\mathcal\{R\}\_\{R\}\(\\bm\{h\}\)\_\{i\}:=R\_\{\\varepsilon\}\(h\_\{i\}\),\\qquad i=1,\\ldots,d\_\{\\mathrm\{sae\}\}\.Since𝒛R​\(𝒉\)i=ρ​\(hi\)\\bm\{z\}\_\{\\mathrm\{R\}\}\(\\bm\{h\}\)\_\{i\}=\\rho\(h\_\{i\}\), the scalar uniform bound gives

∥ℛR​\(𝒉\)−𝒛R​\(𝒉\)∥∞=max1≤i≤dsae⁡\|Rε​\(hi\)−ρ​\(hi\)\|≤ε\\big\\lVert\\mathcal\{R\}\_\{R\}\(\\bm\{h\}\)\-\\bm\{z\}\_\{\\mathrm\{R\}\}\(\\bm\{h\}\)\\big\\rVert\_\{\\infty\}=\\max\_\{1\\leq i\\leq d\_\{\\mathrm\{sae\}\}\}\|R\_\{\\varepsilon\}\(h\_\{i\}\)\-\\rho\(h\_\{i\}\)\|\\leq\\varepsilonfor every𝒉∈\[−1,1\]dsae\\bm\{h\}\\in\[\-1,1\]^\{d\_\{\\mathrm\{sae\}\}\}\. ∎

###### Theorem 3\(Rational approximation of JumpReLU\)\.

Fix positive thresholds𝛉∈\(ℝ\+\)dsae\\bm\{\\theta\}\\in\(\\mathbb\{R\}^\{\+\}\)^\{d\_\{\\mathrm\{sae\}\}\}and a marginδ∈\(0,1\)\\delta\\in\(0,1\), and define

Ωδ:=\{𝒉∈\[−1,1\]dsae:\|hi−θi\|≥δ​for every​i\}\.\\Omega\_\{\\delta\}:=\\\{\\bm\{h\}\\in\[\-1,1\]^\{d\_\{\\mathrm\{sae\}\}\}:\|h\_\{i\}\-\\theta\_\{i\}\|\\geq\\delta\\text\{ for every \}i\\\}\.For every0<ε<10<\\varepsilon<1, there is a rational activation blockℛJ:\[−1,1\]dsae→ℝdsae\\mathcal\{R\}\_\{J\}:\[\-1,1\]^\{d\_\{\\mathrm\{sae\}\}\}\\to\\mathbb\{R\}^\{d\_\{\\mathrm\{sae\}\}\}whose scalar coordinate maps admit either of two implementations\. First, each coordinate map can be implemented directly as a trainable scalar rational activation of size

𝒪​\(log⁡\(1/ε\)​log⁡\(1/δ\)\),\\mathcal\{O\}\\\!\\Big\(\\log\(1/\\varepsilon\)\\log\(1/\\delta\)\\Big\),with constants depending only on the fixed threshold scale\. Second, each coordinate map can be realized by a constant\-width deep rational network of internal depth

MJ=𝒪​\(log⁡log⁡\(1/ε\)\+log⁡log⁡\(1/δ\)\)M\_\{J\}\\;=\\;\\mathcal\{O\}\\\!\\Big\(\\log\\log\(1/\\varepsilon\)\+\\log\\log\(1/\\delta\)\\Big\)and per\-coordinate size𝒪​\(MJ\)\\mathcal\{O\}\(M\_\{J\}\)\. Under either implementation,ℛJ\\mathcal\{R\}\_\{J\}satisfies

sup𝒉∈Ωδ∥ℛJ​\(𝒉\)−𝒛J​\(𝒉\)∥∞≤ε\.\\sup\_\{\\bm\{h\}\\in\\Omega\_\{\\delta\}\}\\big\\lVert\\mathcal\{R\}\_\{J\}\(\\bm\{h\}\)\-\\bm\{z\}\_\{\\mathrm\{J\}\}\(\\bm\{h\}\)\\big\\rVert\_\{\\infty\}\\;\\leq\\;\\varepsilon\.

###### Proof\.

For each coordinate,

zJ,i​\(𝒉\)=hi​H​\(hi−θi\)=hi⋅1\+sign​\(hi−θi\)2\.z\_\{\\mathrm\{J\},i\}\(\\bm\{h\}\)=h\_\{i\}\\,H\(h\_\{i\}\-\\theta\_\{i\}\)=h\_\{i\}\\cdot\\frac\{1\+\\mathrm\{sign\}\(h\_\{i\}\-\\theta\_\{i\}\)\}\{2\}\.Thus the problem reduces to approximating the scalar gateH​\(hi−θi\)H\(h\_\{i\}\-\\theta\_\{i\}\)on the margin\-separated set\|hi−θi\|≥δ\|h\_\{i\}\-\\theta\_\{i\}\|\\geq\\deltaand then multiplying byhih\_\{i\}\.

Let

Cθ:=1\+‖𝜽‖∞\.C\_\{\\theta\}:=1\+\\\|\\bm\{\\theta\}\\\|\_\{\\infty\}\.Since\|hi\|≤1\|h\_\{i\}\|\\leq 1, we have\|hi−θi\|≤Cθ\|h\_\{i\}\-\\theta\_\{i\}\|\\leq C\_\{\\theta\}for every coordinate\. Apply Lemma[1](https://arxiv.org/html/2606.14990#Thmtheorem1)to the rescaled variableu=\(hi−θi\)/Cθu=\(h\_\{i\}\-\\theta\_\{i\}\)/C\_\{\\theta\}, which ranges over\[−1,−δ/Cθ\]∪\[δ/Cθ,1\]\[\-1,\-\\delta/C\_\{\\theta\}\]\\cup\[\\delta/C\_\{\\theta\},1\]onΩδ\\Omega\_\{\\delta\}\. This gives a scalar rational functionsssuch that

sup\|t\|∈\[δ,Cθ\]\|sign​\(t\)−s​\(tCθ\)\|≤ε\.\\sup\_\{\|t\|\\in\[\\delta,C\_\{\\theta\}\]\}\\Big\|\\mathrm\{sign\}\(t\)\-s\\Big\(\\frac\{t\}\{C\_\{\\theta\}\}\\Big\)\\Big\|\\leq\\varepsilon\.Define

H~​\(t\):=1\+s​\(t/Cθ\)2,\\widetilde\{H\}\(t\):=\\frac\{1\+s\(t/C\_\{\\theta\}\)\}\{2\},then for all\|t\|≥δ\|t\|\\geq\\delta,

\|H~​\(t\)−H​\(t\)\|=12​\|s​\(tCθ\)−sign​\(t\)\|≤ε2\.\|\\widetilde\{H\}\(t\)\-H\(t\)\|=\\frac\{1\}\{2\}\\Big\|s\\Big\(\\frac\{t\}\{C\_\{\\theta\}\}\\Big\)\-\\mathrm\{sign\}\(t\)\\Big\|\\leq\\frac\{\\varepsilon\}\{2\}\.By Lemma[1](https://arxiv.org/html/2606.14990#Thmtheorem1), this scalar gate is implemented by a constant\-width rational network of depth

𝒪​\(log⁡log⁡\(1/ε\)\+log⁡log⁡\(Cθ/δ\)\)=𝒪​\(log⁡log⁡\(1/ε\)\+log⁡log⁡\(1/δ\)\),\\mathcal\{O\}\\\!\\big\(\\log\\log\(1/\\varepsilon\)\+\\log\\log\(C\_\{\\theta\}/\\delta\)\\big\)=\\mathcal\{O\}\\\!\\big\(\\log\\log\(1/\\varepsilon\)\+\\log\\log\(1/\\delta\)\\big\),where the second equality uses thatCθC\_\{\\theta\}is an architectural constant\.

For the direct trainable rational\-activation implementation, keep the same scalar rational gate instead of factorizing it into a deep constant\-width composition\. Ifnε,δn\_\{\\varepsilon,\\delta\}denotes the degree selected in Lemma[1](https://arxiv.org/html/2606.14990#Thmtheorem1)with marginδ/Cθ\\delta/C\_\{\\theta\}, then

nε,δ=𝒪​\(log⁡\(Cθ/δ\)​log⁡\(1/ε\)\)=𝒪​\(log⁡\(1/δ\)​log⁡\(1/ε\)\),n\_\{\\varepsilon,\\delta\}=\\mathcal\{O\}\\\!\\Big\(\\log\(C\_\{\\theta\}/\\delta\)\\log\(1/\\varepsilon\)\\Big\)=\\mathcal\{O\}\\\!\\Big\(\\log\(1/\\delta\)\\log\(1/\\varepsilon\)\\Big\),again treatingCθC\_\{\\theta\}as fixed\. The rational gatet↦H~​\(t\)t\\mapsto\\widetilde\{H\}\(t\)therefore has numerator and denominator degree𝒪​\(nε,δ\)\\mathcal\{O\}\(n\_\{\\varepsilon,\\delta\}\)\.

We now further define the rational approximation

z~i​\(𝒉\):=hi​H~​\(hi−θi\)\.\\widetilde\{z\}\_\{i\}\(\\bm\{h\}\):=h\_\{i\}\\,\\widetilde\{H\}\(h\_\{i\}\-\\theta\_\{i\}\)\.The product is realised exactly by the standard multiplication identity

x​y=14​\[\(x\+y\)2−\(x−y\)2\],xy=\\frac\{1\}\{4\}\\big\[\(x\+y\)^\{2\}\-\(x\-y\)^\{2\}\\big\],which is polynomial of degree22and therefore belongs to the rational class with constant additional depth\. OnΩδ\\Omega\_\{\\delta\}we have

\|z~i​\(𝒉\)−zJ,i​\(𝒉\)\|\\displaystyle\|\\widetilde\{z\}\_\{i\}\(\\bm\{h\}\)\-z\_\{\\mathrm\{J\},i\}\(\\bm\{h\}\)\|=\|hi\|​\|H~​\(hi−θi\)−H​\(hi−θi\)\|≤ε2≤ε\.\\displaystyle=\|h\_\{i\}\|\\,\|\\widetilde\{H\}\(h\_\{i\}\-\\theta\_\{i\}\)\-H\(h\_\{i\}\-\\theta\_\{i\}\)\|\\leq\\frac\{\\varepsilon\}\{2\}\\leq\\varepsilon\.So each coordinate is approximated uniformly to error at mostε\\varepsilon\.

Moreover, the scalar maphi↦z~i​\(𝒉\)h\_\{i\}\\mapsto\\widetilde\{z\}\_\{i\}\(\\bm\{h\}\)is itself a univariate rational function of degree𝒪​\(nε,δ\)\\mathcal\{O\}\(n\_\{\\varepsilon,\\delta\}\): multiplying byhih\_\{i\}only increases the numerator degree by one\.

For the direct scalar rational implementation, the numerator and denominator degrees are therefore𝒪​\(nε,δ\)\\mathcal\{O\}\(n\_\{\\varepsilon,\\delta\}\), so each coordinate map has size𝒪​\(log⁡\(1/ε\)​log⁡\(1/δ\)\)\\mathcal\{O\}\\\!\\big\(\\log\(1/\\varepsilon\)\\log\(1/\\delta\)\\big\)\. DefineℛJ​\(𝒉\)i:=z~i​\(𝒉\)\\mathcal\{R\}\_\{J\}\(\\bm\{h\}\)\_\{i\}:=\\widetilde\{z\}\_\{i\}\(\\bm\{h\}\)and apply the maps coordinatewise\. The rational coefficients in the gateH~\\widetilde\{H\}are shared across coordinates; the coordinate dependence enters only through the affine shifthi↦hi−θih\_\{i\}\\mapsto h\_\{i\}\-\\theta\_\{i\}and the final multiplication byhih\_\{i\}\. Equivalently, sinceθi\>0\\theta\_\{i\}\>0, one may use a shared prototype gater1/2r\_\{1/2\}at threshold1/21/2and apply it asr1/2​\(hi/\(2​θi\)\)r\_\{1/2\}\(h\_\{i\}/\(2\\theta\_\{i\}\)\), because

hi2​θi−12=hi−θi2​θi\.\\frac\{h\_\{i\}\}\{2\\theta\_\{i\}\}\-\\frac\{1\}\{2\}=\\frac\{h\_\{i\}\-\\theta\_\{i\}\}\{2\\theta\_\{i\}\}\.The fixed threshold scale only changes constants in the margin, so the same asymptotic size bound applies\. The scalar error bound gives

sup𝒉∈Ωδ‖ℛJ​\(𝒉\)−𝒛J​\(𝒉\)‖∞≤ε,\\sup\_\{\\bm\{h\}\\in\\Omega\_\{\\delta\}\}\\\|\\mathcal\{R\}\_\{J\}\(\\bm\{h\}\)\-\\bm\{z\}\_\{\\mathrm\{J\}\}\(\\bm\{h\}\)\\\|\_\{\\infty\}\\leq\\varepsilon,which is the direct scalar\-rational realization claimed in the theorem\.

For the deep implementation, realize the same shared gate, eitherH~\\widetilde\{H\}or the equivalent prototyper1/2r\_\{1/2\}, by the constant\-width rational network above and usedsaed\_\{\\mathrm\{sae\}\}copies in parallel\. The coordinate\-dependent affine shifts or scalings and the final multiplication byhih\_\{i\}add only constant depth, so the vector\-valued realization has width proportional to the number of coordinates and depth𝒪​\(log⁡log⁡\(1/ε\)\+log⁡log⁡\(1/δ\)\)\\mathcal\{O\}\(\\log\\log\(1/\\varepsilon\)\+\\log\\log\(1/\\delta\)\)\. ∎

###### Theorem 4\(Rational approximation of supplied\-threshold TopK gate\)\.

Fix1≤k<dsae1\\leq k<d\_\{\\mathrm\{sae\}\}anda marginδ∈\(0,1\)\\delta\\in\(0,1\)and define

ΩδT:=\{\(𝒉,τk\)∈\[−1,1\]dsae×\[−1,1\]:h\(k\)−τk≥δ,τk−h\(k\+1\)≥δ\},\{\\color\[rgb\]\{0,0,1\}\\Omega^\{\\mathrm\{T\}\}\_\{\\delta\}:=\\\{\(\\bm\{h\},\\tau\_\{k\}\)\\in\[\-1,1\]^\{d\_\{\\mathrm\{sae\}\}\}\\times\[\-1,1\]:h\_\{\(k\)\}\-\\tau\_\{k\}\\geq\\delta,\\;\\tau\_\{k\}\-h\_\{\(k\+1\)\}\\geq\\delta\\\},\}whereh\(1\)≥⋯≥h\(dsae\)h\_\{\(1\)\}\\geq\\cdots\\geq h\_\{\(d\_\{\\mathrm\{sae\}\}\)\}are the sorted pre\-activations andτk\\tau\_\{k\}is a supplied separator between thekk\-th and\(k\+1\)\(k\+1\)\-st order statistics, not thekk\-th activation itself\.Suppose the scalar thresholdτk∈\[−1,1\]\\tau\_\{k\}\\in\[\-1,1\]is supplied together with each pre\-activation vector\. For every0<ε<10<\\varepsilon<1, there is a rational networkℛT:\[−1,1\]dsae\+1→ℝdsae\\mathcal\{R\}\_\{T\}:\[\-1,1\]^\{d\_\{\\mathrm\{sae\}\}\+1\}\\to\\mathbb\{R\}^\{d\_\{\\mathrm\{sae\}\}\}whose scalar coordinate maps admit either of two implementations\. First, each coordinate map can be implemented directly as a trainable scalar rational activation of size

𝒪​\(log⁡\(1/ε\)​log⁡\(1/δ\)\)\.\\mathcal\{O\}\\\!\\Big\(\\log\(1/\\varepsilon\)\\log\(1/\\delta\)\\Big\)\.Second, each coordinate map can be realized by a constant\-width deep rational network of internal depth

MT=𝒪​\(log⁡log⁡\(1/ε\)\+log⁡log⁡\(1/δ\)\)\.M\_\{T\}\\;=\\;\\mathcal\{O\}\\\!\\Big\(\\log\\log\(1/\\varepsilon\)\+\\log\\log\(1/\\delta\)\\Big\)\.Under either implementation,ℛT\\mathcal\{R\}\_\{T\}satisfies

sup\(𝒉,τk\)∈ΩδT∥ℛT​\(𝒉,τk\)−𝒛T​\(𝒉;τk\)∥∞≤ε\.\\sup\_\{\(\\bm\{h\},\\tau\_\{k\}\)\\in\\Omega^\{\\mathrm\{T\}\}\_\{\\delta\}\}\\big\\lVert\\mathcal\{R\}\_\{T\}\(\\bm\{h\},\\tau\_\{k\}\)\-\\bm\{z\}\_\{\\mathrm\{T\}\}\(\\bm\{h\};\\tau\_\{k\}\)\\big\\rVert\_\{\\infty\}\\;\\leq\\;\\varepsilon\.

###### Proof\.

Setti:=hi−τkt\_\{i\}:=h\_\{i\}\-\\tau\_\{k\}\. For each coordinate,

zT,i​\(𝒉;τk\)=hi​H​\(hi−τk\)=hi⋅1\+sign​\(hi−τk\)2\.z\_\{\\mathrm\{T\},i\}\(\\bm\{h\};\\tau\_\{k\}\)=h\_\{i\}\\,H\(h\_\{i\}\-\\tau\_\{k\}\)=h\_\{i\}\\cdot\\frac\{1\+\\mathrm\{sign\}\(h\_\{i\}\-\\tau\_\{k\}\)\}\{2\}\.OnΩδT\\Omega^\{\\mathrm\{T\}\}\_\{\\delta\}, the top\-kkcoordinates satisfyhi−τk≥δh\_\{i\}\-\\tau\_\{k\}\\geq\\deltaand the remaining coordinates satisfyhi−τk≤−δh\_\{i\}\-\\tau\_\{k\}\\leq\-\\delta\. Therefore\|ti\|∈\[δ,2\]\|t\_\{i\}\|\\in\[\\delta,2\]for every coordinate, becausehi,τk∈\[−1,1\]h\_\{i\},\\tau\_\{k\}\\in\[\-1,1\]\.Hence the JumpReLU proof applies with the learned thresholdθi\\theta\_\{i\}replaced by the supplied thresholdτk\\tau\_\{k\}and with the fixed scale22\. Applying Lemma[1](https://arxiv.org/html/2606.14990#Thmtheorem1)tou=ti/2u=t\_\{i\}/2gives a shared scalar rational functionsssuch that

sup\|t\|∈\[δ,2\]\|sign​\(t\)−s​\(t2\)\|≤ε\.\\sup\_\{\|t\|\\in\[\\delta,2\]\}\\Big\|\\mathrm\{sign\}\(t\)\-s\\Big\(\\frac\{t\}\{2\}\\Big\)\\Big\|\\leq\\varepsilon\.Define

G​\(u\):=1\+s​\(u\)2,H~​\(t\):=1\+s​\(t/2\)2,z~i​\(𝒉;τk\):=hi​H~​\(hi−τk\)\.G\(u\):=\\frac\{1\+s\(u\)\}\{2\},\\qquad\\widetilde\{H\}\(t\):=\\frac\{1\+s\(t/2\)\}\{2\},\\qquad\\widetilde\{z\}\_\{i\}\(\\bm\{h\};\\tau\_\{k\}\):=h\_\{i\}\\,\\widetilde\{H\}\(h\_\{i\}\-\\tau\_\{k\}\)\.Then, for every\(𝒉,τk\)∈ΩδT\(\\bm\{h\},\\tau\_\{k\}\)\\in\\Omega^\{\\mathrm\{T\}\}\_\{\\delta\},

\|z~i​\(𝒉;τk\)−zT,i​\(𝒉;τk\)\|≤\|hi\|​\|H~​\(hi−τk\)−H​\(hi−τk\)\|≤ε2≤ε\.\|\\widetilde\{z\}\_\{i\}\(\\bm\{h\};\\tau\_\{k\}\)\-z\_\{\\mathrm\{T\},i\}\(\\bm\{h\};\\tau\_\{k\}\)\|\\leq\|h\_\{i\}\|\\,\|\\widetilde\{H\}\(h\_\{i\}\-\\tau\_\{k\}\)\-H\(h\_\{i\}\-\\tau\_\{k\}\)\|\\leq\\frac\{\\varepsilon\}\{2\}\\leq\\varepsilon\.
For the direct trainable rational\-activation implementation, keep the shared gateGGunfactored and apply it to the affine scalar\(hi−τk\)/2\(h\_\{i\}\-\\tau\_\{k\}\)/2\. By Lemma[1](https://arxiv.org/html/2606.14990#Thmtheorem1), the numerator and denominator degrees are𝒪​\(log⁡\(1/ε\)​log⁡\(1/δ\)\)\\mathcal\{O\}\(\\log\(1/\\varepsilon\)\\log\(1/\\delta\)\); multiplying byhih\_\{i\}increases only the numerator degree by one\. Thus each coordinate map has size𝒪​\(log⁡\(1/ε\)​log⁡\(1/δ\)\)\\mathcal\{O\}\(\\log\(1/\\varepsilon\)\\log\(1/\\delta\)\), with the rational coefficients shared across coordinates\.

For the deep implementation, realize the same shared gateGGby a constant\-width rational network of internal depth𝒪​\(log⁡log⁡\(1/ε\)\+log⁡log⁡\(1/δ\)\)\\mathcal\{O\}\(\\log\\log\(1/\\varepsilon\)\+\\log\\log\(1/\\delta\)\)\. The affine difference and fixed scaling\(hi−τk\)/2\(h\_\{i\}\-\\tau\_\{k\}\)/2, together with the final multiplication byhih\_\{i\}, add only constant depth\. Applying this construction coordinatewise gives the stated blockℛT\\mathcal\{R\}\_\{T\}and the uniformℓ∞\\ell\_\{\\infty\}error bound\. ∎

###### Theorem 5\(Lower bound for ReLU/JumpReLU/TopK networks\)\.

There exists a scalar rational target mapℛη⋆:\[−1,1\]→\[0,1\]\\mathcal\{R\}^\{\\star\}\_\{\\eta\}:\[\-1,1\]\\to\[0,1\], realizable with𝒪​\(1\)\\mathcal\{O\}\(1\)rational parameters, such that the following hold\.

First, any scalar map𝒮:\[−1,1\]→\[0,1\]\\mathcal\{S\}:\[\-1,1\]\\to\[0,1\]realized by a scalar\-output ReLU/JumpReLU/supplied\-threshold TopK network and satisfying

∥𝒮−ℛη⋆∥L∞​\(\[−1,1\]\)≤ε\\lVert\\mathcal\{S\}\-\\mathcal\{R\}^\{\\star\}\_\{\\eta\}\\rVert\_\{L^\{\\infty\}\(\[\-1,1\]\)\}\\leq\\varepsilonmust satisfyP=Ω​\(log⁡\(1/ε\)\)P=\\Omega\(\\log\(1/\\varepsilon\)\), wherePPis the number of trainable parameters\.

Second, any scalar map𝒮:\[−1,1\]→\[0,1\]\\mathcal\{S\}:\[\-1,1\]\\to\[0,1\]realized by the scalar\-output version of the single\-layer encoder architecture in \([1](https://arxiv.org/html/2606.14990#S2.E1)\), using ReLU, JumpReLU, or a supplied\-threshold TopK gate withNNactivated coordinates, and satisfying

∥𝒮−ℛη⋆∥L∞​\(\[−1,1\]\)≤ε\\lVert\\mathcal\{S\}\-\\mathcal\{R\}^\{\\star\}\_\{\\eta\}\\rVert\_\{L^\{\\infty\}\(\[\-1,1\]\)\}\\leq\\varepsilonmust satisfyN=Ω​\(ε−1/2\)N=\\Omega\(\\varepsilon^\{\-1/2\}\)\.

###### Proof\.

Consider a rational function:

ℛη⋆​\(x\):=η2x2\+η2,x∈\[−1,1\],\\mathcal\{R\}^\{\\star\}\_\{\\eta\}\(x\):=\\frac\{\\eta^\{2\}\}\{x^\{2\}\+\\eta^\{2\}\},\\qquad x\\in\[\-1,1\],withη∈\(0,1/2\)\\eta\\in\(0,1/2\)fixed and the intervalIη=\[−η/2,η/2\]I\_\{\\eta\}=\[\-\\eta/2,\\eta/2\]\. A direct calculation gives

\(ℛη⋆\)′′​\(x\)=2​η2​\(3​x2−η2\)\(x2\+η2\)3\.\(\\mathcal\{R\}^\{\\star\}\_\{\\eta\}\)^\{\\prime\\prime\}\(x\)=\\frac\{2\\eta^\{2\}\\big\(3x^\{2\}\-\\eta^\{2\}\\big\)\}\{\(x^\{2\}\+\\eta^\{2\}\)^\{3\}\}\.For\|x\|≤η/2\|x\|\\leq\\eta/2, we have3​x2−η2≤−η2/43x^\{2\}\-\\eta^\{2\}\\leq\-\\eta^\{2\}/4andx2\+η2≤5​η2/4x^\{2\}\+\\eta^\{2\}\\leq 5\\eta^\{2\}/4, hence

\(ℛη⋆\)′′\(x\)≤−η4/2\(5​η2/4\)3=−32125​η2=:−cη\.\(\\mathcal\{R\}^\{\\star\}\_\{\\eta\}\)^\{\\prime\\prime\}\(x\)\\leq\-\\frac\{\\eta^\{4\}/2\}\{\(5\\eta^\{2\}/4\)^\{3\}\}=\-\\frac\{32\}\{125\\eta^\{2\}\}=:\-c\_\{\\eta\}\.Thusℛη⋆\\mathcal\{R\}^\{\\star\}\_\{\\eta\}is uniformly concave onIηI\_\{\\eta\}\.

Letg:\[−1,1\]→\[0,1\]g:\[\-1,1\]\\to\[0,1\]be piecewise affine withRRaffine pieces and assume‖g−ℛη⋆‖L∞​\(\[−1,1\]\)≤ε\\\|g\-\\mathcal\{R\}^\{\\star\}\_\{\\eta\}\\\|\_\{L^\{\\infty\}\(\[\-1,1\]\)\}\\leq\\varepsilon\. Since\|Iη\|=η\|I\_\{\\eta\}\|=\\eta, one affine piece ofggcontains a subintervalJ=\[a,b\]⊂IηJ=\[a,b\]\\subset I\_\{\\eta\}of length

ℓ:=\|J\|≥ηR\.\\ell:=\|J\|\\geq\\frac\{\\eta\}\{R\}\.LetsJs\_\{J\}be the secant line ofℛη⋆\\mathcal\{R\}^\{\\star\}\_\{\\eta\}onJJ, and letm=\(a\+b\)/2m=\(a\+b\)/2be the midpoint ofJJ\. Because\(ℛη⋆\)′′≤−cη\(\\mathcal\{R\}^\{\\star\}\_\{\\eta\}\)^\{\\prime\\prime\}\\leq\-c\_\{\\eta\}onJJ, Taylor’s theorem at the midpoint gives

ℛη⋆​\(m\)−sJ​\(m\)≥cη​ℓ28\.\\mathcal\{R\}^\{\\star\}\_\{\\eta\}\(m\)\-s\_\{J\}\(m\)\\geq\\frac\{c\_\{\\eta\}\\ell^\{2\}\}\{8\}\.Now write the affine restriction ofggonJJas

whereqqis affine\. Sinceqqis affine,

q​\(m\)=q​\(a\)\+q​\(b\)2\.q\(m\)=\\frac\{q\(a\)\+q\(b\)\}\{2\}\.Also, becausesJ​\(a\)=ℛη⋆​\(a\)s\_\{J\}\(a\)=\\mathcal\{R\}^\{\\star\}\_\{\\eta\}\(a\)andsJ​\(b\)=ℛη⋆​\(b\)s\_\{J\}\(b\)=\\mathcal\{R\}^\{\\star\}\_\{\\eta\}\(b\), the uniform error bound at the endpoints implies

\|q​\(a\)\|≤ε,\|q​\(b\)\|≤ε,\|q\(a\)\|\\leq\\varepsilon,\\qquad\|q\(b\)\|\\leq\\varepsilon,and therefore\|q​\(m\)\|≤ε\|q\(m\)\|\\leq\\varepsilon\. Evaluating the error at the midpoint, we obtain

ε≥\|ℛη⋆​\(m\)−g​\(m\)\|=\|ℛη⋆​\(m\)−sJ​\(m\)−q​\(m\)\|≥cη​ℓ28−ε\.\\varepsilon\\geq\\big\|\\mathcal\{R\}^\{\\star\}\_\{\\eta\}\(m\)\-g\(m\)\\big\|=\\big\|\\mathcal\{R\}^\{\\star\}\_\{\\eta\}\(m\)\-s\_\{J\}\(m\)\-q\(m\)\\big\|\\geq\\frac\{c\_\{\\eta\}\\ell^\{2\}\}\{8\}\-\\varepsilon\.Hence

ε≥cη​ℓ216≥cη​η216​R2,\\varepsilon\\geq\\frac\{c\_\{\\eta\}\\ell^\{2\}\}\{16\}\\geq\\frac\{c\_\{\\eta\}\\eta^\{2\}\}\{16R^\{2\}\},which yields

R≥cη′​ε−1/2R\\geq c^\{\\prime\}\_\{\\eta\}\\,\\varepsilon^\{\-1/2\}for a constantcη′\>0c^\{\\prime\}\_\{\\eta\}\>0depending only onη\\eta\.

This already gives the single\-layer encoder lower bound\. Indeed, a scalar\-output version of the SAE encoder in \([1](https://arxiv.org/html/2606.14990#S2.E1)\) has the form

g​\(x\)=a0\+∑j=1Naj​ψj​\(wj​x\+bj\),g\(x\)=a\_\{0\}\+\\sum\_\{j=1\}^\{N\}a\_\{j\}\\,\\psi\_\{j\}\(w\_\{j\}x\+b\_\{j\}\),where eachψj\\psi\_\{j\}is a ReLU, JumpReLU, or supplied\-threshold TopK gate\. Each supplied\-threshold TopK summand is interpreted with its supplied threshold fixed along this scalar restriction, so every summand is affine except at at most one breakpoint\. Hence the union of all breakpoints has size at mostNNand the scalar map has at mostN\+1N\+1affine pieces\. Combining this with the piece\-count lower bound above gives

cη′​ε−1/2≤R≤N\+1,c^\{\\prime\}\_\{\\eta\}\\,\\varepsilon^\{\-1/2\}\\leq R\\leq N\+1,and henceN=Ω​\(ε−1/2\)N=\\Omega\(\\varepsilon^\{\-1/2\}\)\.

For the arbitrary\-depth statement, if𝒮\\mathcal\{S\}is a scalar\-valued ReLU network with depthLLandMMhidden units, then the network has at most\(M/L\)L\(M/L\)^\{L\}breakpoints, and the same conclusion holds for JumpReLU and supplied\-threshold TopK networks \(see, e\.g\.,Telgarsky \[[2016](https://arxiv.org/html/2606.14990#bib.bib32)\]\)\. Therefore, in either case,

cη′​ε−1/2≤R≤\(M/L\)L\.c^\{\\prime\}\_\{\\eta\}\\,\\varepsilon^\{\-1/2\}\\leq R\\leq\(M/L\)^\{L\}\.Taking logarithms of both sides ofcη′​ε−1/2≤\(M/L\)Lc^\{\\prime\}\_\{\\eta\}\\,\\varepsilon^\{\-1/2\}\\leq\(M/L\)^\{L\}gives

log⁡\(cη′\)\+12​log⁡\(1/ε\)≤L​log⁡\(M/L\)\.\\log\(c^\{\\prime\}\_\{\\eta\}\)\+\\tfrac\{1\}\{2\}\\log\(1/\\varepsilon\)\\;\\leq\\;L\\log\(M/L\)\.Since each layer has at least22neurons we haveM≥2​LM\\geq 2L, soM/L≥2\>1M/L\\geq 2\>1\. Usinglog⁡x<x\\log x<xfor allx\>1x\>1yieldslog⁡\(M/L\)<M/L\\log\(M/L\)<M/L, and therefore

L​log⁡\(M/L\)<L⋅ML=M\.L\\log\(M/L\)\\;<\\;L\\cdot\\frac\{M\}\{L\}\\;=\\;M\.Combining the two inequalities,

M\>log⁡\(cη′\)\+12​log⁡\(1/ε\),M\\;\>\\;\\log\(c^\{\\prime\}\_\{\\eta\}\)\+\\tfrac\{1\}\{2\}\\log\(1/\\varepsilon\),soM=Ω​\(log⁡\(1/ε\)\)M=\\Omega\(\\log\(1/\\varepsilon\)\)\. Since each hidden unit contributes at least one bias parameter, the total trainable parameter count satisfiesP≥MP\\geq M, and thereforeP=Ω​\(log⁡\(1/ε\)\)P=\\Omega\(\\log\(1/\\varepsilon\)\)\. ∎

## Appendix BDetailed Empirical Results

### B\.1Implementation Details and Experimental Setup

Algorithm 1Rational Sparse Autoencoder \(RSAE\) construction\.1:Pre\-trained baseline SAE

\{𝑾enc~,𝒃enc~,𝑾dec~,𝒃dec~\}\\\{\\widetilde\{\\bm\{W\}\_\{\\text\{enc\}\}\},\\widetilde\{\\bm\{b\}\_\{\\text\{enc\}\}\},\\widetilde\{\\bm\{W\}\_\{\\text\{dec\}\}\},\\widetilde\{\\bm\{b\}\_\{\\text\{dec\}\}\}\\\}with activation

ϕteacher∈\{ReLU,JumpReLUθ,TopKk\}\\phi^\{\\text\{teacher\}\}\\in\\\{\\mathrm\{ReLU\},\\,\\mathrm\{JumpReLU\}\_\{\\theta\},\\,\\mathrm\{TopK\}\_\{k\}\\\}; rational type

\(p,q\)\(p,q\); calibration buffer

\{𝒙n\}n=1Ncal\\\{\\bm\{x\}\_\{n\}\\\}\_\{n=1\}^\{N\_\{\\text\{cal\}\}\}; dense grid size

NN; sparsity coefficient

λ\\lambda\.

2:Step 1 — RSAE Initialization Procedure\.

3:Build dense grid

\{\(tℓ,yℓ\)\}ℓ=1N⊂\[−1,1\]\\\{\(t\_\{\\ell\},y\_\{\\ell\}\)\\\}\_\{\\ell=1\}^\{N\}\\subset\[\-1,1\]with

yℓ=ϕteacher​\(tℓ\)y\_\{\\ell\}=\\phi^\{\\text\{teacher\}\}\(t\_\{\\ell\}\)\.

4:Fit rational coefficients

\(𝒂∗,𝒃∗\)\(\\bm\{a\}^\{\\ast\},\\bm\{b\}^\{\\ast\}\)to the teacher activation by the relaxed Remez exchange targeting \([7](https://arxiv.org/html/2606.14990#S4.E7)\) \(Appendix[B\.2](https://arxiv.org/html/2606.14990#A2.SS2)\); fall back to the

L2L^\{2\}or smoothed\-

L∞L^\{\\infty\}surrogate if Remez is numerically unstable\. \(Optionally distill onto the safe\-Padé family by least squares on

\{tℓ\}ℓ=1N\\\{t\_\{\\ell\}\\\}\_\{\\ell=1\}^\{N\}\.\)

5:Compute teacher pre\-activations

𝒉n=𝑾enc~​\(𝒙n−𝒃dec~\)\+𝒃enc~\\bm\{h\}\_\{n\}=\\widetilde\{\\bm\{W\}\_\{\\text\{enc\}\}\}\\,\(\\bm\{x\}\_\{n\}\-\\widetilde\{\\bm\{b\}\_\{\\text\{dec\}\}\}\)\+\\widetilde\{\\bm\{b\}\_\{\\text\{enc\}\}\}for

n=1,…,Ncaln=1,\\ldots,N\_\{\\text\{cal\}\}\.

6:Adapt the rational activation to the teacher by minimising \([8](https://arxiv.org/html/2606.14990#S4.E8)\) over

\(𝒂,𝒃,Cin,Cout\)\(\\bm\{a\},\\bm\{b\},C\_\{\\mathrm\{in\}\},C\_\{\\mathrm\{out\}\}\)with

\(𝒂,𝒃\)\(\\bm\{a\},\\bm\{b\}\)initialised from

\(𝒂∗,𝒃∗\)\(\\bm\{a\}^\{\\ast\},\\bm\{b\}^\{\\ast\}\), yielding

\(𝒂~,𝒃~,C~in,C~out\)\(\\widetilde\{\\bm\{a\}\},\\widetilde\{\\bm\{b\}\},\\widetilde\{C\}\_\{\\mathrm\{in\}\},\\widetilde\{C\}\_\{\\mathrm\{out\}\}\)\.

7:Step 2 — RSAE Fine\-Tuning Procedure\.

8:Inherit teacher weights

\{𝑾enc,𝒃enc,𝑾dec,𝒃dec\}←\{𝑾enc~,𝒃enc~,𝑾dec~,𝒃dec~\}\\\{\\bm\{W\}\_\{\\text\{enc\}\},\\bm\{b\}\_\{\\text\{enc\}\},\\bm\{W\}\_\{\\text\{dec\}\},\\bm\{b\}\_\{\\text\{dec\}\}\\\}\\\!\\leftarrow\\\!\\\{\\widetilde\{\\bm\{W\}\_\{\\text\{enc\}\}\},\\widetilde\{\\bm\{b\}\_\{\\text\{enc\}\}\},\\widetilde\{\\bm\{W\}\_\{\\text\{dec\}\}\},\\widetilde\{\\bm\{b\}\_\{\\text\{dec\}\}\}\\\}and set

\(𝒂,𝒃,Cin,Cout\)←\(𝒂~,𝒃~,C~in,C~out\)\(\\bm\{a\},\\bm\{b\},C\_\{\\mathrm\{in\}\},C\_\{\\mathrm\{out\}\}\)\\\!\\leftarrow\\\!\(\\widetilde\{\\bm\{a\}\},\\widetilde\{\\bm\{b\}\},\\widetilde\{C\}\_\{\\mathrm\{in\}\},\\widetilde\{C\}\_\{\\mathrm\{out\}\}\)\.

9:Collect

Θ←\{𝑾enc,𝒃enc,𝑾dec,𝒃dec,𝒂,𝒃,Cin,Cout\}\\Theta\\\!\\leftarrow\\\!\\\{\\bm\{W\}\_\{\\text\{enc\}\},\\bm\{b\}\_\{\\text\{enc\}\},\\bm\{W\}\_\{\\text\{dec\}\},\\bm\{b\}\_\{\\text\{dec\}\},\\bm\{a\},\\bm\{b\},C\_\{\\mathrm\{in\}\},C\_\{\\mathrm\{out\}\}\\\}\.

10:repeat⊳\\trianglerightJoint Adam fine\-tuning of \([9](https://arxiv.org/html/2606.14990#S4.E9)\)\.

11:Sample mini\-batch

𝒙∼𝒟\\bm\{x\}\\sim\\mathcal\{D\}; compute

𝒉=𝑾enc​\(𝒙−𝒃dec\)\+𝒃enc\\bm\{h\}=\\bm\{W\}\_\{\\text\{enc\}\}\\,\(\\bm\{x\}\-\\bm\{b\}\_\{\\text\{dec\}\}\)\+\\bm\{b\}\_\{\\text\{enc\}\},

𝒛=ϕ​\(𝒉\)\\bm\{z\}=\\phi\(\\bm\{h\}\)via \([3](https://arxiv.org/html/2606.14990#S3.E3)\), and

𝒙^=𝑾dec​𝒛\+𝒃dec\\hat\{\\bm\{x\}\}=\\bm\{W\}\_\{\\text\{dec\}\}\\,\\bm\{z\}\+\\bm\{b\}\_\{\\text\{dec\}\}\.

12:Take an Adam step on

Θ\\Thetaminimising the integrand of \([9](https://arxiv.org/html/2606.14990#S4.E9)\), with the unit\-norm\-row constraint on

𝑾dec\\bm\{W\}\_\{\\text\{dec\}\}enforced by gradient projection and renormalisation\.

13:untilconvergence\.

14:RSAE with parameters

Θ\\Theta\.

#### Models and hook points\.

We evaluate on three open\-weight language models\. ForGPT\-2 smallwe hook the residual stream at layer66\(blocks\.6\.hook\_resid\_post,din=768d\_\{\\mathrm\{in\}\}=768\); forPythia\-160m\-dedupedat layer88\(blocks\.8\.hook\_resid\_post,din=768d\_\{\\mathrm\{in\}\}=768\); and forGemma\-2\-2Bat layer1212\(blocks\.12\.hook\_resid\_post,din=2304d\_\{\\mathrm\{in\}\}=2304\)\. All language\-model forwards are run inbf16\.

#### Teacher SAE checkpoints\.

Teacher SAEs are pre\-trained, publicly released checkpoints\. For GPT\-2 small the ReLU teacher isgpt2\-small\-res\-jb\[Bloom,[2024](https://arxiv.org/html/2606.14990#bib.bib20)\]\(dsae=24,576d\_\{\\mathrm\{sae\}\}=24\{,\}576\) and the TopK teacher is the OpenAI\-v5gpt2\-small\-resid\-post\-v5\-32k\[Gaoet al\.,[2025](https://arxiv.org/html/2606.14990#bib.bib10)\]\(dsae=32,768d\_\{\\mathrm\{sae\}\}=32\{,\}768, with layer\-norm preprocessing of the residual stream\)\. For Pythia\-160m the ReLU teacher isadamkarvonen/saebench\_pythia\-160m\-deduped\_width\-2pow14\_date\-0108\(dsae=16,384d\_\{\\mathrm\{sae\}\}=16\{,\}384\); JumpReLU and TopK teachers are taken from the matching SAEBench\[Karvonenet al\.,[2025](https://arxiv.org/html/2606.14990#bib.bib22)\]releases at the same width\. For Gemma\-2\-2B all three teacher families are taken from the corresponding SAEBench releases atdsae=4,096d\_\{\\mathrm\{sae\}\}=4\{,\}096\.

#### RSAE activation and parameter init\.

Every RSAE uses the standard\-Padé rational activation \([3](https://arxiv.org/html/2606.14990#S3.E3)\) with degree\(p,q\)=\(3,2\)\(p,q\)=\(3,2\)for ReLU teachers and\(9,8\)\(9,8\)for JumpReLU and TopK teachers \(chosen from the synthetic ablation of §[5\.1](https://arxiv.org/html/2606.14990#S5.SS1); see Figure[3](https://arxiv.org/html/2606.14990#A2.F3)\)\. Coefficients\(𝒂,𝒃\)\(\\bm\{a\},\\bm\{b\}\)are initialised from the converged relaxed Remez fit of the teacher activation \(§[4](https://arxiv.org/html/2606.14990#S4)\); per\-feature scaleslog⁡Cin\\log C\_\{\\mathrm\{in\}\}andlog⁡Cout\\log C\_\{\\mathrm\{out\}\}are initialised to zero and learned\. Encoder/decoder weights and biases are inherited verbatim from the teacher checkpoint att=0t=0so that the RSAE approximately matches the teacher before fine\-tuning\. To preserve the standard SAE non\-negativity convention𝒛≥𝟎\\bm\{z\}\\geq\\bm\{0\}, we apply amax⁡\(0,ϕ​\(𝒉\)\)\\max\(0,\\phi\(\\bm\{h\}\)\)clamp on the rational output element\-wise\.

#### Activation buffer and calibration\.

Activations are streamed from a Pile / OpenWebText mixture using TransformerLens hooks; we collect a calibration buffer of∼\\sim2M tokens that is used both to fit the activation distillation step \([8](https://arxiv.org/html/2606.14990#S4.E8)\) and to provide statistics for the per\-feature pre\-activation scale used in \([3](https://arxiv.org/html/2606.14990#S3.E3)\)\. A separate held\-out evaluation buffer of∼\\sim10K tokens \(disjoint from the calibration buffer\) is used for all reported metrics\.

#### Init procedure \(500 steps\)\.

The empirical activation distillation step fits\(𝒂,𝒃,Cin,Cout\)\(\\bm\{a\},\\bm\{b\},C\_\{\\mathrm\{in\}\},C\_\{\\mathrm\{out\}\}\)to the teacher activationϕteacher\\phi^\{\\text\{teacher\}\}on calibration\-buffer pre\-activations, with all other RSAE parameters frozen\. We use Adam with learning rate10−310^\{\-3\}, batch size1,0241\{,\}024tokens, and run for500500steps\.

#### Fine\-tune procedure \(2,000 steps\)\.

The full RSAE is then fine\-tuned jointly on theℓ1\\ell\_\{1\}\-regularised reconstruction objective \([9](https://arxiv.org/html/2606.14990#S4.E9)\) for2,0002\{,\}000Adam steps with learning rate5×10−45\\\!\\times\\\!10^\{\-4\}and a cosine decay to0, batch size4,0964\{,\}096tokens, and gradient clipping at norm1\.01\.0\. SAE training itself runs infp32\. The sparsity coefficientλ\\lambdais swept on a small grid per \(model, teacher\) pair to trace the Pareto front of Figure[2](https://arxiv.org/html/2606.14990#S5.F2); the values reported in Tables[1](https://arxiv.org/html/2606.14990#S5.T1)–[2](https://arxiv.org/html/2606.14990#S5.T2)use theλ\\lambdathat maximises the number of strict per\-axis wins over the teacher\.

#### Hardware\.

All RSAE training reported in Table[3](https://arxiv.org/html/2606.14990#S5.T3)is run on a single NVIDIA RTX 5090 \(32 GB\)\. End\-to\-end wall\-clock per \(model, baseline\) is dominated by the fine\-tuning procedure\.

#### Detailed wall\-clock runtime \(Table[6](https://arxiv.org/html/2606.14990#A2.T6)\)\.

Table[6](https://arxiv.org/html/2606.14990#A2.T6)reports the per\-\(model, baseline\) wall\-clock breakdown that underlies the mean±\\,\\pm\\,std summary of Table[3](https://arxiv.org/html/2606.14990#S5.T3)in the main text\. Init\- and Finetune\-procedure timings are dominated by the language\-model forward pass through the host network at each training step\.

Table 6:Detailed wall\-clock runtimeof the RSAE pipeline per \(model, baseline SAE\), measured on a single NVIDIA RTX 5090 \(32 GB\)\.
#### Evaluation protocol\.

Reconstruction MSE,ℓ0\\ell\_\{0\}\(at the\|z\|\>10−6\|z\|\>10^\{\-6\}threshold\), and the alive\-feature fraction \(a latent is “alive” if it fires on at least one held\-out token\) are evaluated on the∼\\sim10K\-token held\-out buffer\. Cross\-entropy degradationΔ​CE\\Delta\\mathrm\{CE\}and the loss\-recovered fractionLR\\mathrm\{LR\}are computed over128128Pile / OpenWebText sequences of length128128tokens, by routing the residual stream through𝒙^\\hat\{\\bm\{x\}\}at the SAE’s host layer and reading the language model’s next\-token cross\-entropy at every position;CEzero\\mathrm\{CE\}\_\{\\text\{zero\}\}is the cross\-entropy obtained when the residual stream at the host layer is replaced by the zero vector\.

#### Sparse\-probing setup \(Table[5](https://arxiv.org/html/2606.14990#S5.T5)\)\.

We use the SAEBench probing suite of88binary\-classification tasks\. Per dataset we train anℓ2\\ell\_\{2\}\-regularised logistic regression on the full SAE feature vector \(nokk\-sparsity constraint\) using SAEBench’s default split, and report mean test accuracy across the dataset’s evaluation seeds\.

### B\.2Relaxed Remez Exchange Details

This appendix expands on the relaxed Remez exchange used in Step 1 of §[4](https://arxiv.org/html/2606.14990#S4)to solve the min–max objective \([7](https://arxiv.org/html/2606.14990#S4.E7)\)\. Writingr\(𝒂,𝒃\)​\(t\)=P​\(t\)/Q​\(t\)r\_\{\(\\bm\{a\},\\bm\{b\}\)\}\(t\)=P\(t\)/Q\(t\)in the standard\-Padé formP​\(t\)=∑i=0pai​tiP\(t\)=\\sum\_\{i=0\}^\{p\}a\_\{i\}\\,t^\{i\},Q​\(t\)=1\+∑j=1qbj​tjQ\(t\)=1\+\\sum\_\{j=1\}^\{q\}b\_\{j\}\\,t^\{j\}, and settingK=p\+q\+1K=p\+q\+1, therr\-th outer iteration carries an amplitude estimateErE\_\{r\}together withK\+2K\{\+\}2alternation nodes\{td\(r\)\}d=0K\+1⊂\[−1,1\]\\\{t\_\{d\}^\{\(r\)\}\\\}\_\{d=0\}^\{K\+1\}\\\!\\subset\\\!\[\-1,1\]\. Replacing the unknown amplitude on the right\-hand side of the Chebyshev equioscillation condition withErE\_\{r\}yields theK\+2K\{\+\}2*linear*equations

P​\(td\(r\)\)−\[yd\(r\)−\(−1\)d​Er\]​\(Q​\(td\(r\)\)−1\)−\(−1\)d​Er\+1=yd\(r\),d=0,…,K\+1,P\\bigl\(t\_\{d\}^\{\(r\)\}\\bigr\)\\,\-\\,\\bigl\[\\,y\_\{d\}^\{\(r\)\}\-\(\-1\)^\{d\}\\,E\_\{r\}\\,\\bigr\]\\,\\bigl\(\\,Q\\bigl\(t\_\{d\}^\{\(r\)\}\\bigr\)\-1\\,\\bigr\)\\,\-\\,\(\-1\)^\{d\}\\,E\_\{r\+1\}\\;=\\;y\_\{d\}^\{\(r\)\},\\qquad d=0,\\,\\ldots,\\,K\{\+\}1,\(10\)withyd\(r\)=ϕteacher​\(td\(r\)\)y\_\{d\}^\{\(r\)\}=\\phi^\{\\text\{teacher\}\}\\bigl\(t\_\{d\}^\{\(r\)\}\\bigr\), which we solve in the least\-squares sense for\(𝒂,𝒃,Er\+1\)\(\\bm\{a\},\\bm\{b\},E\_\{r\+1\}\); the nodes are then relocated to theK\+2K\{\+\}2largest local extrema of\|r\(𝒂,𝒃\)​\(tℓ\)−yℓ\|\|r\_\{\(\\bm\{a\},\\bm\{b\}\)\}\(t\_\{\\ell\}\)\-y\_\{\\ell\}\|on the dense grid\{tℓ\}ℓ=1N\\\{t\_\{\\ell\}\\\}\_\{\\ell=1\}^\{N\}\. The relaxation, usingErE\_\{r\}in place of the unknownEr\+1E\_\{r\+1\}on the right\-hand side, reduces each inner step to a single linear solve and makes the algorithm robust on discontinuous targets such asJumpReLUθ\\mathrm\{JumpReLU\}\_\{\\theta\}\. At convergence \(\|Er\+1−Er\|<ε\|E\_\{r\+1\}\-E\_\{r\}\|<\\varepsilon\) the residual equioscillates andr\(𝒂,𝒃\)r\_\{\(\\bm\{a\},\\bm\{b\}\)\}attains the bestL∞L^\{\\infty\}rational approximation of type\(p,q\)\(p,q\)on\[−1,1\]\[\-1,1\]by Chebyshev’s characterisation\[Trefethen,[2019](https://arxiv.org/html/2606.14990#bib.bib36)\]\. Because Remez fits the standard\-Padé denominatorQQrather than the safe\-Padé form of \([3](https://arxiv.org/html/2606.14990#S3.E3)\), we then distill the converged Remez rational onto the safe\-Padé family by least squares on\{tℓ\}ℓ=1N\\\{t\_\{\\ell\}\\\}\_\{\\ell=1\}^\{N\}\.

### B\.3Rational Fitting on Synthetic Data

This appendix records the full numerical comparison of the four rational\-fitting procedures of §[4](https://arxiv.org/html/2606.14990#S4)across the activation primitives used by current SAE families\. We sweep superdiagonal types\(p,q\)∈\{\(3,2\),\(5,4\),…,\(19,18\)\}\(p,q\)\\in\\\{\(3,2\),\(5,4\),\\ldots,\(19,18\)\\\}\(extended to\(29,28\)\(29,28\)for the safe\-Padé baselines\), evaluate every fit on a uniform dense grid ofN=4001N=4001points over\[−1,1\]\[\-1,1\], and report the bestL2L^\{2\}mean\-squared error attained by each fitter\. The procedures compared are: \(i\) standard\-Padé Remez, the relaxed exchange ofChenet al\.\[[2018](https://arxiv.org/html/2606.14990#bib.bib30)\]that targets theL∞L^\{\\infty\}minimax objective \([7](https://arxiv.org/html/2606.14990#S4.E7)\) in the familyQ​\(t\)=1\+∑jϕj​tjQ\(t\)=1\+\\sum\_\{j\}\\phi\_\{j\}t^\{j\}with signedϕj\\phi\_\{j\}; \(ii\) safe\-Padé Remez via warm\-start \(Route A\), in which the converged standard\-Padé Remez coefficients are distilled onto the pole\-free familyQ​\(t\)=1\+∑j\|bj\|​\|t\|jQ\(t\)=1\+\\sum\_\{j\}\|b\_\{j\}\|\|t\|^\{j\}of \([3](https://arxiv.org/html/2606.14990#S3.E3)\) by least\-squares fitting on\{tℓ\}ℓ=1N\\\{t\_\{\\ell\}\\\}\_\{\\ell=1\}^\{N\}; \(iii\)L2L^\{2\}safe\-Padé fit, minimising1N​∑ℓ\(r\(𝒂,𝒃\)​\(tℓ\)−yℓ\)2\\frac\{1\}\{N\}\\sum\_\{\\ell\}\(r\_\{\(\\bm\{a\},\\bm\{b\}\)\}\(t\_\{\\ell\}\)\-y\_\{\\ell\}\)^\{2\}over the safe\-Padé parameters via Adam with cosine learning\-rate decay; and \(iv\)L∞L^\{\\infty\}safe\-Padé fit, minimising the smoothed\-supremum log\-sum\-exp surrogate ofmaxℓ⁡\|r\(𝒂,𝒃\)​\(tℓ\)−yℓ\|\\max\_\{\\ell\}\|r\_\{\(\\bm\{a\},\\bm\{b\}\)\}\(t\_\{\\ell\}\)\-y\_\{\\ell\}\|via Adam\. We refer to the safe\-Padé form distilled from the standard\-Padé Remez fit as Route A\.

Table 7:Best\-MSE rational fits across procedures\.Each cell reports the \(best degree,L2L^\{2\}MSE on a uniformN=4001N=4001grid over\[−1,1\]\[\-1,1\]\)\. Standard\-Padé Remez minimisesL∞L^\{\\infty\}but is reported here under the sameL2L^\{2\}MSE for fair comparison; safe\-Padé Remez \(Route A\) is the standard\-Padé Remez fit distilled onto the safe\-Padé family of \([3](https://arxiv.org/html/2606.14990#S3.E3)\)\. “–” marks targets we did not run with the Route\-A pipeline\.Boldmarks the best fitter per target\.![Refer to caption](https://arxiv.org/html/2606.14990v1/x3.png)Figure 3:L2L^\{2\}MSE versus rational degree on\[−1,1\]\[\-1,1\]\.Each curve traces the bestL2L^\{2\}MSE attained by one fitter as the type\(p,q\)\(p,q\)is swept across\{\(3,2\),\(5,4\),…,\(19,18\)\}\\\{\(3,2\),\(5,4\),\\ldots,\(19,18\)\\\}\.\(a\)OnReLU\\mathrm\{ReLU\}, standard\-Padé Remez \(red\) decays near\-exponentially with degree until type\(15,14\)\(15,14\), beyond which numerical conditioning of the linearised exchange dominates and the curve flattens\. The pole\-free safe\-PadéL2L^\{2\}\(blue\) andL∞L^\{\\infty\}\(green\) fits saturate earlier, nearMSE≈4×10−6\\mathrm\{MSE\}\\\!\\approx\\\!4\\\!\\times\\\!10^\{\-6\}, due to the strictly smaller capacity of the safe family\.\(b\)OnJumpReLU0\.1\\mathrm\{JumpReLU\}\_\{0\.1\}, the same pattern holds and the Route\-A safe\-Padé Remez fit \(purple\) tracks the standard\-Padé Remez curve at a fixed∼7×\\sim\\\!7\\timesoffset, the safe\-Padé family ceiling\.#### Standard\-Padé Remez dominates on smooth and small\-jump targets\.

On four of the six rows of Table[7](https://arxiv.org/html/2606.14990#A2.T7)\(ReLU\\mathrm\{ReLU\}andJumpReLUθ\\mathrm\{JumpReLU\}\_\{\\theta\}forθ∈\{0\.1,0\.2,0\.4\}\\theta\\in\\\{0\.1,0\.2,0\.4\\\}\), the relaxed Remez exchange in the standard\-Padé family attains the best MSE by a margin of roughly2–30×2\\text\{\-\-\}30\\timesover the next\-best independent fitter \(and1\.6–5\.8×1\.6\\text\{\-\-\}5\.8\\timesover Route\-A safe\-Padé Remez, which is itself distilled from this Remez solution\)\. This is the expected behaviour of a Chebyshev best\-rational solution: byTrefethen \[[2019](https://arxiv.org/html/2606.14990#bib.bib36)\]’s characterisation, the equioscillation property of Remez’s converged residual is necessary and sufficient forL∞L^\{\\infty\}optimality, and on smooth or small\-jump targets the resultingL∞L^\{\\infty\}error tracks closely theL2L^\{2\}MSE because the residual oscillation amplitude itself controls both norms\. In particular, onReLU\\mathrm\{ReLU\}at type\(15,14\)\(15,14\)Remez reaches MSE3\.8×10−73\.8\\\!\\times\\\!10^\{\-7\}, well below the noise floor of any downstream SAE reconstruction objective, and onJumpReLU0\.1\\mathrm\{JumpReLU\}\_\{0\.1\}at type\(19,18\)\(19,18\)MSE2\.4×10−62\.4\\\!\\times\\\!10^\{\-6\}, with sup\-error tracking the information\-theoretic half\-jump floorsup\|r−f\|≥θ/2\\sup\|r\-f\|\\geq\\theta/2\.

#### Safe\-Padé Remez \(Route A\) recovers most of the standard\-Padé quality while staying pole\-free\.

The safe\-Padé familyQ​\(t\)=1\+∑j\|bj\|​\|t\|jQ\(t\)=1\+\\sum\_\{j\}\|b\_\{j\}\|\|t\|^\{j\}is a strict subset of the standard\-Padé family, so it cannot exceed standard\-Padé Remez in approximation power; the relevant question is how much accuracy the pole\-safety constraint costs\. Route A warm\-starts a safe\-Padé fit from the converged standard\-Padé Remez coefficients viaai←ψia\_\{i\}\\\!\\leftarrow\\\!\\psi\_\{i\},bj←\|ϕj\|b\_\{j\}\\\!\\leftarrow\\\!\|\\phi\_\{j\}\|and refines underL2L^\{2\}on the same dense grid\. The result on the two targets we ran end\-to\-end \(Table[7](https://arxiv.org/html/2606.14990#A2.T7)\) is a2\.5×2\.5\\timesMSE gap to standard\-Padé Remez onReLU\\mathrm\{ReLU\}\(9\.6×10−79\.6\\\!\\times\\\!10^\{\-7\}vs\.3\.8×10−73\.8\\\!\\times\\\!10^\{\-7\}\) and a∼6×\\sim\\\!6\\timesgap onJumpReLU0\.1\\mathrm\{JumpReLU\}\_\{0\.1\}\(1\.4×10−51\.4\\\!\\times\\\!10^\{\-5\}vs\.2\.4×10−62\.4\\\!\\times\\\!10^\{\-6\}\)\. Both are still4–50×4\\text\{\-\-\}50\\timestighter than the directL2L^\{2\}andL∞L^\{\\infty\}safe\-Padé fits at the same family, indicating that the warm\-start from the Remez minimax solution materially helps the optimisation\.

#### L2L^\{2\}safe\-Padé wins at large jump sizes due to Remez’s numerical instability\.

Atθ=0\.3\\theta=0\.3andθ=0\.5\\theta=0\.5, standard\-Padé Remez underperforms theL2L^\{2\}safe\-Padé fit even at lower degree \(Remez’s best is\(5,4\)\(5,4\)vs\.L2L^\{2\}’s best at\(29,28\)\(29,28\)\)\. This is not a deficiency of the Remez objective itself but of the linearised exchange: when the target’s discontinuity is large compared to its derivative scale, the linear system \([10](https://arxiv.org/html/2606.14990#A2.E10)\) becomes severely ill\-conditioned at high\(p,q\)\(p,q\), causing the exchange to oscillate or diverge\. TheL2L^\{2\}safe\-Padé fit, by contrast, is solved by Adam on a smooth non\-convex landscape and remains numerically stable across all degrees we tested\. The MSE gap is small in absolute terms \(compare9\.8×10−49\.8\\\!\\times\\\!10^\{\-4\}vs\.8\.9×10−48\.9\\\!\\times\\\!10^\{\-4\}atθ=0\.3\\theta=0\.3\) and reflects the fact that at largeθ\\thetaboth methods are dominated by the half\-jump floor anyway\.

#### L∞L^\{\\infty\}safe\-Padé is consistently weakest inL2L^\{2\}MSE, by design\.

Across every row, the smoothed\-supremum fit attains the worstL2L^\{2\}MSE among the four procedures\. This is expected: the log\-sum\-exp surrogate concentrates training pressure on the largest\-error points, sacrificingL2L^\{2\}MSE for tighterL∞L^\{\\infty\}control\. We includeL∞L^\{\\infty\}safe\-Padé in the table for completeness, since it remains the right choice in any application that cares about worst\-case error rather than average error\.

Similar Articles

WriteSAE: Sparse Autoencoders for Recurrent State

Hugging Face Daily Papers

WriteSAE introduces the first sparse autoencoder that decomposes matrix cache writes in state-space and hybrid recurrent language models, enabling superior token-level interventions compared to existing methods.

Decompose Sparsely Where You Should, Absorb Densely Where You Should No

arXiv cs.LG

The paper hypothesizes that language model activations contain a low-rank dense component that is inefficiently represented by sparse autoencoders (SAEs). By adding a linear bottleneck to absorb dense structure, the authors reduce dense latents and improve sparse probing performance on Gemma-2-2B.

Variational lossy autoencoder

OpenAI Blog

OpenAI researchers present a Variational Lossy Autoencoder (VLAE) that combines VAEs with neural autoregressive models (RNN, MADE, PixelRNN/CNN) to learn controllable global representations, achieving state-of-the-art results on MNIST, OMNIGLOT, and Caltech-101 Silhouettes density estimation tasks.