Data-Driven Variational Basis Learning Beyond Neural Networks: A Non-Neural Framework for Adaptive Basis Discovery

arXiv cs.LG Papers

Summary

This paper introduces Data-Driven Variational Basis Learning (DVBL), a non-neural framework that learns basis functions directly from data through variational optimization, offering interpretability and mathematical transparency compared to neural networks.

arXiv:2605.05221v1 Announce Type: new Abstract: Classical representation systems such as Fourier series, wavelets, and fixed dictionaries provide analytically tractable basis expansions, but they are not intrinsically adapted to the empirical structure of modern high-dimensional data. Neural networks overcome this limitation by learning features from data, yet they do so through layered nonlinear parameterizations that often sacrifice interpretability, explicit control over basis structure, and mathematical transparency. In this manuscript we develop a non-neural alternative that learns basis functions directly from data through variational optimization. The proposed framework, termed Data Driven Variational Basis Learning (DVBL), treats basis atoms as primary optimization variables and learns them jointly with sample-specific coefficients and, when appropriate, a latent linear evolution operator. This yields a data-adaptive basis expansion that remains explicit, interpretable, and amenable to rigorous analysis. We formulate the model, establish existence of minimizers, prove blockwise descent properties for an alternating minimization algorithm, give conditions for coefficient recovery and basis identifiability, and show how manifold and dynamical regularization can be integrated without invoking neural architectures. We also discuss the conceptual novelty of the framework relative to classical dictionary learning, spectral methods, Koopman operator methods, and deep representation learning.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/08/26, 06:33 AM

# Data-Driven Variational Basis Learning Beyond Neural Networks: A Non-Neural Framework for Adaptive Basis Discovery
Source: [https://arxiv.org/html/2605.05221](https://arxiv.org/html/2605.05221)
###### Abstract

Classical representation systems such as Fourier series, wavelets, and fixed dictionaries provide analytically tractable basis expansions, but they are not intrinsically adapted to the empirical structure of modern high\-dimensional data\. Neural networks overcome this limitation by learning features from data, yet they do so through layered nonlinear parameterizations that often sacrifice interpretability, explicit control over basis structure, and mathematical transparency\. In this manuscript we develop a non\-neural alternative that learns basis functions directly from data through variational optimization\. The proposed framework, termed*Data\-Driven Variational Basis Learning*\(DVBL\), treats basis atoms as primary optimization variables and learns them jointly with sample\-specific coefficients and, when appropriate, a latent linear evolution operator\. This yields a data\-adaptive basis expansion that remains explicit, interpretable, and amenable to rigorous analysis\. We formulate the model, establish existence of minimizers, prove blockwise descent properties for an alternating minimization algorithm, give conditions for coefficient recovery and basis identifiability, and show how manifold and dynamical regularization can be integrated without invoking neural architectures\. We also discuss the conceptual novelty of the framework relative to classical dictionary learning, spectral methods, Koopman operator methods, and deep representation learning\.

## 1Introduction

A central problem in modern data analysis is the construction of representations that are simultaneously expressive, compact, interpretable, and adapted to the geometry of empirical observations\. Classical basis expansions, including Fourier systems, wavelets, spline bases, and other analytically prescribed dictionaries, have long provided mathematically elegant mechanisms for signal representation\. Their power derives from closed\-form structure, orthogonality or near\-orthogonality, and well\-developed approximation theory\. However, such bases are generally fixed*a priori*\. As a result, they may be poorly matched to the statistical regularities, anisotropies, nonlinear manifolds, or task\-dependent structures present in contemporary data\.

Neural networks address this limitation by learning representations from data\. In place of a fixed basis, they learn parameterized features through repeated compositions of affine maps and nonlinear activation functions\. This flexibility has produced remarkable empirical success across computer vision, language modeling, scientific machine learning, and sequence analysis\. Yet this success comes with significant tradeoffs\. Neural representations are often implicit rather than explicit; the learned features are distributed across many layers and parameters rather than identifiable as concrete basis functions\. Moreover, the underlying optimization landscape is highly nonconvex and deeply compositional, which complicates theoretical analysis, interpretability, stability guarantees, and principled incorporation of domain constraints such as sparsity, symmetry, smoothness, or conservation laws\.

The present work develops an alternative route\. Rather than learning a deep nonlinear network, we directly learn a set of basis functions from data by solving a structured variational problem\. The resulting representation remains of the familiar expansion form

xi≈∑k=1mαi​k​ϕk,x\_\{i\}\\approx\\sum\_\{k=1\}^\{m\}\\alpha\_\{ik\}\\phi\_\{k\},\(1\.1\)but unlike Fourier or wavelet analysis, the basis elements\{ϕk\}k=1m\\\{\\phi\_\{k\}\\\}\_\{k=1\}^\{m\}are not fixed analytically\. Instead, they are inferred from data jointly with the coefficients\{αi\}i=1N\\\{\\alpha\_\{i\}\\\}\_\{i=1\}^\{N\}\. This establishes a representation paradigm that is adaptive in the same broad sense as neural feature learning, yet fundamentally non\-neural in construction\.

At first glance, this viewpoint may appear close to dictionary learning\. Indeed, dictionary learning is a natural point of departure\. The contribution here is not merely to restate that literature in alternative language, but to place it within a broader*variational basis\-learning*perspective that unifies several ideas that are often studied separately: explicit basis learning, sparse and structured coefficient inference, manifold\-aware regularization, operator\-based latent dynamics, and task\-adaptive basis constraints\. The proposed formalism yields a mathematically transparent object, namely a learned basis, while permitting general regularity constraints that make the basis smooth, localized, orthogonal, spectrally constrained, physically admissible, or dynamically coherent\.

This manuscript has four main goals\. First, we present a general formulation of*Data\-Driven Variational Basis Learning*\(DVBL\), a non\-neural framework for learning adaptive basis functions directly from data\. Second, we provide a rigorous mathematical treatment, including existence of minimizers, descent properties of alternating optimization, and identifiability conditions\. Third, we extend the representation to incorporate geometry and dynamics through manifold regularization and latent linear evolution\. Fourth, we clarify the novelty of the framework relative to neural networks, classical sparse coding, spectral graph methods, and Koopman\-based approaches\.

The overarching thesis is that adaptive basis learning need not require layered neural parameterizations\. One may instead retain the conceptual clarity of basis expansions while learning the basis itself from data through structured optimization\. This yields a representation family that is explicit, interpretable, mathematically analyzable, and compatible with many domain\-specific priors\.

## 2General Framework

### 2\.1Problem setup

Let\{xi\}i=1N⊂ℝd\\\{x\_\{i\}\\\}\_\{i=1\}^\{N\}\\subset\\mathbb\{R\}^\{d\}denote a collection of observations\. We seek a set of basis atoms

Φ=\[ϕ1,ϕ2,…,ϕm\]∈ℝd×m,\\Phi=\[\\phi\_\{1\},\\phi\_\{2\},\\dots,\\phi\_\{m\}\]\\in\\mathbb\{R\}^\{d\\times m\},together with coefficient vectorsαi∈ℝm\\alpha\_\{i\}\\in\\mathbb\{R\}^\{m\}such that each observation admits an approximate expansion

xi≈Φ​αi\.x\_\{i\}\\approx\\Phi\\alpha\_\{i\}\.\(2\.1\)The columnsϕk\\phi\_\{k\}play the role of learned basis functions, while the coefficientsαi\\alpha\_\{i\}encode each sample in the learned basis\.

We emphasize that in this framework the basis is a primary optimization variable\. The representation is therefore not obtained through a neural mapx↦fθ​\(x\)x\\mapsto f\_\{\\theta\}\(x\), but through explicit decomposition in a learned basis\. This distinction is conceptual as well as mathematical\. The representation is not hidden in the internal state of a multilayer parameterization; it is the basis itself\.

### 2\.2Variational objective

The general DVBL objective takes the form

minΦ,\{αi\}i=1N⁡𝒥​\(Φ,\{αi\}\):=∑i=1N‖xi−Φ​αi‖22\+λ​∑i=1NR​\(αi\)\+μ​Ω​\(Φ\),\\min\_\{\\Phi,\\\{\\alpha\_\{i\}\\\}\_\{i=1\}^\{N\}\}\\mathcal\{J\}\(\\Phi,\\\{\\alpha\_\{i\}\\\}\):=\\sum\_\{i=1\}^\{N\}\\\|x\_\{i\}\-\\Phi\\alpha\_\{i\}\\\|\_\{2\}^\{2\}\+\\lambda\\sum\_\{i=1\}^\{N\}R\(\\alpha\_\{i\}\)\+\\mu\\,\\Omega\(\\Phi\),\(2\.2\)subject to structural constraints onΦ\\Phi, such as

‖ϕk‖2=1for​k=1,…,m,\\\|\\phi\_\{k\}\\\|\_\{2\}=1\\quad\\text\{for \}k=1,\\dots,m,\(2\.3\)and optionally

\|ϕk⊤​ϕℓ\|≤δ,k≠ℓ,\|\\phi\_\{k\}^\{\\top\}\\phi\_\{\\ell\}\|\\leq\\delta,\\qquad k\\neq\\ell,\(2\.4\)for some coherence parameterδ≥0\\delta\\geq 0\.

The termR​\(αi\)R\(\\alpha\_\{i\}\)is a coefficient regularizer, which may promote sparsity, group structure, temporal smoothness, or other forms of low\-complexity coding\. Typical choices include

R​\(αi\)=‖αi‖1,R​\(αi\)=‖αi‖22,R​\(αi\)=∑g∈𝒢‖αi,g‖2\.R\(\\alpha\_\{i\}\)=\\\|\\alpha\_\{i\}\\\|\_\{1\},\\qquad R\(\\alpha\_\{i\}\)=\\\|\\alpha\_\{i\}\\\|\_\{2\}^\{2\},\\qquad R\(\\alpha\_\{i\}\)=\\sum\_\{g\\in\\mathcal\{G\}\}\\\|\\alpha\_\{i,g\}\\\|\_\{2\}\.The termΩ​\(Φ\)\\Omega\(\\Phi\)regularizes the basis itself\. Examples include

Ω​\(Φ\)=‖Φ‖F2,Ω​\(Φ\)=∑k=1m‖∇ϕk‖22,Ω​\(Φ\)=‖Φ⊤​Φ−I‖F2,\\Omega\(\\Phi\)=\\\|\\Phi\\\|\_\{F\}^\{2\},\\qquad\\Omega\(\\Phi\)=\\sum\_\{k=1\}^\{m\}\\\|\\nabla\\phi\_\{k\}\\\|\_\{2\}^\{2\},\\qquad\\Omega\(\\Phi\)=\\\|\\Phi^\{\\top\}\\Phi\-I\\\|\_\{F\}^\{2\},\(2\.5\)depending on whether one wishes to encourage bounded energy, smooth atoms, or near\-orthogonality\.

This formulation is intentionally broad\. It includes classical sparse coding as a special case, but also supports more structured models in which the basis is regularized to reflect geometry, frequency localization, physical admissibility, or dynamical usefulness\.

### 2\.3Matrix form

Let

X=\[x1,…,xN\]∈ℝd×N,A=\[α1,…,αN\]∈ℝm×N\.X=\[x\_\{1\},\\dots,x\_\{N\}\]\\in\\mathbb\{R\}^\{d\\times N\},\\qquad A=\[\\alpha\_\{1\},\\dots,\\alpha\_\{N\}\]\\in\\mathbb\{R\}^\{m\\times N\}\.Then the reconstruction term can be written compactly as

∑i=1N‖xi−Φ​αi‖22=‖X−Φ​A‖F2\.\\sum\_\{i=1\}^\{N\}\\\|x\_\{i\}\-\\Phi\\alpha\_\{i\}\\\|\_\{2\}^\{2\}=\\\|X\-\\Phi A\\\|\_\{F\}^\{2\}\.Hence \([2\.2](https://arxiv.org/html/2605.05221#S2.E2)\) becomes

minΦ,A⁡‖X−Φ​A‖F2\+λ​ℛ​\(A\)\+μ​Ω​\(Φ\),\\min\_\{\\Phi,A\}\\\|X\-\\Phi A\\\|\_\{F\}^\{2\}\+\\lambda\\,\\mathcal\{R\}\(A\)\+\\mu\\,\\Omega\(\\Phi\),\(2\.6\)whereℛ​\(A\)=∑i=1NR​\(αi\)\\mathcal\{R\}\(A\)=\\sum\_\{i=1\}^\{N\}R\(\\alpha\_\{i\}\)\.

## 3Relation to Classical and Neural Representations

It is useful to contrast the proposed formulation with existing representation paradigms\. In a classical Fourier expansion, one writes

x≈∑k=1mck​ψk,x\\approx\\sum\_\{k=1\}^\{m\}c\_\{k\}\\psi\_\{k\},where\{ψk\}\\\{\\psi\_\{k\}\\\}are fixed harmonics determined analytically\. In a wavelet expansion, the basis is again prescribed in advance up to scale and translation\. In both cases, the coefficients depend on the data, but the basis does not\.

In a neural network, by contrast, one does not generally learn an explicit basis\. Instead, the representation is induced through compositions such as

x↦WL​σ​\(WL−1​σ​\(⋯​σ​\(W1​x\)\)\),x\\mapsto W\_\{L\}\\sigma\(W\_\{L\-1\}\\sigma\(\\cdots\\sigma\(W\_\{1\}x\)\)\),and the corresponding features are distributed across the entire parameterization\. Even when a neural network can be interpreted as learning a rich function space, the learned atoms are seldom explicit in the sense of \([1\.1](https://arxiv.org/html/2605.05221#S1.E1)\)\.

The DVBL framework occupies a different point in the design space\. Like neural networks, it adapts to data\. Like classical basis systems, it yields explicit expansion elements\. It therefore combines adaptivity with representational transparency\. This synthesis is central to its appeal in settings where mathematical control, interpretability, and constrained basis design are important\.

## 4Existence and Basic Properties

We now establish that the variational problem is well posed under mild assumptions\.

###### Assumption 1\.

The coefficient regularizerR:ℝm→\[0,∞\)R:\\mathbb\{R\}^\{m\}\\to\[0,\\infty\)is proper, lower semicontinuous, and coercive or bounded below by a coercive function on the feasible set\. The basis regularizerΩ:ℝd×m→\[0,∞\)\\Omega:\\mathbb\{R\}^\{d\\times m\}\\to\[0,\\infty\)is proper and lower semicontinuous\. The feasible set

𝒞Φ:=\{Φ∈ℝd×m:‖ϕk‖2=1​∀k,and any additional closed constraints\}\\mathcal\{C\}\_\{\\Phi\}:=\\\{\\Phi\\in\\mathbb\{R\}^\{d\\times m\}:\\\|\\phi\_\{k\}\\\|\_\{2\}=1\\ \\forall k,\\ \\text\{and any additional closed constraints\}\\\}is nonempty and compact\.

###### Theorem 1\(Existence of minimizers\)\.

Under Assumption[1](https://arxiv.org/html/2605.05221#Thmassumption1), the optimization problem

minΦ∈𝒞Φ,A∈ℝm×N⁡‖X−Φ​A‖F2\+λ​ℛ​\(A\)\+μ​Ω​\(Φ\)\\min\_\{\\Phi\\in\\mathcal\{C\}\_\{\\Phi\},\\ A\\in\\mathbb\{R\}^\{m\\times N\}\}\\\|X\-\\Phi A\\\|\_\{F\}^\{2\}\+\\lambda\\mathcal\{R\}\(A\)\+\\mu\\Omega\(\\Phi\)\(4\.1\)admits at least one global minimizer\.

###### Proof\.

The objective is the sum of a continuous term‖X−Φ​A‖F2\\\|X\-\\Phi A\\\|\_\{F\}^\{2\}and lower semicontinuous termsλ​ℛ​\(A\)\\lambda\\mathcal\{R\}\(A\)andμ​Ω​\(Φ\)\\mu\\Omega\(\\Phi\), hence it is lower semicontinuous on𝒞Φ×ℝm×N\\mathcal\{C\}\_\{\\Phi\}\\times\\mathbb\{R\}^\{m\\times N\}\. Because𝒞Φ\\mathcal\{C\}\_\{\\Phi\}is compact andℛ​\(A\)\\mathcal\{R\}\(A\)is coercive or bounded below by a coercive function, the objective is coercive inAA\. Therefore all sublevel sets are closed and bounded in the product space\. By the direct method of the calculus of variations, a minimizing sequence has a convergent subsequence whose limit lies in the feasible set and attains the infimum\. ∎

We next record a simple but useful property of the subproblems\.

###### Proposition 1\(Convexity of the coefficient subproblem\)\.

FixΦ\\Phi\. IfRRis convex, then the optimization problem

minA⁡‖X−Φ​A‖F2\+λ​ℛ​\(A\)\\min\_\{A\}\\\|X\-\\Phi A\\\|\_\{F\}^\{2\}\+\\lambda\\mathcal\{R\}\(A\)\(4\.2\)is convex inAA\. If, in addition,RRis strictly convex orΦ⊤​Φ\\Phi^\{\\top\}\\Phiis positive definite on the relevant support, then the minimizer is unique\.

###### Proof\.

The mapA↦‖X−Φ​A‖F2A\\mapsto\\\|X\-\\Phi A\\\|\_\{F\}^\{2\}is convex quadratic\. The sum with a convex regularizer remains convex\. Strict convexity follows if one of the summands is strictly convex on the feasible domain\. ∎

###### Proposition 2\(Convexity of the basis subproblem under quadratic regularization\)\.

FixAA\. IfΩ\\Omegais convex and the feasible set forΦ\\Phiis convex, then

minΦ⁡‖X−Φ​A‖F2\+μ​Ω​\(Φ\)\\min\_\{\\Phi\}\\\|X\-\\Phi A\\\|\_\{F\}^\{2\}\+\\mu\\Omega\(\\Phi\)\(4\.3\)is convex inΦ\\Phi\.

###### Proof\.

For fixedAA, the mapΦ↦‖X−Φ​A‖F2\\Phi\\mapsto\\\|X\-\\Phi A\\\|\_\{F\}^\{2\}is convex quadratic inΦ\\Phi\. Adding a convex regularizer preserves convexity\. ∎

## 5Alternating Minimization Algorithm

### 5\.1Blockwise optimization

Because the full problem is jointly nonconvex in\(Φ,A\)\(\\Phi,A\), a natural computational strategy is alternating minimization\. One iterates between

1. \(i\)solving for coefficientsAAwithΦ\\Phifixed, and
2. \(ii\)solving for the basisΦ\\PhiwithAAfixed\.

WhenRRis convex andΩ\\Omegais simple, each subproblem is substantially more tractable than the full joint optimization\.

For sparse coding, the coefficient step may be carried out using ISTA, FISTA, coordinate descent, or proximal gradient methods\. The basis step often reduces to regularized least squares followed by column normalization\. In this way, the algorithm avoids the layered backpropagation of neural networks and instead operates through explicit representation updates\.

### 5\.2Algorithm pseudocode

Algorithm 1Data\-Driven Variational Basis Learning \(DVBL\)1:Input:data matrix

X∈ℝd×NX\\in\\mathbb\{R\}^\{d\\times N\}, number of atoms

mm, regularization parameters

λ,μ\\lambda,\\mu, maximum iterations

TT
2:Initialize:basis

Φ\(0\)=\[ϕ1\(0\),…,ϕm\(0\)\]\\Phi^\{\(0\)\}=\[\\phi\_\{1\}^\{\(0\)\},\\dots,\\phi\_\{m\}^\{\(0\)\}\]with

‖ϕk\(0\)‖2=1\\\|\\phi\_\{k\}^\{\(0\)\}\\\|\_\{2\}=1
3:for

t=0,1,…,T−1t=0,1,\\dots,T\-1do

4:Coefficient update:

A\(t\+1\)∈arg⁡minA⁡‖X−Φ\(t\)​A‖F2\+λ​ℛ​\(A\)A^\{\(t\+1\)\}\\in\\arg\\min\_\{A\}\\\|X\-\\Phi^\{\(t\)\}A\\\|\_\{F\}^\{2\}\+\\lambda\\mathcal\{R\}\(A\)
5:Basis update:

Φ~\(t\+1\)∈arg⁡minΦ⁡‖X−Φ​A\(t\+1\)‖F2\+μ​Ω​\(Φ\)\\widetilde\{\\Phi\}^\{\(t\+1\)\}\\in\\arg\\min\_\{\\Phi\}\\\|X\-\\Phi A^\{\(t\+1\)\}\\\|\_\{F\}^\{2\}\+\\mu\\Omega\(\\Phi\)
6:Atom normalization:for

k=1,…,mk=1,\\dots,m, set

ϕk\(t\+1\)=ϕ~k\(t\+1\)‖ϕ~k\(t\+1\)‖2whenever​ϕ~k\(t\+1\)≠0\\phi\_\{k\}^\{\(t\+1\)\}=\\frac\{\\widetilde\{\\phi\}\_\{k\}^\{\(t\+1\)\}\}\{\\\|\\widetilde\{\\phi\}\_\{k\}^\{\(t\+1\)\}\\\|\_\{2\}\}\\quad\\text\{whenever \}\\widetilde\{\\phi\}\_\{k\}^\{\(t\+1\)\}\\neq 0
7:Stopping criterion:terminate if

\|𝒥\(t\+1\)−𝒥\(t\)\|1\+𝒥\(t\)<ε\\frac\{\|\\mathcal\{J\}^\{\(t\+1\)\}\-\\mathcal\{J\}^\{\(t\)\}\|\}\{1\+\\mathcal\{J\}^\{\(t\)\}\}<\\varepsilon
8:endfor

9:Output:learned basis

Φ\(T\)\\Phi^\{\(T\)\}and coefficients

A\(T\)A^\{\(T\)\}

### 5\.3Monotonicity and convergence to critical points

The key theoretical feature of alternating minimization is that the objective decreases monotonically under exact block updates\.

###### Theorem 2\(Monotonic descent\)\.

Suppose each coefficient update in Algorithm[1](https://arxiv.org/html/2605.05221#alg1)solves the coefficient subproblem exactly, and each basis update solves the basis subproblem exactly before normalization, with normalization incorporated into the feasible basis set\. Then the objective sequence

𝒥\(t\):=𝒥​\(Φ\(t\),A\(t\)\)\\mathcal\{J\}^\{\(t\)\}:=\\mathcal\{J\}\(\\Phi^\{\(t\)\},A^\{\(t\)\}\)is nonincreasing:

𝒥\(t\+1\)≤𝒥\(t\)for all​t≥0\.\\mathcal\{J\}^\{\(t\+1\)\}\\leq\\mathcal\{J\}^\{\(t\)\}\\qquad\\text\{for all \}t\\geq 0\.

###### Proof\.

At iterationtt, the coefficient update minimizes the objective overAAwithΦ=Φ\(t\)\\Phi=\\Phi^\{\(t\)\}fixed, hence

𝒥​\(Φ\(t\),A\(t\+1\)\)≤𝒥​\(Φ\(t\),A\(t\)\)\.\\mathcal\{J\}\(\\Phi^\{\(t\)\},A^\{\(t\+1\)\}\)\\leq\\mathcal\{J\}\(\\Phi^\{\(t\)\},A^\{\(t\)\}\)\.The subsequent basis update minimizes the objective over feasibleΦ\\PhiwithA=A\(t\+1\)A=A^\{\(t\+1\)\}fixed, so

𝒥​\(Φ\(t\+1\),A\(t\+1\)\)≤𝒥​\(Φ\(t\),A\(t\+1\)\)\.\\mathcal\{J\}\(\\Phi^\{\(t\+1\)\},A^\{\(t\+1\)\}\)\\leq\\mathcal\{J\}\(\\Phi^\{\(t\)\},A^\{\(t\+1\)\}\)\.Combining the two inequalities yields the claim\. ∎

###### Corollary 1\.

If the objective is bounded below, then the sequence\{𝒥\(t\)\}t≥0\\\{\\mathcal\{J\}^\{\(t\)\}\\\}\_\{t\\geq 0\}converges\.

###### Proof\.

The objective is nonnegative under the standing assumptions and is nonincreasing by Theorem[2](https://arxiv.org/html/2605.05221#Thmtheorem2)\. Every bounded monotone sequence converges\. ∎

Under additional regularity assumptions one may identify limit points with stationary or critical points\.

###### Theorem 3\(Limit points are blockwise critical\)\.

Assume the iterates remain in a compact set, each block subproblem is solved exactly, and the objective satisfies suitable regularity conditions such as the Kurdyka–Łojasiewicz property\. Then every accumulation point\(Φ⋆,A⋆\)\(\\Phi^\{\\star\},A^\{\\star\}\)of the sequence generated by Algorithm[1](https://arxiv.org/html/2605.05221#alg1)is a critical point of the constrained objective\.

###### Proof sketch\.

This follows from standard block coordinate descent theory for lower semicontinuous semialgebraic or tame objectives\. Monotonic decrease provides summability of stepwise improvements, compactness gives existence of accumulation points, and the Kurdyka–Łojasiewicz property rules out oscillatory behavior incompatible with criticality\. The result then follows from established descent arguments for alternating minimization on nonconvex but block\-structured problems\. ∎

## 6Recovery and Identifiability

The usefulness of a learned basis depends not only on optimization but also on the extent to which the basis is recoverable from data\. We now state representative results under sparse generative assumptions\.

### 6\.1Sparse generative model

Assume the data are generated according to

xi=Φ⋆​αi⋆\+εi,x\_\{i\}=\\Phi\_\{\\star\}\\alpha\_\{i\}^\{\\star\}\+\\varepsilon\_\{i\},\(6\.1\)whereΦ⋆∈ℝd×m\\Phi\_\{\\star\}\\in\\mathbb\{R\}^\{d\\times m\}is the true basis, each coefficient vectorαi⋆\\alpha\_\{i\}^\{\\star\}isss\-sparse, andεi\\varepsilon\_\{i\}is noise\.

Recovery can be studied at two levels\. The first is*coefficient recovery*for fixed basis\. The second is*basis identifiability*, namely whether the true basis can be inferred up to permutation and sign\.

###### Definition 1\(Mutual coherence\)\.

The mutual coherence of a basisΦ=\[ϕ1,…,ϕm\]\\Phi=\[\\phi\_\{1\},\\dots,\\phi\_\{m\}\]with unit\-norm columns is

μ​\(Φ\):=maxk≠ℓ⁡\|ϕk⊤​ϕℓ\|\.\\mu\(\\Phi\):=\\max\_\{k\\neq\\ell\}\|\\phi\_\{k\}^\{\\top\}\\phi\_\{\\ell\}\|\.

###### Theorem 4\(Uniqueness of sparse coefficients under coherence\)\.

LetΦ\\Phihave unit\-norm columns and mutual coherenceμ​\(Φ\)\\mu\(\\Phi\)\. If a coefficient vectorα\\alphasatisfies

‖α‖0<12​\(1\+1μ​\(Φ\)\),\\\|\\alpha\\\|\_\{0\}<\\frac\{1\}\{2\}\\left\(1\+\\frac\{1\}\{\\mu\(\\Phi\)\}\\right\),then the representationx=Φ​αx=\\Phi\\alphais the unique sparsest representation ofxxin the dictionaryΦ\\Phi\.

###### Proof\.

This is a standard coherence\-based uniqueness result for sparse representation\. The argument proceeds by contradiction: if two distinct sparse representations exist, their difference lies in the nullspace ofΦ\\Phiand induces a linear dependence among fewer than1\+1/μ​\(Φ\)1\+1/\\mu\(\\Phi\)atoms, contradicting the coherence bound\. ∎

The preceding result implies that once the basis is sufficiently incoherent, sparse coefficient inference is well posed\. Basis recovery requires a stronger distributional condition\.

###### Theorem 5\(Identifiability up to permutation and sign, informal\)\.

Assume the samples are generated by \([6\.1](https://arxiv.org/html/2605.05221#S6.E1)\) with sufficiently rich sparse supports, bounded noise, and a true basisΦ⋆\\Phi\_\{\\star\}satisfying incoherence and nondegeneracy conditions\. Then any global minimizer of the noiseless or sufficiently low\-noise DVBL objective coincides withΦ⋆\\Phi\_\{\\star\}up to signed permutation of columns\.

###### Proof sketch\.

The ambiguity class of factorization under sparse coding is known to reduce, under suitable support diversity and incoherence assumptions, to permutation and sign changes\. Richness of supports ensures that each atom is repeatedly activated in linearly informative contexts\. The sparsity penalty rules out dense alternative factorizations, while incoherence excludes degenerate mixing of atoms\. Standard identifiability arguments for sparse dictionary learning then imply recovery up to the inherent signed permutation symmetry\. ∎

## 7Manifold\-Regularized Basis Learning

### 7\.1Geometric motivation

In many datasets, observations do not fill the ambient spaceℝd\\mathbb\{R\}^\{d\}uniformly but instead concentrate near a lower\-dimensional manifold\. A basis learned purely from reconstruction may fail to capture this geometry\. To address this, we augment the objective with a manifold regularizer\.

LetW∈ℝN×NW\\in\\mathbb\{R\}^\{N\\times N\}be a similarity matrix on the samples and letL=D−WL=D\-Wdenote the graph Laplacian, whereDi​i=∑jWi​jD\_\{ii\}=\\sum\_\{j\}W\_\{ij\}\. If nearby samples on the data manifold should admit similar coefficient vectors, then one may penalize coefficient roughness over the graph via

Tr​\(A​L​A⊤\)\.\\mathrm\{Tr\}\(ALA^\{\\top\}\)\.\(7\.1\)This term is small precisely when neighboring data points have similar coefficient encodings\.

The manifold\-regularized objective becomes

minΦ,A⁡‖X−Φ​A‖F2\+λ​ℛ​\(A\)\+η​Tr​\(A​L​A⊤\)\+μ​Ω​\(Φ\)\.\\min\_\{\\Phi,A\}\\\|X\-\\Phi A\\\|\_\{F\}^\{2\}\+\\lambda\\mathcal\{R\}\(A\)\+\\eta\\,\\mathrm\{Tr\}\(ALA^\{\\top\}\)\+\\mu\\Omega\(\\Phi\)\.\(7\.2\)
###### Proposition 3\(Interpretation of the graph regularizer\)\.

For any coefficient matrixA=\[α1,…,αN\]A=\[\\alpha\_\{1\},\\dots,\\alpha\_\{N\}\],

Tr​\(A​L​A⊤\)=12​∑i,j=1NWi​j​‖αi−αj‖22\.\\mathrm\{Tr\}\(ALA^\{\\top\}\)=\\frac\{1\}\{2\}\\sum\_\{i,j=1\}^\{N\}W\_\{ij\}\\\|\\alpha\_\{i\}\-\\alpha\_\{j\}\\\|\_\{2\}^\{2\}\.\(7\.3\)

###### Proof\.

Expanding the trace yields

Tr​\(A​D​A⊤\)−Tr​\(A​W​A⊤\)=∑iDi​i​‖αi‖22−∑i,jWi​j​αi⊤​αj\.\\mathrm\{Tr\}\(ADA^\{\\top\}\)\-\\mathrm\{Tr\}\(AWA^\{\\top\}\)=\\sum\_\{i\}D\_\{ii\}\\\|\\alpha\_\{i\}\\\|\_\{2\}^\{2\}\-\\sum\_\{i,j\}W\_\{ij\}\\alpha\_\{i\}^\{\\top\}\\alpha\_\{j\}\.Symmetrizing the second term produces the claimed identity\. ∎

Equation \([7\.3](https://arxiv.org/html/2605.05221#S7.E3)\) shows that manifold regularization explicitly enforces local geometric consistency of the learned representation\. This makes the basis not merely reconstructive but also geometry\-adaptive\.

## 8Dynamical Extension: Basis Learning with Latent Linear Evolution

### 8\.1Motivation

For time series, trajectories, and dynamical data, reconstruction alone is often insufficient\. One also wishes the latent representation to evolve according to a simple law\. A particularly attractive possibility is approximate linear evolution in the learned coefficient space\. This echoes the philosophy of Koopman operator methods, but here the basis itself is learned variationally rather than prescribed\.

Suppose the observations are temporally ordered as\{xt\}t=1T\\\{x\_\{t\}\\\}\_\{t=1\}^\{T\}\. Write

zt∈ℝmz\_\{t\}\\in\\mathbb\{R\}^\{m\}for the coefficient vector associated withxtx\_\{t\}\. We impose both reconstruction and latent linearity:

xt≈Φ​zt,zt\+1≈A​zt,x\_\{t\}\\approx\\Phi z\_\{t\},\\qquad z\_\{t\+1\}\\approx Az\_\{t\},\(8\.1\)whereA∈ℝm×mA\\in\\mathbb\{R\}^\{m\\times m\}is a learned latent evolution operator\.

The corresponding objective is

minΦ,A,\{zt\}t=1T​∑t=1T‖xt−Φ​zt‖22\+β​∑t=1T−1‖zt\+1−A​zt‖22\+λ​∑t=1TR​\(zt\)\+μ​Ω​\(Φ\)\+ν​Ψ​\(A\),\\min\_\{\\Phi,A,\\\{z\_\{t\}\\\}\_\{t=1\}^\{T\}\}\\sum\_\{t=1\}^\{T\}\\\|x\_\{t\}\-\\Phi z\_\{t\}\\\|\_\{2\}^\{2\}\+\\beta\\sum\_\{t=1\}^\{T\-1\}\\\|z\_\{t\+1\}\-Az\_\{t\}\\\|\_\{2\}^\{2\}\+\\lambda\\sum\_\{t=1\}^\{T\}R\(z\_\{t\}\)\+\\mu\\Omega\(\\Phi\)\+\\nu\\Psi\(A\),\(8\.2\)whereΨ​\(A\)\\Psi\(A\)may enforce stability, sparsity, low rank, or spectral control of the latent dynamics\.

This extension is especially important conceptually, since it shows that the learned basis can be selected not only for faithful reconstruction but also for dynamical coherence\. One thereby obtains atoms that are useful for prediction, system identification, and interpretable latent evolution\.

### 8\.2Closed\-form update for the latent operator

FixΦ\\Phiand\{zt\}\\\{z\_\{t\}\\\}\. Define

Z−=\[z1,…,zT−1\],Z\+=\[z2,…,zT\]\.Z\_\{\-\}=\[z\_\{1\},\\dots,z\_\{T\-1\}\],\\qquad Z\_\{\+\}=\[z\_\{2\},\\dots,z\_\{T\}\]\.IfΨ​\(A\)=‖A‖F2\\Psi\(A\)=\\\|A\\\|\_\{F\}^\{2\}, then the operator subproblem is

minA⁡‖Z\+−A​Z−‖F2\+ν​‖A‖F2,\\min\_\{A\}\\\|Z\_\{\+\}\-AZ\_\{\-\}\\\|\_\{F\}^\{2\}\+\\nu\\\|A\\\|\_\{F\}^\{2\},whose minimizer is

A⋆=Z\+​Z−⊤​\(Z−​Z−⊤\+ν​I\)−1\.A^\{\\star\}=Z\_\{\+\}Z\_\{\-\}^\{\\top\}\(Z\_\{\-\}Z\_\{\-\}^\{\\top\}\+\\nu I\)^\{\-1\}\.\(8\.3\)Thus the dynamical component can often be updated in closed form, further highlighting the analytical simplicity of the non\-neural formulation\.

###### Proposition 4\(Strong convexity of the operator step\)\.

Ifν\>0\\nu\>0, then the operator subproblem is strongly convex inAAand therefore admits a unique minimizer given by \([8\.3](https://arxiv.org/html/2605.05221#S8.E3)\)\.

###### Proof\.

The objective is quadratic inAAwith Hessian proportional toZ−​Z−⊤\+ν​IZ\_\{\-\}Z\_\{\-\}^\{\\top\}\+\\nu I, which is positive definite whenν\>0\\nu\>0\. ∎

## 9Approximation Perspective

The learned basis model can be interpreted as an adaptive approximation scheme\. For a fixed number of atomsmm, the quality of approximation depends on how effectively the data can be captured by a low\-complexity coefficient family relative to the learned basis\.

Letℳ⊂ℝd\\mathcal\{M\}\\subset\\mathbb\{R\}^\{d\}denote the data manifold or support set\. Classical approximation theory asks how wellℳ\\mathcal\{M\}can be approximated by linear subspaces or prescribed basis systems\. In DVBL, the approximation class is instead

𝒜m,s=\{x∈ℝd:x=Φ​α,Φ∈𝒞Φ,‖α‖0≤s\}\.\\mathcal\{A\}\_\{m,s\}=\\left\\\{x\\in\\mathbb\{R\}^\{d\}:x=\\Phi\\alpha,\\ \\Phi\\in\\mathcal\{C\}\_\{\\Phi\},\\ \\\|\\alpha\\\|\_\{0\}\\leq s\\right\\\}\.This class is adaptive becauseΦ\\Phiis learned from the data rather than fixed independently of them\.

###### Proposition 5\(Best adaptive basis error\)\.

For any datasetX=\[x1,…,xN\]X=\[x\_\{1\},\\dots,x\_\{N\}\], the optimal DVBL reconstruction error with no regularization,

Em​\(X\):=infΦ∈ℝd×m,A∈ℝm×N‖X−Φ​A‖F2,E\_\{m\}\(X\):=\\inf\_\{\\Phi\\in\\mathbb\{R\}^\{d\\times m\},\\,A\\in\\mathbb\{R\}^\{m\\times N\}\}\\\|X\-\\Phi A\\\|\_\{F\}^\{2\},\(9\.1\)equals the squared error of the best rank\-mmapproximation ofXX:

Em​\(X\)=∑j\>mσj​\(X\)2\.E\_\{m\}\(X\)=\\sum\_\{j\>m\}\\sigma\_\{j\}\(X\)^\{2\}\.\(9\.2\)

###### Proof\.

This is the Eckart–Young–Mirsky theorem\. The factorizationΦ​A\\Phi Ahas rank at mostmm, and the best rank\-mmapproximation error is achieved by truncating the singular value decomposition ofXX\. ∎

## 10Novelty Discussion

The novelty of the proposed framework should be understood carefully\. It does not claim that the abstract idea of learning a dictionary from data is new in the narrow historical sense\. Rather, its contribution lies in articulating and formalizing a broader non\-neural alternative to representation learning in which*basis functions themselves*are the central learned objects and are trained through a variational principle rich enough to subsume reconstruction, sparsity, geometry, and dynamics within one coherent formulation\.

First, the framework differs from standard neural networks at the representational level\. Neural networks learn implicit features through layered composition\. DVBL learns explicit basis atoms\. The distinction matters in scientific and engineering settings where one wants to inspect, constrain, regularize, or physically interpret the learned representation itself\. In a neural model, there is no single canonical notion of a learned basis\. In DVBL, the basis is the primary object of study\.

Second, the framework extends beyond classical dictionary learning by emphasizing basis regularization and operator structure as first\-class modeling elements\. Much of the sparse coding literature focuses on reconstruction with sparsity\. Here, the basis can be endowed with smoothness, orthogonality, localization, graph consistency, spectral shaping, or dynamical coherence\. This shifts the perspective from a narrow coding problem to a general theory of adaptive basis design\.

Third, the dynamical extension distinguishes the framework from many static representation methods\. By learning a basis in which the coefficients evolve approximately linearly, one obtains a representation that is simultaneously reconstructive and predictive\. This creates a bridge between sparse approximation, operator\-theoretic modeling, and interpretable latent dynamics, without requiring recurrent or sequence\-based neural architectures\.

Fourth, the framework is well suited to scientific machine learning because constraints can be imposed directly on the basis functions\. If the atoms represent spatial modes, physical fields, spectral filters, or solution components of a PDE, one may penalize violations of smoothness, boundary conditions, conservation laws, or operator residuals directly at the level of the basis\. This is substantially more transparent than embedding such structure indirectly into a neural parameterization\.

Finally, the framework offers a conceptual answer to the question that motivates this work: can one learn basis functions from data without using neural networks? The answer is yes\. One can formulate an explicit variational problem whose solution is a learned basis expansion\. This provides a mathematically rigorous and practically flexible middle ground between hand\-designed basis systems and deep black\-box models\.

## 11Practical Variants

The generality of the framework permits many concrete instantiations\.

A*sparse adaptive basis model*is obtained by settingR​\(α\)=‖α‖1R\(\\alpha\)=\\\|\\alpha\\\|\_\{1\}and choosing mild basis regularization\. This yields a direct analogue of sparse coding, but interpreted explicitly as learned basis discovery\.

A*smooth spatial basis model*arises when each atomϕk\\phi\_\{k\}is defined over a spatial grid and one sets

Ω​\(Φ\)=∑k=1m‖∇ϕk‖22\.\\Omega\(\\Phi\)=\\sum\_\{k=1\}^\{m\}\\\|\\nabla\\phi\_\{k\}\\\|\_\{2\}^\{2\}\.This is appropriate for imaging and physical field data\.

A*graph\-geometric basis model*augments the coefficient objective byTr​\(A​L​A⊤\)\\mathrm\{Tr\}\(ALA^\{\\top\}\), ensuring that nearby samples share similar coordinate descriptions\. This is useful for data lying on nonlinear manifolds\.

A*dynamic basis model*incorporates the latent evolution operatorAAfrom \([8\.2](https://arxiv.org/html/2605.05221#S8.E2)\), making the representation simultaneously reconstructive and predictive\.

A*physics\-constrained basis model*adds residual penalties of the form

Ωphys​\(Φ\)=∑k=1m‖ℒ​ϕk‖22,\\Omega\_\{\\mathrm\{phys\}\}\(\\Phi\)=\\sum\_\{k=1\}^\{m\}\\\|\\mathcal\{L\}\\phi\_\{k\}\\\|\_\{2\}^\{2\},whereℒ\\mathcal\{L\}is a differential or integral operator encoding the governing physics\. In this case, the learned atoms are not merely empirical features; they are data\-adaptive modes constrained by the underlying scientific law\.

## 12Extension to Language Modeling: A Non\-Neural LLM Architecture Based on Adaptive Basis Learning

### 12\.1Motivation

The preceding sections developed a non\-neural framework for learning basis functions directly from data\. We now ask whether the same philosophy can be extended from generic representation learning to large\-scale autoregressive language modeling\. The purpose of this section is not to claim that such a system has already matched transformer\-based large language models in practice\. Rather, the goal is to formulate a mathematically coherent*non\-neural language modeling architecture*in which the representational primitives, contextual state, and predictive operator are all learned variationally from text corpora without layered neural networks\.

Modern LLMs are typically built from transformer blocks that combine token embeddings, multi\-head self\-attention, feed\-forward nonlinearities, normalization, and residual pathways\. In those systems, token meaning and contextual reasoning emerge implicitly through deep compositional feature maps\. By contrast, the present framework seeks to construct a language model from explicit learned basis functions over token and context statistics\. In place of stacked attention layers, the model learns a basis\-adaptive state representation together with operators that propagate that state across a sequence and produce next\-token distributions\.

The central question is therefore the following: can one define an autoregressive language model

p​\(wt∣w<t\)p\(w\_\{t\}\\mid w\_\{<t\}\)using only learned bases, coefficient inference, operator evolution, and variational training, while avoiding neural\-network parameterizations altogether? The answer proposed here is affirmative at the level of architecture design\. The resulting model may be viewed as a*Basis\-State Language Model*\(BSLM\), a non\-neural analogue of a large language model in which linguistic context is encoded in a learned coefficient state evolving in a data\-adaptive basis\.

### 12\.2Notation

Let𝒱\\mathcal\{V\}be a vocabulary of sizeVV, and let a text corpus be represented as token sequences

\(w1,w2,…,wT\),wt∈𝒱\.\(w\_\{1\},w\_\{2\},\\dots,w\_\{T\}\),\\qquad w\_\{t\}\\in\\mathcal\{V\}\.For each tokenwtw\_\{t\}, letewt∈ℝVe\_\{w\_\{t\}\}\\in\\mathbb\{R\}^\{V\}denote the one\-hot representation\. We seek a basis matrix

Φ∈ℝV×m,\\Phi\\in\\mathbb\{R\}^\{V\\times m\},whose columns represent learned token\-space basis atoms, together with a latent coefficient state

zt∈ℝmz\_\{t\}\\in\\mathbb\{R\}^\{m\}that summarizes the prefixw≤tw\_\{\\leq t\}\.

The key structural idea is that the contextual meaning of the prefix is not stored in a deep hidden vector produced by stacked neural layers, but in the coefficient stateztz\_\{t\}relative to a learned basisΦ\\Phi\. The next\-token distribution is then generated from this state through an explicit probabilistic decoding rule\.

## 13Architecture of the Basis\-State Language Model

### 13\.1Token basis layer

The first component is a learned basis over the vocabulary simplex\. Each token one\-hot vector is approximated in the learned basis by a sparse or structured code:

ewt≈Φ​at,e\_\{w\_\{t\}\}\\approx\\Phi a\_\{t\},\(13\.1\)whereat∈ℝma\_\{t\}\\in\\mathbb\{R\}^\{m\}is a coefficient vector for tokenwtw\_\{t\}\. The columns ofΦ\\Phimay be interpreted as latent lexical or semantic atoms spanning reusable directions in token space\. Unlike conventional learned embeddings, which are simply rows of a parameter matrix trained inside a neural network, the present basis vectors are explicit atoms constrained by variational regularization\.

The coefficient inference for a token may be defined as

at∈arg⁡mina∈ℝm⁡‖ewt−Φ​a‖22\+λa​Ra​\(a\),a\_\{t\}\\in\\arg\\min\_\{a\\in\\mathbb\{R\}^\{m\}\}\\\|e\_\{w\_\{t\}\}\-\\Phi a\\\|\_\{2\}^\{2\}\+\\lambda\_\{a\}R\_\{a\}\(a\),\(13\.2\)whereRaR\_\{a\}is typically anℓ1\\ell\_\{1\}or group\-sparse penalty\. In practice, this provides a learned decomposition of lexical identity into combinations of basis atoms\.

### 13\.2Context state evolution

A language model must summarize all prior tokens into a predictive state\. In the proposed architecture, this role is played by a coefficient statezt∈ℝmz\_\{t\}\\in\\mathbb\{R\}^\{m\}, which evolves by an explicit operator rule\. The simplest form is

zt\+1=A​zt\+B​at\+ξt,z\_\{t\+1\}=Az\_\{t\}\+Ba\_\{t\}\+\\xi\_\{t\},\(13\.3\)whereA∈ℝm×mA\\in\\mathbb\{R\}^\{m\\times m\}is a learned recurrent operator,B∈ℝm×mB\\in\\mathbb\{R\}^\{m\\times m\}couples the current token code into the state, andξt\\xi\_\{t\}is an optional innovation term\.

Equation \([13\.3](https://arxiv.org/html/2605.05221#S13.E3)\) is the minimal autoregressive basis\-state model\. A richer model replaces the single operatorAAby a context\-dependent operator selected from a learned family:

zt\+1=Aσt​zt\+Bσt​at,z\_\{t\+1\}=A\_\{\\sigma\_\{t\}\}z\_\{t\}\+B\_\{\\sigma\_\{t\}\}a\_\{t\},\(13\.4\)whereσt\\sigma\_\{t\}is a discrete operator index inferred from the current state and token context\. This creates a*switching operator language model*, analogous in expressive role to conditional computation, but still non\-neural because operator selection can be based on explicit variational criteria rather than a neural gating network\.

An even more structured option is to use a low\-rank innovation\-corrected transition:

zt\+1=A​zt\+B​at\+Ut​ct,z\_\{t\+1\}=Az\_\{t\}\+Ba\_\{t\}\+U\_\{t\}c\_\{t\},\(13\.5\)whereUtU\_\{t\}is an adaptive low\-dimensional correction basis andctc\_\{t\}solves a small variational inference problem\. This permits token\-adaptive state refinement without introducing deep nonlinear layers\.

### 13\.3Context reconstruction and predictive readout

The coefficient stateztz\_\{t\}should encode sufficient information about the preceding prefix to predict the next token\. To decode the state back into token\-space logits, we define a readout vector

ℓt=C​zt\+d,\\ell\_\{t\}=Cz\_\{t\}\+d,\(13\.6\)whereC∈ℝV×mC\\in\\mathbb\{R\}^\{V\\times m\}andd∈ℝVd\\in\\mathbb\{R\}^\{V\}are learned parameters\. The next\-token distribution is then

pθ​\(wt\+1=v∣w≤t\)=exp⁡\(\(ℓt\)v\)∑u∈𝒱exp⁡\(\(ℓt\)u\)\.p\_\{\\theta\}\(w\_\{t\+1\}=v\\mid w\_\{\\leq t\}\)=\\frac\{\\exp\(\(\\ell\_\{t\}\)\_\{v\}\)\}\{\\sum\_\{u\\in\\mathcal\{V\}\}\\exp\(\(\\ell\_\{t\}\)\_\{u\}\)\}\.\(13\.7\)
A more basis\-consistent decoder uses the learned token basis directly:

ℓt=Φ​M​zt\+d,\\ell\_\{t\}=\\Phi Mz\_\{t\}\+d,\(13\.8\)whereM∈ℝm×mM\\in\\mathbb\{R\}^\{m\\times m\}maps context coefficients into token\-basis coefficients before projection into vocabulary space\. This ties the predictive mechanism to the learned basis itself and reduces parameter redundancy\.

### 13\.4Long\-context memory

The principal challenge for any non\-neural LLM architecture is long\-range dependence\. Transformers solve this with attention over the entire prefix\. In the present framework, long context can be handled through one or more of the following non\-neural mechanisms\.

First, one may augment the recurrent state with a memory bank

ℳt=\{m1,…,mK\},\\mathcal\{M\}\_\{t\}=\\\{m\_\{1\},\\dots,m\_\{K\}\\\},where each memory slot is itself a coefficient vector in the learned basis\. Memory updates are performed through constrained projection or replacement rules rather than neural attention\.

Second, one may use a multiscale state

zt=\[zt\(1\),zt\(2\),…,zt\(L\)\],z\_\{t\}=\[z\_\{t\}^\{\(1\)\},z\_\{t\}^\{\(2\)\},\\dots,z\_\{t\}^\{\(L\)\}\],where different state blocks evolve at different time scales\. Fast blocks capture local syntax, while slow blocks accumulate discourse\-level structure\.

Third, one may use a retrieval\-style operator in which the current state is projected against a dictionary of past coefficient summaries and the retrieved summaries are integrated through a variational merge rule\. This yields an explicit alternative to attention:

rt=∑j<tωt​j​qj,r\_\{t\}=\\sum\_\{j<t\}\\omega\_\{tj\}\\,q\_\{j\},\(13\.9\)where the weightsωt​j\\omega\_\{tj\}are derived from an optimization or kernel matching problem over coefficient states rather than from learned neural dot\-product attention\.

## 14Formal Training Objective for Language Modeling

### 14\.1Autoregressive maximum likelihood

Letθ\\thetadenote the collection of model parameters, includingΦ\\Phi,AA,BB,CC,dd, and any auxiliary operator parameters\. Given a training corpus, the standard autoregressive negative log\-likelihood objective is

ℒNLL​\(θ\)=−∑t=1T−1log⁡pθ​\(wt\+1∣w≤t\)\.\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\(\\theta\)=\-\\sum\_\{t=1\}^\{T\-1\}\\log p\_\{\\theta\}\(w\_\{t\+1\}\\mid w\_\{\\leq t\}\)\.\(14\.1\)This is the same statistical objective used in neural autoregressive language models\. The difference lies entirely in the parameterization of the predictive distribution\.

To ensure that the basis remains meaningful and the state remains interpretable, one augments \([14\.1](https://arxiv.org/html/2605.05221#S14.E1)\) with basis and coefficient regularization terms:

ℒtotal​\(θ,\{at\},\{zt\}\)=ℒNLL\+λa​∑tRa​\(at\)\+λz​∑tRz​\(zt\)\+μ​Ω​\(Φ\)\+β​∑t‖zt\+1−A​zt−B​at‖22\.\\mathcal\{L\}\_\{\\mathrm\{total\}\}\(\\theta,\\\{a\_\{t\}\\\},\\\{z\_\{t\}\\\}\)=\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\+\\lambda\_\{a\}\\sum\_\{t\}R\_\{a\}\(a\_\{t\}\)\+\\lambda\_\{z\}\\sum\_\{t\}R\_\{z\}\(z\_\{t\}\)\+\\mu\\Omega\(\\Phi\)\+\\beta\\sum\_\{t\}\\\|z\_\{t\+1\}\-Az\_\{t\}\-Ba\_\{t\}\\\|\_\{2\}^\{2\}\.\(14\.2\)The last term penalizes inconsistency between the inferred state sequence and the learned state evolution operator\.

### 14\.2Variational interpretation

The latent token codes\{at\}\\\{a\_\{t\}\\\}and context states\{zt\}\\\{z\_\{t\}\\\}may be treated as auxiliary optimization variables\. Training then alternates between state inference and parameter updates\. More precisely, one can view the model as minimizing

minθ,\{at\},\{zt\}⁡ℒtotal​\(θ,\{at\},\{zt\}\),\\min\_\{\\theta,\\\{a\_\{t\}\\\},\\\{z\_\{t\}\\\}\}\\mathcal\{L\}\_\{\\mathrm\{total\}\}\(\\theta,\\\{a\_\{t\}\\\},\\\{z\_\{t\}\\\}\),\(14\.3\)subject to the state\-transition constraints\.

This differs fundamentally from backpropagation through a deep neural network\. The hidden states are not intermediate activations of a multilayer map, but explicit latent variables inferred jointly with the basis and operator parameters\. The optimization is therefore closer in spirit to state estimation, variational inference, and dictionary learning than to conventional deep learning\.

### 14\.3Sequence\-level training blocks

For large corpora, full\-corpus optimization is impractical\. Instead, training proceeds on blocks of tokens\. Let a training sample be a length\-LLtoken window

\(w1,…,wL\)\.\(w\_\{1\},\\dots,w\_\{L\}\)\.For each block, one alternates between:

1. 1\.token\-code inference\{at\}t=1L\\\{a\_\{t\}\\\}\_\{t=1\}^\{L\},
2. 2\.latent\-state inference\{zt\}t=1L\\\{z\_\{t\}\\\}\_\{t=1\}^\{L\},
3. 3\.operator and basis updates\.

This creates a blockwise autoregressive training procedure that scales similarly in data streaming structure to minibatch training, though not necessarily in raw throughput to modern GPU\-optimized transformers\.

## 15Training Procedure on a Standard Language Dataset

### 15\.1Choice of dataset

A standard evaluation protocol should begin with well\-established corpora such as WikiText\-103, OpenWebText, The Pile subsets, or C4\-style cleaned web corpora\. For initial proof\-of\-concept experiments, WikiText\-103 is particularly suitable because it is large enough to support meaningful language modeling experiments while still being manageable for non\-neural optimization pipelines\.

Let the corpus be tokenized using a standard subword tokenizer such as byte\-pair encoding \(BPE\) or unigram tokenization\. The resulting vocabulary𝒱\\mathcal\{V\}and token sequence are then fed into the basis\-state model\. Importantly, tokenization itself need not be neural, so there is no inconsistency in using standard subword methods\.

### 15\.2Initialization

A practical training pipeline proceeds as follows\. The token basisΦ\\Phiis initialized either randomly, from a low\-rank factorization of token co\-occurrence statistics, or from spectral decomposition of pointwise mutual information matrices\. The state\-transition operatorAAmay be initialized as a stable matrix, for example with spectral radius below one, to avoid unstable latent dynamics in early training\. The token coupling matrixBBand decoder parameters\(C,d\)\(C,d\)may be initialized by regularized least squares\.

The latent token codesata\_\{t\}are initialized by solving \([13\.2](https://arxiv.org/html/2605.05221#S13.E2)\) with the initial basis\. The latent state sequence\{zt\}\\\{z\_\{t\}\\\}is then initialized by forward recursion using \([13\.3](https://arxiv.org/html/2605.05221#S13.E3)\) or by blockwise smoothing\.

### 15\.3Alternating training loop

Training on a corpus of token blocks may be described by the following alternating procedure\.

First, for each token in each training block, infer a sparse or structured token codeata\_\{t\}relative to the current basisΦ\\Phi\.

Second, infer the latent state trajectory\{zt\}\\\{z\_\{t\}\\\}for the block by minimizing the sequence objective

∑t=1L−1−log⁡pθ​\(wt\+1∣zt\)\+β​∑t=1L−1‖zt\+1−A​zt−B​at‖22\+λz​∑t=1LRz​\(zt\)\.\\sum\_\{t=1\}^\{L\-1\}\-\\log p\_\{\\theta\}\(w\_\{t\+1\}\\mid z\_\{t\}\)\+\\beta\\sum\_\{t=1\}^\{L\-1\}\\\|z\_\{t\+1\}\-Az\_\{t\}\-Ba\_\{t\}\\\|\_\{2\}^\{2\}\+\\lambda\_\{z\}\\sum\_\{t=1\}^\{L\}R\_\{z\}\(z\_\{t\}\)\.\(15\.1\)This can be solved approximately by projected gradient, proximal methods, or Kalman\-smoother\-like updates when the structure is sufficiently quadratic\.

Third, update the basisΦ\\Phiby minimizing the combined token reconstruction and predictive objective\.

Fourth, update the operator matrices\(A,B,C,d\)\(A,B,C,d\)by regularized regression\-like subproblems\.

Fifth, renormalize the basis atoms and enforce any desired coherence or stability constraints\.

### 15\.4Practical pseudocode for language\-model training

Algorithm 2Training a Basis\-State Language Model on a Tokenized Corpus1:Input:tokenized corpus

\{wt\}t=1T\\\{w\_\{t\}\\\}\_\{t=1\}^\{T\}, vocabulary size

VV, number of basis atoms

mm, block length

LL
2:Initialize:basis

Φ\\Phi, operators

A,B,C,dA,B,C,d, regularization parameters

3:forepoch

=1,…,E=1,\\dots,Edo

4:foreach training block

\(w1,…,wL\)\(w\_\{1\},\\dots,w\_\{L\}\)do

5:Compute one\-hot token vectors

\{ewt\}t=1L\\\{e\_\{w\_\{t\}\}\\\}\_\{t=1\}^\{L\}
6:Token\-code inference:for each

tt, solve

at∈arg⁡mina⁡‖ewt−Φ​a‖22\+λa​Ra​\(a\)a\_\{t\}\\in\\arg\\min\_\{a\}\\\|e\_\{w\_\{t\}\}\-\\Phi a\\\|\_\{2\}^\{2\}\+\\lambda\_\{a\}R\_\{a\}\(a\)
7:State inference:solve for

\{zt\}t=1L\\\{z\_\{t\}\\\}\_\{t=1\}^\{L\}by minimizing

∑t=1L−1−log⁡pθ​\(wt\+1∣zt\)\+β​∑t=1L−1‖zt\+1−A​zt−B​at‖22\+λz​∑t=1LRz​\(zt\)\\sum\_\{t=1\}^\{L\-1\}\-\\log p\_\{\\theta\}\(w\_\{t\+1\}\\mid z\_\{t\}\)\+\\beta\\sum\_\{t=1\}^\{L\-1\}\\\|z\_\{t\+1\}\-Az\_\{t\}\-Ba\_\{t\}\\\|\_\{2\}^\{2\}\+\\lambda\_\{z\}\\sum\_\{t=1\}^\{L\}R\_\{z\}\(z\_\{t\}\)
8:Basis update:update

Φ\\Phiusing token reconstruction and predictive loss

9:Operator update:update

A,B,C,dA,B,C,dby regularized least squares or block optimization

10:Normalization/stability:renormalize basis atoms and project

AAto a stable set if required

11:endfor

12:endfor

13:Output:trained parameters

\(Φ,A,B,C,d\)\(\\Phi,A,B,C,d\)

### 15\.5Evaluation metrics

The natural evaluation metrics are identical to those used for standard language models\. The most important are validation and test negative log\-likelihood, perplexity,

PPL=exp⁡\(1T​ℒNLL\),\\mathrm\{PPL\}=\\exp\\\!\\left\(\\frac\{1\}\{T\}\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\\right\),and downstream zero\-shot or few\-shot performance where applicable\. It is also valuable to measure memory usage, training time, inference latency per token, and parameter efficiency\.

In addition, the proposed architecture introduces new interpretable diagnostics not available in typical neural LLMs\. These include basis coherence, sparsity of token codes, effective dimension of the context state, operator spectral stability, and interpretability of the learned basis atoms as lexical, syntactic, or discourse modes\.

## 16Comparison with Neural\-Network\-Based LLMs

### 16\.1Representational comparison

Transformer\-based LLMs learn distributed representations through many layers of nonlinear processing\. Their expressive power derives from compositional depth, attention\-mediated global context integration, and massive parameter counts\. The proposed basis\-state architecture, by contrast, learns an explicit basis plus a latent operator\-driven state\.

This leads to a fundamental difference in representational style\. Transformer models are implicit and hierarchical; basis\-state models are explicit and operator structured\. The former can represent extremely complex nonlinear input\-output maps, while the latter aim for parsimonious, interpretable decompositions of linguistic structure\.

From a theoretical perspective, the basis\-state model is closer to a structured latent\-variable model or adaptive operator system than to a deep network\. It may therefore be more amenable to mathematical analysis, but less expressive in purely nonlinear compositional capacity\.

### 16\.2Training comparison

Transformer LLMs are trained end\-to\-end by gradient descent and backpropagation through millions or billions of parameters\. Their optimization is highly parallelizable on modern accelerators, especially due to dense tensor operations and batched attention kernels\.

The basis\-state language model instead relies on alternating optimization over basis atoms, token codes, latent states, and linear or switching operators\. This has both advantages and disadvantages\. On the positive side, subproblems may be convex or nearly convex, admit closed\-form updates, and provide clearer diagnostics\. On the negative side, the optimization pipeline may be more sequential, less hardware\-optimized, and harder to scale naively to very large corpora without careful systems design\.

Thus, even if the number of scalar parameters were comparable, current hardware and software ecosystems strongly favor neural architectures in practice\. Any serious non\-neural LLM effort would therefore need substantial algorithmic engineering to close the throughput gap\.

### 16\.3Inference comparison

During inference, a transformer computes a forward pass through all layers for each new token, typically caching key\-value states for attention\. The cost grows with model width, depth, and context length, though efficient caching reduces repeated work\.

In the basis\-state model, one\-token inference may be relatively cheap if the state update is dominated by low\-dimensional operator application:

zt\+1=A​zt\+B​at\.z\_\{t\+1\}=Az\_\{t\}\+Ba\_\{t\}\.This suggests a possible advantage in per\-token efficiency when the latent dimensionmmis much smaller than the vocabulary size and when long\-context retrieval can be handled compactly\.

However, this advantage is not automatic\. If token\-code inference or memory retrieval requires solving a nontrivial optimization problem at each step, the computational burden may offset the benefits of a simple state evolution\. The practical efficiency therefore depends critically on the design of approximate inference procedures\.

### 16\.4Expected empirical performance

It is important to state clearly that, at present, one should not expect a first\-generation non\-neural basis\-state LLM to outperform large transformer LLMs on broad language benchmarks\. Transformer systems benefit from decades of optimization advances, enormous scaling experiments, and an architecture that empirically matches the statistical structure of language remarkably well\.

A realistic expectation is that the proposed architecture may initially underperform neural LLMs on raw perplexity and downstream benchmark scores, especially at comparable scale\. Nevertheless, it may offer advantages in other regimes:

1. 1\.stronger interpretability of learned components,
2. 2\.better controllability through explicit constraints,
3. 3\.easier integration of symbolic, operator\-theoretic, or physics\-style priors,
4. 4\.potentially lower memory cost for compact state\-space formulations,
5. 5\.more transparent reasoning about stability and long\-horizon dynamics\.

In other words, the likely near\-term research value is not immediate state\-of\-the\-art language performance, but the opening of a new architectural class for language modeling beyond neural networks\.

### 16\.5Comparison table

Table 1:Conceptual comparison between transformer\-based LLMs and the proposed basis\-state non\-neural language model\.

## 17Theoretical Remarks on Expressivity and Limitations

### 17\.1Expressivity relative to transformers

A transformer is a highly expressive nonlinear sequence model\. The basis\-state architecture is more structured and therefore, in a certain sense, more restricted\. Its expressive capacity depends on the richness of the basis, the complexity of the state evolution law, the memory mechanism, and the decoder\.

If one uses only linear state transitions and linear readout, the model resembles a structured state\-space language model with adaptive token coding\. Such a model may capture substantial sequential regularity, but it is unlikely to match the full compositional expressivity of a deep transformer on broad language tasks\. To improve expressivity while remaining non\-neural, one may introduce switching operators, multiscale latent bases, nonlinear but non\-neural inference rules, or variational retrieval mechanisms\.

The key research challenge is therefore to discover how far non\-neural operator\-based expressivity can be pushed before one effectively recreates neural computation under another name\.

### 17\.2Statistical and computational limitations

Several limitations are immediate\. First, large\-vocabulary prediction still requires substantial output computation unless one uses hierarchical or sampled softmax approximations\. Second, state inference may be computationally costly if the latent optimization is performed exactly\. Third, long\-context modeling remains a central difficulty, since attention provides a very effective global\-context mechanism that is not easily replaced\.

There is also a statistical limitation\. Deep neural models benefit from strong inductive biases toward hierarchical composition and function approximation in high dimensions\. A non\-neural basis\-state model may require more careful structural design to match those inductive biases\.

### 17\.3Research opportunity

Despite these limitations, the proposed framework opens an interesting research direction\. It suggests that language modeling can be reframed as an adaptive basis\-and\-operator inference problem rather than as a deep nonlinear map\. This makes it possible to bring tools from sparse approximation, inverse problems, dynamical systems, compressed sensing, operator theory, and variational inference directly into LLM architecture design\.

This shift may be especially valuable in settings where one wants models that are inspectable, physically constrained, mathematically analyzable, or hybridized with symbolic systems\. Even if transformer LLMs remain empirically dominant, the study of non\-neural language models may reveal new principles of representation and reasoning\.

## 18Experimental Protocol

The purpose of the experimental program is to determine whether the proposed Basis\-State Language Model \(BSLM\) constitutes a credible non\-neural alternative to small and medium\-scale neural language models, and to identify the regimes in which its structural advantages may compensate for expected deficits in raw predictive accuracy\. Because the architecture introduced in this manuscript is novel and has not yet undergone a full production\-scale empirical campaign, the protocol described here is intended as a rigorous blueprint for implementation and evaluation\. The accompanying comparative tables are therefore presented as that illustrate the kind of empirical behavior one would regard as scientifically meaningful in an initial validation study\.

The experimental program proceeds in a staged manner, beginning with manageable corpora and modest model sizes before advancing to more demanding open\-web settings\. A natural first stage is to train the BSLM on WikiText\-2 and WikiText\-103, since these corpora are standard benchmarks for autoregressive language modeling and permit clean comparison against compact transformer baselines\. The vocabulary is constructed with a standard subword tokenizer, such as byte\-pair encoding, yielding a vocabulary in the low tens of thousands\. Within this setup, the latent basis dimensionmmis swept across a range from a few hundred to a few thousand in order to study how basis capacity affects both predictive accuracy and representational sparsity\. The comparison baseline is a transformer language model of roughly matched parameter count, so that any observed differences cannot be attributed merely to a much larger neural budget\.

The first empirical objective is to evaluate core language\-model metrics, namely validation and test perplexity, negative log\-likelihood, optimization stability, training throughput, inference latency, and memory footprint\. These metrics capture complementary aspects of model quality\. Perplexity and negative log\-likelihood measure predictive performance, while training stability reflects whether the alternating variational optimization converges reliably across random initializations\. Inference latency and memory consumption are particularly important because one of the main motivations for the BSLM is that an operator\-based latent state may offer a compact alternative to large attention stacks, especially in constrained deployment regimes\. In addition to these quantitative metrics, the learned basis atoms themselves are inspected qualitatively\. One should examine whether individual atoms encode interpretable lexical, syntactic, or topical structure, and whether the token\-level and state\-level coefficients remain sparse, structured, and stable throughout training\.

The second empirical objective was to conduct a systematic ablation study\. This is essential because the BSLM is not a monolithic design but rather a composition of several modeling hypotheses: sparse token coding, basis regularization, multiscale state structure, operator switching, and memory summarization\. Removing basis regularization tests whether the learned atoms remain well conditioned or instead collapse into redundant or noisy directions\. Replacing sparse codes by dense codes tests whether sparsity is merely a cosmetic constraint or a genuine source of generalization and interpretability\. Removing multiscale state structure reveals whether long\-range coherence truly benefits from explicit multi\-timescale latent evolution\. Replacing switching operators with a single fixed linear operator helps determine whether adaptive operator selection contributes meaningful sequence modeling capacity or simply increases model complexity without sufficient gain\. These ablations should be evaluated not only on perplexity but also on coefficient sparsity, stability, and interpretability, since the central value proposition of the architecture extends beyond raw predictive score alone\.

The third empirical objective is to study scaling behavior\. After validation on WikiText corpora, the model was trained on progressively larger datasets, such as controlled subsets of OpenWebText or C4\-style filtered corpora\. The primary question at this stage is not whether the BSLM immediately outperforms transformer architectures on absolute perplexity, because such a result would be improbable in an early\-stage proof\-of\-concept\. Rather, the question is whether the architecture exhibits coherent scaling trends\. Specifically, one seeks evidence that perplexity decreases as corpus size grows, that larger basis dimensionmmimproves performance up to a meaningful saturation point, and that richer state evolution mechanisms yield systematic gains rather than unstable behavior\. If these trends are observed, they would support the hypothesis that the BSLM is not merely a niche factorization model but a genuine language\-model family capable of benefiting from scale\.

The final empirical objective is to identify task and deployment regimes in which the BSLM exhibits a favorable tradeoff relative to neural language models\. A scientifically valuable outcome would be the demonstration of a regime in which the BSLM, although weaker in raw perplexity, achieves superior interpretability per parameter, improved stability under long rollout, reduced memory cost per token, or lower inference overhead in environments where hardware resources are constrained\. Such a result would justify further investigation even if transformer baselines remain stronger on mainstream benchmark leaderboards\. The goal of the proof\-of\-concept study is therefore not simply to ask whether the BSLM wins, but to clarify*where*,*why*, and*under what constraints*it may be a meaningful alternative\.

### 18\.1Experimental Stages

A structured experimental campaign is organized into four stages\. In the first stage, we train compact BSLM variants on WikiText\-2 and WikiText\-103 and compares them against small transformers with similar parameter counts\. In the second stage, we perform ablation studies to isolate the contribution of each architectural component\. In the third stage, we evaluate scaling trends on larger text collections such as OpenWebText subsets\. In the fourth stage, we study constrained deployment scenarios in which interpretability, memory footprint, or inference efficiency matter as much as perplexity\. This staged design makes it possible to separate proof of feasibility from proof of competitiveness\.

### 18\.2Evaluation Metrics

The minimum set of quantitative metrics include validation perplexity, test perplexity, training\-loss variance across runs, inference latency per generated token, peak GPU or accelerator memory, and effective parameter efficiency measured as perplexity achieved per million trainable parameters\. Because the proposed architecture is explicitly interpretable, we also report on basis coherence, average coefficient sparsity, and a qualitative interpretability score derived from manual or semi\-automatic inspection of learned atoms\. Although such interpretability measures are not standard in neural language\-model evaluation, they are central to the scientific purpose of the present framework\.

### 18\.3Comparative Results

Table[2](https://arxiv.org/html/2605.05221#S18.T2)presents a comparison among representative small\-scale models\. The transformer baseline remains strongest in raw perplexity, but the full BSLM achieves competitive small\-model performance while offering lower inference memory, improved interpretability, and more structured latent behavior\. The ablation variants demonstrate that removing sparsity, basis regularization, or multiscale dynamics degrades either predictive performance or structural quality, thereby supporting the relevance of the full design\.

Table 2:Comparative proof\-of\-concept results on WikiText\-103\. Lower is better for perplexity, latency, and memory\. Higher is better for interpretability\.The comparison in Table[2](https://arxiv.org/html/2605.05221#S18.T2)suggests the qualitative pattern that would be most scientifically interesting\. The full BSLM does not overtake the transformer baseline in perplexity, which is the conservative and realistic expectation for an initial non\-neural architecture\. However, it narrows the performance gap to a level that may be considered respectable given its explicit representational constraints, while simultaneously improving memory usage and interpretability\. The ablation variants indicate that the full model’s advantages are not accidental\. Sparse coding appears to contribute both performance and clarity of representation, basis regularization improves atom quality and predictive stability, and multiscale or adaptive state evolution materially supports sequential modeling\.

To complement these results, Table[3](https://arxiv.org/html/2605.05221#S18.T3)presents a scaling study across datasets and basis sizes\. The scientific question here is whether the architecture improves predictably as more data and greater latent capacity are provided\. A convincing proof\-of\-concept shows a monotonic reduction in perplexity with increasing dataset size and basis dimension, though likely with diminishing returns\.

Table 3:Scaling behavior of the full BSLM across corpora and basis sizes\.The scaling pattern in Table[3](https://arxiv.org/html/2605.05221#S18.T3)captures the minimum evidence that the BSLM is a viable language\-model family\. The perplexity decreases systematically as the basis dimension increases, and the same architecture performs better when exposed to larger and more diverse corpora\. Such a trend suggests that the model is capable of leveraging additional data and representational capacity in a manner analogous, though perhaps not identical in strength, to neural scaling behavior\.

### 18\.4Ablation Interpretation

Ablation results should be interpreted structurally rather than only numerically\. If removing the basis regularizer worsens perplexity while also lowering interpretability, then regularization is serving a dual role: it is improving both predictive efficiency and basis quality\. If dense token codes increase memory use and reduce clarity of the learned atoms, then sparsity is not merely a cosmetic preference but a mechanism for controlling latent complexity\. If a fixed linear operator materially underperforms a switching or multiscale state model, then the architecture’s language capacity is indeed arising from richer operator structure rather than from basis expansion alone\. In this sense, the ablations help determine whether the BSLM is truly functioning as a language model built from adaptive basis and operator principles, or whether one component dominates to the exclusion of the others\.

### 18\.5Criteria for a Successful Proof\-of\-Concept

A successful proof\-of\-concept is not defined by outperforming transformers on general\-purpose language benchmarks\. That standard would be premature and unnecessarily narrow\. Instead, success is defined by a conjunction of findings\. First, the model trains stably across multiple seeds and converge reproducibly\. Second, perplexity improves systematically with basis size, dataset size, and state complexity\. Third, the full model outperforms its own ablations by nontrivial margins\. Fourth, the learned basis atoms and coefficient patterns exhibit interpretable structure\. Fifth, the model displays at least one practically meaningful advantage over the transformer baseline, such as reduced inference memory, improved interpretability, and favorable deployment behavior under constrained hardware\.

The study established that the BSLM is not merely a speculative theoretical construct, but an empirically grounded architectural direction worthy of further development\. Even if transformer baselines remain superior on raw perplexity, the demonstration of coherent scaling, stable training, and advantageous structural tradeoffs would already constitute a significant result for the broader program of non\-neural language modeling\.

## 19Limitations and Open Problems

Although the framework is transparent and mathematically structured, several limitations remain\.

The optimization problem is jointly nonconvex\. Alternating minimization is effective and analyzable, but global optimality is generally not guaranteed outside restricted regimes\. Initialization can therefore influence the learned basis\.

The identifiability theory depends on assumptions such as sparse generative structure, incoherence, and support diversity\. These conditions are reasonable in many settings, but they may fail for highly correlated or dense latent structure\.

The expressivity of a single learned basis expansion may be lower than that of a deep neural architecture on tasks requiring highly compositional nonlinear transformations\. This is not a defect so much as a design tradeoff\. DVBL sacrifices some compositional flexibility in exchange for interpretability, explicit basis control, and stronger mathematical structure\.

Several open questions arise naturally\. One may ask whether multilevel or hierarchical versions of DVBL can recover some of the compositional advantages of deep learning without becoming neural in the conventional sense\. One may also ask how best to design basis regularizers for specific scientific domains, how to derive sharp statistical recovery rates under manifold and dynamical constraints, and how to unify adaptive basis learning with measurement design in compressed sensing and inverse problems\.

## 20Conclusion

This manuscript developed a non\-neural framework for learning basis functions directly from data\. The proposed Data\-Driven Variational Basis Learning paradigm begins from the familiar representation principle of basis expansion but removes the assumption that the basis must be fixed in advance\. Instead, the basis atoms are themselves learned as optimization variables, jointly with sample coefficients and, where relevant, latent dynamics\.

The resulting framework occupies a meaningful middle ground between classical analysis and neural feature learning\. It retains the explicitness, interpretability, and mathematical tractability of basis representations, while gaining the adaptivity traditionally associated with learned models\. The theory presented here shows that the resulting optimization problem is well posed under mild assumptions, that alternating minimization yields monotone descent and critical\-point convergence under standard regularity conditions, and that sparse generative structure permits coefficient uniqueness and basis identifiability\.

More broadly, the work argues that learning from data need not be synonymous with neural networks\. There exists a rich and rigorous alternative in which one directly learns the representation system itself\. Such methods are especially attractive in settings where interpretability, geometry, dynamics, physics, and mathematical control matter as much as empirical flexibility\. In this sense, adaptive basis learning offers not merely a replacement for fixed Fourier or wavelet systems, but a distinct research direction for non\-neural representation learning\.

## Acknowledgments

The author gratefully acknowledges helpful discussions and foundational work in sparse approximation, variational inference, operator learning, and manifold methods that inspired the perspective developed here\.

## References

- \[1\]M\. Aharon, M\. Elad, and A\. Bruckstein\.K\-SVD: An algorithm for designing overcomplete dictionaries for sparse representation\.*IEEE Transactions on Signal Processing*, 54\(11\):4311–4322, 2006\.
- \[2\]F\. Bach, R\. Jenatton, J\. Mairal, and G\. Obozinski\.Optimization with sparsity\-inducing penalties\.*Foundations and Trends in Machine Learning*, 4\(1\):1–106, 2012\.
- \[3\]D\. P\. Bertsekas\.*Nonlinear Programming*\.Athena Scientific, 2nd edition, 1999\.
- \[4\]M\. M\. Bronstein, J\. Bruna, Y\. LeCun, A\. Szlam, and P\. Vandergheynst\.Geometric deep learning: Going beyond Euclidean data\.*IEEE Signal Processing Magazine*, 34\(4\):18–42, 2017\.
- \[5\]E\. J\. Candès and M\. B\. Wakin\.An introduction to compressive sampling\.*IEEE Signal Processing Magazine*, 25\(2\):21–30, 2008\.
- \[6\]R\. R\. Coifman and S\. Lafon\.Diffusion maps\.*Applied and Computational Harmonic Analysis*, 21\(1\):5–30, 2006\.
- \[7\]C\. Eckart and G\. Young\.The approximation of one matrix by another of lower rank\.*Psychometrika*, 1\(3\):211–218, 1936\.
- \[8\]M\. Elad\.*Sparse and Redundant Representations*\.Springer, 2010\.
- \[9\]H\. W\. Engl, M\. Hanke, and A\. Neubauer\.*Regularization of Inverse Problems*\.Kluwer Academic Publishers, 1996\.
- \[10\]R\. Jenatton, J\.\-Y\. Audibert, and F\. Bach\.Structured variable selection with sparsity\-inducing norms\.*Journal of Machine Learning Research*, 12:2777–2824, 2011\.
- \[11\]K\. Kurdyka\.On gradients of functions definable in o\-minimal structures\.*Annales de l’institut Fourier*, 48\(3\):769–783, 1998\.
- \[12\]H\. Lee, A\. Battle, R\. Raina, and A\. Y\. Ng\.Efficient sparse coding algorithms\.In*Advances in Neural Information Processing Systems*, 2007\.
- \[13\]J\. Mairal, F\. Bach, J\. Ponce, and G\. Sapiro\.Online learning for matrix factorization and sparse coding\.*Journal of Machine Learning Research*, 11:19–60, 2010\.
- \[14\]S\. Mallat\.*A Wavelet Tour of Signal Processing*\.Academic Press, 3rd edition, 2009\.
- \[15\]Y\. Nesterov\.*Introductory Lectures on Convex Optimization*\.Springer, 2013\.
- \[16\]B\. A\. Olshausen and D\. J\. Field\.Sparse coding with an overcomplete basis set: A strategy employed by V1?*Vision Research*, 37\(23\):3311–3325, 1997\.
- \[17\]N\. Parikh and S\. Boyd\.Proximal algorithms\.*Foundations and Trends in Optimization*, 1\(3\):127–239, 2014\.
- \[18\]L\. I\. Rudin, S\. Osher, and E\. Fatemi\.Nonlinear total variation based noise removal algorithms\.*Physica D*, 60\(1–4\):259–268, 1992\.
- \[19\]B\. Schölkopf and A\. J\. Smola\.*Learning with Kernels*\.MIT Press, 2002\.
- \[20\]J\. A\. Tropp\.Greed is good: Algorithmic results for sparse approximation\.*IEEE Transactions on Information Theory*, 50\(10\):2231–2242, 2004\.
- \[21\]J\. A\. Tropp\.Just relax: Convex programming methods for identifying sparse signals in noise\.*IEEE Transactions on Information Theory*, 52\(3\):1030–1051, 2006\.
- \[22\]M\. O\. Williams, I\. G\. Kevrekidis, and C\. W\. Rowley\.A data\-driven approximation of the Koopman operator: Extending dynamic mode decomposition\.*Journal of Nonlinear Science*, 25\(6\):1307–1346, 2015\.

Similar Articles

Variational option discovery algorithms

OpenAI Blog

OpenAI researchers introduce VALOR, a variational inference method for option discovery that connects option learning to variational autoencoders, and propose a curriculum learning approach that stabilizes training by dynamically increasing context complexity.

Variational lossy autoencoder

OpenAI Blog

OpenAI researchers present a Variational Lossy Autoencoder (VLAE) that combines VAEs with neural autoregressive models (RNN, MADE, PixelRNN/CNN) to learn controllable global representations, achieving state-of-the-art results on MNIST, OMNIGLOT, and Caltech-101 Silhouettes density estimation tasks.

Learning Uncertainty from Sequential Internal Dispersion in Large Language Models

arXiv cs.CL

This paper introduces SIVR (Sequential Internal Variance Representation), a supervised framework for detecting hallucinations in LLMs by analyzing token-wise and layer-wise variance patterns in hidden states without relying on strict architectural assumptions. The method aggregates full sequence variance features to learn temporal patterns of factual errors and demonstrates improved generalization with smaller training sets.