LLMs Without Deep Neural Networks: New Architecture, Benefits and Case Study

arXiv cs.LG 06/01/26, 04:00 AM Papers
llm deep-neural-network rbf-network kan alternative-architecture explainable-ai retrieval-augmented
Summary
This paper presents an alternative architecture for LLMs using Radial Basis Function (RBF) networks that eliminates deep neural networks and finds the global optimum in closed form, requiring no iterative training. It also reviews other non-DNN methods like KANs and k-NN retrieval, with a case study demonstrating increased explainability and faster training.
arXiv:2605.30385v1 Announce Type: new Abstract: The purpose of this article is to provide validation to my deep neural network alternative in the context of LLMs. Very recently, there has been a significant interest by Chinese researchers in a model called RBF network, as a substitute to standard DNNs, with increased explainability and higher accuracy. It turns out that my new model, discovered independently, is based on the exact same machinery. But with a major twist: it does not need DNN as it finds the global optimum of the loss function in closed form, in one iteration, thus eliminating the tedious training step. Here I provide a high-level overview of my technology, with case study and comparison to similar methods.
Original Article
View Cached Full Text
Cached at: 06/01/26, 09:23 AM
# 1 Building an LLM with alternatives to deep neural networks
Source: [https://arxiv.org/html/2605.30385](https://arxiv.org/html/2605.30385)
LLMs Without Deep Neural Networks

New Architecture, Benefits & Case Study

Vincent Granville, Ph\.D\.\|\|CAIO\|\|vincent@BondingAI\.io

[BondingAI\.io](https://bondingai.io/), version 1\.0, May 2026

The purpose of this article is to provide validation to my deep neural network alternative in the context of LLMs\. Very recently, there has been a significant interest by Chinese researchers in a model called RBF network, as a substitute to standard DNNs, with increased explainability and higher accuracy\. It turns out that my new model, discovered independently, is based on the exact same machinery\. But with a major twist: it does not need DNN as it finds the global optimum of the loss function in closed form, in one iteration, thus eliminating the tedious training step\. Here I provide a high\-level overview of my technology, with case study and comparison to similar methods\.

Several approaches have been tried to bypass hard\-to\-train deep neural networks and replace Blackbox parameters withexplainable AIand replicability\. Radial basis function \(RBF\) networks is the most recent one to be tested in LLM contexts\. It is also the one I pioneered\. Some of these methods such as RBF rely on explainable DNNs\. My version of RBF is the only one not using any DNNs, as far as I know\.

- •Statisticalnn–gram and Index\-Based Generation Instead of using billions of floating\-point numbers to predict text probability, you can model language directly by analyzing a large corpus’s frequency and context\. - –How it works: You build an exact index of every token and its preceding context in your training data\. By tallying exactly how often a word follows a specific sequence, the system computes the statistical likelihood of the next token\. - –Tools & Concepts: This is the foundational concept behind information\-retrieval \(IR\) systems, Markov chains, and probabilistic suffix trees\.
- •Kolmogorov\-Arnold Networks \(KANs\) KANs serve as a recent mathematical alternative to Multi\-Layer Perceptrons\. - –How it works: While traditional DNNs keep synapse weights fixed and adjust them numerically during training, KANs place learnable functions on the network’s edges rather than the nodes\. This allows the model to represent highly complex, multi\-variable mathematical relationships with smaller architectures\. - –Implementation: You can define KAN structures in Python using the native GitHub \- KindXiaoming/pykan library\.
- •kk\-Nearest Neighbors \(kk\-NN\) / Exact Match Retrievers Rather than encoding worldly knowledge into the model’s internal matrix weights, you can build a system that acts via dynamic lookup\. - –How it works: During inference, the system searches your pre\-indexed training dataset for the chunks of text that most closely match the current context, borrowing from them directly to predict the next word\. - –Tools & Concepts: To mimic this architecture, you can pair a lightweight parser with high\-speed vector databases like Milvus or Qdrant, or use an orchestrator like LangChain for retrieval\.
- •Radial Basis Function \(RBF\) Networks This is a new approach pioneered very recently by several Chinese researchers, and independently by me\. My implementation is the only one not using deep neural networks, allowing for very fast training without epochs or gradient descent\. I discuss it in the remaining of this paper\. For details, see chapter 6 in\[[2](https://arxiv.org/html/2605.30385#bib.bib16)\]\.

Both KAN and RBF networks leverage theUniversal Approximation Theoremfor neural networks\. That is, the predictor can approximate any continuous function in any dimension\. For KAN, it is a consequence of the Kolmogorov\-Arnold representation theorem\. For RBF, it is linked to the universal approximation theorem for Gaussian mixtures\. The basis in RBF models can be a multivatiate Gaussian function leading to aGaussian mixture model\. Even a radial one: that is, spherical rather than ellipsoidal\. Thus the name RBF\. The basis is called kernel in other contexts, and the wordkernel methodand RBF can be used interchangeably\.

It is argued that KAN can overfit\. In one experiment, KANs fitted the pure random data in the features to the labels provided, with an extremely high accuracy, see[here](https://medium.com/@rubenszimbres/kolmogorov-arnold-networks-a-critique-2b37fea2112e)\. This is also true with my implementation of RBF networks\. However, in my case, the model benefits frombenign overfittingand performs very well outside the training set, even when corrupting the input data with significant noise\. In other words, it acts as a deblurring or high\-pass filter even on fairly chaotic data\.

For a recent reference on KAN, see\[[3](https://arxiv.org/html/2605.30385#bib.bib4)\]\. For RBF networks, see\[[5](https://arxiv.org/html/2605.30385#bib.bib3)\]discussing GenLoRa and\[[4](https://arxiv.org/html/2605.30385#bib.bib1)\]focusing on Gaussian basis\. See also\[[1](https://arxiv.org/html/2605.30385#bib.bib2)\]focusing on LLM security with radial basis\. Finally, my recent book about modern AI and LLMs\[[2](https://arxiv.org/html/2605.30385#bib.bib16)\]features explainable DNN models as well as RBF networks without DNN\. For a non\-technical discussion on the topic, see “Building LLMs without neural networks”,[here](https://zoea.co.uk/news/news-250127.html)\.

### 1\.1Connection between RBF networks and standard LLMs

While Radial Basis Function \(RBF\) networks and Large Language Models \(LLMs\) belong to different architectural families, their concepts are converging to improve model efficiency\. Recent AI research introduces frameworks likeGenLoRA, which utilize lightweight RBFs to replace large, explicitly stored basis vectors in standard low\-rank adaptation methods, achieving superior accuracy on smaller parameter budgets\.

Unlike LLM transformers—which build hierarchical, attention\-based representations through stacked layers—RBF networks are traditional shallow feedforward networks\. However, their ability to perform efficient non\-linear function approximation makes them highly relevant when applied alongside LLMs:

- •Generative Low\-Rank Adapters \(GenLoRA\):Instead of explicitly storing bulky basis vectors in matrices like standard LoRA, methods such as GenLoRA use a set of RBF\-based non\-linear generators to synthesize the necessary basis vectors from a small, shared latent space\. This dramatically improves parameter efficiency during LLM fine\-tuning\.
- •Concept Representation & Interpretability:RBF Networks are being embedded in low\-dimensional spaces for concept visualization\. Because they represent non\-linear decision boundaries efficiently, they serve as powerful probes to decode and control the internal representations of black\-box LLMs\.
- •Feature Extraction & Embedding:The hidden layers of RBF networks compute the Euclidean distance of an input from pre\-determined center vectors, often using a Gaussian kernel\. While not used as the core architecture of an LLM, RBF components can sit on top of LLM embeddings \(e\.g\., as part of a classifier head\) to map dense, high\-dimensional textual features to specific, non\-linear classification outputs\.

### 1\.2Combining RBF networks with standard LLMs

Table[1](https://arxiv.org/html/2605.30385#S1.T1)shows the benefits of combining standard LLMs based on transformers, with the RBF system\.

Table 1:Comparing standard LLM transformers with RBF network

## 2Fast, high\-accuracy RBF network without training

My model is a standard exact RBF interpolator\. I call itinterpolatorrather than network as it does not involve DNNs\. The machinery is pretty heavy similarly to DNNs, with many shared features\. However, it uses 10,000 fewer embeddings because I focus on corporate corpuses\. That is, SLMs trained on corpus and English language \(a tiny fraction of the whole Internet\) to answer specialized business questions, instead of generic LLMs that can write code, solve math problems, and answer any question in any language\. The focus is on concise yet exhaustive and structured answers, with a relevancy score attached to each item in the response\.

For numerical data, I use radialGaussian mixturesas seen in formulas \([1](https://arxiv.org/html/2605.30385#S2.E1)\) and \([3](https://arxiv.org/html/2605.30385#S2.E3)\)\. See also formula \([4](https://arxiv.org/html/2605.30385#S2.E4)\) for adaptation to text data\. In the Python code, text strings are kept “as is”, and not even turned into numericalembeddings, withvector databasesreplaced by nested hashes\. Also, the choice of the kernelKKis not important\. Instead, I put emphasis on fast computations using pre\-tabulated values with the minimum precision needed viaquantization\.

The two main differences with standard models are as follows:

- •The weightswk\(x\)w\_\{k\}\(x\)depend onxxand are normalized: they add up to 1\. This contrasts with most other implementations that do not require normalization and instead use DNNs to find the best weights \(called parameters\)\.
- •I focus on the singular case whenτ→∞\\tau\\rightarrow\\inftyin \([1](https://arxiv.org/html/2605.30385#S2.E1)\)\. It results in exact predictions on the training data, no matter the weights and other parameters or hyperparameters\. This is a consequence of theorem[2\.1](https://arxiv.org/html/2605.30385#S2.Thmtheorem1)proved in chapter 6 in\[[2](https://arxiv.org/html/2605.30385#bib.bib16)\]\. To work well, it requires careful attention to the numerical analysis aspects\. Then, no training is needed\.

There are several other differentiatiors\. I use multi\-tokens instead of embeddings, each consisting of a sequence of stemmed words\. This allows to better match business acronyms in the prompt to text in the corpus\. Also, there are different types of multi\-tokens: regular and contextual\. The latter consists for instance of text elements found in titles, tags, categories, or bigger fonts in PDF documents\. The user can specify tags and negative keywords when prompting\.

### 2\.1Model description and formulation

Predictive models are typically denoted asY=f\(X\)Y=f\(X\)where the responseYYis a column vector withnnobservations, and the input dataXXis a table withnnrows andmmcolumns\. The columns are called thefeatures, andmmis thedimension\. Here I use the notationβ\\betato representXX, withβk\\beta\_\{k\}corresponding to thekk\-th row andβkj\\beta\_\{kj\}being the value in cell\(k,j\)\(k,j\)in the table\. Theβk\\beta\_\{k\}’s are called thenodes\. More specifically,β=φ\(X\)\\beta=\\varphi\(X\)whereφ\\varphiis an invertible transform \(or a chain of invertible transforms\) used asnormalizerto dramatically improve the performance\. The model is as follows:

fpred\(x\)=∑k=1nωk\(x\)f\(βk\)exp⁡\[−τK\(x,βk\)\],f\_\{\\text\{pred\}\}\(x\)=\\sum\_\{k=1\}^\{n\}\\omega\_\{k\}\(x\)\\,f\(\\beta\_\{k\}\)\\,\\exp\\Big\[\-\\tau\\,K\(x,\\beta\_\{k\}\)\\Big\],\(1\)wherefpred\(x\)f\_\{\\text\{pred\}\}\(x\)is the predicted value off\(x\)f\(x\)for a functionffknown only atnnlocationsβ1,…,βn\\beta\_\{1\},\\dots,\\beta\_\{n\}in a space of dimensionmm\. For allxx, theweightsωk\(x\)\\omega\_\{k\}\(x\)must satisfy:

∑k=1nωk\(x\)exp⁡\[−τK\(x,βk\)\]=1\.\\sum\\limits\_\{k=1\}^\{n\}\\omega\_\{k\}\(x\)\\exp\\Big\[\-\\tau\\,K\(x,\\beta\_\{k\}\)\\Big\]=1\.\(2\)Thus the weights depend onxxand implicitly on the nodes in a highly non\-linear way\. This is our first strong departure from many kernel models\. Another major difference is that the number of non\-zero weights can be as large asnn, compared to other methods where most weightsωk\(x\)\\omega\_\{k\}\(x\)are zero unlessxxandβk\\beta\_\{k\}are close enough\. The functionKKis called thekernel\. It must be positive, symmetric, and equal to zero only if both arguments are identical\. A typical example in our framework is

K\(x,βk\)=\[1m\(x−βk\)\(x−βk\)T\]γ=\[1m‖x−βk‖2\]γK\(x,\\beta\_\{k\}\)=\\Big\[\\frac\{1\}\{m\}\(x\-\\beta\_\{k\}\)\(x\-\\beta\_\{k\}\)^\{T\}\\Big\]^\{\\gamma\}=\\Bigg\[\\frac\{1\}\{m\}\\,\\big\|\\big\|x\-\\beta\_\{k\}\\big\|\\big\|^\{2\}\\Bigg\]^\{\\gamma\}\(3\)assumingx,βkx,\\beta\_\{k\}are row vectors withmmcomponents \(mmas large as 1000\), andTTis the transposition operator\. So, the vector product in \([3](https://arxiv.org/html/2605.30385#S2.E3)\) is adot product\. I mostly usedγ=12\\gamma=\\frac\{1\}\{2\}, withγ=1\\gamma=1on occasions\. The multiplication by1/m1/min \([3](https://arxiv.org/html/2605.30385#S2.E3)\) proves particularly useful whenmmis large, acting as a normalizer\. A slight generalizations consists of using

K\(x,βk\)=∑j=1mθjδ\(xj,βkj\)K\(x,\\beta\_\{k\}\)=\\sum\_\{j=1\}^\{m\}\\theta\_\{j\}\\delta\(x\_\{j\},\\beta\_\{kj\}\)\(4\)where theθj\\theta\_\{j\}’s are positive and add up to 1\. In case of numerical values,δ\(xj,βkj\)=\(xj−βkj\)2\\delta\(x\_\{j\},\\beta\_\{kj\}\)=\(x\_\{j\}\-\\beta\_\{kj\}\)^\{2\}\. In the context of LLMs, the meaning is as follows:

- •Bothxxandβ\\betaare text;xxcomes from a prompt, whileβ\\betacomes from a corpus\.
- •xj,βkjx\_\{j\},\\beta\_\{kj\}may be small text strings, andδ\(xj,βkj\)\\delta\(x\_\{j\},\\beta\_\{kj\}\)is some association measure betweenxjx\_\{j\}andβkj\\beta\_\{kj\}, such as theenhanced PMIin the xLLM model\.
- •f\(x\)=P\(xm\|x1,…,xm−1\)f\(x\)=P\(x\_\{m\}\\,\|\\,x\_\{1\},\\dots,x\_\{m\-1\}\)is the probability to observexmx\_\{m\}at the end of a text stringxx, given that the previous text elements in the string arexm−1,xm−2,…,x1x\_\{m\-1\},x\_\{m\-2\},\\dots,x\_\{1\}in that order\. Thus, it deals with next token prediction\.
- •The parametersθj\\theta\_\{j\}weight each element ofxxbased on its positionjjin the stringxx\. Typically, theθj\\theta\_\{j\}’s are decaying weights to give more importance to word elements with a location closer toxmx\_\{m\}in the stringxx\.

Model \([3](https://arxiv.org/html/2605.30385#S2.E3)\) is a particular case of \([4](https://arxiv.org/html/2605.30385#S2.E4)\) with allθj\\theta\_\{j\}’s being equal\. Ignoring theφ\\varphi\-transform for now,β\\betais called thetraining set, whether actual training is needed or not\. The first important result, applying both to text and numerical data, is as follows:

###### Theorem 2\.1

Ifxxis an observation in the training set andf\(x\)≠0f\(x\)\\neq 0, then formula \([1](https://arxiv.org/html/2605.30385#S2.E1)\) estimatesf\(x\)f\(x\)exactly whenτ→∞\\tau\\rightarrow\\infty, whether based on kernel \([3](https://arxiv.org/html/2605.30385#S2.E3)\) or \([4](https://arxiv.org/html/2605.30385#S2.E4)\), both for text or numerical data\. No training is needed\.

Note that even if multiple nodesβk\\beta\_\{k\}are duplicate but have the same valuef\(βk\)f\(\\beta\_\{k\}\), the theorem remains true\. Thus, the functionffcan approximate any data, even the most chaotic ones, even if the data is a white noise\. The convergence is pointwise, not uniform\. Thus to make correct predictions outside the training set, you may need a large training set ifffis not smooth\.

### 2\.2Benign overfitting, other features and benefits

My model, jut like most DNNs, benefit frombenign overfitting: the ability to get very good predictions outside the training set despite being 100% correct \(thus very overfit\) on the training data\. For DNNs, why it works remains a mystery\. However, in my case, the explanation is straightforward\. The model originates from exact spatial interpolation in 2 dimensions, also known askriging\. If you look at figure[1](https://arxiv.org/html/2605.30385#S2.F1), the dots represent training set locations where the temperature is perfectly matched\. Outside those locations, you still have good predictions, actually better than from standard models that tend to oversmooth, including irregular contour lines instead of smooth elliptic curves, to better represent local variations\. The bottom right corner is the city of Chicago with its heat dome\.

![Refer to caption](https://arxiv.org/html/2605.30385v1/vgeo2.png)Figure 1:Chicago temperatures with exact
predictions on training set points \(the dots\)My model generalizes the 2D case to 10,000\+ dimensions, without suffering from thecurse of dimensionality\. The reason is because even though the training set occupies a tiny portion of the space in high dimensions, this is also true for the validation set, and both domains significantly overlap\. Because both deal with the same narrow type of data: what is in your specialized corpus for the most part\. In addition, it is easy to understand why the model works, since it consists of locations and scale parameters, thus based onexplainable AIinstead of Blackbox parameters, with connection tonearest neighbortechniques\. Corpuses where the same terms have different meanings depending on the business department, benefit from splitting into sub\-LLMs, also known asmixtures of experts\.

Other features of my model:

- •Deterministic AI\. Standard LLMs rely on DNNs and probabilistic models, for instance Gaussian latent variables, Markov chains, autoregressive models, and stochastic gradient descent\. None of this is present in my model, leading toreplicability\.
- •Temperature fine\-tuning\. To enrich the response to a prompt, LLMs allow you to choose a temperature value to generate different answers when you run the same prompt twice\. This feature is also available in my model, yet with full replicability as long as you choose the same seed\. Indeed, I use my ownNPG random generator, faster and better, see[here](https://mltblog.com/npg)\. It gives you full control and replicability, unlike Blackbox PRNGs which may change over time without your knowledge\. This happened recently in Python, with the Mersenne Twister transparently replaced by PCG64 and leading to different results even if using the same seed\.
- •Attention mechanism\. As in all transformer\-based LLMs, my model incorporates an attention mechanism: the order of the words in a sentence does matter\. See formula \([4](https://arxiv.org/html/2605.30385#S2.E4)\) and the associated description\.
- •Three\-way training\. Predictions are always correct on the training set, thus you cannot use the standard two\-way method \(training \+ validation sets\) to optimize my model\. Instead, you need an intermediate set between training and validation\. I call it the optimization set\. You use it to fine\-tune the hyperparameters\.
- •Adaptive parameters\. The hyperparametersτ\\tauandγ\\gammain formulas \([3](https://arxiv.org/html/2605.30385#S2.E3)\) and \([4](https://arxiv.org/html/2605.30385#S2.E4)\) can depend onkkinstead of being static\. In this enhanced model \(the adaptive version\), they play the role of parameters in standard DNNs\. They are optimized using a DNN on the optimization set described above, as the predictions are still 100% on the training set\. A good analogy are thermonuclear bombs \(DNNs in this context\) that need a standard nuclear detonation \(my RBF device\) as a pre\-processing step to trigger the massive explosion\. Alternatively, for the second step, you can use another RBF on top of the first one, instead of DNNs\.
- •Weights distillation\. The numbernnof training set embeddings in formula \([4](https://arxiv.org/html/2605.30385#S2.E4)\) can be very large\. Yet most terms in the sum contribute very little\. You can speed up computations by dropping terms \(orβk\\beta\_\{k\}’s\) wherexxandβk\\beta\_\{k\}are almost never close neighbors regardless of the inputxxcoming from the prompt, by analyzing data over time to optimize for speed via self\-learning\. I call itauto\-distillation\. Another idea is to do searches in dimensions much lower thanmmto reduce the number of terms on the flight, depending onxx\. By contrast, LLM products on the market are very expensive as tons of negligible weights that could be ignored, cost you the same as high\-value weights\.

Even thoughnnis large, typically in the high six or low seven digits, it is still far smaller than in standard LLMs where the lowest value that works is in the billions, due to the nature of DNNs\. In short, the gain in efficiency is of the order 10,000\. The main reason is because we are dealing with SLMs instead of LLMs\. Also, I use long tokens consisting of stemmed words augmented with stemmed acronyms, further loweringnnwhile increasing relevancy and exhaustivity in the response, with a focus on concise structured output as opposed to long paragraphs and high\-quality grammar\. The end goal is summarization, not essay writing\.

### 2\.3From billions to fewer than a million parameters

As a side note, the numberVNV\_\{N\}of unique tokens found in a corpus follows theHeap law:VN=λNνV\_\{N\}=\\lambda N^\{\\nu\}, whereNNis the total number of tokens, a linear function of the corpus size\. In practice,0\.4<ν<0\.60\.4<\\nu<0\.6and10<λ<10010<\\lambda<100\. If instead of using the whole Internet to train your model, you carefully select sources covering 0\.01%, you reduceVNV\_\{N\}by a factor 100 without impact on quality\. Also, the vocabulary of an average human consists of less than 30k tokens\. Thus using a lot more is pointless, as most will almost never be shown in an answer, and when they do, they won’t be understood\.

Table 2:mm\-gram counts, NVIDIA corpusThe same probabilistic laws apply to unique token combinations consisting ofmmtokens \(embeddings, in short\)\. While the number quickly explodes as a function ofmmwhenmmis small, it also quickly tappers off whenmmreaches some rather low critical threshold, depending on corpus size\. This is illustrated in table[2](https://arxiv.org/html/2605.30385#S2.T2), wheremm\-gram stands for multi\-token withmmwords\.

The reason is as follows: the immense majority of multi\-tokens of \(say\) sizem=10m=10is absent in the corpus, and the few that show up have very low frequency \(typically less than 2 occurrences\) making them useless for predictions, and best ignored\. There will definitely be very long multi\-token \(mmlarge\) that occur a few times, and it’s best to keep them\. All this to explain why my model needs much fewer tokens, and why it makes sense to use hashes rather than vectors to store them, to efficiently handlevariable\-length embeddingsand sparsity\.

## 3Case study: 96% correct prediction rate

In one tough test involving numericalsynthetic data, I scrambled a training set withn=104n=10^\{4\}observations andm=103m=10^\{3\}dimensions by adding significant noise\. I then dropped 70% of the observations in the training set and computedfpredf\_\{\\text\{pred\}\}with \([1](https://arxiv.org/html/2605.30385#S2.E1)\)\. The R\-squared on the full training data with blurry response dropped from 100% to 50%, mostly due to the noise\. Yet, outside the training set, on input data with known noise\-free response, it was 97%\. Note that since the data is synthetic,frealf\_\{\\text\{real\}\}is known\. In this case,frealf\_\{\\text\{real\}\}consisted of 100 irregular splines—far from a smooth function—and I usedτ=500\\tau=500as an approximation to the desiredτ=∞\\tau=\\infty\. In short, my system was able to recover the original signal\. It also shows that in specific situations, it is possible to get a good outcome out of bad data\. Even in high dimensions\.

This is one of many examples that seem to defeat all the known laws of statistics, used in mystress\-testand summarized in table 6\.1 in\[[2](https://arxiv.org/html/2605.30385#bib.bib16)\]\. The fully replicable Python code is also available in that book\. It involves normalization and recalibration techniques, central to DNNs\. The case I now discuss deals with text rather than numerical data\. It is less challenging despite coming from a real\-word corpus \(NVIDIA\) instead of AI\-generated data\.

### 3\.1NVIDIA case study

The mechanics for text data processing are described in detail in section 6\.2\.3 in\[[2](https://arxiv.org/html/2605.30385#bib.bib16)\]\. Here I provide a quick overview and the main results\. The ability to provide different answers to a same prompt comes from the fact thatfpredf\_\{\\text\{pred\}\}returns more than just the predicted value, but also other multi\-tokensβk\\beta\_\{k\}that significantly contribute to the sum in \([1](https://arxiv.org/html/2605.30385#S2.E1)\)\. Also, the correct prediction rate is significantly boosted outside the training set by a mechanism to avoidf\(x\)=0f\(x\)=0and the corresponding issue in theorem[2\.1](https://arxiv.org/html/2605.30385#S2.Thmtheorem1)\.

![Refer to caption](https://arxiv.org/html/2605.30385v1/interx.png)Figure 2:Goodness of fit when predicting the next token, with correct prediction \(green dots\) on red diagonal
![Refer to caption](https://arxiv.org/html/2605.30385v1/interz.png)Figure 3:Goodness of fit when looking for related multi\-tokens, perfect fit for green dots on red diagonal

I tested my model for two different tasks: predicting the next token, and finding relevant multi\-tokens associated to the one you are interested in \(denoted asxx\)\. The latter is fundamental to suggest related prompts to a user querying the system\. Performance outside the training set is shown respectively in figures[3](https://arxiv.org/html/2605.30385#S3.F3)and[3](https://arxiv.org/html/2605.30385#S3.F3)\. Again, I use the terms multi\-token and embeddings interchangeably\. However, my “single” tokens consist of whole stemmed words instead of classic, small tokens\.

Herem=4m=4\. The number of multi\-tokens of sizemmin the training set isn=15,000n=15,000, out of the 74,717 total for the corpus as reported in table[2](https://arxiv.org/html/2605.30385#S2.T2)\. To predict the next token based on the 3 previous ones, I test all the 5,804 single tokens in the corpus to find which one maximizesfpredf\_\{\\text\{pred\}\}\. I also run a number of queries to see how many unique multi\-tokensβk\\beta\_\{k\}from the training set get triggered in \([1](https://arxiv.org/html/2605.30385#S2.E1)\) to answer them\. See figure[5](https://arxiv.org/html/2605.30385#S3.F5)with the number of queries on the X\-axis\. To serve 1,600 queries, you need about 9,000βk\\beta\_\{k\}’s, that is, about 60% of the training set size\. That number initially increases linearly with the number of queries, but then taper off\.

An even stronger indicator of the sparsity of the system is shown in figure[5](https://arxiv.org/html/2605.30385#S3.F5): each queryxxin a sample of 9,000 triggers fewer than 20βk\\beta\_\{k\}’s out of 15,000 terms in \([1](https://arxiv.org/html/2605.30385#S2.E1)\)\. Of course which ones are triggered depend onxx\. But it shows that there is room for improvement to potentially significantly reduce the number of terms in \([1](https://arxiv.org/html/2605.30385#S2.E1)\) by only keeping the active ones, either depending on the prompt \(big saving\>99\.5%\>99\.5\\%but difficult\), or globally \(easy but smaller saving<40%<40\\%\)\.

![Refer to caption](https://arxiv.org/html/2605.30385v1/interz2.png)Figure 4:Cumulative number of nodes triggered \(Y\-axis\) out of 15,000, based on cumulative queries on the X\-axis
![Refer to caption](https://arxiv.org/html/2605.30385v1/interz3.png)Figure 5:Nodes triggered out of 15,000 to answer a
query \(X\-axis\), distribution based on 9,000 queries

Table[3](https://arxiv.org/html/2605.30385#S3.T3)shows examples of related multi\-tokens associated to an original multi\-tokenxxwithm=4m=4stemmed words, along with metrics \(f\(x\)f\(x\)and grade\) assessing the quality of the match, based on the corpus\. Table[4](https://arxiv.org/html/2605.30385#S3.T4)shows the predictionNextpred\\text\{Next\}\_\{\\text\{pred\}\}based on the 3 previous tokens, and compare it with the real oneNextreal\\text\{Next\}\_\{\\text\{real\}\}that was deleted from the multi\-token in question\. I focused on the few cases outside the training set where the prediction is not correct\. The suggested next token, while not correct, is actually very relevant, enriching the response\.

Table 3:Sample multi\-tokens with their top relatives and quality metricsTable 4:Examples of failed prediction for the next token
### 3\.2Next token prediction: computational complexity

The computational complexity of predicting the next token using a Deep Neural Network \(DNN\)—typically an autoregressive transformer—is𝒪\(L×d2\)\\mathcal\{O\}\(L\\times d^\{2\}\)for each token, whereLLis the number of layers andddis the hidden dimension\. Total inference cost scales quadratically with sequence length due to attention mechanisms\.

The computational complexity after the training phase—meaning during inference \(running the model to predict the next token\)—is𝒪\(L⋅d⋅\(T2\+V\)\)\\mathcal\{O\}\(L\\cdot d\\cdot\(T^\{2\}\+V\)\)for a standard transformer model\. When utilizing optimization techniques like key\-value \(KV\) caching, the per\-token computational complexity reduces to𝒪\(L⋅d2\+L⋅T⋅d\)\\mathcal\{O\}\(L\\cdot d^\{2\}\+L\\cdot T\\cdot d\)\. I now breakdown the post\-training computations with typical values for small models \(SLMs\) to assess actual performance\. The numbers come from Google AI\.

- •VVis the vocabulary size, that is, the number of unique tokens\. In my small case study with stemmed tokens,V=5804V=5804, see table[2](https://arxiv.org/html/2605.30385#S2.T2)\. Typical values range from 32,000 and 50,000\.
- •TTis the context length and range from 4,096 to 8,192 tokens\. In my case study,T=m=4T=m=4\. My model can handle far larger values \(when it makes sense\) and has been tested on numerical data withm=1000m=1000\.
- •LLis the number of layers and range from 24 to 32\.
- •ddis the hidden dimension and range from 2048 to 3073\.

The value forTTis very small in my model compared to standard transformer\-based SLMs because the goal is different: identifying related concepts, as opposed to producing full, long English sentences\. That aside, the computational complexity of my model, without optimization, is𝒪\(n⋅T⋅V\)\\mathcal\{O\}\(n\\cdot T\\cdot V\), wheren=15,000n=15,000is the number of unique multi\-tokens of sizeT=4T=4in the training set\. For the tasks performed here, my system is as efficient as standard DNNs while entirely skipping the costly training phase\. Optimization techniques consist in finding how to skip most of the irrelevant terms \(out ofnn\) when computing \([1](https://arxiv.org/html/2605.30385#S2.E1)\) for a specificxx, and using pre\-tabulated values in a mechanism similar to KV cache\.

### 3\.3Earlier DNN\-free model with exact predictions on training set

The first model tested for numerical data involved predictingf\(x\)f\(x\)by first approximatingxxwith a linear convex combination of the nodesβ1,…,βn\\beta\_\{1\},\\dots,\\beta\_\{n\}in the training set, solving the quadratic constrained optimization problem

ω∗\(x\)=arg⁡minω‖x−∑k=1nωkβk‖2\\omega^\{\*\}\(x\)=\\arg\\min\_\{\\omega\}\\,\\Big\|\\Big\|x\-\\sum\\limits\_\{k=1\}^\{n\}\\omega\_\{k\}\\beta\_\{k\}\\Big\|\\Big\|^\{2\}\(5\)whereω=\(ω1,…,ωn\)\\omega=\(\\omega\_\{1\},\\dots,\\omega\_\{n\}\)and the positive weightsωk\\omega\_\{k\}add up to 1\. Ifx=βix=\\beta\_\{i\}is one of the nodes, thenωk∗\(x\)\\omega^\{\*\}\_\{k\}\(x\)is zero for allkkexceptk=ik=ifor whichωk∗\(x\)=1\\omega^\{\*\}\_\{k\}\(x\)=1\. Then, instead of \([1](https://arxiv.org/html/2605.30385#S2.E1)\), I predictf\(x\)f\(x\)with

fpred\(x\)=∑k=1nωk∗\(x\)f\(βk\)f\_\{\\text\{pred\}\}\(x\)=\\sum\\limits\_\{k=1\}^\{n\}\\omega^\{\*\}\_\{k\}\(x\)f\(\\beta\_\{k\}\)\(6\)Again, it leads to exact predictionsfpred\(x\)=f\(x\)f\_\{\\text\{pred\}\}\(x\)=f\(x\)in any dimension whenxxis in the training set, combined with the desirable feature known asbenign overfitting\. However, the optimum is reached when all the weightsωk∗\(x\)\\omega^\{\*\}\_\{k\}\(x\)are zero except for the nodeskkwhereβk\\beta\_\{k\}is one of them\+1m\+1vertices in the polyhedron encompassingxxin themm\-dimensionalDelauney triangulationof the training set\. I did not pursue this approach since I did not see how to adapt it to LLMs\. The other drawback is computational complexity, as the optimization problem must be solved for each new vectorxx\. In LLMs,xxwould be some embedding coming from a prompt\.

## 4Conclusions

My alternative to DNNs for LLM architecture may have been perceived as an isolated, one\-off model untested by others 12 months ago\. With Chinese researchers now actively working on the exact same model, it is becoming a topic of significant interest\. They call it “RBF networks” while I used the word “kernel method” in the past\. Both terms are correct and widely known in contexts other than LLMs\. The difference reflects the research field you are coming from, but both points to the exact same equations\. However, my approach is unique in the sense that it does not use DNNs to compute the weights\. Instead, I obtain them in one\-shot without training, with 100% correct prediction on the training set, without bad overfitting, in high dimensions\.

I introduce auto\-distillation and pre\-tabulated values as mechanisms to speed up computations\. I also discuss why it works with 10,000 fewer embeddings\. In the original book where my method was first published\[[2](https://arxiv.org/html/2605.30385#bib.bib16)\], I also discuss distillation\-resistant invisible watermarking techniques to protect your model\. Last but not least, I feature a case study with 96% correct prediction rate for next token, and discuss replicability, explainability and deterministic AI attached to the model, with the ability to allow for controlled randomness in the response if desired\. Due to perfect predictions on the training set, I explain how to perform three\-way training to fine\-tune the hyperparameters\. The 96% correct prediction rate outside the training set is far above the 30 to 55% achieved by standard transformer\-based models, while avoiding costly training and without increased compute time post\-training\. This high performance is due to specialization to the specific corpus, by contrast to generic predictors\.

The next steps include working with a larger corpus, and performing tasks beyond predicting the next token, suggesting relating queries, or finding synonyms\. The methodology is also well suited for image classification and problems with numerical data \(time series and so on\)\.

## References

- \[1\]\(2025\)Concept control for LLM safety using radial basis function representations\.ACM Digital Library,pp\. 220–232\.Note:Proc\. Australasian AI Conf\. 2025 \[[Link](https://dl.acm.org/doi/10.1007/978-981-95-4969-6_17)\]Cited by:[§1](https://arxiv.org/html/2605.30385#S1.p5.1)\.
- \[2\]V\. Granville\(2026\)No\-blackbox, secure, efficient ai andXLLM solutions\.MLT,\.Note:\[[Link](https://mltechniques.com/product/no-blackbox-secure-efficient-ai-and-llm-solutions/)\]Cited by:[4th item](https://arxiv.org/html/2605.30385#S1.I1.i4.p2.1),[§1](https://arxiv.org/html/2605.30385#S1.p5.1),[2nd item](https://arxiv.org/html/2605.30385#S2.I1.i2.p1.1),[§3\.1](https://arxiv.org/html/2605.30385#S3.SS1.p1.3),[§3](https://arxiv.org/html/2605.30385#S3.p2.1),[§4](https://arxiv.org/html/2605.30385#S4.p2.1)\.
- \[3\]T\. Ji\(2025\)A comprehensive survey on Kolmogorov Arnold networks \(KAN\)\.Preprint,pp\. 1–16\.Note:arXiv:2407\.1105v7 \[[Link](https://arxiv.org/html/2407.11075v7)\]Cited by:[§1](https://arxiv.org/html/2605.30385#S1.p5.1)\.
- \[4\]W\. Lu and H\. Zhang\(2025\)\.PreprintRethinking Nonlinearity: Trainable Gaussian Mixture Models for Modern Neural Architectures,pp\. 1–12\.Note:arXiv:2510:06660v1 \[[Link](https://arxiv.org/html/2510.06660v1)\]Cited by:[§1](https://arxiv.org/html/2605.30385#S1.p5.1)\.
- \[5\]Y\. Ouyanget al\.\(2026\)Nonlinearity as rank: generative low\-rank adapter with radial basis functions\.Preprint,pp\. 1–31\.Note:arXiv:2602\.0579 \[[Link](https://arxiv.org/abs/2602.05709)\]Cited by:[§1](https://arxiv.org/html/2605.30385#S1.p5.1)\.
LLMs Without Deep Neural Networks: New Architecture, Benefits and Case Study

Similar Articles

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention [P]

Personal continual learning for LLMs without GPU — position paper [OC]

Inference Engines for LLMs & Local AI Hardware (2026 Edition)

@AndrewYNg: New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonabl…

How LLMs Actually Work

Submit Feedback

Similar Articles

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention [P]
Personal continual learning for LLMs without GPU — position paper [OC]
Inference Engines for LLMs & Local AI Hardware (2026 Edition)
@AndrewYNg: New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonabl…