TorchKM: A GPU-Oriented Library for Kernel Learning and Model Selection

arXiv cs.LG Tools

Summary

TorchKM is an open-source GPU-accelerated library for kernel machines (SVMs, kernel logistic regression, etc.) with a scikit-learn-style API. It accelerates training and model selection by reusing matrix operations, offering substantial speedups over standard baselines.

arXiv:2606.06742v1 Announce Type: new Abstract: TorchKM is an open-source library for kernel machines, including support vector machines, kernel logistic regression, and kernel quantile regression, with GPU acceleration. The library features a scikit-learn-style API and is designed to exploit GPU-friendly linear algebra, accelerating the full training and model-selection pipeline through intelligent reuse of matrix operations. Benchmarks show competitive predictive performance together with substantial speedups over standard baselines. Code and documentation are available at https://github.com/YikaiZhang95/torchkm, and the package can be easily installed via PyPI.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:18 AM

# TorchKM: A GPU‑Oriented Library for Kernel Learning and Model Selection
Source: [https://arxiv.org/html/2606.06742](https://arxiv.org/html/2606.06742)
\\nameYikai Zhang\\emailyikai\-zhang@uiowa\.edu \\addrDepartment of Statistics and Actuarial Science University of Iowa Iowa City, IA 52242, USA\\nameJie Ding\\emaildingj@umn\.edu \\addrSchool of Statistics University of Minnesota Minneapolis, MN 55455, USA\\nameBoxiang Wang\\emailboxiang\-wang@uiowa\.edu \\addrDepartment of Statistics and Actuarial Science University of Iowa Iowa City, IA 52242, USA

###### Abstract

TorchKMis an open\-source library for kernel machines, including support vector machines, kernel logistic regression, and kernel quantile regression, with GPU acceleration\. The library features a scikit\-learn\-style API and is designed to exploit GPU\-friendly linear algebra, accelerating the full training and model\-selection pipeline through intelligent reuse of matrix operations\. Benchmarks show competitive predictive performance together with substantial speedups over standard baselines\. Code and documentation are available at[https://github\.com/YikaiZhang95/torchkm](https://github.com/YikaiZhang95/torchkm), and the package can be easily installed via PyPI\.

Keywords:algorithm–hardware co\-design, cross\-validation, GPU acceleration, kernel machines, model selection, support vector machines

## 1Introduction

Kernel machines, including support vector machines\(SVMs, Cortes and Vapnik,[1995](https://arxiv.org/html/2606.06742#bib.bib3)\), have been a fundamental class of machine learning algorithms\. They are recognized for their strong predictive performance, a convex optimization framework with unique global solutions, and rigorous theoretical foundations\. Kernel machines remain highly competitive in structured\-data applications and continue to define the state\-of\-the\-art in many fields, as evidenced by recent comprehensive evaluations inCervantes et al\. \([2020](https://arxiv.org/html/2606.06742#bib.bib1)\)\.

In modern practice, however, kernel machines are often relegated to the “back seat” of the machine learning toolbox, and the primary barrier is computational cost\. The performance of an SVM, for instance, is sensitive to hyperparameter choices, but identifying a good configuration can be prohibitively expensive\. Classical implementations, such asLIBSVM\(Chang and Lin,[2011](https://arxiv.org/html/2606.06742#bib.bib2)\),scikit\-learn\(Pedregosa et al\.,[2011](https://arxiv.org/html/2606.06742#bib.bib11)\), andkernlab\(Karatzoglou et al\.,[2004](https://arxiv.org/html/2606.06742#bib.bib6)\), follow a standard paradigm in which training and model selection are treated as separate stages\. Although fitting a model at each tuning parameter can be made efficient through specialized algorithms, such as sequential minimal optimization\(Platt,[1999](https://arxiv.org/html/2606.06742#bib.bib12)\), tuning is still typically carried out in an outer loop through repeated refitting over a parameter grid, often at substantial computation cost\. As such, kernel methods are often under\-tuned in practice, and thus may not achieve their full predictive potential\.

One promising way to reduce computational cost is to exploit advances in modern computing hardware, particularly graphics processing unit \(GPU\) acceleration\(Nickolls et al\.,[2008](https://arxiv.org/html/2606.06742#bib.bib10)\)\. An important breakthrough in this direction was ThunderSVM\(Wen et al\.,[2018b](https://arxiv.org/html/2606.06742#bib.bib19); Jiang et al\.,[2021](https://arxiv.org/html/2606.06742#bib.bib5)\), which substantially accelerated kernel SVM training on GPUs\. Despite this important progress, end\-to\-end kernel\-machine workflows can further benefit from acceleration beyond a single model fit, since repeated matrix operations across cross\-validation folds and tuning parameters often dominate the overall computational cost\. Reducing this tuning overhead is therefore essential for fully exploiting GPU acceleration in kernel\-machine computation\.

In this paper, we presentTorchKM, an open\-source, GPU\-oriented library for efficient end\-to\-end kernel learning with integrated model selection\.TorchKMsupports a broad range of methods for both regression and classification, including quantile regression, logistic regression, SVM, and distance\-weighted discrimination \(DWD\)\. It provides a user\-friendly, scikit\-learn\-style interface\. Rather than directly porting an existing CPU solver to the GPU,TorchKMis built through algorithm–hardware co\-design: its computational algorithms are designed to exploit GPU\-friendly linear algebra\. At its core are two key algorithmic ideas, an exact cross\-validation formula and a spectral algorithm, which together speed up the full training and tuning pipeline and still produceexactsolutions rather than approximations\.

Table 1:Comparison of representative kernel\-learning libraries\. ThunderSVM is known as a GPU\-accelerated SVM library; TorchKM focuses on integrated model\-selection pipeline\.The key differences betweenTorchKMand the two representative libraries,scikit\-learnandThunderSVM, are summarized in Table[1](https://arxiv.org/html/2606.06742#S1.T1)\. Notably,TorchKMfeatures an integrated pipeline for joint training and tuning, which is complementary to ThunderSVM’s emphasis on fast GPU\-accelerated SVM fitting\. Section 2 describes the core features ofTorchKM, including GPU acceleration, support for methods beyond SVM, probability calibration, and scalable approximation for large data sets\. Section 3 then outlines the exact cross\-validation and spectral algorithm that make this integration possible\. Algorithmic details and further numerical results are provided in the appendix\.

## 2Package Overview and Workflow

TorchKMadopts a scikit\-learn\-style interface, so users already familiar with scikit\-learn should find it easy to use\. Taking the SVM as an example, the basic workflow is similar to that ofsklearn\.svm\.SVC: users instantiateTorchKMSVC, callfitto train the SVM model, and then usepredictfor prediction\.

The key difference from the standardsklearn\.svm\.SVCworkflow is thatTorchKMSVCaccepts a sequence of tuning parameters throughCs, although a single value is also permitted\. In the standard scikit\-learn workflow, model selection is typically handled by an outer loop over candidate values through functions such asGridSearchCVorRandomizedSearchCV, with a separate fit for each value\. InTorchKM, by contrast, training and tuning are integrated into the algorithm itself rather than carried out through repeated external refits\.

The basic usage ofTorchKMSVCis demonstrated in the following code snippet\.

importnumpyasnp

fromsklearn\.datasetsimportmake\_circles

fromtorchkm\.estimatorsimportTorchKMSVC

X,y=make\_circles\(1200,factor=0\.4,noise=0\.08,random\_state=0\)

Cs=np\.logspace\(3,\-3,num=12\)

clf=TorchKMSVC\(kernel="rbf",Cs=Cs,device=’cuda’,probability=True\)

clf\.fit\(X,y\)

clf\.predict\(X\)

clf\.predict\_proba\(X\)

clf\.fit\(X,y,low\_rank=True\)

Settingdevice="cuda"enables GPU computation, whiledevice="cpu"uses the same workflow on a CPU when a GPU is not available\. For SVMs, probability estimates are available through Platt scaling viapredict\_proba\. By default,low\_rank=False, in which caseTorchKMgives exact SVM solutions; for larger data sets, users can opt to setlow\_rank=Trueto enable the Nyström method and obtain approximate solutions\.

Although we use SVMs for illustration, the same interface extends to other methods, includingTorchKMDWDfor DWD\(Marron et al\.,[2007](https://arxiv.org/html/2606.06742#bib.bib8); Wang and Zou,[2016](https://arxiv.org/html/2606.06742#bib.bib14),[2018](https://arxiv.org/html/2606.06742#bib.bib15)\),TorchKMLogitfor kernel logistic regression, andTorchKMKQRfor kernel quantile regression\(Koenker and Hallock,[2001](https://arxiv.org/html/2606.06742#bib.bib7); Tang et al\.,[2026](https://arxiv.org/html/2606.06742#bib.bib13)\)\. Consequently, this shared API provides a consistent workflow across different kernel learning models\.

clf\_DWD=TorchKMDWD\(kernel="rbf",Cs=Cs,cv=5,device=’cuda’\)

clf\_Logit=TorchKMLogit\(kernel="rbf",Cs=Cs,cv=5,device=’cuda’\)

TorchKMis released under the MIT license and distributed through GitHub and PyPI\. The package is tested withpytest, and the repository includes detailed installation instructions, tutorials, API documentation, benchmarking notes, and a contribution guide\.

## 3Core Computational Algorithms and Benchmark Performance

In kernel learning, the full training\-and\-tuning pipeline is computationally expensive because both training split and tuning parameters change the linear system to be solved\. In a naiveKK\-fold search overLLcandidate values, this leads to roughlyK​LKLkernel solves, which requireO​\(K​L​n3\)O\(KLn^\{3\}\)operations\. Due to those repeated cubic\-cost operations, simply running the same algorithm on a GPU may yield only modest improvement\. \(See an example in Figure 1 of the appendix\.\) As such, the central idea behindTorchKMis to redesign the algorithm to shift the cubic\-cost burden into matrix\-vector operations\. Hence the GPU’s parallel architecture is made fuller use to accelerate the whole training\-and\-tuning pipeline\.

To achieve this,TorchKMemploys two techniques, both of which retain exact solutions\. First, an exact cross\-validation formula represents each fold through a modified response vector while keeping the kernel matrix unchanged\. Second, a spectral algorithm computes a single eigendecomposition of the kernel matrix and performs all subsequent updates for different tuning parameters via matrix\-vector multiplications\. Together, these strategies replace the aforementioned repeated cubic\-cost operations with a singleO​\(n3\)O\(n^\{3\}\)decomposition followed by onlyO​\(n2\)O\(n^\{2\}\)operations\. Full details are provided in the appendix\.

Table 2:Comparison ofscikit\-learn,ThunderSVM, andTorchKM, averaged over 50 independent runs, with standard errors in parentheses and best values in bold\.As shown in Table[2](https://arxiv.org/html/2606.06742#S3.T2),TorchKMattains the lowest objective values and the shortest run times\. All methods were evaluated using the same train/test splits and cross\-validation folds\. All objective values were evaluated at the same tuning parameter under the same objective, equation \([1](https://arxiv.org/html/2606.06742#A1.E1)\) in the appendix\. Time includes the full train\-and\-tune pipeline\. In the largest setting withn=20,000n=20\{,\}000andp=1,000p=1\{,\}000,scikit\-learndid not even complete within the 8\-hour limit, whileTorchKMcompleted the full training\-and\-tuning task in only 129\.3 seconds\.ThunderSVMimproves over the CPU\-based workflow in run time\.TorchKMsuccessfully further improves performance through integrated training and tuning\.

## 4Conclusion

TorchKMis an open\-source library built on algorithm–hardware co\-design, implementing exact cross\-validation and spectral algorithms that naturally suit GPU computation\.TorchKMis easy to use through a scikit\-learn\-style interface for SVM, DWD, logistic regression, and quantile regression\. By reducing the computational overhead of tuning,TorchKMempowers kernel learning and supports its continued success in modern structured\-data applications\.

## Appendix

## Appendix ACore Algorithm

In this appendix, we introduce the core algorithm behind our libraryTorchKM:

### A\.1Kernel SVM

In this section, we use the kernel SVM as an illustration\. Given the training data,\{\(yi,𝐱i\)\}i=1n\\\{\(y\_\{i\},\\mathbf\{x\}\_\{i\}\)\\\}\_\{i=1\}^\{n\}, the kernel SVM can be formulated as:

\(𝜶^,β^0\)=argmin𝜶∈ℝn,β0∈ℝ\[1n​∑i=1n\(1−yi​\(𝐊i⊤​𝜶\+β0\)\)\+\+λ​𝜶⊤​𝐊​𝜶\],\(\\hat\{\\boldsymbol\{\\alpha\}\},\\hat\{\\beta\}\_\{0\}\)=\\operatorname\*\{argmin\}\_\{\\boldsymbol\{\\alpha\}\\in\\mathbb\{R\}^\{n\},\\beta\_\{0\}\\in\\mathbb\{R\}\}\\left\[\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\(1\-y\_\{i\}\(\\mathbf\{K\}^\{\\top\}\_\{i\}\\boldsymbol\{\\alpha\}\+\\beta\_\{0\}\)\\right\)\_\{\+\}\+\\lambda\\boldsymbol\{\\alpha\}^\{\\top\}\\mathbf\{K\}\\boldsymbol\{\\alpha\}\\right\],\(1\)where\(1−u\)\+=max⁡\{1−u,0\}\(1\-u\)\_\{\+\}=\\max\\\{1\-u,0\\\}is the non\-differentiable SVM hinge loss andλ\>0\\lambda\>0is a tuning parameter\.

To solve the optimization problem \([1](https://arxiv.org/html/2606.06742#A1.E1)\),TorchKMimplements the finite smoothing algorithm\(Wang and Zou,[2022](https://arxiv.org/html/2606.06742#bib.bib17)\)which transforms the hinge loss function into a sequence of smooth optimization problems with aδ\\delta\-smoothed hinge loss,

Lδ​\(u\)=\{1−uu≤1−δ,14​δ​\[u−\(1\+δ\)\]21−δ<u<1\+δ,0u≥1\+δ,L\_\{\\delta\}\(u\)=\\begin\{cases\}1\-u&u\\leq 1\-\\delta,\\\\ \\frac\{1\}\{4\\delta\}\[u\-\(1\+\\delta\)\]^\{2\}&1\-\\delta<u<1\+\\delta,\\\\ 0&u\\geq 1\+\\delta,\\end\{cases\}and obtain the exact SVM solution\.

The smoothed problem is then solved using the proximal gradient descent\. The update formula is given by:

\(β0\(k\+1\)𝜶\(k\+1\)\)−\(β0\(k\)𝜶\(k\)\)=−𝐇λ−1​\(𝐊\)​\(𝟏⊤​𝒛\(k\)𝐊​𝒛\(k\)\+2​λ​𝐊​𝜶\(k\)\),\\binom\{\\beta\_\{0\}^\{\(k\+1\)\}\}\{\\boldsymbol\{\\alpha\}^\{\(k\+1\)\}\}\-\\binom\{\\beta\_\{0\}^\{\(k\)\}\}\{\\boldsymbol\{\\alpha\}^\{\(k\)\}\}=\-\\mathbf\{H\}\_\{\\lambda\}^\{\-1\}\(\\mathbf\{K\}\)\\binom\{\\mathbf\{1\}^\{\\top\}\\boldsymbol\{z\}^\{\(k\)\}\}\{\\mathbf\{K\}\\boldsymbol\{z\}^\{\(k\)\}\+2\\lambda\\mathbf\{K\}\\boldsymbol\{\\alpha\}^\{\(k\)\}\},where𝐇λ​\(𝐊\)=2​λ​𝐊\+1n​κ​𝐊𝐊\\mathbf\{H\}\_\{\\lambda\}\(\\mathbf\{K\}\)=2\\lambda\\mathbf\{K\}\+\\frac\{1\}\{n\\kappa\}\\mathbf\{K\}\\mathbf\{K\}and𝒛\(k\)=\(z1,z2,⋯,zn\)⊤\\boldsymbol\{z\}^\{\(k\)\}=\(z\_\{1\},z\_\{2\},\\cdots,z\_\{n\}\)^\{\\top\}with eachzi=yi​Lδ′​\[yi​\(β0\(k\)\+𝐊i​𝜶\(k\)\)\]/nz\_\{i\}=y\_\{i\}L\_\{\\delta\}^\{\{\}^\{\\prime\}\}\[y\_\{i\}\(\\beta\_\{0\}^\{\(k\)\}\+\\mathbf\{K\}\_\{i\}\\boldsymbol\{\\alpha\}^\{\(k\)\}\)\]/n\.

From the above update formula, we see that the main computational bottleneck in training kernel SVMs is computing𝐇λ−1​\(𝐊\)\\mathbf\{H\}\_\{\\lambda\}^\{\-1\}\(\\mathbf\{K\}\), which has𝒪​\(n3\)\\mathcal\{O\}\(n^\{3\}\)complexity and remains costly with GPU acceleration\. Additionally, cross\-validation for model selection and computing solutions along theλ\\lambdapath requires repeated model fitting\.

To illustrate the benefit of algorithm\-hardware co\-design, we conducted a simulation study comparing the full SVM training\-and\-tuning pipeline under three implementations: a standard proximal\-gradient solver on CPU, the same solver on GPU, andTorchKM\. We considered simulated data sets which are generated from a mixture\-of\-Gaussians model inHastie et al\. \([2009](https://arxiv.org/html/2606.06742#bib.bib4)\)\. Letμ\+,μ−∈ℝp\\mu\_\{\+\},\\mu\_\{\-\}\\in\\mathbb\{R\}^\{p\}have entries equal toμ\\muon disjoint halves of their coordinates and zero elsewhere\. For each replicate, we sample centersμk\+∼𝒩​\(μ\+,I\)\\mu\_\{k\}^\{\+\}\\sim\\mathcal\{N\}\(\\mu\_\{\+\},I\)andμk−∼𝒩​\(μ−,I\)\\mu\_\{k\}^\{\-\}\\sim\\mathcal\{N\}\(\\mu\_\{\-\},I\)fork=1,…,10k=1,\\ldots,10\. Positive examples are then drawn from the equal\-weight mixture110​∑k=110𝒩​\(μk\+,σ​I\)\\tfrac\{1\}\{10\}\\sum\_\{k=1\}^\{10\}\\mathcal\{N\}\(\\mu\_\{k\}^\{\+\},\\sigma I\), and negative examples from the analogous mixture on theμk−\\mu\_\{k\}^\{\-\}side\. We setμ=2\\mu=2,σ=3\\sigma=3, and considern∈\{1000,10000\}n\\in\\\{1000,10000\\\}withp=10p=10\. For each data set, we computed the solution path over 50 regularization parameters and performed model selection using 5\-fold cross\-validation\. As shown in Figure[1](https://arxiv.org/html/2606.06742#A1.F1), GPU acceleration of the standard solver reduces run time, but the repeated computations across cross\-validation folds and tuning parameters still remain a major bottleneck\. In contrast,TorchKMachieves substantially larger speedups by reusing matrix computations across the training\-and\-tuning pipeline\. Whenn=10,000n=10\{,\}000,TorchKMis more than two orders of magnitude faster than the standard CPU implementation and nearly two orders of magnitude faster than the standard GPU implementation\. These results show that simply offloading computation to the GPU is not sufficient; substantial acceleration requires algorithms designed to exploit GPU\-friendly linear algebra\. Hardware and software details are provided in Section B\.1\.

![Refer to caption](https://arxiv.org/html/2606.06742v1/runtime1000.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.06742v1/runtime10000.png)\(b\)

Figure 1:Run time for standard SVM training and tuning under CPU and GPU computation on simulated data sets\. The standard SVM was fitted by proximal gradient descent, and model selection was performed by 5\-fold cross\-validation over 50 regularization parameter values\. Panel \(a\) shows results for a data set of sizen=1,000n=1\{,\}000, and panel \(b\) shows results for a data set of sizen=10,000n=10\{,\}000\. Lower run time indicates better computational performance\. Because model fitting is repeated across cross\-validation folds and along the regularization path, GPU offloading alone provides only limited speedup\.To address these challenges, we introduce two strategies in the following sections: an efficient exact cross\-validation method and a spectral algorithm that avoids repeated fitting when computing the solution path\. Although the discussion above focuses on kernel SVM, we use it as a representative example\. For kernel methods with smooth losses, since the smoothing step is unnecessary, the corresponding derivations are more direct\.

### A\.2Exact Cross\-Validation Formula

Traditional cross\-validation methods require the model to be retrained separately for each fold\. This repeated fitting is computationally expensive and can even become infeasible when the data set is large or when leave\-one\-out cross\-validation is required\. In order to make cross\-validation practical for large\-scale kernel classification,Wang and Zou \([2022](https://arxiv.org/html/2606.06742#bib.bib17)\)introduced a strategy that avoids repeatedly retraining models\. Given a loss functionL​\(⋅\)L\(\\cdot\)\. Let\[v\]\[v\]denote the left\-out fold withmmobservations and define𝐊\[v\]∈ℝ\(n−m\)×\(n−m\)\\mathbf\{K\}\_\{\[v\]\}\\in\\mathbb\{R\}^\{\(n\-m\)\\times\(n\-m\)\}as the corresponding kernel matrix with theii\-th rows and columns removed fori∈\[v\]i\\in\[v\]\. For each tuning parameterλ\\lambda, the cross\-validation solution obtained by removing the set\[v\]\[v\]from the training set is given by:

\(𝜶^\[−v\],β^0\[−v\]\)=argmin𝜶∈ℝn−m,β0∈ℝ\[1n−m​∑j∉\[v\]L​\(yj​\(𝐊\[v\],j⊤​𝜶\+β0\)\)\+λ​𝜶⊤​𝐊\[v\]​𝜶\]\.\(\\hat\{\\boldsymbol\{\\alpha\}\}^\{\[\-v\]\},\\hat\{\\beta\}^\{\[\-v\]\}\_\{0\}\)=\\operatorname\*\{argmin\}\_\{\\boldsymbol\{\\alpha\}\\in\\mathbb\{R\}^\{n\-m\},\\beta\_\{0\}\\in\\mathbb\{R\}\}\\left\[\\frac\{1\}\{n\-m\}\\sum\_\{j\\notin\[v\]\}L\\left\(y\_\{j\}\(\\mathbf\{K\}^\{\\top\}\_\{\[v\],j\}\\boldsymbol\{\\alpha\}\+\\beta\_\{0\}\)\\right\)\+\\lambda\\boldsymbol\{\\alpha\}^\{\\top\}\\mathbf\{K\}\_\{\[v\]\}\\boldsymbol\{\\alpha\}\\right\]\.
Instead of refitting a separate model for each fold, the optimization is reformulated so that the same kernel matrix can be reused, with only minor modifications to the response vector\. In particular, define𝒚~\[v\]\\tilde\{\\boldsymbol\{y\}\}^\{\[v\]\}by lettingy~j\[v\]=0\\tilde\{y\}\_\{j\}^\{\[v\]\}=0forj∈vj\\in vandy~j\[v\]=yj\\tilde\{y\}\_\{j\}^\{\[v\]\}=y\_\{j\}for allj∉\[v\]j\\notin\[v\], and then by Lemma 2 ofWang and Zou \([2022](https://arxiv.org/html/2606.06742#bib.bib17)\), it holds that

\(𝜶~\[−v\],β^0\[−v\]\)=argmin𝜶∈ℝn,β0∈ℝ\[1n−m​∑j=1nL​\(y~j\[v\]​\(𝐊j⊤​𝜶\+β0\)\)\+λ​𝜶⊤​𝐊​𝜶\]\.\(\\tilde\{\\boldsymbol\{\\alpha\}\}^\{\[\-v\]\},\\hat\{\\beta\}^\{\[\-v\]\}\_\{0\}\)=\\operatorname\*\{argmin\}\_\{\\boldsymbol\{\\alpha\}\\in\\mathbb\{R\}^\{n\},\\beta\_\{0\}\\in\\mathbb\{R\}\}\\left\[\\frac\{1\}\{n\-m\}\\sum\_\{j=1\}^\{n\}L\\left\(\\tilde\{y\}\_\{j\}^\{\[v\]\}\(\\mathbf\{K\}^\{\\top\}\_\{j\}\\boldsymbol\{\\alpha\}\+\\beta\_\{0\}\)\\right\)\+\\lambda\\boldsymbol\{\\alpha\}^\{\\top\}\\mathbf\{K\}\\boldsymbol\{\\alpha\}\\right\]\.and𝜶^\[−v\]\\hat\{\\boldsymbol\{\\alpha\}\}^\{\[\-v\]\}can be obtained from𝜶~\[−v\]\\tilde\{\\boldsymbol\{\\alpha\}\}^\{\[\-v\]\}by removing the entries corresponding to the held\-out observations\. Consequently, the exact cross\-validation formula allows us to work with the same kernel matrix and slightly modifiedy~j\[v\]\\tilde\{y\}\_\{j\}^\{\[v\]\}\. We compute and store the inversion of𝐇λ​\(𝐊\)\\mathbf\{H\}\_\{\\lambda\}\(\\mathbf\{K\}\)only once, avoiding the need to invert the matrixVVtimes\.

Algorithm 1Exact leave\-one\-out cross\-validation for kernel SVM inTorchKM\(cvksvm\)0:

𝒚,𝐊\\boldsymbol\{y\},\\mathbf\{K\}, and

λ\\lambdasequence

1:Initialize

δ\\delta\. The smooth function is

LδL\_\{\\delta\}\. Let

κ=2​δ\\kappa=2\\delta\.

2:Initialize

𝜶~\[−i\]=𝟎\\tilde\{\\boldsymbol\{\\alpha\}\}^\{\[\-i\]\}=\\mathbf\{0\}for

i=1,…,ni=1,\\dots,n\.

3:Compute eigendecomposition

𝐊=𝐔​𝚲​𝐔⊤\\mathbf\{K\}=\\mathbf\{U\}\\mathbf\{\\Lambda\}\\mathbf\{U\}^\{\\top\}
4:foreach tuning parameter

λℓ\\lambda\_\{\\ell\},

ℓ=1,2,…,L\\ell=1,2,\\ldots,L,do

5:repeat

6:Compute

𝚷λℓ=2​λℓ​𝚲\+1/\(n​κ\)​𝚲​𝚲\\boldsymbol\{\\Pi\}\_\{\\lambda\_\{\\ell\}\}=2\\lambda\_\{\\ell\}\\boldsymbol\{\\Lambda\}\+1/\(n\\kappa\)\\boldsymbol\{\\Lambda\}\\boldsymbol\{\\Lambda\}\.

7:Compute

𝐯=𝐔​𝚲​𝚷λℓ−1​𝐔⊤​𝟏\\mathbf\{v\}=\\mathbf\{U\}\\boldsymbol\{\\Lambda\}\\boldsymbol\{\\Pi\}\_\{\\lambda\_\{\\ell\}\}^\{\-1\}\\mathbf\{U\}^\{\\top\}\\mathbf\{1\}and

g=1/\(n​𝟏⊤​𝐔​𝚲​𝚷λℓ−1​𝚲​𝐔⊤​𝟏\)g=1/\\bigl\(n\\mathbf\{1\}^\{\\top\}\\mathbf\{U\}\\boldsymbol\{\\Lambda\}\\boldsymbol\{\\Pi\}\_\{\\lambda\_\{\\ell\}\}^\{\-1\}\\boldsymbol\{\\Lambda\}\\mathbf\{U\}^\{\\top\}\\mathbf\{1\}\\bigr\)\.

8:for

i=1,…,ni=1,\\dots,ndo

9:Let

y~j\[−i\]=yj\\tilde\{y\}\_\{j\}^\{\[\-i\]\}=y\_\{j\}if

j≠ij\\neq i, and

y~i\[−i\]=0\\tilde\{y\}\_\{i\}^\{\[\-i\]\}=0\.

10:Set

\(𝜶¯,β¯0\)←\(𝜶~\[−i\],β~0\[−i\]\)\(\\bar\{\\boldsymbol\{\\alpha\}\},\\bar\{\\beta\}\_\{0\}\)\\leftarrow\(\\tilde\{\\boldsymbol\{\\alpha\}\}^\{\[\-i\]\},\\tilde\{\\beta\}\_\{0\}^\{\[\-i\]\}\)and

\(𝜶′,β0′\)←\(𝜶~\[−i\],β~0\[−i\]\)\(\\boldsymbol\{\\alpha\}^\{\\prime\},\\beta\_\{0\}^\{\\prime\}\)\\leftarrow\(\\tilde\{\\boldsymbol\{\\alpha\}\}^\{\[\-i\]\},\\tilde\{\\beta\}\_\{0\}^\{\[\-i\]\}\)\.

11:Let

r=1r=1\.

12:repeat

13:Compute

r′=1\+1\+4​r22r^\{\\prime\}=\\frac\{1\+\\sqrt\{1\+4r^\{2\}\}\}\{2\}\.

14:Update

\(𝜶¯,β¯0\)←\(𝜶~\[−i\],β~0\[−i\]\)\+r−1r′​\{\(𝜶~\[−i\],β~0\[−i\]\)−\(𝜶′,β0′\)\}\.\(\\bar\{\\boldsymbol\{\\alpha\}\},\\bar\{\\beta\}\_\{0\}\)\\leftarrow\(\\tilde\{\\boldsymbol\{\\alpha\}\}^\{\[\-i\]\},\\tilde\{\\beta\}\_\{0\}^\{\[\-i\]\}\)\+\\frac\{r\-1\}\{r^\{\\prime\}\}\\left\\\{\(\\tilde\{\\boldsymbol\{\\alpha\}\}^\{\[\-i\]\},\\tilde\{\\beta\}\_\{0\}^\{\[\-i\]\}\)\-\(\\boldsymbol\{\\alpha\}^\{\\prime\},\\beta\_\{0\}^\{\\prime\}\)\\right\\\}\.
15:Let

𝒛¯=\(z¯1,…,z¯n\)⊤\\bar\{\\boldsymbol\{z\}\}=\\left\(\\bar\{z\}\_\{1\},\\ldots,\\bar\{z\}\_\{n\}\\right\)^\{\\top\}, with

z¯j=y~j\[−i\]​Lδ′​\[y~j\[−i\]​\(𝐊j⊤​𝜶¯\+β¯0\)\]/n\.\\bar\{z\}\_\{j\}=\\tilde\{y\}\_\{j\}^\{\[\-i\]\}L\_\{\\delta\}^\{\\prime\}\[\\tilde\{y\}\_\{j\}^\{\[\-i\]\}\(\\mathbf\{K\}\_\{j\}^\{\\top\}\\bar\{\\boldsymbol\{\\alpha\}\}\+\\bar\{\\beta\}\_\{0\}\)\]/n\.
16:From right to left avoiding matrix multiplications, compute

Δ​𝜶=g​\(−\(𝟏⊤​𝐳¯\)​𝐯\+𝐯𝐯⊤​𝐊​\(𝐳¯\+2​λℓ​𝜶¯\)\)\+𝐔​𝚲​𝚷λℓ−1​𝐔⊤​\(𝐳¯\+2​λℓ​𝜶¯\)\.\\Delta\\boldsymbol\{\\alpha\}=g\(\-\(\\mathbf\{1\}^\{\\top\}\\bar\{\\mathbf\{z\}\}\)\\mathbf\{v\}\+\\mathbf\{vv\}^\{\\top\}\\mathbf\{K\}\(\\bar\{\\mathbf\{z\}\}\+2\\lambda\_\{\\ell\}\\bar\{\\boldsymbol\{\\alpha\}\}\)\)\+\\mathbf\{U\}\\boldsymbol\{\\Lambda\}\\boldsymbol\{\\Pi\}\_\{\\lambda\_\{\\ell\}\}^\{\-1\}\\mathbf\{U\}^\{\\top\}\(\\bar\{\\mathbf\{z\}\}\+2\\lambda\_\{\\ell\}\\bar\{\\boldsymbol\{\\alpha\}\}\)\.
17:Compute

Δ​β0=g​\(𝟏⊤​𝐳¯\)−g​𝐯⊤​𝐊​\(𝐳¯\+2​λℓ​𝜶¯\)\\Delta\\beta\_\{0\}=g\(\\mathbf\{1\}^\{\\top\}\\bar\{\\mathbf\{z\}\}\)\-g\\mathbf\{v\}^\{\\top\}\\mathbf\{K\}\(\\bar\{\\mathbf\{z\}\}\+2\\lambda\_\{\\ell\}\\bar\{\\boldsymbol\{\\alpha\}\}\)
18:Update

\(𝜶′,β0′\)←\(𝜶~\[−i\],β~0\[−i\]\)\(\\boldsymbol\{\\alpha\}^\{\\prime\},\\beta\_\{0\}^\{\\prime\}\)\\leftarrow\(\\tilde\{\\boldsymbol\{\\alpha\}\}^\{\[\-i\]\},\\tilde\{\\beta\}\_\{0\}^\{\[\-i\]\}\)\.

19:Update

\(𝜶~\[−i\],β~0\[−i\]\)←\(𝜶¯−Δ​𝜶,β¯0−Δ​β0\)\.\(\\tilde\{\\boldsymbol\{\\alpha\}\}^\{\[\-i\]\},\\tilde\{\\beta\}\_\{0\}^\{\[\-i\]\}\)\\leftarrow\(\\bar\{\\boldsymbol\{\\alpha\}\}\-\\Delta\\boldsymbol\{\\alpha\},\\bar\{\\beta\}\_\{0\}\-\\Delta\\beta\_\{0\}\)\.
20:Update

r←r′r\\leftarrow r^\{\\prime\}\.

21:untilthe convergence condition is met\.

22:endfor

23:Shrink

δ=η​δ\\delta=\\eta\\delta, where

η=0\.125\\eta=0\.125\. Update

LδL\_\{\\delta\}and

κ=2​δ\\kappa=2\\delta\.

24:untilthe KKT conditions of all SVM models are satisfied\.

25:endfor

### A\.3Spectral Algorithm

Even with the exact cross\-validation formula, we still need to solve the optimization problem along the solution path for each tuning parameterλ\\lambda\. Recall that the kernel SVM training updates take the following form:

\(β0\(k\+1\)𝜶\(k\+1\)\)−\(β0\(k\)𝜶\(k\)\)=−𝐇λ−1​\(𝐊\)​𝐝\(k\),where​𝐝\(k\)=\(𝟏⊤​𝒛\(k\)𝐊​𝒛\(k\)\+2​λ​𝐊​𝜶\(k\)\)\\binom\{\\beta\_\{0\}^\{\(k\+1\)\}\}\{\\boldsymbol\{\\alpha\}^\{\(k\+1\)\}\}\-\\binom\{\\beta\_\{0\}^\{\(k\)\}\}\{\\boldsymbol\{\\alpha\}^\{\(k\)\}\}=\-\\mathbf\{H\}\_\{\\lambda\}^\{\-1\}\(\\mathbf\{K\}\)\\mathbf\{d\}^\{\(k\)\},\\ \\text\{ where \}\\mathbf\{d\}^\{\(k\)\}=\\binom\{\\mathbf\{1\}^\{\\top\}\\boldsymbol\{z\}^\{\(k\)\}\}\{\\mathbf\{K\}\\boldsymbol\{z\}^\{\(k\)\}\+2\\lambda\\mathbf\{K\}\\boldsymbol\{\\alpha\}^\{\(k\)\}\}and𝐇λ​\(𝐊\)=2​λ​𝐊\+1n​κ​𝐊𝐊\\mathbf\{H\}\_\{\\lambda\}\(\\mathbf\{K\}\)=2\\lambda\\mathbf\{K\}\+\\frac\{1\}\{n\\kappa\}\\mathbf\{K\}\\mathbf\{K\}\. Each step would require a new inversion of𝐇λ​\(𝐊\)\\mathbf\{H\}\_\{\\lambda\}\(\\mathbf\{K\}\)for everyλ\\lambda, whose complexity isO​\(n3\)O\(n^\{3\}\)\.

To address this,TorchKMapplies a spectral algorithm: compute the eigendecomposition𝐊=𝐔​𝚲​𝐔⊤\\mathbf\{K\}=\\mathbf\{U\}\\mathbf\{\\Lambda\}\\mathbf\{U\}^\{\\top\}first\. For eachλ\\lambda, define𝚷λ=2​λ​𝚲\+1n​κ​𝚲​𝚲\.\\mathbf\{\\Pi\}\_\{\\lambda\}=2\\lambda\\mathbf\{\\Lambda\}\+\\frac\{1\}\{n\\kappa\}\\mathbf\{\\Lambda\\Lambda\}\.Then𝐇λ−1\\mathbf\{H\}\_\{\\lambda\}^\{\-1\}can be written as:

𝐇λ−1​\(𝐊\)=g​\(1−𝐯\)​\(1−𝐯⊤\)\+\(0𝟎⊤𝟎𝐔​𝚷λ−1​𝐔⊤\),\\mathbf\{H\}\_\{\\lambda\}^\{\-1\}\(\\mathbf\{K\}\)=g\\begin\{pmatrix\}1\\\\ \-\\mathbf\{v\}\\end\{pmatrix\}\\begin\{pmatrix\}1&\-\\mathbf\{v\}^\{\\top\}\\end\{pmatrix\}\+\\begin\{pmatrix\}0&\\mathbf\{0\}^\{\\top\}\\\\ \\mathbf\{0\}&\\mathbf\{U\}\\mathbf\{\\Pi\}\_\{\\lambda\}^\{\-1\}\\mathbf\{U\}^\{\\top\}\\end\{pmatrix\},where𝐯=𝐔​𝚲​Πλ−1​𝐔⊤​𝟏\\mathbf\{v\}=\\mathbf\{U\}\\boldsymbol\{\\Lambda\}\\Pi\_\{\\lambda\}^\{\-1\}\\mathbf\{U\}^\{\\top\}\\mathbf\{1\}, andg=1/\(n​𝟏⊤​𝐔​𝚲​Πλ−1​𝚲​𝐔⊤​𝟏\)g=1/\\bigl\(n\\mathbf\{1\}^\{\\top\}\\mathbf\{U\}\\boldsymbol\{\\Lambda\}\\Pi\_\{\\lambda\}^\{\-1\}\\boldsymbol\{\\Lambda\}\\mathbf\{U\}^\{\\top\}\\mathbf\{1\}\\bigr\)\. At thekkth iteration, we compute𝐇λ−1​\(𝐊\)​𝐝\(k\)\\mathbf\{H\}\_\{\\lambda\}^\{\-1\}\(\\mathbf\{K\}\)\\mathbf\{d\}^\{\(k\)\}from right to left with onlyO​\(n2\)O\(n^\{2\}\)complexity\.

Consequently, after a singleO​\(n3\)O\(n^\{3\}\)eigendecomposition of𝐊\\mathbf\{K\}, the subsequent computations along theλ\\lambda\-path reduce toO​\(n2\)O\(n^\{2\}\)matrix–vector multiplications\. This structure is well suited to GPU parallelism and is a key reason for the computational efficiency ofTorchKM\.

### A\.4TheTorchKMAlgorithm

With the exact cross\-validation formula and the spectral algorithm, we implement theTorchKMalgorithm as detailed in Algorithm 1\. For illustration, we present the leave\-one\-out cross\-validation version for kernel SVM over a sequence of tuning parametersλ1,λ2,…,λL\\lambda\_\{1\},\\lambda\_\{2\},\\ldots,\\lambda\_\{L\}\. To further accelerate computation along the regularization path, we use warm starts: the solution obtained atλℓ\\lambda\_\{\\ell\}is used as the initial value for solving the problem atλℓ\+1\\lambda\_\{\\ell\+1\}\. We also incorporate Nesterov’s acceleration\(Nesterov,[1983](https://arxiv.org/html/2606.06742#bib.bib9)\)to speed up convergence\.

Algorithm 1 presents the leave\-one\-out case for notational simplicity\. ForVV\-fold cross\-validation, the held\-out indexiiis replaced by a held\-out fold and the same spectral quantities can be used to compute the fold\-wise validation loss values\.

### A\.5Nyström Kernel Approximation

So far, we have introducedTorchKMwhich provides exact solution in the full\-kernel setting\. For large\-scale problems,TorchKMalso supports the Nyström approximation\. Working directly with a fulln×nn\\times nkernel matrix quickly becomes infeasible whennnis large\. The Nyström method addresses this issue by constructing a compressed representation of the kernel using only a subset of the training points\.TorchKMadopts a customized Nyström kernel approximation framework that avoids recomputing the approximation for every regularization parameter and every cross\-validation fold, as is typically required in conventional Nyström methods\.TorchKMcan perform a single unified Nyström approximation of𝐊\\mathbf\{K\}outside theλ\\lambda\-path and cross\-validation loops, making model training efficient\.

### A\.6Multiclass Classification

UnlikeThunderSVMwhich handles multiclass classification\(Wen et al\.,[2018a](https://arxiv.org/html/2606.06742#bib.bib18)\), the current version ofTorchKMonly focus on binary classification, and multiclass problems needs to be handled externally through the standard one\-vs\-rest or one\-vs\-one approaches\. Native multiclass support is a natural direction for future development; for example, multicategory kernel DWD provides a potential route for those kernel machines\(Wang and Zou,[2019](https://arxiv.org/html/2606.06742#bib.bib16)\)\.

![Refer to caption](https://arxiv.org/html/2606.06742v1/platt.png)Figure 2:Reliability curve for Platt\-calibrated probabilities on the independent test set\. The plotted curve compares observed frequencies with predicted probabilities, and the dashed 45∘line denotes perfect calibration\. Curves closer to the diagonal indicate better calibration\. The expected calibration error \(ECE\) is 0\.044 and the Brier score is 0\.130; smaller values indicate better calibration, suggesting good overall agreement between predicted probabilities and observed frequencies\.
### A\.7Platt Scaling Plot

TorchKMalso provides calibrated probability estimates in addition to class predictions\. Platt scaling is a probability calibration method that transforms a model’s raw decision scores into estimated class probabilities by fitting a sigmoid function\. Its main purpose is to improve the interpretability and reliability of the classifier\. In this section, we used the same binary classification setting as in the previous section\. The training set consisted ofn=10000n=10000observations, and the test set consisted of 1000 observations, each withp=10p=10predictors\. The RBF\-kernel SVM was fitted over a sequence of 50 regularization parameter values\. The Platt calibration plot was then used to evaluate how well the calibrated probabilities align with the observed outcome frequencies\. We evaluated the calibration performance on an independent test set using a reliability curve with the expected calibration error \(ECE\), and the Brier score\. Figure[2](https://arxiv.org/html/2606.06742#A1.F2)shows the reliability curve for Platt\-calibrated probabilities on the independent test set\. The dashed line denotes perfect calibration\. The curve remains close to the identity line overall, with ECE of 0\.044 and Brier score of 0\.130, which indicates good overall calibration\.

## Appendix BAdditional Experiment Results

### B\.1Experimental Setup

#### Hardware and software environment\.

All timing experiments reported in Tables[2](https://arxiv.org/html/2606.06742#S3.T2)–[4](https://arxiv.org/html/2606.06742#A2.T4)were carried out on a single workstation equipped with one NVIDIA L40S GPU with 48 GB of memory, an AMD EPYC 9334 \(32\-Core\) CPU, and 768 GB of system RAM, running Ubuntu 22\.04 with CUDA 12\.1\.TorchKMwas executed under Python 3\.11 with PyTorch 2\.4\.1 \(CUDA build\)\. Thescikit\-learnbaseline usedscikit\-learn1\.1\.3 together withNumPy1\.25\.2 andSciPy1\.9\.3 on the CPU\.ThunderSVMwas built from the official release 0\.3\.4 against CUDA 12\.1 and run on the same GPU asTorchKM\.

All run times are wall\-clock times measured end\-to\-end for the full training\-and\-tuning pipeline, averaged over 50 runs in Table[2](https://arxiv.org/html/2606.06742#S3.T2)and 10 runs in Tables[3](https://arxiv.org/html/2606.06742#A2.T3)–[4](https://arxiv.org/html/2606.06742#A2.T4)\.

Table 3:Classification accuracy \(Acc\) and run time \(Time, in seconds\) forTorchKMandThunderSVMon thea7a,a8a, andw7abenchmark classification data sets\. Numbers in parentheses give the sample size of each data set\. BothTorchKMandThunderSVMuse the full\-kernel SVM solver without a Nyström approximation\. Higher accuracy and lower run time indicate better performance\. Across the three data sets,TorchKMattains better or matched accuracy and shorter run time\.
#### Tuning\-parameter grid and cross\-validation\.

For all libraries, we used a grid of5050candidate regularization values, spaced log\-uniformly over\[−3,3\]\[\-3,3\], corresponding to

C∈\[10−3,103\]C\\in\[10^\{\-3\},10^\{3\}\]under thescikit\-learn/LIBSVM parameterization\. Equivalently, one may use theλ\\lambda\-parameterization by converting each value ofCCvia

C=12​n​λ\.C=\\frac\{1\}\{2n\\lambda\}\.ForTorchKM, model selection in Tables[2](https://arxiv.org/html/2606.06742#S3.T2)–[4](https://arxiv.org/html/2606.06742#A2.T4)was performed using1010\-fold cross\-validation\. The competing libraries were also tuned using1010\-fold cross\-validation viaGridSearchCVorcross\_val\_score\.

#### Benchmark data\.

Each data set was used in its standard LIBSVM\-format release without further preprocessing beyond feature scaling to\[−1,1\]\[\-1,1\]where provided\.

### B\.2Benchmark Results

In this section, we demonstrate the performance ofTorchKMin terms of classification accuracy and run time on three benchmark data sets:a7a,a8a, andw7a, which contain 16,100, 22,696, and 24,692 samples, respectively\. Table[3](https://arxiv.org/html/2606.06742#A2.T3)shows thatTorchKMconsistently improves predictive accuracy while significantly reducing run time\. This demonstrates the advantage ofTorchKMas an efficient and effective kernel learning approach\.

For the larger\-scale evaluation, we used five benchmark data sets:a9a,w8a,ijcnn,covtype, andMNIST8m\(4 vs 6\), which contain 32,561, 49,749, 49,990, 581,012, and 1,270,178 samples, respectively\. These data sets were selected to extend the comparison to larger classification problems and to examine the scalability of the solvers\. We evaluated both the predictive performance and the tuning efficiency ofTorchKMwith Nyström approximation, and Nyström implemented inscikit\-learn\.

Table[4](https://arxiv.org/html/2606.06742#A2.T4)presents the comparison of these solvers on the larger benchmark data sets in terms of accuracy and run time\. The improvement ofTorchKMbecomes even more pronounced onMNIST8m\(4 vs 6\), where it obtains the higher accuracy of 0\.997 in just 64\.647 seconds, compared with 0\.996 in 5189\.03 seconds forscikit\-learnwith Nyström\. These results demonstrate thatTorchKMnot only delivers consistently superior accuracy, but also scales far more efficiently on large\-scale data sets\.

Table 4:Classification accuracy \(Acc\) and run time \(Time, in seconds\) forcvknyssvminTorchKMand the Nyström approximation inscikit\-learnon thea9a,w8a,ijcnn,covtype, andMNIST8m\(4 vs 6\) benchmark classification data sets\. Numbers in parentheses give the sample size of each data set\. Higher accuracy and lower run time indicate better performance\. Across all five data sets,TorchKMattains higher accuracy and shorter run time, with the largest gain appearing onMNIST8m\(4 vs 6\)\.

## References

- Cervantes et al\. \(2020\)Jair Cervantes, Farid Garcia\-Lamont, Lisbeth Rodríguez\-Mazahua, and Asdrubal Lopez\.A comprehensive survey on support vector machine classification: Applications, challenges and trends\.*Neurocomputing*, 408:189–215, 2020\.
- Chang and Lin \(2011\)Chih\-Chung Chang and Chih\-Jen Lin\.LIBSVM: A library for support vector machines\.*ACM Transactions on Intelligent Systems and Technology \(TIST\)*, 2\(3\):1–27, 2011\.
- Cortes and Vapnik \(1995\)Corinna Cortes and Vladimir Vapnik\.Support\-vector networks\.*Machine Learning*, 20\(3\):273–297, 1995\.
- Hastie et al\. \(2009\)Trevor Hastie, Robert Tibshirani, and Jerome Friedman\.*The Elements of Statistical Learning: Data Mining, Inference, and Prediction*\.Springer, New York, 2nd edition, 2009\.
- Jiang et al\. \(2021\)Jiantong Jiang, Zeyi Wen, Zeke Wang, Bingsheng He, and Jian Chen\.Parallel and distributed structured SVM training\.*IEEE Transactions on Parallel and Distributed Systems*, 33\(5\):1084–1096, 2021\.
- Karatzoglou et al\. \(2004\)Alexandros Karatzoglou, Alexandros Smola, Kurt Hornik, and Achim Zeileis\.kernlab–an S4 package for kernel methods in R\.*Journal of Statistical Software*, 11:1–20, 2004\.
- Koenker and Hallock \(2001\)Roger Koenker and Kevin F Hallock\.Quantile regression\.*Journal of Economic Perspectives*, 15\(4\):143–156, 2001\.
- Marron et al\. \(2007\)James Stephen Marron, Michael J Todd, and Jeongyoun Ahn\.Distance\-weighted discrimination\.*Journal of the American Statistical Association*, 102\(480\):1267–1271, 2007\.
- Nesterov \(1983\)Yurii Nesterov\.A method for solving the convex programming problem with convergence rateO​\(1/k2\)\{O\}\(1/k^\{2\}\)\.269\(3\):543–547, 1983\.
- Nickolls et al\. \(2008\)John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron\.Scalable parallel programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for?*Queue*, 6\(2\):40–53, 2008\.
- Pedregosa et al\. \(2011\)Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al\.Scikit\-learn: Machine learning in Python\.*Journal of Machine Learning Research*, 12:2825–2830, 2011\.
- Platt \(1999\)John Platt\.Fast training of support vector machines using sequential minimal optimization\.*Advances in Kernel Methods: Support Vector Learning*, pages 185–208, 1999\.
- Tang et al\. \(2026\)Qian Tang, Yuwen Gu, and Boxiang Wang\.fastkqr: A fast algorithm for kernel quantile regression\.*Journal of Computational and Graphical Statistics*, 35\(1\):395–405, 2026\.
- Wang and Zou \(2016\)Boxiang Wang and Hui Zou\.Sparse distance weighted discrimination\.*Journal of Computational and Graphical Statistics*, 25\(3\):826–838, 2016\.
- Wang and Zou \(2018\)Boxiang Wang and Hui Zou\.Another look at distance\-weighted discrimination\.*Journal of the Royal Statistical Society Series B: Statistical Methodology*, 80\(1\):177–198, 2018\.
- Wang and Zou \(2019\)Boxiang Wang and Hui Zou\.A multicategory kernel distance weighted discrimination method for multiclass classification\.*Technometrics*, 61\(3\):396–408, 2019\.
- Wang and Zou \(2022\)Boxiang Wang and Hui Zou\.Fast and exact leave\-one\-out analysis of large\-margin classifiers\.*Technometrics*, 64\(3\):291–298, 2022\.
- Wen et al\. \(2018a\)Zeyi Wen, Jiashuai Shi, Bingsheng He, Jian Chen, and Yawen Chen\.Efficient multi\-class probabilistic SVMs on GPUs\.*IEEE Transactions on Knowledge and Data Engineering*, 31\(9\):1693–1706, 2018a\.
- Wen et al\. \(2018b\)Zeyi Wen, Jiashuai Shi, Qinbin Li, Bingsheng He, and Jian Chen\.ThunderSVM: A fast SVM library on GPUs and CPUs\.*Journal of Machine Learning Research*, 19\(21\):1–5, 2018b\.

Similar Articles

Luce Megakernal: Why nobody is taking about this?

Reddit r/LocalLLaMA

Lucebox Hub provides optimized CUDA kernels (Megakernel, DFlash, PFlash) for local LLM inference, achieving significant speedups (2-10x) over llama.cpp on various models and GPUs.

Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering

Hugging Face Daily Papers

Flash-GMM introduces a fused Triton kernel for Gaussian Mixture Models that achieves 20x speedup and enables training on datasets 100x larger on a single GPU, making soft clustering a viable drop-in replacement for k-means in approximate nearest neighbor search.

@AnimaAnandkumar: TorchLean codebase is now available! TorchLean is a Lean 4 framework for verified neural-network software. It supports …

X AI KOLs Following

TorchLean is a newly released Lean 4 framework that enables formal verification of neural network software, featuring typed tensors, verified autograd, PyTorch interoperability, and GPU execution. The release expands support to modern architectures like diffusion models, GPT-style transformers, and state-space models, bridging practical ML workflows with mathematical proof checking.