A polynomial autoencoder beats PCA on transformer embeddings

Hacker News Top 05/05/26, 11:31 AM Tools

Summary

This article introduces a polynomial autoencoder that improves upon PCA for compressing transformer embeddings by using a quadratic decoder to capture nonlinear variance. Benchmarks on BEIR show it significantly outperforms standard PCA and Matryoshka embeddings in retrieval quality while maintaining high compression ratios.

No content available

Original Article

View Cached Full Text

Cached at: 05/08/26, 08:27 AM

# Polynomial autoencoder Source: [https://ivanpleshkov.dev/blog/polynomial-autoencoder/](https://ivanpleshkov.dev/blog/polynomial-autoencoder/) The most direct way to compress an embedding \(other than quantization\) is to fit PCA on the corpus and keep the top\-d eigenvectors\. It works, but PCA is a linear projection, and neural\-network embeddings on the sphere are structurally nonlinear — the well\-known**cone effect**in transformers\. Some of the variance lives in a nonlinear tail that a linear decoder can’t reach\. This post is about a closed\-form way to add a**quadratic**decoder on top of PCA, to capture part of that nonlinear tail\. The encoder stays as plain PCA\. The decoder is a degree\-2 polynomial lift plus Ridge OLS \(ordinary linear regression with L2 regularization\), also closed\-form\. No SGD, no epochs, no hyperparameter search\. One`np\.linalg\.solve`over corpus statistics\. The construction itself isn’t mine\. “PCA encoder \+ quadratic decoder \+ least\-squares fit” appears in the dynamical\-systems literature under the name**quadratic manifold**\(see[Jain 2017](https://arxiv.org/abs/1610.09902),[Geelen\-Willcox 2022](https://arxiv.org/abs/2205.02304)/[2023](https://arxiv.org/abs/2306.13748),[Schwerdtner\-Peherstorfer 2024\+](https://arxiv.org/abs/2403.06732)— more in[§9](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#9-where-the-method-came-from-and-where-its-already-used)\)\. I only stumbled onto these papers after running my experiments and believing the construction was new\. The polynomial lift doesn’t come up much in modern ML conversations, and this post is a note about a useful trick from an adjacent discipline carrying over to retrieval\. The concrete result\. BEIR/FiQA,`mxbai\-embed\-large\-v1`\(1024d\), per\-vector budget 512 bytes\. Metric is**NDCG@10**\(Normalized Discounted Cumulative Gain over the top\-10, the standard retrieval ranking\-quality measure; range \[0, 1\], higher is better\): methodNDCG@10Δ vs raw 1024draw 1024d \(4096 bytes\)0\.4525—PCA top\-2560\.4168\-3\.58 p\.p\.**poly\-AE 256d****0\.4441****\-0\.85 p\.p\.**matryoshka top\-2560\.4039\-4\.86 p\.p\.PCA already gives 4× per\-vector memory compression at \-3\.58 p\.p\. NDCG\. A quadratic decoder on top of PCA pulls another**\+2\.73 p\.p\.**— closing almost the entire gap to raw, at the same byte budget\. Matryoshka is in the table as another familiar baseline \(here it drops more than PCA — a known side observation, not the central claim of this post; see[§3](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#3-the-headline-table--four-models)footnote and[§4](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#4-where-the-quadratic-decoder-actually-helps)\)\. Measured on four models \(nomic\-v1\.5, mxbai\-large, bge\-base, e5\-base\)\. Poly\-AE over PCA: \+1 to \+4\.4 p\.p\. at d=128, \+0\.03 to \+2\.7 p\.p\. at d=256\. Full table in[§3](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#3-the-headline-table--four-models)\. Full implementation — ~150 lines of numpy, MIT, repository[github\.com/IvanPleshkov/poly\-autoencoder](https://github.com/IvanPleshkov/poly-autoencoder)\. The BEIR eval script is`beir\_eval\.py`in the same repo\. Reproduces in 30–40 minutes on an M\-series MacBook \(10–15 min to encode the corpus, ~15 min for the Ridge solve at d=256\)\. **Contents:** - [§1 what we’re comparing](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#1-what-were-comparing) - [§2 setup](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#2-experimental-setup) - [§3 headline table](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#3-the-headline-table--four-models) - [§4 where the quadratic decoder helps](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#4-where-the-quadratic-decoder-actually-helps) - [§5 why linear projection loses](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#5-why-a-linear-projection-loses-information) - [§6 method](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#6-a-polynomial-decoder-via-a-linear-lift) - [§7 small\-corpus caveat](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#7-the-small-corpus-measurement-and-in-sample-magic) - [§8 residual compression](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#8-compression--what-to-do-with-the-residual) - [§9 where the method came from](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#9-where-the-method-came-from-and-where-its-already-used) - [§10 limits](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#10-limits-of-the-method) - [§11 what to try next](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#11-what-to-try-next) ## 1\. What we’re comparing Four lines in every measurement: - **raw**— the full embedding, no compression\. Quality ceiling and the most bytes per vector\. The “expensive” baseline that gives a fair ceiling\. - **matryoshka**—`embedding\[:d\]`plus L2 normalization\. On models trained with a matryoshka loss \(nomic, mxbai in our sample\), this is a valid matryoshka vector\. On models without matryoshka training \(bge\-base, e5\-base\) it’s a test of what happens when you naively slice a non\-MRL model — the scenario users of the bge family, e5, and custom\-fine\-tuned embeddings actually fall into\. - **PCA**— top\-d eigenvectors of the corpus covariance\. Vectors live in d\-dimensional PCA coordinates\. - **poly\-AE**— our method\. Encode with PCA into`p ∈ ℝ^d`, decode with a quadratic polynomial back to a full D\-dimensional`V̂`, retrieve on`V̂`\. At a fixed`d`, all four methods store`2d`bytes per vector \(fp16 coordinates\)\. ## 2\. Experimental setup BEIR is the standard set of retrieval datasets \(SciFact, FiQA, NFCorpus, TREC\-COVID, etc\.\)\. The metric is NDCG@10\. Corpus and queries are encoded with the chosen model, top\-10 are retrieved by cosine similarity, and NDCG is computed against the labeled qrels\. PCA and poly\-AE are fit**transductively**: statistics are computed on the corpus we want to compress\. Queries never participate in the fit — they hit a fixed encoder/decoder at inference time\. This matches a production deployment: an index operator computes PCA \+ Ridge once on their data and then serves queries\. For the main runs we use**FiQA**— 57K documents, 648 queries, 1706 qrels\. ## 3\. The headline table — four models NDCG@10 on FiQA at budgets of 256 fp16 \(512 bytes/vector\) and 128 fp16 \(256 bytes\): ModelDdrawmatryoshka†PCA**poly\-AE**poly over matryoshkapoly vs rawnomic\-embed\-text\-v1\.57681280\.37460\.31900\.3273**0\.3380**\+1\.90 p\.p\.\-3\.65 p\.p\.nomic\-embed\-text\-v1\.57682560\.37460\.35080\.3670**0\.3673**\+1\.65 p\.p\.\-0\.73 p\.p\.mxbai\-embed\-large\-v110241280\.45250\.35030\.3689**0\.4129**\+6\.26 p\.p\.\-3\.97 p\.p\.mxbai\-embed\-large\-v110242560\.45250\.40390\.4168**0\.4441**\+4\.02 p\.p\.\-0\.85 p\.p\.bge\-base\-en\-v1\.5\*7681280\.40620\.29140\.3266**0\.3654**\+7\.40 p\.p\.\-4\.09 p\.p\.bge\-base\-en\-v1\.5\*7682560\.40620\.35740\.3688**0\.3958**\+3\.84 p\.p\.\-1\.05 p\.p\.e5\-base\-v2\*7681280\.39870\.24980\.3065**0\.3317**\+8\.18 p\.p\.\-6\.70 p\.p\.e5\-base\-v2\*7682560\.39870\.33330\.3618**0\.3852**\+5\.19 p\.p\.\-1\.35 p\.p\.† The “matryoshka” column is`embedding\[:d\]`plus L2 normalization\. On nomic and mxbai it’s a valid matryoshka vector\. On models marked with`\*`\(bge\-base, e5\-base\) the model wasn’t trained for matryoshka, and the slice here is a test of what happens when you naively slice a non\-MRL model\. This is the scenario that users of the bge family, e5, and custom\-fine\-tuned embeddings actually fall into — and we measure it here honestly\. What this shows: 1. **Poly\-AE is consistently ahead of PCA**across all four models\. Lift: \+1 to \+4\.4 p\.p\. NDCG at d=128, \+0\.03 to \+2\.7 p\.p\. at d=256\. Where the quadratic decoder actually helps and at what`d`— discussed in[§4](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#4-where-the-quadratic-decoder-actually-helps)\. 2. **At`d=256`, poly\-AE loses 0\.7–1\.4 p\.p\. NDCG vs raw 768/1024**on all four models\. 4× per\-vector memory compression for less than 1\.5 p\.p\. lost — the main number of the post\. 3. **On non\-matryoshka\-trained models, the matryoshka column drops more than PCA**— up to \-15 p\.p\. NDCG at d=128\. This is a side observation: the post compares PCA and poly\-AE, not PCA vs matryoshka\. If the matryoshka numbers in the table look surprising, there’s a short pointer in[§4](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#4-where-the-quadratic-decoder-actually-helps)\. ## 4\. Where the quadratic decoder actually helps PCA is the linear baseline\. The quadratic decoder adds the nonlinear piece that the linear one can’t reach \(mechanics in[§5](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#5-why-a-linear-projection-loses-information)\)\. How much does that actually help on retrieval, and at what`d`? Poly\-AE lift over PCA, by model and`d`: Modelpoly over PCA, d=128poly over PCA, d=256nomic\-v1\.5\+1\.07 p\.p\.\+0\.03 p\.p\.mxbai\-large\+4\.40 p\.p\.\+2\.73 p\.p\.bge\-base\+3\.88 p\.p\.\+2\.70 p\.p\.e5\-base\-v2\+2\.52 p\.p\.\+2\.34 p\.p\.The picture: 1. **At d=128 \(8× compression\) poly is consistently 1–4 p\.p\. ahead of PCA\.**This is the regime where the linear decoder starts dropping noticeable variance into the nonlinear tail, and the quadratic correction pulls it back\. Sweet spot for the method\. 2. **At d=256 \(4× compression\) the gap is uneven\.**On mxbai/bge/e5 — a stable \+2\.3–2\.7 p\.p\. On nomic — close to zero \(\+0\.03\)\. Likely reason: nomic was carefully trained with multi\-slice contrastive loss, its latent is more isotropic, and at d=256 the linear projection already takes most of what’s there\. On non\-MRL models the nonlinear tail is bigger → poly helps more\. 3. **More anisotropy → bigger lift\.**The stronger the cone effect, the more variance lives in the nonlinear tail that PCA can’t reach but poly can\. That’s the geometry[§5](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#5-why-a-linear-projection-loses-information)unpacks\. ### Side: where matryoshka sits in the table In[§3](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#3-the-headline-table--four-models)you can see that on non\-MRL models \(bge, e5\) the matryoshka column drops more than PCA — i\.e\. on a random non\-MRL\-trained model, naive slicing works worse than a corpus\-side linear projection\. This is a known result; the “MRL vs PCA on retrieval” question has been discussed independently of this post — see[Matryoshka\-Adaptor 2024](https://arxiv.org/abs/2407.20243),[SMEC “Rethinking MRL” 2025](https://arxiv.org/abs/2510.12474),[CoRECT 2025](https://arxiv.org/abs/2510.19340), and the YouTube video literally titled[«Is PCA enough?»](https://www.youtube.com/watch?v=lklw59jQRKE)\. This post compares PCA and poly\-AE; matryoshka is in the[§3](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#3-the-headline-table--four-models)table as a third reference point\. ### Where poly\-AE doesn’t apply A corpus\-side PCA fit is a required part of the method\. That means poly\-AE**doesn’t work**when the corpus isn’t available: - **multi\-tenant SaaS**: one model, thousands of clients with different corpora — fitting PCA per client is operational pain; - **streaming indices**: statistics drift over time, PCA needs periodic refits; - **edge inference**: phone, browser, embedded — you don’t want to ship a per\-client PCA matrix alongside the model\. In those settings you want an MRL\-trained model and`embedding\[:d\]`, and poly\-AE isn’t an alternative — it also needs corpus statistics\. ### Practical takeaway In the operator\-fit setting \(fixed corpus, the operator fits compression once\), you have two working modes: - **d=256**gives 4× compression at \-0\.7 to \-1\.4 p\.p\. NDCG vs raw\. Poly over PCA: from \+0\.03 \(nomic\) to \+2\.7 p\.p\. \(mxbai/bge/e5\)\. On an MRL\-trained model like nomic the gap to PCA is minimal; on non\-MRL models poly is clearly ahead\. - **d=128**gives 8× compression\. Poly over PCA: \+1 to \+4\.4 p\.p\. on any model\. Sweet spot for the method\. ## 5\. Why a linear projection loses information PCA is the best possible**linear**projection of data into a d\-dimensional subspace\. But “best linear” doesn’t mean “good enough”: if the data has nonlinear structure, a linear decoder can’t reach it, period\. Transformer embeddings have such structure, well\-studied — the**cone effect**\. The point cloud is concentrated inside a narrow cone on the unit sphere and is heavily nonlinearly structured inside that cone\. Left: isotropic data\. Right: one example of what anisotropic data can look like\.drag to rotate PCA only catches projections along orthogonal eigenvectors — i\.e\. the linear ellipsoid that the cloud doesn’t actually fit in\. Whatever lives in the curvature of the manifold is structurally invisible to a linear decoder\. The stronger the anisotropy \(the narrower the cone\), the more variance sits in this nonlinear tail that PCA structurally can’t reach\. So what we want is clear\. A decoder that can deal in**quadratic combinations of coordinates**— i\.e\. a decoder that captures local curvature of the manifold\. Then we’d recover some of the information that PCA loses\. ## 6\. A polynomial decoder via a linear lift [§5](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#5-why-a-linear-projection-loses-information)made it clear that we need a nonlinear decoder\. The straightforward route is to train a neural network\. But then we lose the closed\-form pipeline: SGD, learning rate, batch size, convergence, early stopping\. We’d like a nonlinear decoder solvable by a single formula\. Standard regression trick: the**polynomial lift**\. Take vector`p`and lift it into all monomials up to degree 2 — bias, linear terms, squares, and pairwise products\. On a 2D example: ``` lift([p₁, p₂]) = [1, p₁, p₂, p₁², p₁·p₂, p₂²] ↑ ↑ ↑ ↑ ↑ ↑ bias linear quadratic ``` Any linear combination of these six = a quadratic function of the original`p₁, p₂`\. So**a linear regression on the lift = a quadratic regression in the original space**\. No nonlinear optimization needed, plain Ridge OLS in closed form does it\. Apply this to our case\. The encoder is top\-d PCA \(`p = \(V − V̄\) @ Q`,`p ∈ ℝ^d`\)\. Lift`p`into all monomials up to degree 2: ``` def polynomial_lift(p, degree=2): features = [1.0] # bias features.extend(p) # degree 1 for i in range(len(p)): for j in range(i, len(p)): features.append(p[i] * p[j]) # degree 2 return np.array(features) ``` The lift dimension is`M = \(d\+1\)\(d\+2\)/2`\. For`d=128`that’s`M = 8385`, for`d=256`it’s`M = 33153`\. The decoder is a linear regression from the M\-dimensional lift back to the D\-dimensional original\. Despite being “just a linear regression”, the resulting map`p → V̂`is quadratic — exactly what we needed\. Training the decoder is one`np\.linalg\.solve`: ``` def fit_decoder(P, V, lam=1e-3): L = polynomial_lift(P, degree=2) # (N, M) G = L.T @ L + lam * np.trace(L.T @ L) / L.shape[1] * np.eye(L.shape[1]) W = np.linalg.solve(G, L.T @ V) return W ``` The polynomial autoencoder is assembled: ![Polynomial autoencoder schema: V → PCA → p → poly lift → L → Ridge OLS → V̂; V − V̂ → V_resid](https://ivanpleshkov.dev/blog/polynomial-autoencoder/images/ae_schema_en_light.svg) - **encoder**— closed\-form linear \(PCA\), no training, - **decoder**— closed\-form quadratic \(Ridge OLS on the polynomial lift\), trained with one`np\.linalg\.solve`, - **reconstruction**—`V̂ = polynomial\_lift\(p\) @ W`, - **residual**—`V\_resid = V − V̂`\(useful in[§8](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#8-compression--what-to-do-with-the-residual)for quantization\)\. No backprop, learning rate, batch size, epochs\. On 100K vectors at d=100, a couple of minutes on CPU\. At d=256, around 15 minutes \(Ridge solves an`M³`system, cubic in`M`\)\. ### One technical note on normalization Raw`p`after PCA has huge anisotropy across its coordinates \(the first axis ~50× the variance of the last\) — bad for Ridge: large axes dominate the solution, small ones get under\-weighted\. Fix is per\-axis std normalization: ``` p_normed = p / np.sqrt(eigvals) p = p_normed * (0.9 / np.linalg.norm(p_normed, axis=1).max()) ``` The global`0\.9 / max\-norm`factor gives`‖p‖ ≤ 0\.9`— numerical stability under squaring\. Doesn’t affect NDCG, but cleaner numerically\. ## 7\. The small\-corpus measurement and in\-sample magic A methodological note\. On the first sanity check on SciFact \(5183 docs, 300 queries\), poly\-AE at`d=256`showed NDCG@10**0\.6980**vs raw**0\.7032**— losing only 0\.5 p\.p\. The reported gain over PCA there was \+1\.36 p\.p\. On FiQA \(57638 docs\), the same experiment gave: - poly vs raw: \-0\.73 p\.p\. \(stable, like SciFact\); - **poly over PCA: \+0\.03 p\.p\.**\(down from \+1\.36 on SciFact\)\. What happened\. On SciFact at`d=256`Ridge solves a regression with**M=33153 features on N=5183 vectors**— severely overdetermined: the in\-sample`V̂\_corpus`matches`V\_corpus`almost exactly, so retrieval on`V̂\_corpus`is equivalent to retrieval on the full 768\-d`V\_corpus`\. The SciFact poly\-AE measurement inflated its gap with PCA because it**memorized the corpus**in the Ridge weights, not because it generalized better\. What’s important: the poly\-vs\-**raw**drop is stable \(\-0\.5 to \-0\.85 p\.p\.\) across both corpora\. But for poly\-vs\-PCA, on small corpora \(`N ≤ 10K`at`d=256`\) the gap is partly mythical\. A safe rule of thumb:`N`should be at least 5–10× larger than`M = \(d\+1\)\(d\+2\)/2`\. For`d=256`that’s`N ≳ 200K`\. For`d=128`it’s`N ≳ 50K`\. In the headline table \([§3](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#3-the-headline-table--four-models)\), FiQA’s`N = 57K`is enough for`d=128`and borderline for`d=256`\. The bge numbers on FiQA at`d=256`should be read with that caveat\. ## 8\. Compression — what to do with the residual The autoencoder doesn’t compress anything by itself, strictly speaking\. It repackages a vector into a pair`\(p, V\_resid\)`— latent code plus residual\. Same information, different shape\. To get actual compression, the residual has to be quantized\. This is where poly\-AE’s anisotropy\-removing property matters\. On DBpedia\-OpenAI at`d=100`,`cond\(V\_resid\)`drops from 72 to 3\.4 — almost isotropic residual\. Linear PCA at the same`d`gives`cond=4\.21`— noticeably less isotropic\. Google Research recently published[**TurboQuant**](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/)— a quantizer with a fixed random rotation that**on isotropic distributions gets close to the theoretical compression limit**\(Shannon rate\-distortion for normals\)\. The composition is natural: the autoencoder strips out the anisotropy, TurboQuant efficiently quantizes the now\-isotropic residual\. ``` V ──► encoder (PCA) ──► p │ decoder (poly+Ridge) ──► V̂ │ V − V̂ ──► V_resid ──► TurboQuant ──► compact code ``` For Qdrant 1\.18 I shipped a TurboQuant modification with anisotropy compensation\. It achieves roughly the same recall as the “poly\-AE \+ residual quantization” hybrid, but without needing to store the poly\-AE latent coordinates — saving the p\-side of the per\-vector byte budget\. The tricks I layered on top of standard TurboQuant will be in a separate post\. ## 9\. Where the method came from and where it’s already used I’ve been familiar with the polynomial lift since my time at[ACD/Labs](https://www.acdlabs.com/)— in cheminformatics it’s a standard tool for building features from molecular structure \(as input for downstream regression\)\. Currently I work at Qdrant\. When I was implementing late\-interaction retrieval support there \([ColBERT\-style multi\-vectors with maxsim, v1\.10](https://github.com/qdrant/qdrant/releases/tag/v1.10.0)\), I remembered the trick — and that’s how the “PCA \+ quadratic decoder” idea came together\. I thought no one had tried it for neural embeddings yet\. When the draft went to[Sava Kalbachou](https://www.linkedin.com/in/sava-kalbachou/)for review, he pointed out specific papers where this exact construction has long been used — in numerical modeling of physical systems, under the name**quadratic manifold**: - **Jain, Tiso, Rixen, Rutzmoser 2017**—[«A quadratic manifold for model order reduction of nonlinear structural dynamics»](https://arxiv.org/abs/1610.09902)\. Introduces the concept: PCA top\-d plus a quadratic correction\. The quadratic part is derived from the physics of the system, not from a data fit\. The earliest paper\. - **Geelen, Wright, Willcox 2022**—[«Operator inference for non\-intrusive model reduction with quadratic manifolds»](https://arxiv.org/abs/2205.02304)\.**Exactly our construction**: PCA encoder, decoder on the Kronecker product of the latent code, coefficients fit via regularized least squares\.[Open\-source code](https://github.com/geelenr/quad_manifold)\. - **Geelen, Balzano, Willcox 2023**—[«Learning latent representations in high\-dimensional state spaces using polynomial manifold constructions»](https://arxiv.org/abs/2306.13748)\. Explicitly named “polynomial manifold”, explicitly fit by regression\. - **Schwerdtner, Peherstorfer 2024–2025**— a series of follow\-ups exploring training and application variants:[greedy direction selection](https://arxiv.org/abs/2403.06732),[sparse regression on the manifold](https://doi.org/10.1137/24M1717270),[streaming online learning](https://royalsocietypublishing.org/doi/10.1098/rspa.2024.0670), and[«Operator Inference Aware Quadratic Manifolds with Isotropic Reduced Coordinates»](https://arxiv.org/abs/2507.20463)— the last one curiously echoes the residual\-isotropy story we use for TurboQuant in[§8](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#8-compression--what-to-do-with-the-residual)\. In those papers the construction is applied to solution\-trajectory snapshots of physical\-system simulations\. Embeddings are a different data geometry, and it’s not obvious in advance that quadratic manifold should also work there; the numbers in[§3](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#3-the-headline-table--four-models)show that it does\. ### Why this combination doesn’t seem to be in mainstream ML These are just speculations\. The communities probably don’t overlap much: quadratic\-manifold work appears in engineering journals and rarely surfaces in ML research\. Also, in ML’s collective memory polynomial methods are mostly associated with the kernel trick \(SVM, kernel PCA\) — a different construction, and the explicit polynomial lift as a stand\-alone technique rarely comes up after that\. And an “autoencoder” is usually reflexively understood as a neural network in modern ML; closed\-form variants may simply fall outside the usual field of attention\. So this is an applied\-ML post: a technique developed in an adjacent field turns out to be useful for embeddings as well\. The contribution of the post is the port and the empirical check\. ## 10\. Limits of the method A few honest caveats: - **Cubic Ridge solve\.**At`d=256`,`M = 33K`, and`np\.linalg\.solve`in closed form costs`O\(M³\)`≈ 5–15 minutes on CPU\. At`d=384`that’s already`M = 74K`and`M³`is unmanageable\. The method is comfortable up to`d ≈ 200–256`\. For larger`d`you’d need randomized methods \(random feature approximation\), or move to a neural decoder\. - **Transductive fit\.**PCA \+ Ridge are computed on the corpus we intend to compress\. If the corpus drifts over time, decompression degrades\. Fine for static indices, problematic for streaming\. - **Small N\.**When`N < 5M`\(i\.e\.`N < 200K`at`d=256`\), Ridge starts memorizing the corpus, in\-sample retrieval improves due to overfitting rather than due to structure\. The signal: a poly\-vs\-PCA gap that’s too large probably means overfit\. - **Degree 3 doesn’t work yet\.**A degree\-3 lift gives`M = O\(d³\)`— for`d=100`that’s 175K features\. Fits in memory, but the Ridge solve is impractical and overfitting climbs\. ## 11\. What to try next Obvious branches: - **Bigger corpora, heavier models\.**All measurements in the post are on FiQA with models in the 305–560M\-param range\. Worth running on MS MARCO \(8\.8M passages\) with e5\-mistral\-7b\-class models\. The cubic Ridge becomes the bottleneck — you’d need random\-feature approximation\. - **Lift degree\.**At what`d/N`does degree 3 start to actually help? Are there embeddings with significant 3rd\-order tensor structure that we’re losing now? - **The regime where you can’t compute matryoshka\.**Multi\-tenant SaaS, edge inference — there poly\-AE without PCA won’t work either \(it needs corpus stats\)\. But a**hybrid**: matryoshka in the model \+ polynomial decoder on the client — should give the best of both worlds\. ## Code and reproduction Loaders for BEIR \(SciFact, FiQA, NFCorpus, TREC\-COVID\), encoders \(nomic\-v1\.5, mxbai\-large, bge\-base, e5\-base\), PCA\-init with per\-axis std, polynomial lift, Ridge OLS, retrieval with NDCG@10/Recall@10 — all in[github\.com/IvanPleshkov/poly\-autoencoder](https://github.com/IvanPleshkov/poly-autoencoder)\. Minimal run: ``` git clone https://github.com/IvanPleshkov/poly-autoencoder.git cd poly-autoencoder python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt pytrec_eval einops python beir_eval.py --model nomic-embed-text-v1.5 --dataset fiqa --d 128,256 ``` On an M\-series MacBook the first run downloads the model and dataset, encodes the corpus \(5–15 min depending on the model\), and over the next 5–20 minutes prints a table with NDCG@10 for all four methods\.

A polynomial autoencoder beats PCA on transformer embeddings

Similar Articles

At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization

Transformers Learn the Mestre-Nagao Heuristic

DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy

Submit Feedback

Similar Articles

At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization

Transformers Learn the Mestre-Nagao Heuristic

DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy