Cached at:
05/08/26, 08:27 AM
# Polynomial autoencoder
Source: [https://ivanpleshkov.dev/blog/polynomial-autoencoder/](https://ivanpleshkov.dev/blog/polynomial-autoencoder/)
The most direct way to compress an embedding \(other than quantization\) is to fit PCA on the corpus and keep the top\-d eigenvectors\. It works, but PCA is a linear projection, and neural\-network embeddings on the sphere are structurally nonlinear — the well\-known**cone effect**in transformers\. Some of the variance lives in a nonlinear tail that a linear decoder can’t reach\.
This post is about a closed\-form way to add a**quadratic**decoder on top of PCA, to capture part of that nonlinear tail\. The encoder stays as plain PCA\. The decoder is a degree\-2 polynomial lift plus Ridge OLS \(ordinary linear regression with L2 regularization\), also closed\-form\. No SGD, no epochs, no hyperparameter search\. One`np\.linalg\.solve`over corpus statistics\.
The construction itself isn’t mine\. “PCA encoder \+ quadratic decoder \+ least\-squares fit” appears in the dynamical\-systems literature under the name**quadratic manifold**\(see[Jain 2017](https://arxiv.org/abs/1610.09902),[Geelen\-Willcox 2022](https://arxiv.org/abs/2205.02304)/[2023](https://arxiv.org/abs/2306.13748),[Schwerdtner\-Peherstorfer 2024\+](https://arxiv.org/abs/2403.06732)— more in[§9](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#9-where-the-method-came-from-and-where-its-already-used)\)\. I only stumbled onto these papers after running my experiments and believing the construction was new\. The polynomial lift doesn’t come up much in modern ML conversations, and this post is a note about a useful trick from an adjacent discipline carrying over to retrieval\.
The concrete result\. BEIR/FiQA,`mxbai\-embed\-large\-v1`\(1024d\), per\-vector budget 512 bytes\. Metric is**NDCG@10**\(Normalized Discounted Cumulative Gain over the top\-10, the standard retrieval ranking\-quality measure; range \[0, 1\], higher is better\):
methodNDCG@10Δ vs raw 1024draw 1024d \(4096 bytes\)0\.4525—PCA top\-2560\.4168\-3\.58 p\.p\.**poly\-AE 256d****0\.4441****\-0\.85 p\.p\.**matryoshka top\-2560\.4039\-4\.86 p\.p\.PCA already gives 4× per\-vector memory compression at \-3\.58 p\.p\. NDCG\. A quadratic decoder on top of PCA pulls another**\+2\.73 p\.p\.**— closing almost the entire gap to raw, at the same byte budget\. Matryoshka is in the table as another familiar baseline \(here it drops more than PCA — a known side observation, not the central claim of this post; see[§3](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#3-the-headline-table--four-models)footnote and[§4](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#4-where-the-quadratic-decoder-actually-helps)\)\.
Measured on four models \(nomic\-v1\.5, mxbai\-large, bge\-base, e5\-base\)\. Poly\-AE over PCA: \+1 to \+4\.4 p\.p\. at d=128, \+0\.03 to \+2\.7 p\.p\. at d=256\. Full table in[§3](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#3-the-headline-table--four-models)\.
Full implementation — ~150 lines of numpy, MIT, repository[github\.com/IvanPleshkov/poly\-autoencoder](https://github.com/IvanPleshkov/poly-autoencoder)\. The BEIR eval script is`beir\_eval\.py`in the same repo\. Reproduces in 30–40 minutes on an M\-series MacBook \(10–15 min to encode the corpus, ~15 min for the Ridge solve at d=256\)\.
**Contents:**
- [§1 what we’re comparing](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#1-what-were-comparing)
- [§2 setup](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#2-experimental-setup)
- [§3 headline table](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#3-the-headline-table--four-models)
- [§4 where the quadratic decoder helps](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#4-where-the-quadratic-decoder-actually-helps)
- [§5 why linear projection loses](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#5-why-a-linear-projection-loses-information)
- [§6 method](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#6-a-polynomial-decoder-via-a-linear-lift)
- [§7 small\-corpus caveat](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#7-the-small-corpus-measurement-and-in-sample-magic)
- [§8 residual compression](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#8-compression--what-to-do-with-the-residual)
- [§9 where the method came from](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#9-where-the-method-came-from-and-where-its-already-used)
- [§10 limits](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#10-limits-of-the-method)
- [§11 what to try next](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#11-what-to-try-next)
## 1\. What we’re comparing
Four lines in every measurement:
- **raw**— the full embedding, no compression\. Quality ceiling and the most bytes per vector\. The “expensive” baseline that gives a fair ceiling\.
- **matryoshka**—`embedding\[:d\]`plus L2 normalization\. On models trained with a matryoshka loss \(nomic, mxbai in our sample\), this is a valid matryoshka vector\. On models without matryoshka training \(bge\-base, e5\-base\) it’s a test of what happens when you naively slice a non\-MRL model — the scenario users of the bge family, e5, and custom\-fine\-tuned embeddings actually fall into\.
- **PCA**— top\-d eigenvectors of the corpus covariance\. Vectors live in d\-dimensional PCA coordinates\.
- **poly\-AE**— our method\. Encode with PCA into`p ∈ ℝ^d`, decode with a quadratic polynomial back to a full D\-dimensional`V̂`, retrieve on`V̂`\.
At a fixed`d`, all four methods store`2d`bytes per vector \(fp16 coordinates\)\.
## 2\. Experimental setup
BEIR is the standard set of retrieval datasets \(SciFact, FiQA, NFCorpus, TREC\-COVID, etc\.\)\. The metric is NDCG@10\. Corpus and queries are encoded with the chosen model, top\-10 are retrieved by cosine similarity, and NDCG is computed against the labeled qrels\.
PCA and poly\-AE are fit**transductively**: statistics are computed on the corpus we want to compress\. Queries never participate in the fit — they hit a fixed encoder/decoder at inference time\. This matches a production deployment: an index operator computes PCA \+ Ridge once on their data and then serves queries\.
For the main runs we use**FiQA**— 57K documents, 648 queries, 1706 qrels\.
## 3\. The headline table — four models
NDCG@10 on FiQA at budgets of 256 fp16 \(512 bytes/vector\) and 128 fp16 \(256 bytes\):
ModelDdrawmatryoshka†PCA**poly\-AE**poly over matryoshkapoly vs rawnomic\-embed\-text\-v1\.57681280\.37460\.31900\.3273**0\.3380**\+1\.90 p\.p\.\-3\.65 p\.p\.nomic\-embed\-text\-v1\.57682560\.37460\.35080\.3670**0\.3673**\+1\.65 p\.p\.\-0\.73 p\.p\.mxbai\-embed\-large\-v110241280\.45250\.35030\.3689**0\.4129**\+6\.26 p\.p\.\-3\.97 p\.p\.mxbai\-embed\-large\-v110242560\.45250\.40390\.4168**0\.4441**\+4\.02 p\.p\.\-0\.85 p\.p\.bge\-base\-en\-v1\.5\*7681280\.40620\.29140\.3266**0\.3654**\+7\.40 p\.p\.\-4\.09 p\.p\.bge\-base\-en\-v1\.5\*7682560\.40620\.35740\.3688**0\.3958**\+3\.84 p\.p\.\-1\.05 p\.p\.e5\-base\-v2\*7681280\.39870\.24980\.3065**0\.3317**\+8\.18 p\.p\.\-6\.70 p\.p\.e5\-base\-v2\*7682560\.39870\.33330\.3618**0\.3852**\+5\.19 p\.p\.\-1\.35 p\.p\.† The “matryoshka” column is`embedding\[:d\]`plus L2 normalization\. On nomic and mxbai it’s a valid matryoshka vector\. On models marked with`\*`\(bge\-base, e5\-base\) the model wasn’t trained for matryoshka, and the slice here is a test of what happens when you naively slice a non\-MRL model\. This is the scenario that users of the bge family, e5, and custom\-fine\-tuned embeddings actually fall into — and we measure it here honestly\.
What this shows:
1. **Poly\-AE is consistently ahead of PCA**across all four models\. Lift: \+1 to \+4\.4 p\.p\. NDCG at d=128, \+0\.03 to \+2\.7 p\.p\. at d=256\. Where the quadratic decoder actually helps and at what`d`— discussed in[§4](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#4-where-the-quadratic-decoder-actually-helps)\.
2. **At`d=256`, poly\-AE loses 0\.7–1\.4 p\.p\. NDCG vs raw 768/1024**on all four models\. 4× per\-vector memory compression for less than 1\.5 p\.p\. lost — the main number of the post\.
3. **On non\-matryoshka\-trained models, the matryoshka column drops more than PCA**— up to \-15 p\.p\. NDCG at d=128\. This is a side observation: the post compares PCA and poly\-AE, not PCA vs matryoshka\. If the matryoshka numbers in the table look surprising, there’s a short pointer in[§4](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#4-where-the-quadratic-decoder-actually-helps)\.
## 4\. Where the quadratic decoder actually helps
PCA is the linear baseline\. The quadratic decoder adds the nonlinear piece that the linear one can’t reach \(mechanics in[§5](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#5-why-a-linear-projection-loses-information)\)\. How much does that actually help on retrieval, and at what`d`?
Poly\-AE lift over PCA, by model and`d`:
Modelpoly over PCA, d=128poly over PCA, d=256nomic\-v1\.5\+1\.07 p\.p\.\+0\.03 p\.p\.mxbai\-large\+4\.40 p\.p\.\+2\.73 p\.p\.bge\-base\+3\.88 p\.p\.\+2\.70 p\.p\.e5\-base\-v2\+2\.52 p\.p\.\+2\.34 p\.p\.The picture:
1. **At d=128 \(8× compression\) poly is consistently 1–4 p\.p\. ahead of PCA\.**This is the regime where the linear decoder starts dropping noticeable variance into the nonlinear tail, and the quadratic correction pulls it back\. Sweet spot for the method\.
2. **At d=256 \(4× compression\) the gap is uneven\.**On mxbai/bge/e5 — a stable \+2\.3–2\.7 p\.p\. On nomic — close to zero \(\+0\.03\)\. Likely reason: nomic was carefully trained with multi\-slice contrastive loss, its latent is more isotropic, and at d=256 the linear projection already takes most of what’s there\. On non\-MRL models the nonlinear tail is bigger → poly helps more\.
3. **More anisotropy → bigger lift\.**The stronger the cone effect, the more variance lives in the nonlinear tail that PCA can’t reach but poly can\. That’s the geometry[§5](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#5-why-a-linear-projection-loses-information)unpacks\.
### Side: where matryoshka sits in the table
In[§3](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#3-the-headline-table--four-models)you can see that on non\-MRL models \(bge, e5\) the matryoshka column drops more than PCA — i\.e\. on a random non\-MRL\-trained model, naive slicing works worse than a corpus\-side linear projection\. This is a known result; the “MRL vs PCA on retrieval” question has been discussed independently of this post — see[Matryoshka\-Adaptor 2024](https://arxiv.org/abs/2407.20243),[SMEC “Rethinking MRL” 2025](https://arxiv.org/abs/2510.12474),[CoRECT 2025](https://arxiv.org/abs/2510.19340), and the YouTube video literally titled[«Is PCA enough?»](https://www.youtube.com/watch?v=lklw59jQRKE)\. This post compares PCA and poly\-AE; matryoshka is in the[§3](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#3-the-headline-table--four-models)table as a third reference point\.
### Where poly\-AE doesn’t apply
A corpus\-side PCA fit is a required part of the method\. That means poly\-AE**doesn’t work**when the corpus isn’t available:
- **multi\-tenant SaaS**: one model, thousands of clients with different corpora — fitting PCA per client is operational pain;
- **streaming indices**: statistics drift over time, PCA needs periodic refits;
- **edge inference**: phone, browser, embedded — you don’t want to ship a per\-client PCA matrix alongside the model\.
In those settings you want an MRL\-trained model and`embedding\[:d\]`, and poly\-AE isn’t an alternative — it also needs corpus statistics\.
### Practical takeaway
In the operator\-fit setting \(fixed corpus, the operator fits compression once\), you have two working modes:
- **d=256**gives 4× compression at \-0\.7 to \-1\.4 p\.p\. NDCG vs raw\. Poly over PCA: from \+0\.03 \(nomic\) to \+2\.7 p\.p\. \(mxbai/bge/e5\)\. On an MRL\-trained model like nomic the gap to PCA is minimal; on non\-MRL models poly is clearly ahead\.
- **d=128**gives 8× compression\. Poly over PCA: \+1 to \+4\.4 p\.p\. on any model\. Sweet spot for the method\.
## 5\. Why a linear projection loses information
PCA is the best possible**linear**projection of data into a d\-dimensional subspace\. But “best linear” doesn’t mean “good enough”: if the data has nonlinear structure, a linear decoder can’t reach it, period\.
Transformer embeddings have such structure, well\-studied — the**cone effect**\. The point cloud is concentrated inside a narrow cone on the unit sphere and is heavily nonlinearly structured inside that cone\.
Left: isotropic data\. Right: one example of what anisotropic data can look like\.drag to rotate
PCA only catches projections along orthogonal eigenvectors — i\.e\. the linear ellipsoid that the cloud doesn’t actually fit in\. Whatever lives in the curvature of the manifold is structurally invisible to a linear decoder\. The stronger the anisotropy \(the narrower the cone\), the more variance sits in this nonlinear tail that PCA structurally can’t reach\.
So what we want is clear\. A decoder that can deal in**quadratic combinations of coordinates**— i\.e\. a decoder that captures local curvature of the manifold\. Then we’d recover some of the information that PCA loses\.
## 6\. A polynomial decoder via a linear lift
[§5](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#5-why-a-linear-projection-loses-information)made it clear that we need a nonlinear decoder\. The straightforward route is to train a neural network\. But then we lose the closed\-form pipeline: SGD, learning rate, batch size, convergence, early stopping\. We’d like a nonlinear decoder solvable by a single formula\.
Standard regression trick: the**polynomial lift**\. Take vector`p`and lift it into all monomials up to degree 2 — bias, linear terms, squares, and pairwise products\. On a 2D example:
```
lift([p₁, p₂]) = [1, p₁, p₂, p₁², p₁·p₂, p₂²]
↑ ↑ ↑ ↑ ↑ ↑
bias linear quadratic
```
Any linear combination of these six = a quadratic function of the original`p₁, p₂`\. So**a linear regression on the lift = a quadratic regression in the original space**\. No nonlinear optimization needed, plain Ridge OLS in closed form does it\.
Apply this to our case\. The encoder is top\-d PCA \(`p = \(V − V̄\) @ Q`,`p ∈ ℝ^d`\)\. Lift`p`into all monomials up to degree 2:
```
def polynomial_lift(p, degree=2):
features = [1.0] # bias
features.extend(p) # degree 1
for i in range(len(p)):
for j in range(i, len(p)):
features.append(p[i] * p[j]) # degree 2
return np.array(features)
```
The lift dimension is`M = \(d\+1\)\(d\+2\)/2`\. For`d=128`that’s`M = 8385`, for`d=256`it’s`M = 33153`\. The decoder is a linear regression from the M\-dimensional lift back to the D\-dimensional original\. Despite being “just a linear regression”, the resulting map`p → V̂`is quadratic — exactly what we needed\.
Training the decoder is one`np\.linalg\.solve`:
```
def fit_decoder(P, V, lam=1e-3):
L = polynomial_lift(P, degree=2) # (N, M)
G = L.T @ L + lam * np.trace(L.T @ L) / L.shape[1] * np.eye(L.shape[1])
W = np.linalg.solve(G, L.T @ V)
return W
```
The polynomial autoencoder is assembled:

- **encoder**— closed\-form linear \(PCA\), no training,
- **decoder**— closed\-form quadratic \(Ridge OLS on the polynomial lift\), trained with one`np\.linalg\.solve`,
- **reconstruction**—`V̂ = polynomial\_lift\(p\) @ W`,
- **residual**—`V\_resid = V − V̂`\(useful in[§8](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#8-compression--what-to-do-with-the-residual)for quantization\)\.
No backprop, learning rate, batch size, epochs\. On 100K vectors at d=100, a couple of minutes on CPU\. At d=256, around 15 minutes \(Ridge solves an`M³`system, cubic in`M`\)\.
### One technical note on normalization
Raw`p`after PCA has huge anisotropy across its coordinates \(the first axis ~50× the variance of the last\) — bad for Ridge: large axes dominate the solution, small ones get under\-weighted\. Fix is per\-axis std normalization:
```
p_normed = p / np.sqrt(eigvals)
p = p_normed * (0.9 / np.linalg.norm(p_normed, axis=1).max())
```
The global`0\.9 / max\-norm`factor gives`‖p‖ ≤ 0\.9`— numerical stability under squaring\. Doesn’t affect NDCG, but cleaner numerically\.
## 7\. The small\-corpus measurement and in\-sample magic
A methodological note\. On the first sanity check on SciFact \(5183 docs, 300 queries\), poly\-AE at`d=256`showed NDCG@10**0\.6980**vs raw**0\.7032**— losing only 0\.5 p\.p\. The reported gain over PCA there was \+1\.36 p\.p\.
On FiQA \(57638 docs\), the same experiment gave:
- poly vs raw: \-0\.73 p\.p\. \(stable, like SciFact\);
- **poly over PCA: \+0\.03 p\.p\.**\(down from \+1\.36 on SciFact\)\.
What happened\. On SciFact at`d=256`Ridge solves a regression with**M=33153 features on N=5183 vectors**— severely overdetermined: the in\-sample`V̂\_corpus`matches`V\_corpus`almost exactly, so retrieval on`V̂\_corpus`is equivalent to retrieval on the full 768\-d`V\_corpus`\. The SciFact poly\-AE measurement inflated its gap with PCA because it**memorized the corpus**in the Ridge weights, not because it generalized better\.
What’s important: the poly\-vs\-**raw**drop is stable \(\-0\.5 to \-0\.85 p\.p\.\) across both corpora\. But for poly\-vs\-PCA, on small corpora \(`N ≤ 10K`at`d=256`\) the gap is partly mythical\. A safe rule of thumb:`N`should be at least 5–10× larger than`M = \(d\+1\)\(d\+2\)/2`\. For`d=256`that’s`N ≳ 200K`\. For`d=128`it’s`N ≳ 50K`\.
In the headline table \([§3](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#3-the-headline-table--four-models)\), FiQA’s`N = 57K`is enough for`d=128`and borderline for`d=256`\. The bge numbers on FiQA at`d=256`should be read with that caveat\.
## 8\. Compression — what to do with the residual
The autoencoder doesn’t compress anything by itself, strictly speaking\. It repackages a vector into a pair`\(p, V\_resid\)`— latent code plus residual\. Same information, different shape\. To get actual compression, the residual has to be quantized\.
This is where poly\-AE’s anisotropy\-removing property matters\. On DBpedia\-OpenAI at`d=100`,`cond\(V\_resid\)`drops from 72 to 3\.4 — almost isotropic residual\. Linear PCA at the same`d`gives`cond=4\.21`— noticeably less isotropic\.
Google Research recently published[**TurboQuant**](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/)— a quantizer with a fixed random rotation that**on isotropic distributions gets close to the theoretical compression limit**\(Shannon rate\-distortion for normals\)\. The composition is natural: the autoencoder strips out the anisotropy, TurboQuant efficiently quantizes the now\-isotropic residual\.
```
V ──► encoder (PCA) ──► p
│
decoder (poly+Ridge) ──► V̂
│
V − V̂ ──► V_resid ──► TurboQuant ──► compact code
```
For Qdrant 1\.18 I shipped a TurboQuant modification with anisotropy compensation\. It achieves roughly the same recall as the “poly\-AE \+ residual quantization” hybrid, but without needing to store the poly\-AE latent coordinates — saving the p\-side of the per\-vector byte budget\. The tricks I layered on top of standard TurboQuant will be in a separate post\.
## 9\. Where the method came from and where it’s already used
I’ve been familiar with the polynomial lift since my time at[ACD/Labs](https://www.acdlabs.com/)— in cheminformatics it’s a standard tool for building features from molecular structure \(as input for downstream regression\)\. Currently I work at Qdrant\. When I was implementing late\-interaction retrieval support there \([ColBERT\-style multi\-vectors with maxsim, v1\.10](https://github.com/qdrant/qdrant/releases/tag/v1.10.0)\), I remembered the trick — and that’s how the “PCA \+ quadratic decoder” idea came together\. I thought no one had tried it for neural embeddings yet\.
When the draft went to[Sava Kalbachou](https://www.linkedin.com/in/sava-kalbachou/)for review, he pointed out specific papers where this exact construction has long been used — in numerical modeling of physical systems, under the name**quadratic manifold**:
- **Jain, Tiso, Rixen, Rutzmoser 2017**—[«A quadratic manifold for model order reduction of nonlinear structural dynamics»](https://arxiv.org/abs/1610.09902)\. Introduces the concept: PCA top\-d plus a quadratic correction\. The quadratic part is derived from the physics of the system, not from a data fit\. The earliest paper\.
- **Geelen, Wright, Willcox 2022**—[«Operator inference for non\-intrusive model reduction with quadratic manifolds»](https://arxiv.org/abs/2205.02304)\.**Exactly our construction**: PCA encoder, decoder on the Kronecker product of the latent code, coefficients fit via regularized least squares\.[Open\-source code](https://github.com/geelenr/quad_manifold)\.
- **Geelen, Balzano, Willcox 2023**—[«Learning latent representations in high\-dimensional state spaces using polynomial manifold constructions»](https://arxiv.org/abs/2306.13748)\. Explicitly named “polynomial manifold”, explicitly fit by regression\.
- **Schwerdtner, Peherstorfer 2024–2025**— a series of follow\-ups exploring training and application variants:[greedy direction selection](https://arxiv.org/abs/2403.06732),[sparse regression on the manifold](https://doi.org/10.1137/24M1717270),[streaming online learning](https://royalsocietypublishing.org/doi/10.1098/rspa.2024.0670), and[«Operator Inference Aware Quadratic Manifolds with Isotropic Reduced Coordinates»](https://arxiv.org/abs/2507.20463)— the last one curiously echoes the residual\-isotropy story we use for TurboQuant in[§8](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#8-compression--what-to-do-with-the-residual)\.
In those papers the construction is applied to solution\-trajectory snapshots of physical\-system simulations\. Embeddings are a different data geometry, and it’s not obvious in advance that quadratic manifold should also work there; the numbers in[§3](https://ivanpleshkov.dev/blog/polynomial-autoencoder/#3-the-headline-table--four-models)show that it does\.
### Why this combination doesn’t seem to be in mainstream ML
These are just speculations\. The communities probably don’t overlap much: quadratic\-manifold work appears in engineering journals and rarely surfaces in ML research\.
Also, in ML’s collective memory polynomial methods are mostly associated with the kernel trick \(SVM, kernel PCA\) — a different construction, and the explicit polynomial lift as a stand\-alone technique rarely comes up after that\.
And an “autoencoder” is usually reflexively understood as a neural network in modern ML; closed\-form variants may simply fall outside the usual field of attention\.
So this is an applied\-ML post: a technique developed in an adjacent field turns out to be useful for embeddings as well\. The contribution of the post is the port and the empirical check\.
## 10\. Limits of the method
A few honest caveats:
- **Cubic Ridge solve\.**At`d=256`,`M = 33K`, and`np\.linalg\.solve`in closed form costs`O\(M³\)`≈ 5–15 minutes on CPU\. At`d=384`that’s already`M = 74K`and`M³`is unmanageable\. The method is comfortable up to`d ≈ 200–256`\. For larger`d`you’d need randomized methods \(random feature approximation\), or move to a neural decoder\.
- **Transductive fit\.**PCA \+ Ridge are computed on the corpus we intend to compress\. If the corpus drifts over time, decompression degrades\. Fine for static indices, problematic for streaming\.
- **Small N\.**When`N < 5M`\(i\.e\.`N < 200K`at`d=256`\), Ridge starts memorizing the corpus, in\-sample retrieval improves due to overfitting rather than due to structure\. The signal: a poly\-vs\-PCA gap that’s too large probably means overfit\.
- **Degree 3 doesn’t work yet\.**A degree\-3 lift gives`M = O\(d³\)`— for`d=100`that’s 175K features\. Fits in memory, but the Ridge solve is impractical and overfitting climbs\.
## 11\. What to try next
Obvious branches:
- **Bigger corpora, heavier models\.**All measurements in the post are on FiQA with models in the 305–560M\-param range\. Worth running on MS MARCO \(8\.8M passages\) with e5\-mistral\-7b\-class models\. The cubic Ridge becomes the bottleneck — you’d need random\-feature approximation\.
- **Lift degree\.**At what`d/N`does degree 3 start to actually help? Are there embeddings with significant 3rd\-order tensor structure that we’re losing now?
- **The regime where you can’t compute matryoshka\.**Multi\-tenant SaaS, edge inference — there poly\-AE without PCA won’t work either \(it needs corpus stats\)\. But a**hybrid**: matryoshka in the model \+ polynomial decoder on the client — should give the best of both worlds\.
## Code and reproduction
Loaders for BEIR \(SciFact, FiQA, NFCorpus, TREC\-COVID\), encoders \(nomic\-v1\.5, mxbai\-large, bge\-base, e5\-base\), PCA\-init with per\-axis std, polynomial lift, Ridge OLS, retrieval with NDCG@10/Recall@10 — all in[github\.com/IvanPleshkov/poly\-autoencoder](https://github.com/IvanPleshkov/poly-autoencoder)\.
Minimal run:
```
git clone https://github.com/IvanPleshkov/poly-autoencoder.git
cd poly-autoencoder
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt pytrec_eval einops
python beir_eval.py --model nomic-embed-text-v1.5 --dataset fiqa --d 128,256
```
On an M\-series MacBook the first run downloads the model and dataset, encodes the corpus \(5–15 min depending on the model\), and over the next 5–20 minutes prints a table with NDCG@10 for all four methods\.