@raphaelsrty: Computing max similarity (scoring step of colbert, colpali) on gpus can be optimized and this is what @tonywu_71 did. I…

X AI KOLs Following 06/10/26, 02:50 PM Tools

multi-vector-models colbert colpali maximum-similarity triton-kernels pylate gpu-optimization

Summary

Tony Wu released late-interaction-kernels (LIK): fused Triton kernels for MaxSim, the scoring step behind ColBERT and ColPali, integrated into PyLate and colpali-engine, offering memory efficiency and performance gains.

Computing max similarity (scoring step of colbert, colpali) on gpus can be optimized and this is what @tonywu_71 did. It's available in PyLate, it will accelerate both training and inference of multi-vector models pip install "pylate[lik]" so cool, from @tonywu_71 and @Aurelien_L_

Original Article

View Cached Full Text

Cached at: 06/10/26, 03:54 PM

Computing max similarity (scoring step of colbert, colpali) on gpus can be optimized and this is what @tonywu_71 did.

It’s available in PyLate, it will accelerate both training and inference of multi-vector models

pip install “pylate[lik]”

so cool, from @tonywu_71 and @Aurelien_L_

Tony Wu (@tonywu_71): Very excited to release late-interaction-kernels (LIK): fused Triton kernels for MaxSim, the scoring step behind ColBERT, ColPali & LateOn. 🚀

Numerically equivalent to PyTorch at a fraction of the memory, with day-0 support in PyLate & colpali-engine. (1/N 🧵)

Similar Articles

@ErikKaum: Releasing my first kernel on @huggingface: MaxSim Late-interaction retrieval (ColBERT / PyLate) bottlenecks on material…

X AI KOLs Following

Releases a kernel on Hugging Face that accelerates MaxSim late-interaction retrieval by using tiled scoring with SIMD group matrix operations (Metal and WMMA), achieving 3–5× speedup over the naive implementation.

@antoine_chaffin: Whether you are GPU poor or GPU rich, today's release of PyLate has something for you! GPU maxxers: MaxSim kernels grea…

X AI KOLs Following

The release of PyLate introduces MaxSim kernels for GPU-accelerated training with lower memory requirements and TACHIOM for fast multi-vector indexing and search on CPU.

@bo_wangbo: okay maybe it's a good time? We have a small colbert model trained at pplx, it is a continue-training of pplx-embed-0.6…

X AI KOLs Following

Perplexity AI releases pplx-embed-v1-late-0.6b, a small ColBERT late-interaction embedding model for retrieval, fine-tuned from their existing embedding model and optimized for MaxSim scoring, now open-source on HuggingFace.

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

X AI KOLs Timeline

A user shares their experience setting up a dual-GPU local AI lab with RTX 4080 Super and 5060 Ti, running Qwen 3.6 models via llama.cpp and llama-swap to reduce API costs and enable unrestricted experimentation.

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.

Reddit r/LocalLLaMA

Packed Twin Inference (PTI) is a technique that achieves ~2× LLM throughput by running multiple token sequences in a single batch decode, exploiting weight sharing in llama.cpp without needing a draft model or additional VRAM.

Similar Articles

@ErikKaum: Releasing my first kernel on @huggingface: MaxSim Late-interaction retrieval (ColBERT / PyLate) bottlenecks on material…

@antoine_chaffin: Whether you are GPU poor or GPU rich, today's release of PyLate has something for you! GPU maxxers: MaxSim kernels grea…

@bo_wangbo: okay maybe it's a good time? We have a small colbert model trained at pplx, it is a continue-training of pplx-embed-0.6…

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.

Submit Feedback