@raphaelsrty: Computing max similarity (scoring step of colbert, colpali) on gpus can be optimized and this is what @tonywu_71 did. I…
Summary
Tony Wu released late-interaction-kernels (LIK): fused Triton kernels for MaxSim, the scoring step behind ColBERT and ColPali, integrated into PyLate and colpali-engine, offering memory efficiency and performance gains.
View Cached Full Text
Cached at: 06/10/26, 03:54 PM
Computing max similarity (scoring step of colbert, colpali) on gpus can be optimized and this is what @tonywu_71 did.
It’s available in PyLate, it will accelerate both training and inference of multi-vector models
pip install “pylate[lik]”
so cool, from @tonywu_71 and @Aurelien_L_
Tony Wu (@tonywu_71): Very excited to release late-interaction-kernels (LIK): fused Triton kernels for MaxSim, the scoring step behind ColBERT, ColPali & LateOn. 🚀
Numerically equivalent to PyTorch at a fraction of the memory, with day-0 support in PyLate & colpali-engine. (1/N 🧵)
Similar Articles
@ErikKaum: Releasing my first kernel on @huggingface: MaxSim Late-interaction retrieval (ColBERT / PyLate) bottlenecks on material…
Releases a kernel on Hugging Face that accelerates MaxSim late-interaction retrieval by using tiled scoring with SIMD group matrix operations (Metal and WMMA), achieving 3–5× speedup over the naive implementation.
@antoine_chaffin: Whether you are GPU poor or GPU rich, today's release of PyLate has something for you! GPU maxxers: MaxSim kernels grea…
The release of PyLate introduces MaxSim kernels for GPU-accelerated training with lower memory requirements and TACHIOM for fast multi-vector indexing and search on CPU.
@bo_wangbo: okay maybe it's a good time? We have a small colbert model trained at pplx, it is a continue-training of pplx-embed-0.6…
Perplexity AI releases pplx-embed-v1-late-0.6b, a small ColBERT late-interaction embedding model for retrieval, fine-tuned from their existing embedding model and optimized for MaxSim scoring, now open-source on HuggingFace.
@leopardracer: https://x.com/leopardracer/status/2055341758523883631
A user shares their experience setting up a dual-GPU local AI lab with RTX 4080 Super and 5060 Ti, running Qwen 3.6 models via llama.cpp and llama-swap to reduce API costs and enable unrestricted experimentation.
2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.
Packed Twin Inference (PTI) is a technique that achieves ~2× LLM throughput by running multiple token sequences in a single batch decode, exploiting weight sharing in llama.cpp without needing a draft model or additional VRAM.