Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM
Summary
This paper presents the first systematic evaluation of cross-family speculative decoding for Polish LLMs on Apple Silicon, extending MLX-LM with UAG to enable cross-tokenizer decoding. It finds that context-aware token translation improves acceptance rates, but unified memory bandwidth limitations prevent theoretical speedup amortization, with best results showing 1.7x throughput gains for structured text.
Similar Articles
@_avichawla: Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effe…
Researchers introduced DFlash, a technique using block diffusion models for speculative decoding that accelerates LLM inference by up to 8.5x without accuracy loss. It is already integrated with major frameworks like vLLM and SGLang.
Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon
Metal-Sci introduces a 10-task benchmark for optimizing scientific computing kernels on Apple Silicon, paired with an evolutionary search framework driven by large language models. The study evaluates models like Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5, demonstrating significant speedups while using out-of-distribution testing to catch silent performance regressions.
@AlexJonesax: Two open-source MLX inference servers worth knowing about if you run LLMs on Mac: MTPLX (@youssofal) Uses a model's own…
This article highlights two open-source MLX inference servers for Mac: MTPLX, which optimizes token speed using speculative decoding without a draft model, and oMLX, which improves workflow efficiency with persistent KV caches for coding agents.
LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language
The article introduces LLiMba, a 3B parameter model adapted from Qwen2.5 for Sardinian using continued pretraining and supervised fine-tuning on a single consumer GPU. It evaluates various LoRA configurations, finding that adapter capacity significantly impacts performance and factual accuracy in low-resource language adaptation.
Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding
This paper identifies a new vulnerability in model-based speculative decoding for large language models, where small perturbations can reduce draft token acceptance without affecting output quality, collapsing acceleration. The authors propose Mistletoe, an attack that jointly optimizes degradation and semantic preservation, demonstrating significant speedup reduction across various systems.