Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM

arXiv cs.CL 04/21/26, 04:00 AM Papers

speculative-decoding apple-silicon polish-language llm-inference cross-tokenizer mlx-lm empirical-evaluation

Summary

This paper presents the first systematic evaluation of cross-family speculative decoding for Polish LLMs on Apple Silicon, extending MLX-LM with UAG to enable cross-tokenizer decoding. It finds that context-aware token translation improves acceptance rates, but unified memory bandwidth limitations prevent theoretical speedup amortization, with best results showing 1.7x throughput gains for structured text.

arXiv:2604.16368v1 Announce Type: new Abstract: Speculative decoding accelerates LLM inference by using a small draft model to propose k candidate tokens for a target model to verify. While effective for same-tokenizer pairs on high-bandwidth GPUs, its applicability to cross-family pairs with mismatched tokenizers and consumer-grade unified memory remains underexplored. We extend the MLX-LM framework with Universal Assisted Generation (UAG) to enable cross-tokenizer speculative decoding on Apple Silicon. We evaluate Bielik 11B-Instruct (Mistral-based) as the target model, paired with three draft models: Bielik 1.5B (Qwen-based with custom tokenizer), Qwen2.5-1.5B, and Llama 3.2-1B. Experiments on three Polish-language datasets (Wikipedia, pl_alpaca, synthetic) use draft lengths k in {2, 4, 6} to compare naive and context-aware token translation. Results show: (1) context-aware translation consistently improves acceptance rates across all configurations; (2) the Polish-specialized Bielik 1.5B achieves lower acceptance than general-purpose Qwen2.5 and Llama 3.2 drafters; (3) throughput on Apple Silicon is content-dependent, reaching 1.7x speedup for structured text but failing for varied instructions; and (4) verification cost on unified memory does not amortize as theory predicts because both models are memory-bandwidth bound, making sequential drafting expensive relative to batched verification. We propose a hardware-aware speedup formula and characterize conditions for cross-family speculative decoding on Apple Silicon. This is the first systematic evaluation of cross-family speculative decoding for Polish LLMs and the first empirical study of UAG-based decoding on unified memory architectures.

Original Article

Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM

Similar Articles

@_avichawla: Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effe…

Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

@AlexJonesax: Two open-source MLX inference servers worth knowing about if you run LLMs on Mac: MTPLX (@youssofal) Uses a model's own…

LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

Submit Feedback

Similar Articles

@_avichawla: Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effe…

Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

@AlexJonesax: Two open-source MLX inference servers worth knowing about if you run LLMs on Mac: MTPLX (@youssofal) Uses a model's own…

LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding