empirical-evaluation

#empirical-evaluation

Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM

arXiv cs.CL ↗ · 2026-04-21

This paper presents the first systematic evaluation of cross-family speculative decoding for Polish LLMs on Apple Silicon, extending MLX-LM with UAG to enable cross-tokenizer decoding. It finds that context-aware token translation improves acceptance rates, but unified memory bandwidth limitations prevent theoretical speedup amortization, with best results showing 1.7x throughput gains for structured text.

0 favorites 0 likes

#empirical-evaluation

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper presents a comprehensive empirical evaluation of how large language models handle corruptions in chain-of-thought reasoning steps, testing 13 models across 5 perturbation types (MathError, UnitConversion, Sycophancy, SkippedSteps, ExtraSteps) on mathematical reasoning tasks. The findings reveal heterogeneous vulnerability patterns with implications for deploying LLMs in multi-stage reasoning pipelines.

0 favorites 0 likes

empirical-evaluation

Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

Submit Feedback