Tag
Benchmark results for running Qwen 3.6 27B on AMD MI50 GPUs using a custom vllm fork, achieving 52.8 tokens/s TG and 1569 tokens/s PP without quantization or MTP, demonstrating usability for agentic tasks on 2018 hardware.
A Stanford lecture on AI inference emphasizes practical bottlenecks like KV-cache and techniques like speculative decoding and continuous batching, offering more real-world insight than typical ML courses.
The article benchmarks the compile-time cost of C++26 reflection for enum-to-string conversion against C++17 libraries and X-macro preprocessor techniques using GCC 16.
The author analyzes 262,715 Stack Overflow questions to identify common regex pain points and demonstrates how their new regex engine, RE#, solves these issues using complement and intersection operations.
This article analyzes unexpected memory allocation costs in Tokio's mpsc channels in Rust, revealing a fixed overhead per channel due to internal block sizing. It demonstrates how this impacts large-scale applications like Agent Gateway and suggests alternatives like futures-channel for memory efficiency.
GPT5.5 achieved the first solve on the difficult ProgramBench SWE benchmark, significantly outperforming Opus 4.7.
A user running multiple agents reports that after upgrading to GPT-5.5, the model suddenly became less capable at executing tool calls and more prone to giving suggestions instead of acting, speculating OpenAI may be throttling for load management.
A user benchmarks token generation speed on llama.cpp with the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 flag, comparing performance with and without MTP (Multi-Token Prediction). Results show a significant speedup from 49 tok/s to 64 tok/s when MTP is enabled on an RTX5090 with a Qwen3.6-27B model.
User @binsquares reports that GPU acceleration on smolvm achieves nearly 90% of host performance when running llama.cpp via the Vulkan backend.
Eldar Kurtic presents a comprehensive study on TurboQuant, revealing its real-world effects on accuracy, latency, and throughput beyond initial evaluations.
Luce DFlash has achieved a 10-15% speedup by implementing per-layer K/V truncation in the draft graph for SWA layers.
Python 3.15 introduces the profiling.sampling module, Tachyon, a statistical profiler that periodically samples stack snapshots with minimal overhead, suitable for development and production environments.
fc is an open-source lossless compressor for IEEE-754 64-bit double streams, offering superior compression ratios for structured data compared to zstd and fpzip, though with slower encoding speeds.
MTPLX v0.3 is released, a native runtime for Apple Silicon that uses Multi-Token Prediction (MTP) to double decode speed while maintaining distributional accuracy via Leviathan-Chen acceptance.
This article discusses the decision to revert the incremental garbage collection feature in Python 3.14 and 3.15.
This technical article explains how to use Erlang's :counters and :atomics modules for high-performance counting and shared mutable state outside the standard process isolation model. It covers atomic operations like add_get, exchange, and compare-and-swap within the BEAM runtime.
A user benchmarked MTP (Multi-Token Prediction) on Gemma 4 with mlx-vlm on M4 Max Studio, finding it excellent for code generation (1.53x faster, 66% acceptance) but detrimental for JSON output (50% slower, only 8% acceptance) and neutral for long-form prose, suggesting MTP benefits vanish when acceptance drops below 50%.
Developer achieved 80+ t/s inference on Qwen3.6-27B with 262K context on a single RTX 4090 by combining MTP (Multi-Token Prediction) with TurboQuant's lossless KV cache compression, sharing their implementation fork and technical details.
React Doctor v2 is an open-source CLI tool that analyzes React codebases for performance issues, bad patterns, unnecessary re-renders, and broken architecture. It supports Next.js, Vite, and React Native and can be run instantly via npx.
Explains how the -ncmoe flag in llama.cpp improves performance for MoE models like Qwen3.6 35B A3B on limited VRAM (8-12GB) by offloading some expert layers to CPU+RAM, with benchmarks showing up to 5x speedup on an RTX 3070Ti.