performance

#performance

MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)

Reddit r/LocalLLaMA ↗ · 2h ago

Benchmark results for running Qwen 3.6 27B on AMD MI50 GPUs using a custom vllm fork, achieving 52.8 tokens/s TG and 1569 tokens/s PP without quantization or MTP, demonstrating usability for agentic tasks on 2018 hardware.

0 favorites 0 likes

#performance

@polydao: This Stanford lecture on AI inference will teach you more about how LLMs work in production than most ML courses > Clau…

X AI KOLs Timeline ↗ · 8h ago

A Stanford lecture on AI inference emphasizes practical bottlenecks like KV-cache and techniques like speculative decoding and continuous batching, offering more real-world insight than typical ML courses.

0 favorites 0 likes

#performance

Cost of enum-to-string: C++26 reflection vs. the old ways

Hacker News Top ↗ · 12h ago Cached

The article benchmarks the compile-time cost of C++26 reflection for enum-to-string conversion against C++17 libraries and X-macro preprocessor techniques using GCC 16.

0 favorites 0 likes

#performance

what 262,715 regex questions on stack overflow haven't answered

Lobsters Hottest ↗ · 18h ago Cached

The author analyzes 262,715 Stack Overflow questions to identify common regex pain points and demonstrates how their new regex engine, RE#, solves these issues using complement and intersection operations.

0 favorites 0 likes

#performance

The hidden cost of mpsc channels

Lobsters Hottest ↗ · yesterday Cached

This article analyzes unexpected memory allocation costs in Tokio's mpsc channels in Rust, revealing a fixed overhead per channel due to internal block sizing. It demonstrates how this impacts large-scale applications like Agent Gateway and suggests alternatives like futures-channel for memory efficiency.

0 favorites 0 likes

#performance

On a difficult new SWE benchmark, ProgramBench, GPT5.5 high/xhigh solves a task for first time, significantly outperforms Opus 4.7

Reddit r/singularity ↗ · yesterday

GPT5.5 achieved the first solve on the difficult ProgramBench SWE benchmark, significantly outperforming Opus 4.7.

0 favorites 0 likes

#performance

Did GPT5.5 get dumber/lazier yesterday for anyone else?

Reddit r/openclaw ↗ · yesterday

A user running multiple agents reports that after upgrading to GPT-5.5, the model suddenly became less capable at executing tool calls and more prone to giving suggestions instead of acting, speculating OpenAI may be throttling for load management.

0 favorites 0 likes

#performance

MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp

Reddit r/LocalLLaMA ↗ · yesterday

A user benchmarks token generation speed on llama.cpp with the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 flag, comparing performance with and without MTP (Multi-Token Prediction). Results show a significant speedup from 49 tok/s to 64 tok/s when MTP is enabled on an RTX5090 with a Qwen3.6-27B model.

0 favorites 0 likes

#performance

@binsquares: omg, GPU acceleration on smolvm works way better than I thought. can run llama.cpp inside the smol machine with close t…

X AI KOLs Following ↗ · 2d ago Cached

User @binsquares reports that GPU acceleration on smolvm achieves nearly 90% of host performance when running llama.cpp via the Vulkan backend.

0 favorites 0 likes

#performance

@_EldarKurtic: TurboQuant has drawn a lot of attention recently, but the accompanying evals didn't tell the full story. So we ran what…

X AI KOLs Following ↗ · 2d ago Cached

Eldar Kurtic presents a comprehensive study on TurboQuant, revealing its real-world effects on accuracy, latency, and throughput beyond initial evaluations.

0 favorites 0 likes

#performance

@davideciffa: Huge thanks to @csujun, now Luce DFlash is 10-15% faster, by implementing per-layer K/V truncation in the draft graph f…

X AI KOLs Timeline ↗ · 3d ago Cached

Luce DFlash has achieved a 10-15% speedup by implementing per-layer K/V truncation in the draft graph for SWA layers.

0 favorites 0 likes

#performance

Profiling.sampling – Statistical Profiler

Hacker News Top ↗ · 3d ago Cached

Python 3.15 introduces the profiling.sampling module, Tachyon, a statistical profiler that periodically samples stack snapshots with minimal overhead, suitable for development and production environments.

0 favorites 0 likes

#performance

Fc, a lossless compressor for floating-point streams

Hacker News Top ↗ · 3d ago Cached

fc is an open-source lossless compressor for IEEE-754 64-bit double streams, offering superior compression ratios for structured data compared to zstd and fpzip, though with slower encoding speeds.

0 favorites 0 likes

#performance

@Youssofal_: MTPLX V0.3 Is Out!: - I realised M1 & M2 macs do not support BF16 and were emulating it leading to significantly decrea…

X AI KOLs Timeline ↗ · 3d ago Cached

MTPLX v0.3 is released, a native runtime for Apple Silicon that uses Multi-Token Prediction (MTP) to double decode speed while maintaining distributional accuracy via Leviathan-Chen acceptance.

0 favorites 0 likes

#performance

Reverting the incremental GC in Python 3.14 and 3.15

Hacker News Top ↗ · 4d ago

This article discusses the decision to revert the incremental garbage collection feature in Python 3.14 and 3.15.

0 favorites 0 likes

#performance

Counting Fast in Erlang with:counters and:atomics

Hacker News Top ↗ · 4d ago Cached

This technical article explains how to use Erlang's :counters and :atomics modules for high-performance counting and shared mutable state outside the standard process isolation model. It covers atomic operations like add_get, exchange, and compare-and-swap within the BEAM runtime.

0 favorites 0 likes

#performance

MTP is all about acceptance rate

Reddit r/LocalLLaMA ↗ · 4d ago

A user benchmarked MTP (Multi-Token Prediction) on Gemma 4 with mlx-vlm on M4 Max Studio, finding it excellent for code generation (1.53x faster, 66% acceptance) but detrimental for JSON output (50% slower, only 8% acceptance) and neutral for long-form prose, suggesting MTP benefits vanish when acceptance drops below 50%.

1 favorites 1 likes

#performance

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Reddit r/LocalLLaMA ↗ · 5d ago

Developer achieved 80+ t/s inference on Qwen3.6-27B with 262K context on a single RTX 4090 by combining MTP (Multi-Token Prediction) with TurboQuant's lossless KV cache compression, sharing their implementation fork and technical details.

1 favorites 1 likes

#performance

@DivyanshT91162: Your AI agent ships React code fast. But half the time it’s bloated, slow, and full of hidden mistakes. React Doctor v2…

X AI KOLs Timeline ↗ · 5d ago

React Doctor v2 is an open-source CLI tool that analyzes React codebases for performance issues, bad patterns, unnecessary re-renders, and broken architecture. It supports Next.js, Vite, and React Native and can be run instantly via npx.

0 favorites 0 likes

#performance

@leftcurvedev_: Anyone with 8GB or 12GB VRAM setups needs to understand that "-ncmoe" is the key flag to boost performance on llama.cpp…

X AI KOLs Timeline ↗ · 5d ago

Explains how the -ncmoe flag in llama.cpp improves performance for MoE models like Qwen3.6 35B A3B on limited VRAM (8-12GB) by offloading some expert layers to CPU+RAM, with benchmarks showing up to 5x speedup on an RTX 3070Ti.

0 favorites 0 likes

performance

Submit Feedback