gpu-inference

#gpu-inference

Qwen3.6-35B-A3B APEX on a Single RTX 3090 - Getting the Most Out of It

Reddit r/LocalLLaMA ↗ · 2026-06-22

A detailed guide on running the Qwen3.6-35B-A3B APEX model on an RTX 3090, comparing two llama.cpp forks and quantization methods for optimal speed and quality.

0 favorites 0 likes

#gpu-inference

@Akashi203: i open-sourced automegakernel -- compiles any huggingface model into a single persistent megakernel batch-1 decode is b…

X AI KOLs Timeline ↗ · 2026-06-17 Cached

AutoMegaKernel is an open-source agent harness that compiles any HuggingFace model into a single persistent megakernel, fusing the entire forward pass into one GPU launch to reduce overhead. It achieves up to 1.33x speedup over CUDA-graphed cuBLAS on inference-class GPUs like L4 and L40S, while proving schedules deadlock- and race-free.

0 favorites 0 likes

#gpu-inference

@PyTorch: ExecuTorch now has an MLX delegate that runs PyTorch models on Apple Silicon GPUs. It supports LLMs, speech-to-text, an…

X AI KOLs Following ↗ · 2026-05-18 Cached

ExecuTorch now has an MLX delegate that enables GPU-accelerated inference for PyTorch models on Apple Silicon Macs, supporting LLMs, speech-to-text, and MoE models with quantization via TorchAO.

0 favorites 0 likes

#gpu-inference

qwen3.6 just stops

Reddit r/LocalLLaMA ↗ · 2026-05-13

A user reports an issue where the Qwen 3.6 model stops mid-task when served via vLLM with specific Docker and speculative decoding configurations.

0 favorites 0 likes

#gpu-inference

Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK

Reddit r/LocalLLaMA ↗ · 2026-05-08

A benchmark analysis of Qwen 3.6 27B MTP on 4x RTX 3090 GPUs, demonstrating that using NVLink for tensor parallelism yields significant throughput improvements (up to +53%) over PCIe configurations.

0 favorites 0 likes

#gpu-inference

@anyscalecompute: In this session, you'll learn: - Build and scale data pipelines with Ray - What is video data curation - Stream large d…

X AI KOLs Following ↗ · 2026-05-07 Cached

Anyscale is hosting a hands-on virtual lab session teaching developers how to build and scale data pipelines with Ray, covering video data curation, distributed GPU inference, and CPU/GPU streaming pipelines.

0 favorites 0 likes

#gpu-inference

@iotcoi: Qwen3.6-27B-FP8 + Dflash + DDTree, 256k context, 10 agents ~200 tokens/sec max decode 136t/s average on a single tiny G…

X AI KOLs Timeline ↗ · 2026-04-22 Cached

Quantized 27B Qwen3.6 model achieves 200 tok/s peak (136 avg) with 256k context and 10 agents on a single 49W GB10 GPU using Dflash+DDTree optimizations.

0 favorites 0 likes

#gpu-inference

@ProTekkFZS: Q4_K_M 3.6 35B at 768k with yarn on my 3090 has been a joy, I can't lie. Using the llama.cpp fork from @no_stp_on_snek …

X AI KOLs Following ↗ · 2026-04-20 Cached

User reports successfully running a 35B-parameter mixture-of-experts model at 768K context length using Q4_K_M quantization and YaRN on an RTX 3090 via a llama.cpp fork, offloading only 8 experts to CPU while maintaining acceptable performance.

0 favorites 0 likes

gpu-inference

Submit Feedback