Tag
A detailed guide on running the Qwen3.6-35B-A3B APEX model on an RTX 3090, comparing two llama.cpp forks and quantization methods for optimal speed and quality.
AutoMegaKernel is an open-source agent harness that compiles any HuggingFace model into a single persistent megakernel, fusing the entire forward pass into one GPU launch to reduce overhead. It achieves up to 1.33x speedup over CUDA-graphed cuBLAS on inference-class GPUs like L4 and L40S, while proving schedules deadlock- and race-free.
ExecuTorch now has an MLX delegate that enables GPU-accelerated inference for PyTorch models on Apple Silicon Macs, supporting LLMs, speech-to-text, and MoE models with quantization via TorchAO.
A user reports an issue where the Qwen 3.6 model stops mid-task when served via vLLM with specific Docker and speculative decoding configurations.
A benchmark analysis of Qwen 3.6 27B MTP on 4x RTX 3090 GPUs, demonstrating that using NVLink for tensor parallelism yields significant throughput improvements (up to +53%) over PCIe configurations.
Anyscale is hosting a hands-on virtual lab session teaching developers how to build and scale data pipelines with Ray, covering video data curation, distributed GPU inference, and CPU/GPU streaming pipelines.
Quantized 27B Qwen3.6 model achieves 200 tok/s peak (136 avg) with 256k context and 10 agents on a single 49W GB10 GPU using Dflash+DDTree optimizations.
User reports successfully running a 35B-parameter mixture-of-experts model at 768K context length using Q4_K_M quantization and YaRN on an RTX 3090 via a llama.cpp fork, offloading only 8 experts to CPU while maintaining acceptable performance.