Tag
OpenAI co-founder Andrej Karpathy released llm.c, an open-source guide to training LLMs from scratch with simple code that runs on any hardware, including CPUs and MacBooks, and is 7% faster than standard approaches.
Mimo 2.5 demonstrates fast performance with large context windows using dual RTX Pro 6000 GPUs.
Modal introduces Auto Endpoints, a self-serve service for optimized, production-grade LLM inference with full code ownership, transparent metrics, and autoscaling, built on their serverless GPU infrastructure.
NVIDIA technology now powers over 400 of the world's 500 fastest supercomputers (81% of the TOP500), with record GPU and networking adoption and top efficiency on the Green500 list.
Discusses multi-tier caching strategies for MoE models to improve inference speed by keeping frequently activated experts on GPU, referencing existing implementations like PowerInfer and llama.cpp branches.
SpaceX reportedly signs a $6.3 billion computing deal with Reflection AI, securing access to Nvidia GB300 GPUs at the Colossus cluster in Memphis through 2029.
Using TurboQuant, the user achieved 20 tokens per second on a Qwen 3.6 35B MoE model running on a GTX1060 3GB, showcasing impressive performance on outdated hardware.
A detailed technical comparison of two dominant LLM serving frameworks, SGLang and vLLM, covering architectural differences in KV cache management (RadixAttention vs PagedAttention), throughput, latency, and deployment considerations for self-hosted environments.
Speed test results for GLM-5.2 running on llama.cpp with RTX 5090 and RTX 3090 Ti, showing prefill speeds up to 579 t/s at 8k context and decode at ~10.6 t/s.
The author built Prompt-Chain, a Streamlit app that chains a small prompter model and a large coder model with automatic VRAM swapping, enabling efficient code generation on an 8GB GPU.
JPMorgan releases ASIC industry report, predicts AI custom chips entering golden cycle, Broadcom and Marvell are biggest beneficiaries, and expects AI ASIC shipments to surpass GPU for the first time by 2027.
Comparison of inference engine performance on different hardware: moving from baseline to vLLM with TP=2 on 2x RTX 3090s improves from ~14.5 tok/s to ~64 tok/s, and on RTX PRO 6000 moving to Sglang improves from ~32 tok/s to ~110 tok/s. Recommends vLLM/Sglang for CUDA/multi-GPU and llama.cpp for edge devices.
A detailed comparison of local AI hardware in terms of memory capacity, bandwidth, and software stack, covering GPUs, Apple Silicon, AMD, Intel, Tenstorrent, and others, with a focus on what bottlenecks matter for AI inference.
Discussion about upcoming AMD GPU offerings and their potential for building an LLM rig, asking the community for build suggestions.
MSI's RTX 5090 GPU operates at 475-500W for inference or training, with a warning about cable bending.
A tweet promoting the Qwen 3.6 27b model and recommending UnslothAI for running it on any GPU.
A market observation that experience with GPUs and local AI will be highly sought after by employers.
The LQ50 and LQ50-24GB are priced at around $1200, indicating a mid-range AI hardware offering.
Explains the communication model for multi-GPU systems, covering the trade-off between latency and bandwidth, and compares MST and Ring algorithms for collective operations like broadcast.
Excited to share cuTile Rust, bringing Rust's fearless concurrency to GPU kernel programming. Their paper 'Fearless Concurrency on the GPU' is now on arXiv.