Tag
The inaugural PyTorch Meetup Singapore brought together AI practitioners for technical talks on vLLM updates, sovereign intelligence, and open-source exchange.
The author reflects on a senior's journey from following Andrew Ng's courses four years ago to publishing papers in top journals today, and cites a blog post explaining style transfer with a PyTorch implementation.
NVIDIA introduces the Vera CPU with a neural branch predictor to accelerate agentic AI and reinforcement learning workloads by reducing CPU execution time and increasing throughput in AI factories.
This blog post continues the profiling in PyTorch series, exploring nn.Linear, MLP blocks, and fusion techniques using Triton kernels to optimize performance.
TorchCodec 0.14 adds HDR video decoding for CPU and CUDA, along with a fast WAV decoder, enabling efficient conversion of video and audio data into PyTorch tensors for ML workflows.
Apple announced next-generation Siri AI features at WWDC 2026, including a custom Gemini-derived model and a new Core AI library with PyTorch integration, running on NVIDIA GPUs in Google Cloud within Private Cloud Compute.
This project decouples Alibaba DAMO Academy's ZipEnhancer noise reduction model from the ModelScope pipeline, rewrites the inference logic in pure PyTorch, and packages it as a FastAPI service. It supports FP16 half-precision and long audio segmentation, providing multiple noise reduction model switching and API interfaces.
NanoQuant is a flexible binary quantization method that compresses dense transformers to sub-1-bit per weight. This repository provides a PyTorch implementation, still a work in progress, capable of quantizing models like Qwen3-0.6B and Qwen3-4B.
A curated guide to studying deep learning with PyTorch via a full YouTube live course series, covering topics from tensors to GANs, organized into six parts.
A beginner-friendly, hands-on GitHub repository that breaks down GPT-like LLM architecture into simple parts, with 10 Jupyter notebooks covering tokenization, attention, transformer blocks, and a mini GPT implementation in PyTorch.
Justin Angel released a complete YouTube workshop teaching you how to build your own large language model from scratch (based on GPT-2 and Qwen3.6 style), covering Transformer architecture, training pipeline, and providing Excel manual operations and Python/PyTorch code practice, with no prerequisites in math or ML.
Helion is a Python DSL that compiles to optimized Triton code for performance-portable GPU kernels. This tutorial at PLDI 2026 covers Helion's architecture, autotuning, and CuteDSL backend.
The PyTorch Foundation project Helion is hosting a Helion DSL Tutorial at PLDI 2026 in Denver. It's an interactive workshop for compiler researchers, kernel authors, and ML systems engineers to write, autotune, and run Helion kernels.
A hands-on PyTorch curriculum that teaches LLM training from transformer basics through fine-tuning and alignment, including RLHF and GRPO.
The author shares lessons from building NeuralDBG, an open-source debugger for PyTorch training loops that detects localized failures like vanishing/exploding gradients by monitoring per-layer gradient norm transitions instead of global loss. Practical code snippets and community questions are included.
A beginner-friendly guide to using PyTorch's torch.profiler for profiling and optimizing neural network operations, starting with matrix multiplication and bias addition. It explains how to read profiler traces and understand CPU/GPU interactions.
EAGLE 3.1, the next evolution of speculative decoding, introduces new FC normalization for improved efficiency, developed by EagleCorp in collaboration with PyTorch, vLLM, and TorchSpec.
This post from NVIDIA explains how to use the NVIDIA Model Optimizer library to quantize a CLIP model to FP8 using post-training quantization, reducing VRAM usage and improving inference performance on consumer GPUs.
Meta open-sources TLX Block Attention, a warp-specialized Triton kernel that achieves 2.3x speedup for block-diagonal self-attention on NVIDIA Blackwell GPUs, with up to 3.5x speedup when fused with rotary embeddings.
Thermocompute is a PyTorch emulator for thermodynamic probabilistic computing that enables neural network layers to achieve constant modeled physical time inference by exploiting parallel thermodynamic substrate, with immediate GPU-usable stochastic layers.