Built an AI Accelerator and opensourced it. [P]
Summary
The author open-sourced a custom AI accelerator (atik) implemented on FPGA with native BF16 and attention support, demonstrating significant speedups over PyTorch for various models.
Similar Articles
AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
AccelOpt is a self-improving LLM agentic system that autonomously optimizes AI accelerator kernels through iterative generation and optimization memory, achieving 49-61% peak throughput improvements on AWS Trainium while being 26x cheaper than Claude Sonnet 4.
A hackable compiler to generate efficient fused GPU kernels for AI models [P]
The author presents a custom, hackable ML compiler written in Python that lowers LLMs to optimized CUDA kernels through a multi-stage IR pipeline, achieving performance competitive with or superior to PyTorch on specific operations. The article details the compiler's optimization passes, lowering rules, and CLI usage for generating efficient fused GPU kernels.
OpenAI and Broadcom announce strategic collaboration to deploy 10 gigawatts of OpenAI-designed AI accelerators
OpenAI and Broadcom announced a multi-year strategic collaboration to co-develop and deploy 10 gigawatts of custom AI accelerators and networking systems, with deployment beginning in mid-2026 and completion by end of 2029. This partnership enables OpenAI to design accelerators that embed learnings from frontier model development directly into hardware.
Wrote a custom C++ engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B) to bypass framework overhead
Developed a custom C++ inference engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B NPU), achieving 2x speedup over stock framework by writing optimized AscendC kernels for matmul and causal-conv1d, reaching 5.90 tokens/s.
I've created the fastest local AI engine for Apple Silicon. Optimised for agentic use.
The author announces the release of 'lightning-mlx', a local AI engine optimized for Apple Silicon that achieves high token speeds for coding agents and tool-calling workflows.