Built an AI Accelerator and opensourced it. [P]

Reddit r/MachineLearning 05/31/26, 12:58 PM Tools

ai-accelerator open-source fpga attention-mechanism risc-v hardware-acceleration bf16

Summary

The author open-sourced a custom AI accelerator (atik) implemented on FPGA with native BF16 and attention support, demonstrating significant speedups over PyTorch for various models.

There is a huge gap in open source AI accelerators, so I implemented [mine](https://github.com/AhmedZeer/atik). Popular and well known ones are already legacy and doesn't support contemporary operations like Attention. Here is what makes mine special: * **Attention** mechanism smelted directly into silicon * Prototyped end-to-end on **FPGA** (AWS F2) * Benchmarked against **PyTorch**\-based workloads * Built on the **RocketChip** architecture (RISC-V) * Native **BF16** support * Up to **225×** speedup on vanilla attention mechanism * Up to **96×** speedup on TinyBERT * Up to **50×** speedup on ViT Up to **30×** speedup on GPT-2 prefill I would really appreciate it if you check the [repo](https://github.com/AhmedZeer/atik) and give me feedback!

Original Article

Similar Articles

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Hugging Face Daily Papers

AccelOpt is a self-improving LLM agentic system that autonomously optimizes AI accelerator kernels through iterative generation and optimization memory, achieving 49-61% peak throughput improvements on AWS Trainium while being 26x cheaper than Claude Sonnet 4.

A hackable compiler to generate efficient fused GPU kernels for AI models [P]

Reddit r/MachineLearning

The author presents a custom, hackable ML compiler written in Python that lowers LLMs to optimized CUDA kernels through a multi-stage IR pipeline, achieving performance competitive with or superior to PyTorch on specific operations. The article details the compiler's optimization passes, lowering rules, and CLI usage for generating efficient fused GPU kernels.

OpenAI and Broadcom announce strategic collaboration to deploy 10 gigawatts of OpenAI-designed AI accelerators

OpenAI Blog

OpenAI and Broadcom announced a multi-year strategic collaboration to co-develop and deploy 10 gigawatts of custom AI accelerators and networking systems, with deployment beginning in mid-2026 and completion by end of 2029. This partnership enables OpenAI to design accelerators that embed learnings from frontier model development directly into hardware.

Wrote a custom C++ engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B) to bypass framework overhead

Reddit r/LocalLLaMA

Developed a custom C++ inference engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B NPU), achieving 2x speedup over stock framework by writing optimized AscendC kernels for matmul and causal-conv1d, reaching 5.90 tokens/s.

I've created the fastest local AI engine for Apple Silicon. Optimised for agentic use.