Tag
A blog post describing a tiny compiler that demonstrates how to lower data-parallel kernels by converting for loops into vectorized loops with lanes and masks, implemented in ~180 lines of Python.