@sumitdotml: week 25, 2026: cpu tensor core basics (add/mul, reduce, stride, 2d matmul, etc.) in c, reading some arcee
Summary
The author shares progress on building a CPU-only tensor library in C, covering basics like add/mul, reduce, strides, and 2D matmul, along with insights from reading Arcee's technical blogs on foundation models.
View Cached Full Text
Cached at: 06/22/26, 07:51 PM
week 25, 2026: cpu tensor core basics (add/mul, reduce, stride, 2d matmul, etc.) in c, reading some arcee https://t.co/ITIQ5I1xG7
Week 25, 2026 | sumit.ml
Source: https://sumit.ml/weekly/2026-w25/
This week, I moved from hand-written matrix exercises into a tiny tensor runtime. This is now atensor library-ish: still CPU-only andfloat32only, but that kinda was the point for now as I wanted to get comfortable with allocation, shape metadata, strides, indexing, and correctness before adding CUDA. Also did some mild skimming ofArcee’s blog posts.
CPU Tensor Basics
Thetensorin my library now carries the pieces I wanted to make explicit:
float *data;
size_t ndim;
size_t shape[TT_TENSOR_MAX_DIMS];
size_t strides[TT_TENSOR_MAX_DIMS];
size_t numel;
The stride idea materialized more inside my head this week, and I feel great about this.
For instance, for a tensor shaped\[2, 3, 4\], the row-major strides are\[12, 4, 1\], so one position becomes:
[1, 2, 3] -> 1 * 12 + 2 * 4 + 3 * 1 = 23
I also added the first CPU helpers around this:
- flat and multidimensional get/set
add/mulsum/max/mean- 2D
matmul
Kept the matmul is intentionally plain for now:
[m, k] @ [k, n] -> [m, n]
Basically no batching or broadcasting yet; will continue to work on these steadily from here. With that being said, my primary objective with this was to simply get more used to writing C so that I could mitigate the friction to some aspect while reading the PMPP book, so this has already served pretty well. As I make more progressions with the book, I will start adding CUDA kernels in this repo in addition to doing more advanced CPU ops.
On Technical Paper Reads
Got pretty darn engrossed in C again this week, not too proud of this. However, I did manage to read a couple of Arcee’s technical writings:
- Arcee 4.5B deepdive
- Extending AFM-4.5B to 64k context length(this is reaaaaaally interesting) and
- The Trinity Manifesto
The blog progression of the good people at Arcee is pretty interesting in that it starts off with the enterprise-focused model fine-tuning techniques and then slowly transitions towards building their own foundation models by firstly experimenting in a smaller scale with SLMs and then progressing into the larger ones with the Trinity family.
Gives me a super nice chronological progression of how Arcee did things, and this study will provide me lots of valuable knowledge imo.
Next Week
- start full-tensor softmax in C.
- resume the PMPP book, ideally complete chapter 3 (might be tough, but effort shall be made)
- Arcee blogs + if time is kinder, complete 2-3 passes for the Trinity Large technical report
- some personal commitments (career-related)
Similar Articles
@swyx: roundup of links:
NVIDIA releases Cosmos 3 (Mixture-of-Transformers models up to 64B), Nemotron 3 Ultra (550B-A55B LLM), and previews RTX Spark personal superchip at Computex 2026, achieving SOTA on multiple open model leaderboards.
@kazukifujii: Tech Blog Release Day5 This is the first installment of a blog series that explains CUDA Programming from the basics, w…
Kazuki Fujii announces the first installment of a blog series on CUDA Programming basics, written in an accessible way, essential for understanding FlashAttention and hardware-aware acceleration techniques.
@levidiamode: Day 138/365 of GPU Programming One of my favorite lectures I've watched this year is Stanford's CS336 lecture 7 on GPU …
A learner shares enthusiasm for Stanford CS336 lecture 7 on GPU parallelism, which covers fundamental operations and connects them to multi-GPU setups and parallelism techniques like tensor, data, and pipeline parallelism.
@HanGuo97: Finally, huge thanks to the incredible team: @jcz42, Arjun, Driss, @tensorcore, @yoonrkim, and @tri_dao! PDF: https://a…
CODA introduces a GPU kernel abstraction that rewrites transformer computations as GEMM-plus-epilogue programs, reducing memory-bound operations and improving efficiency in training.
Wrote a custom C++ engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B) to bypass framework overhead
Developed a custom C++ inference engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B NPU), achieving 2x speedup over stock framework by writing optimized AscendC kernels for matmul and causal-conv1d, reaching 5.90 tokens/s.