@sumitdotml: week 25, 2026: cpu tensor core basics (add/mul, reduce, stride, 2d matmul, etc.) in c, reading some arcee

X AI KOLs Timeline News

Summary

The author shares progress on building a CPU-only tensor library in C, covering basics like add/mul, reduce, strides, and 2D matmul, along with insights from reading Arcee's technical blogs on foundation models.

week 25, 2026: cpu tensor core basics (add/mul, reduce, stride, 2d matmul, etc.) in c, reading some arcee https://t.co/ITIQ5I1xG7
Original Article
View Cached Full Text

Cached at: 06/22/26, 07:51 PM

week 25, 2026: cpu tensor core basics (add/mul, reduce, stride, 2d matmul, etc.) in c, reading some arcee https://t.co/ITIQ5I1xG7


Week 25, 2026 | sumit.ml

Source: https://sumit.ml/weekly/2026-w25/ This week, I moved from hand-written matrix exercises into a tiny tensor runtime. This is now atensor library-ish: still CPU-only andfloat32only, but that kinda was the point for now as I wanted to get comfortable with allocation, shape metadata, strides, indexing, and correctness before adding CUDA. Also did some mild skimming ofArcee’s blog posts.

CPU Tensor Basics

Thetensorin my library now carries the pieces I wanted to make explicit:

float *data;
size_t ndim;
size_t shape[TT_TENSOR_MAX_DIMS];
size_t strides[TT_TENSOR_MAX_DIMS];
size_t numel;

The stride idea materialized more inside my head this week, and I feel great about this.

For instance, for a tensor shaped\[2, 3, 4\], the row-major strides are\[12, 4, 1\], so one position becomes:

[1, 2, 3] -> 1 * 12 + 2 * 4 + 3 * 1 = 23

I also added the first CPU helpers around this:

  • flat and multidimensional get/set
  • add/mul
  • sum/max/mean
  • 2Dmatmul

Kept the matmul is intentionally plain for now:

[m, k] @ [k, n] -> [m, n]

Basically no batching or broadcasting yet; will continue to work on these steadily from here. With that being said, my primary objective with this was to simply get more used to writing C so that I could mitigate the friction to some aspect while reading the PMPP book, so this has already served pretty well. As I make more progressions with the book, I will start adding CUDA kernels in this repo in addition to doing more advanced CPU ops.


On Technical Paper Reads

Got pretty darn engrossed in C again this week, not too proud of this. However, I did manage to read a couple of Arcee’s technical writings:

The blog progression of the good people at Arcee is pretty interesting in that it starts off with the enterprise-focused model fine-tuning techniques and then slowly transitions towards building their own foundation models by firstly experimenting in a smaller scale with SLMs and then progressing into the larger ones with the Trinity family.

Gives me a super nice chronological progression of how Arcee did things, and this study will provide me lots of valuable knowledge imo.


Next Week

  • start full-tensor softmax in C.
  • resume the PMPP book, ideally complete chapter 3 (might be tough, but effort shall be made)
  • Arcee blogs + if time is kinder, complete 2-3 passes for the Trinity Large technical report
  • some personal commitments (career-related)

Similar Articles

@swyx: roundup of links:

X AI KOLs Following

NVIDIA releases Cosmos 3 (Mixture-of-Transformers models up to 64B), Nemotron 3 Ultra (550B-A55B LLM), and previews RTX Spark personal superchip at Computex 2026, achieving SOTA on multiple open model leaderboards.