@sumitdotml: week 25, 2026: cpu tensor core basics (add/mul, reduce, stride, 2d matmul, etc.) in c, reading some arcee

X AI KOLs Timeline 06/22/26, 11:04 AM News

tensor cpu c-programming machine-learning arcee personal-project

Summary

The author shares progress on building a CPU-only tensor library in C, covering basics like add/mul, reduce, strides, and 2D matmul, along with insights from reading Arcee's technical blogs on foundation models.

week 25, 2026: cpu tensor core basics (add/mul, reduce, stride, 2d matmul, etc.) in c, reading some arcee https://t.co/ITIQ5I1xG7

Original Article

View Cached Full Text

Cached at: 06/22/26, 07:51 PM

week 25, 2026: cpu tensor core basics (add/mul, reduce, stride, 2d matmul, etc.) in c, reading some arcee https://t.co/ITIQ5I1xG7

Week 25, 2026 | sumit.ml

Source: https://sumit.ml/weekly/2026-w25/ This week, I moved from hand-written matrix exercises into a tiny tensor runtime. This is now atensor library-ish: still CPU-only andfloat32only, but that kinda was the point for now as I wanted to get comfortable with allocation, shape metadata, strides, indexing, and correctness before adding CUDA. Also did some mild skimming ofArcee’s blog posts.

CPU Tensor Basics

Thetensorin my library now carries the pieces I wanted to make explicit:

float *data;
size_t ndim;
size_t shape[TT_TENSOR_MAX_DIMS];
size_t strides[TT_TENSOR_MAX_DIMS];
size_t numel;

The stride idea materialized more inside my head this week, and I feel great about this.

For instance, for a tensor shaped\[2, 3, 4\], the row-major strides are\[12, 4, 1\], so one position becomes:

[1, 2, 3] -> 1 * 12 + 2 * 4 + 3 * 1 = 23

I also added the first CPU helpers around this:

flat and multidimensional get/set
add/mul
sum/max/mean
2Dmatmul

Kept the matmul is intentionally plain for now:

[m, k] @ [k, n] -> [m, n]

Basically no batching or broadcasting yet; will continue to work on these steadily from here. With that being said, my primary objective with this was to simply get more used to writing C so that I could mitigate the friction to some aspect while reading the PMPP book, so this has already served pretty well. As I make more progressions with the book, I will start adding CUDA kernels in this repo in addition to doing more advanced CPU ops.

On Technical Paper Reads

Got pretty darn engrossed in C again this week, not too proud of this. However, I did manage to read a couple of Arcee’s technical writings:

Arcee 4.5B deepdive
Extending AFM-4.5B to 64k context length(this is reaaaaaally interesting) and
The Trinity Manifesto

The blog progression of the good people at Arcee is pretty interesting in that it starts off with the enterprise-focused model fine-tuning techniques and then slowly transitions towards building their own foundation models by firstly experimenting in a smaller scale with SLMs and then progressing into the larger ones with the Trinity family.

Gives me a super nice chronological progression of how Arcee did things, and this study will provide me lots of valuable knowledge imo.

Next Week

start full-tensor softmax in C.
resume the PMPP book, ideally complete chapter 3 (might be tough, but effort shall be made)
Arcee blogs + if time is kinder, complete 2-3 passes for the Trinity Large technical report
some personal commitments (career-related)

@sumitdotml: week 25, 2026: cpu tensor core basics (add/mul, reduce, stride, 2d matmul, etc.) in c, reading some arcee

Week 25, 2026 | sumit.ml

CPU Tensor Basics

On Technical Paper Reads

Next Week

Similar Articles

@swyx: roundup of links:

@kazukifujii: Tech Blog Release Day5 This is the first installment of a blog series that explains CUDA Programming from the basics, w…

@levidiamode: Day 138/365 of GPU Programming One of my favorite lectures I've watched this year is Stanford's CS336 lecture 7 on GPU …

@HanGuo97: Finally, huge thanks to the incredible team: @jcz42, Arjun, Driss, @tensorcore, @yoonrkim, and @tri_dao! PDF: https://a…

Wrote a custom C++ engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B) to bypass framework overhead

Submit Feedback

Similar Articles

@kazukifujii: Tech Blog Release Day5 This is the first installment of a blog series that explains CUDA Programming from the basics, w…

@levidiamode: Day 138/365 of GPU Programming One of my favorite lectures I've watched this year is Stanford's CS336 lecture 7 on GPU …

@HanGuo97: Finally, huge thanks to the incredible team: @jcz42, Arjun, Driss, @tensorcore, @yoonrkim, and @tri_dao! PDF: https://a…

Wrote a custom C++ engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B) to bypass framework overhead