@TensorTonic: You reach for ReLU, GELU, and Softmax in almost every model you build. But could you write the forward pass and the gra…

X AI KOLs Timeline Tools

Summary

A tweet promoting TensorTonic, a platform that allows users to practice implementing nine common activation functions (Sigmoid, ReLU, Tanh, Softmax, Leaky ReLU, GELU, Swish, ELU, SELU) from scratch, including forward pass and gradient computation.

You reach for ReLU, GELU, and Softmax in almost every model you build. But could you write the forward pass and the gradients from memory without opening a reference? > Sigmoid > ReLU > Tanh > Softmax > Leaky ReLU > GELU > Swish > ELU > SELU Nine activation functions, each explained by implementing them from scratch. Practice all of them on TensorTonic.
Original Article
View Cached Full Text

Cached at: 06/27/26, 01:56 PM

You reach for ReLU, GELU, and Softmax in almost every model you build. But could you write the forward pass and the gradients from memory without opening a reference?

Sigmoid ReLU Tanh Softmax Leaky ReLU GELU Swish ELU SELU

Nine activation functions, each explained by implementing them from scratch.

Practice all of them on TensorTonic.

Similar Articles

@NFTCPS: You keep talking about AI, but can't even explain what a Transformer is? There's a repo that goes all out — builds a GPT from scratch without using any high-level libraries. It lays out exactly how Attention, Multi-Head, Feed-Forward, Embedding, Residual connections, and Layer Norm are pieced together. And it's not just the model; the entire pipeline is covered…

X AI KOLs Timeline

A GitHub open-source project that implements the complete GPT training pipeline from scratch, including data preprocessing, pretraining, SFT, and RLHF post-training, all based on native PyTorch. Ideal for developers who want to deeply understand the Transformer architecture.

Bug or Feature^2: Weight Drift, Activation Sparsity, and Spikes

Hugging Face Daily Papers

This paper formally proves that training neural networks with asymmetric activation functions like ReLU, GELU, or SiLU causes weights to drift negative, leading to up to 90% activation sparsity. It also shows that squared activations like ReLU² improve performance but cause activation spikes, which can be fixed by clipping, with GELU² achieving the best validation loss.

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587

X AI KOLs Timeline

The author shares learnings from training a 160M parameter LLM from scratch, experimenting with architectures like multi-token prediction and hierarchical reasoning models. They emphasize the importance of fast iteration, simplifying ideas, and understanding why architectures work.