@TensorTonic: You reach for ReLU, GELU, and Softmax in almost every model you build. But could you write the forward pass and the gra…
Summary
A tweet promoting TensorTonic, a platform that allows users to practice implementing nine common activation functions (Sigmoid, ReLU, Tanh, Softmax, Leaky ReLU, GELU, Swish, ELU, SELU) from scratch, including forward pass and gradient computation.
View Cached Full Text
Cached at: 06/27/26, 01:56 PM
You reach for ReLU, GELU, and Softmax in almost every model you build. But could you write the forward pass and the gradients from memory without opening a reference?
Sigmoid ReLU Tanh Softmax Leaky ReLU GELU Swish ELU SELU
Nine activation functions, each explained by implementing them from scratch.
Practice all of them on TensorTonic.
Similar Articles
@tetsuoai: The entire core of a neural network on four cards. Neuron, forward pass, activations, backprop. Learn these four and yo…
A set of four cards covering the core concepts of neural networks: neuron, forward pass, activations, and backpropagation, aimed at helping learners understand how models from perceptrons to transformers work.
@NFTCPS: You keep talking about AI, but can't even explain what a Transformer is? There's a repo that goes all out — builds a GPT from scratch without using any high-level libraries. It lays out exactly how Attention, Multi-Head, Feed-Forward, Embedding, Residual connections, and Layer Norm are pieced together. And it's not just the model; the entire pipeline is covered…
A GitHub open-source project that implements the complete GPT training pipeline from scratch, including data preprocessing, pretraining, SFT, and RLHF post-training, all based on native PyTorch. Ideal for developers who want to deeply understand the Transformer architecture.
Bug or Feature^2: Weight Drift, Activation Sparsity, and Spikes
This paper formally proves that training neural networks with asymmetric activation functions like ReLU, GELU, or SiLU causes weights to drift negative, leading to up to 90% activation sparsity. It also shows that squared activations like ReLU² improve performance but cause activation spikes, which can be fixed by clipping, with GELU² achieving the best validation loss.
@gordic_aleksa: new in-depth blog post time: Inside the Transformer: The Life of a Token a deep dive into a modern dense transformer, i…
An in-depth blog post exploring the inner workings of modern dense transformers, covering topics such as YaRN for positional information, hybrid attention for long context lengths, soft capping, QK normalization, and transformer math including FLOPs/token formulas and cluster sizing.
@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587
The author shares learnings from training a 160M parameter LLM from scratch, experimenting with architectures like multi-token prediction and hierarchical reasoning models. They emphasize the importance of fast iteration, simplifying ideas, and understanding why architectures work.