Tag
The Controlled Dynamics Attractor Transformer (CDAT) combines a mixture von Mises-Fisher attention energy with a Hopfield refinement energy and CANN-inspired excitation-inhibition modulation, providing topology-constrained dynamical systems for stable inference. It achieves state-of-the-art performance on graph anomaly detection and classification benchmarks.
AdaVoMP uses a sparse adaptive voxel structure and transformer encoder-decoder to predict spatially-varying mechanical properties for 3D objects, enabling high-resolution deformable simulations with improved accuracy and efficiency.
Looped World Models introduce iterative latent state refinement through shared transformer blocks, achieving 100x parameter efficiency while adapting computational depth to prediction complexity.
A technical analysis of two approaches to building self-evolving AI agents: model-based (via architecture like SSMs or transformer with fast-weight updates, and training methods) and harness-based (via memory or meta harness that can rewrite itself). The author provides practical recommendations for different audiences.
A repository that builds a transformer from scratch without high-level libraries, explaining attention mechanisms and the full training pipeline, trainable in a day on free Colab.
A repository that builds a GPT-style transformer from scratch without high-level libraries, covering everything from data preprocessing to generation, and includes guides for SFT and RLHF.
Explained the operating principles of large models in easy-to-understand language, including word vectors, Transformer attention mechanism, next-word prediction training, and emergent abilities, suitable for beginners to understand basic AI concepts.
A Chinese science tweet that intuitively explains the core chain of LLMs (Large Language Models): from token, embedding, position encoding, attention, FFN to residual stream and next-token prediction, helping readers without a math background understand AI papers.
A GitHub guide published by Fluyeporlaweb shows how to build and train a Transformer model from scratch, implementing attention, multi-head, embeddings, and post-training algorithms (SFT, PPO, DPO, GRPO) without high-level libraries, trained on The Pile dataset.
This paper introduces DRIVE, a unified Transformer-based framework for offline auto-bidding that decouples candidate action generation from decision making, combining distributional action modeling, retrieval-augmented candidate generation, and value-based evaluation to improve bidding performance under budget and cost constraints.
Zeta proposes a dual whitening optimizer that applies coordinate whitening before spectral whitening to resolve scale heterogeneity in momentum matrices, reducing orthogonalization error and improving convergence and generalization in large-scale neural network training.
Presents a Transformer-based scheduling policy trained with reinforcement learning for the open shop scheduling problem, showing that a model trained on small instances can generalize to much larger problems and compete with classical dispatching heuristics.
Taylor-Calibrate proposes a principled initialization method for hybrid linear attention models that significantly improves the efficiency of distilling pretrained Transformers into Gated DeltaNet students, achieving up to 88x improvement and reducing training tokens by 4.9x-9.2x.
MiniMaxAI releases MSA, a library for dense and sparse attention kernels optimized for NVIDIA SM100 GPUs, enabling efficient processing of million-token contexts with FlashAttention and sparse top-k attention.
This tweet shares a well-made explanation of the internal workings of LLMs, covering tokens, embeddings, positional encoding, attention, and feed-forward networks, via a blog post by 0xkato.
This paper introduces the Curse of Depth in LLMs, where deep layers become ineffective due to Pre-Layer Normalization causing output variance explosion. The authors propose LayerNorm Scaling to mitigate this, showing consistent improvements in pre-training and fine-tuning across model sizes up to 7B.
Cohere released a new lightweight 30B open-weight model for agentic coding tasks, built on Command A+ with parallel transformer design, showing strong performance on agentic benchmarks like Terminal-Bench and SWE-Bench.
Otters++ is a novel optical spiking Transformer that leverages time-to-first-spike coding and physical hardware decay to achieve energy-efficient inference, achieving 84.17% on GLUE while maintaining a clear energy advantage over prior spiking Transformer baselines.
Introduces llm.istanbul, a WebGPU LLM workbench that lets you train small models, train tokenizers, and generate text entirely in the browser, no server required, fully local.
This paper introduces an adaptive video tokenisation method that exploits temporal redundancy in latent space to allocate tokens dynamically, achieving efficient compression without auxiliary networks. The proposed Latent Inpainting Transformer reconstructs dropped positions, delivering 31x speedup over ElasticTok-CV and 2x over InfoTok.