Tag
A technical analysis comparing memory designs in RNNs, Transformers, and SSMs, arguing that the key question is where to store sequence state rather than which architecture is better. Discusses trade-offs between compressed hidden states, growing KV caches, and synaptic-like memory in model connectivity.
MosiAI has released MOSS-TTS Local Transformer v1.5, a text-to-speech model that supports voice cloning, over 30 languages, and high-quality 48 kHz output.
The article provides a detailed explanation of Mixture of Experts (MoE) in transformers, covering routing, load balancing, and recent innovations like fine-grained experts. It also highlights the significance of Noam Shazeer's research contributions and his move from Google to OpenAI.
Noam Shazeer, co-inventor of the Transformer architecture and key figure behind Gemini, is leaving Google to join OpenAI, marking his second departure from Google after being brought back in a $2.7 billion deal.
Noam Shazeer, co-author of the Transformer architecture and technical lead of Google's Gemini model, has left Google again and officially joined OpenAI. He will focus on discovering new underlying architectures for large models and driving research into the evolution of Transformers.
This paper introduces QG-MIL, a gated transformer aggregator that mitigates attention concentration in multiple instance learning for medical imaging, achieving domain-agnostic performance without auxiliary losses.
Grouped Query Experts (GQE) improves Transformer efficiency by applying a mixture-of-experts layer on top of grouped-query attention, selectively activating query heads per token while keeping key-value cache benefits, matching baseline accuracy with half the query-head compute at 250M parameter scale.
EveryonesLLM is an open-source Google Colab-based tutorial repository for building a nanoGPT-style LLM from scratch, with step-by-step chapters covering dataloading, embeddings, attention, training, and instruction tuning.
LoopCoder-V2 is a 7B instruction-tuned code model built on the Parallel Loop Transformer (PLT), demonstrating non-monotonic test-time scaling with two loops providing the best gain-cost trade-off and significant improvements over baselines on code generation and reasoning benchmarks.
Speculates on a progression from looped transformers to hyper-looped transformers to looped world models, hinting at a new research direction.
This paper presents a quantized, integer-only transformer implementation for jet tagging on AMD Versal AI Engines, including a reusable open-source framework that maps transformer layers to AIE tiles for low-latency trigger systems at CERN LHC.
This paper presents a discrete autoregressive transformer that generates planar mechanisms from target coupler curves, using variational autoencoder latents and tokenized joint coordinates to achieve diverse, accurate designs across multiple topologies.
This paper demonstrates that when transformers grok modular multiplication, the dense Fourier spectrum observed in previous work is an artifact of using the additive Fourier transform; using the multiplicative character transform reveals a sparse representation, leading to a reverse-engineered 'Discrete-Log Clock' algorithm analogous to the clock algorithm for modular addition.
This paper uses a Transformer-based model on MLB Statcast data to counterfactually optimize baseball pitch sequences, finding that optimizing both final and setup pitches can improve season-level statistics like K/9 by over 1.0.
This paper systematically evaluates the impact of classification model selection within the InferBERT framework for causal adverse drug event detection, finding that domain-specific pre-training (BioBERT) outperforms both simpler models and larger LLMs like Med-LLaMA.
This paper proposes Supervised Memory Training (SMT), which uses Transformer as a super teacher to distill memory states in parallel, then trains RNN with one-step supervised learning, achieving fully parallel training and reducing gradient path from O(T) to O(1), significantly improving long-range dependency learning.
A custom FPGA implementation of a Transformer with KV cache achieves 56,000 tokens per second at 80 MHz, running microGPT on a tiny LCD.
A GitHub open-source project that implements the complete GPT training pipeline from scratch, including data preprocessing, pretraining, SFT, and RLHF post-training, all based on native PyTorch. Ideal for developers who want to deeply understand the Transformer architecture.
EveryonesLLM is an open-source tutorial that provides 29 chapters of Colab notebooks. It teaches users step by step to build a complete large language model from scratch on Google Colab, including pre-training and instruction fine-tuning, and supports Chinese.
This paper investigates explicit encoding of ICD-10-CM hierarchy in EHR foundation models, using hierarchical token augmentation and graph-based code representations. Experiments on MIMIC-IV and eICU show improvements over flat code representations for in-domain and cross-dataset prediction tasks.