transformer

#transformer

Flexformer: Flexible Linear Transformer with Learnable Attention Kernel

arXiv cs.LG ↗ · 10h ago Cached

Flexformer proposes a flexible linear Transformer with fully learnable attention kernels using random Fourier features, achieving linear complexity while matching or exceeding softmax attention performance on language modeling and sequence classification tasks.

0 favorites 0 likes

#transformer

MultiHashFormer: Hash-based Generative Language Models

arXiv cs.CL ↗ · 10h ago Cached

MultiHashFormer is a hash-based generative language model that represents each token as a unique hash signature, enabling parameter-efficient autoregression. It outperforms standard Transformer LMs at 100M, 1B, and 3B scales and supports multilingual vocabulary expansion without increasing parameters.

0 favorites 0 likes

#transformer

From Black-Box to Clinical Insight: A Multi-Stage Explainable Framework for Speech-Based Cognitive Impairment Detection

arXiv cs.CL ↗ · 10h ago Cached

This paper presents a multi-stage explainable framework that combines SHAP-based token attribution, theory-informed linguistic features, and LLaMA-3.1-70B-Instruct LLM reasoning to interpret transformer-based speech models for cognitive impairment detection, achieving strong clinical alignment and high usability scores.

0 favorites 0 likes

#transformer

A Study of Temporal Fusion Strategies for Named Entity Recognition in Historical Texts

arXiv cs.CL ↗ · 10h ago Cached

This paper systematically studies how temporal metadata can be structurally embedded into named entity recognition (NER) models for historical texts. Experiments with absolute and relative temporal representations injected via early or late fusion mechanisms show that late fusion strategies yield more robust performance on French and German historical datasets.

0 favorites 0 likes

#transformer

Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling

arXiv cs.CL ↗ · 10h ago Cached

Introduces LPES, a layer-specific positional embedding scaling method that mitigates the 'lost-in-the-middle' problem in LLMs by assigning distinct scaling factors per layer using a genetic algorithm with Bézier curves, achieving up to 11.2% accuracy gain without fine-tuning or latency increase.

0 favorites 0 likes

#transformer

Prism Transformer: Progressive Head Schedules for Hierarchical Attention Processing

arXiv cs.LG ↗ · 10h ago Cached

The Prism Transformer replaces uniform multi-head attention with a progressive head schedule that increases head count across layers, enabling a local-to-global hierarchy without extra parameters or FLOPs. It consistently outperforms standard Transformers on language modeling and zero-shot benchmarks at 124M, 354M, and 757M scales.

0 favorites 0 likes

#transformer

The Context-Ready Transformer

arXiv cs.CL ↗ · 10h ago Cached

The paper introduces the context-ready transformer, a recurrent architecture that pre-contextualizes tokens before the transformer block, achieving significant inference speedups (e.g., 1.7x on A100) while matching or exceeding standard transformer performance with fewer layers.

0 favorites 0 likes

#transformer

@stanfordnlp: The “problem” with CS336 is not the ~22 hours of videos but the larger number of hours it takes to do the assignments. …

X AI KOLs Following ↗ · 18h ago Cached

Stanford's CS336 course on language modeling from scratch is announced, featuring intensive hands-on assignments covering tokenizers, transformers, data, and alignment.

1 favorites 1 likes

#transformer

Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch

Hacker News Top ↗ · 19h ago Cached

NanoEuler is a GPT-2-scale language model built entirely from scratch in C/CUDA without any ML libraries, including hand-written forward/backward passes, a byte-level BPE tokenizer, and training pipeline. The project is an educational artifact demonstrating the engineering behind transformer training and runs on a single RTX 4070.

0 favorites 0 likes

#transformer

I shrank a transformer until every number fitted on the screen and made the weights editable [R]

Reddit r/MachineLearning ↗ · yesterday

An interactive web page that visualizes a tiny transformer with editable weights, allowing users to see how changes affect predictions in real time, aimed at helping developers understand the forward pass of an LLM.

0 favorites 0 likes

#transformer

@amitiitbhu: Q × Kᵀ tells the model how relevant every word is to every other word. Softmax turns that into probabilities. V deliver…

X AI KOLs Timeline ↗ · 2d ago Cached

A tweet explaining the core formula of the attention mechanism in transformer models: Q × Kᵀ computes relevance, Softmax converts to probabilities, and V delivers content, forming the foundation of modern AI.

0 favorites 0 likes

#transformer

@IlirAliu_: Forget lidar. One single camera. Runs in real time & is open source: A streaming 3D model that reconstructs scenes live…

X AI KOLs Timeline ↗ · 2d ago Cached

LingBot-Map is an open-source, real-time streaming 3D reconstruction model that uses a single camera, running at ~20 FPS via a feed-forward geometric context transformer, outperforming both streaming and offline methods.

0 favorites 0 likes

#transformer

@snowboat84: Have you noticed that the birth of models in AI is actually quite arbitrary? Take language models as an example: first RNN, then LSTM, one day Transformer is said to be effective so everyone switches to it, later it's split into Encoder and Decoder, one moment BERT is all the rage, the next GPT is said to have emergent abilities and Scaling Law. The whole process hardly has any theoretical guidance.

X AI KOLs Timeline ↗ · 2d ago Cached

The article discusses the arbitrariness of AI model creation, proposing to draw inspiration from physics models, build a repository of candidate models, and formalize the model selection process.

0 favorites 0 likes

#transformer

@henit_chobisa: Wanted to share a small achievement. For the last month I’ve been scribbling over whiteboards and notebooks trying to u…

X AI KOLs Timeline ↗ · 3d ago Cached

A developer built a transformer model entirely from scratch in TypeScript, including a custom autograd engine, and released it as an open-source educational tool on GitHub.

0 favorites 0 likes

#transformer

@ZhihuFrontier: Half a year ago, a Zhihu contributor predicted that the next Transformer would absorb loops, recurrent state, sparse ro…

X AI KOLs Timeline ↗ · 3d ago Cached

A Zhihu contributor's half-year-old prediction that the next Transformer would absorb loops, recurrent state, sparse routing, and latent reasoning is gaining relevance as Loop Engineering advances. The article explores how future Transformer architectures may evolve into hybrid models blending linear-complexity layers for background context with attention for precise reasoning, plus finer-grained sparsity and native System 2 reasoning.

0 favorites 0 likes

#transformer

Learning Probabilistic Filters with Strictly Proper Scoring Rules

arXiv cs.LG ↗ · 3d ago Cached

This paper introduces the Proper Scoring Ensemble Filter (PSEF), a transformer-based method for Bayesian filtering that trains an analysis map using strictly proper scoring rules on synthetic state-observation trajectories. It demonstrates superior performance in nonlinear, non-Gaussian filtering tasks compared to classical and learning-based methods.

0 favorites 0 likes

#transformer

At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization

arXiv cs.LG ↗ · 3d ago Cached

This paper proposes using sparse autoencoders to detect out-of-distribution inputs for transformers, including typos and jailbreak prompts, by analyzing spurious concept activations. The method enables a mechanistically grounded fine-tuning strategy to improve LLM robustness.

0 favorites 0 likes

#transformer

Speaking Numbers to LLMs: Multi-Wavelet Number Embeddings for Time Series Forecasting

arXiv cs.CL ↗ · 3d ago Cached

Proposes TempoWave, a plug-and-play temporal wavelet digit interface that maps time series observations into digit-wise embeddings from multi-wavelet coefficients, improving LLM-based time series forecasting and achieving state-of-the-art on multiple benchmarks.

0 favorites 0 likes

#transformer

PMDformer: Patch-Mean Decoupling Information Transformer for Long-term Forecasting

arXiv cs.AI ↗ · 3d ago Cached

PMDformer introduces patch-mean decoupling and specialized attention mechanisms to improve shape similarity modeling in long-term time series forecasting, outperforming existing methods on multiple benchmarks.

0 favorites 0 likes

#transformer

@FinanceYF5: Next token prediction is short-sighted. What if the Transformer learns to predict its own next hidden state? Jayden Teoh proposes Next-Latent Prediction (NextLat): a self-supervised learning method that teaches the Transformer to form...

X AI KOLs Following ↗ · 3d ago Cached

Jayden Teoh proposes Next-Latent Prediction (NextLat), a self-supervised learning method that teaches the Transformer to learn to predict the next hidden state, thereby forming a compact world model for reasoning and planning, and achieves up to 3.3x inference speedup through self-speculative decoding.

0 favorites 0 likes

transformer

Submit Feedback