transformer

#transformer

@amitiitbhu: Q × Kᵀ tells the model how relevant every word is to every other word. Softmax turns that into probabilities. V deliver…

X AI KOLs Timeline ↗ · yesterday Cached

A tweet explaining the core formula of the attention mechanism in transformer models: Q × Kᵀ computes relevance, Softmax converts to probabilities, and V delivers content, forming the foundation of modern AI.

0 favorites 0 likes

#transformer

@IlirAliu_: Forget lidar. One single camera. Runs in real time & is open source: A streaming 3D model that reconstructs scenes live…

X AI KOLs Timeline ↗ · yesterday Cached

LingBot-Map is an open-source, real-time streaming 3D reconstruction model that uses a single camera, running at ~20 FPS via a feed-forward geometric context transformer, outperforming both streaming and offline methods.

0 favorites 0 likes

#transformer

@snowboat84: Have you noticed that the birth of models in AI is actually quite arbitrary? Take language models as an example: first RNN, then LSTM, one day Transformer is said to be effective so everyone switches to it, later it's split into Encoder and Decoder, one moment BERT is all the rage, the next GPT is said to have emergent abilities and Scaling Law. The whole process hardly has any theoretical guidance.

X AI KOLs Timeline ↗ · yesterday Cached

The article discusses the arbitrariness of AI model creation, proposing to draw inspiration from physics models, build a repository of candidate models, and formalize the model selection process.

0 favorites 0 likes

#transformer

@henit_chobisa: Wanted to share a small achievement. For the last month I’ve been scribbling over whiteboards and notebooks trying to u…

X AI KOLs Timeline ↗ · yesterday Cached

A developer built a transformer model entirely from scratch in TypeScript, including a custom autograd engine, and released it as an open-source educational tool on GitHub.

0 favorites 0 likes

#transformer

@ZhihuFrontier: Half a year ago, a Zhihu contributor predicted that the next Transformer would absorb loops, recurrent state, sparse ro…

X AI KOLs Timeline ↗ · yesterday Cached

A Zhihu contributor's half-year-old prediction that the next Transformer would absorb loops, recurrent state, sparse routing, and latent reasoning is gaining relevance as Loop Engineering advances. The article explores how future Transformer architectures may evolve into hybrid models blending linear-complexity layers for background context with attention for precise reasoning, plus finer-grained sparsity and native System 2 reasoning.

0 favorites 0 likes

#transformer

Learning Probabilistic Filters with Strictly Proper Scoring Rules

arXiv cs.LG ↗ · 2d ago Cached

This paper introduces the Proper Scoring Ensemble Filter (PSEF), a transformer-based method for Bayesian filtering that trains an analysis map using strictly proper scoring rules on synthetic state-observation trajectories. It demonstrates superior performance in nonlinear, non-Gaussian filtering tasks compared to classical and learning-based methods.

0 favorites 0 likes

#transformer

At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization

arXiv cs.LG ↗ · 2d ago Cached

This paper proposes using sparse autoencoders to detect out-of-distribution inputs for transformers, including typos and jailbreak prompts, by analyzing spurious concept activations. The method enables a mechanistically grounded fine-tuning strategy to improve LLM robustness.

0 favorites 0 likes

#transformer

Speaking Numbers to LLMs: Multi-Wavelet Number Embeddings for Time Series Forecasting

arXiv cs.CL ↗ · 2d ago Cached

Proposes TempoWave, a plug-and-play temporal wavelet digit interface that maps time series observations into digit-wise embeddings from multi-wavelet coefficients, improving LLM-based time series forecasting and achieving state-of-the-art on multiple benchmarks.

0 favorites 0 likes

#transformer

PMDformer: Patch-Mean Decoupling Information Transformer for Long-term Forecasting

arXiv cs.AI ↗ · 2d ago Cached

PMDformer introduces patch-mean decoupling and specialized attention mechanisms to improve shape similarity modeling in long-term time series forecasting, outperforming existing methods on multiple benchmarks.

0 favorites 0 likes

#transformer

@FinanceYF5: Next token prediction is short-sighted. What if the Transformer learns to predict its own next hidden state? Jayden Teoh proposes Next-Latent Prediction (NextLat): a self-supervised learning method that teaches the Transformer to form...

X AI KOLs Following ↗ · 2d ago Cached

Jayden Teoh proposes Next-Latent Prediction (NextLat), a self-supervised learning method that teaches the Transformer to learn to predict the next hidden state, thereby forming a compact world model for reasoning and planning, and achieves up to 3.3x inference speedup through self-speculative decoding.

0 favorites 0 likes

#transformer

The KV-cache wall: why fixed-size memory sequence models keep coming back

Reddit r/ArtificialInteligence ↗ · 2d ago

Explores the growing memory bottleneck of KV-cache in transformer inference, explaining why alternative architectures with fixed-size memory like Mamba and RWKV are gaining renewed attention.

0 favorites 0 likes

#transformer

Optimizing Abstractive Summarization With Fine-Tuned PEGASUS

arXiv cs.CL ↗ · 3d ago Cached

This paper presents fine-tuning of PEGASUS on the XL-Sum English corpus, achieving state-of-the-art results with significant improvements over the baseline mT5 model across ROUGE scores.

0 favorites 0 likes

#transformer

Multi-Stream Temporal Fusion for Financial Fraud Detection

arXiv cs.LG ↗ · 3d ago Cached

Proposes the Multi-Stream Fraud Transformer (MSFT) for financial fraud detection, which independently encodes transaction, login, and risk event streams using Transformers and fuses them with time-aware positional encoding and gated fusion, achieving 0.9961 AUROC on a large dataset.

0 favorites 0 likes

#transformer

Why Do Accumulated Transformations Extrapolate?

arXiv cs.LG ↗ · 3d ago Cached

This paper investigates why accumulated token-dependent orthogonal transformations, such as those used in PaTH Attention and a simplified variant with SO(2) rotations, enable length extrapolation in transformers. It proves that such transformations become incoherent after a finite number of steps, suppressing attention to distant tokens, and shows both theoretically and experimentally that this mechanism improves extrapolation but eventually degrades at extreme context lengths.

0 favorites 0 likes

#transformer

Towards Scalable Multi-Task Reinforcement Learning with Large Decision Models

arXiv cs.LG ↗ · 3d ago Cached

This paper introduces LDM-v0, a large decision model trained offline on trajectories from thousands of diverse reinforcement learning environments, demonstrating that a single transformer policy can match the performance of task-specific policies across robotics, autonomous driving, inventory management, cybersecurity, trading, and video games.

0 favorites 0 likes

#transformer

@Phoenixyin13: I think this is a top-notch work in ICML 2026. The attention mechanism of traditional Transformers is essentially point-to-point matching: it cuts input into a bunch of tokens (discrete points), computes similarity between Query and Key, and then weights the Value. In NLP...

X AI KOLs Timeline ↗ · 3d ago Cached

Introduces the ICML 2026 paper Functional Attention, which treats functions as first-class citizens and replaces softmax point-to-point similarity with structured linear operators. It addresses issues of discretization, resolution sensitivity, and high computational complexity in traditional Transformers when handling continuous functions. Achieves or surpasses SOTA in tasks like PDE solving and 3D segmentation, and exhibits strong OOD generalization.

0 favorites 0 likes

#transformer

@agisummitai: Speaker Spotlight: Christopher Manning If you've used an LLM, you've used his research. Christopher Manning is one of t…

X AI KOLs Following ↗ · 3d ago Cached

Christopher Manning is spotlighted as a keynote speaker at the AGI Summit, highlighting his pioneering research in NLP, including GloVe and the attention mechanism, as well as his role at Stanford.

0 favorites 0 likes

#transformer

High Dimensional, Dynamic Rotary Positional Embedding [P]

Reddit r/MachineLearning ↗ · 3d ago

Introduces HDD-RoPE, an extension of rotary positional embeddings that uses high-dimensional chunks and data-dependent rotation rates, showing faster convergence on TinyStories compared to xPos.

0 favorites 0 likes

#transformer

@s_scardapane: The Transformer Cookbook by @pentagonalize @davidweichiang et al. A beautiful introduction to "hardcoding" algorithms…

X AI KOLs Timeline ↗ · 3d ago Cached

A tweet introducing 'The Transformer Cookbook', a paper that provides a beautiful introduction to hardcoding algorithms (addition, lookup, branching) inside transformer weights, following the RASP paper.

0 favorites 0 likes

#transformer

An LLM-based Two-Stage Transformer Framework for Cross-Domain Bearing Fault Diagnosis with Limited Data

arXiv cs.LG ↗ · 4d ago Cached

Proposes a knowledge-guided two-stage transfer learning framework using a lightweight GPT-2-style Transformer for cross-domain bearing fault diagnosis with limited data, achieving 92.61% accuracy with only 10% labeled data.

0 favorites 0 likes

transformer

Submit Feedback