[新模型] SupraLabs 推出了 Any2Any 模型系列!
摘要
SupraLabs 发布了 Supra-A2A-Nano-Exp,这是一个小型任意到任意自回归模型,将文本和图像标记化统一到单个 Transformer 中,作为教育原型而非生产就绪系统。
查看缓存全文
缓存时间: 2026/06/21 04:35
SupraLabs/Supra-A2A-Nano-Exp · Hugging Face 来源:https://huggingface.co/SupraLabs/Supra-A2A-Nano-Exp 状态:实验性/教育性原型。并非成品。
Supra-A2A-Nano-Exp 是 SupraLabs(https://huggingface.co/SupraLabs)的一个小型概念验证模型:一个将文本和图像视为统一离散 Token 序列的自回归 GPT,同时支持文本和图像的处理。文本使用标准 BPE Token;图像通过卷积 VQ-VAE 离散化为一组学习得到的“视觉单词”,并附加到同一词汇表。整个多模态流由一个 Transformer 处理,使用同一组权重,无需单独的视觉编码器或扩散头。此版本主要用于在消费级硬件上演示跨模态统一 Token 化的概念。总参数量约 30M,上下文长度为 384 Token,因此不要期望生成连贯的长文本或逼真的图像。请将其视为一个透明、可修改的示例架构,而非一个强大的生成器。
工作原理
每个输入(文本或图像)都被序列化为一个 Token 流,用控制标签包裹:
<|text|> some text here <|image|> [64 visual tokens] <|/image|> [visual tokens][visual tokens]...
- 文本 Token 来自 GPT-2 风格的 BPE 分词器(50,257 个 Token)加上 7 个控制 Token(
<|text|>,<|/text|>,<|image|>,<|/image|>,<|start|>,<|end|>,<|pad|>),共 50,264 个文本侧 ID。 - 视觉 Token 来自 VQ-VAE 的 256 个码本条目。图像通过 3 层步长卷积(总下采样 8 倍)编码,每个空间单元被映射到最近的码本向量。一张 64x64 的图像变成 8x8 网格,即 64 个视觉 Token。
- 这 256 个视觉代码被追加在文本词汇表之后,因此 GPT 的总词汇表正好是
50,264 + 256 = 50,520个 Token。一个嵌入表、一个输出头、一个模型同时处理两种模态。 由于图像被投影到与文本相同的 ID 空间,GPT 原则上可以跨模态进行注意力处理:根据文本提示生成图像,或根据图像内容生成文本,使用与下一 Token 文本预测完全相同的注意力机制。
架构
| 组件 | 详情 |
|---|---|
| GPT 主干 | 4 个 Transformer 块,预归一化,融合 QKV 注意力,因果 |
| 嵌入维度 | 256 |
| 上下文长度 | 384 Tokens |
| 注意力头数 | 4(假设值——参见下方说明) |
| MLP 扩展 | 4 倍(256 -> 1024 -> 256),GELU |
| 总词汇表 | 50,520(50,264 文本 + 256 视觉) |
| GPT 参数量 | ~29.7M |
| VQ-VAE | 3 层卷积编码器/解码器,下采样 8 倍,256x64 码本 |
| VQ-VAE 参数量 | ~0.22M |
| 总参数量 | ~29.9M |
| 精度 | fp32 |
关于注意力头数的说明: 检查点中的 QKV 存储为单个融合的 Linear 层,因此无法仅从权重中恢复出头数。run_supra_a2a.py 默认使用 4 个头(每个头 64 维,GPT-2 惯例)。如果您自己训练了此检查点并使用了不同的头数,请更改脚本顶部的 N_HEAD——加载仍会成功(形状匹配),但错误的值会静默地产生不正确的注意力。
关于像素归一化的说明: 类似地,VQ-VAE 解码器的最终激活函数没有可学习参数,因此在检查点中也看不到。脚本默认为 sigmoid(假设训练在 [0, 1] 范围内)。如果重建效果不佳,请尝试 VQVAE_OUTPUT_ACTIVATION = "tanh"。
本仓库中的文件
| 文件 | 描述 |
|---|---|
model.pt | GPT 主干权重(state_dict,fp32) |
vqvae.pt | VQ-VAE 编码器/解码器/码本权重(state_dict,fp32) |
tokenizer.json | 快速 BPE 分词器(GPT-2 基础词表 + 7 个控制 Token) |
tokenizer_config.json | 分词器元数据(特殊 Token,GPT2Tokenizer 类) |
run_supra_a2a.py | 即插即用推理脚本(见下文) |
样例
| 提示 | 响应 |
|---|---|
| 文本到文本:Once upon a time | Once upon a time, there was a little girl named Lily. She loved to visit her grandma’s house. Her grandma was very bossy and always told her what to do. One day, Lily went to visit her grandma and something strange happened. The sky turned dark and it started to rain. Lily was scared, but her grandma told her not to worry. They sat inside and played games until the rain stopped. |
| 文本到图像:A dog running on snow | 图像(https://cdn-uploads.huggingface.co/production/uploads/68df176c403a7bf9e8ae85a8/IIq5xvzEqunqnEcYS71Gj.png) |
| 文本到视频:A snow scene | https://cdn-uploads.huggingface.co/production/uploads/68df176c403a7bf9e8ae85a8/Lv1UqRG1m8KkiT46X8RNC.mp4 |
使用方式
pip install torch transformers huggingface_hub safetensors Pillow numpy
#!/usr/bin/env python3
""" Supra-A2A-Nano-Exp - inference runner
======================================
An experimental any-to-any model from SupraLabs: a single autoregressive GPT operates over one unified vocabulary that mixes text (BPE, GPT-2 style) and discrete visual codes from a convolutional VQ-VAE. Text and images are serialized into the same token stream, delimited by control tokens (<|text|>, <|/text|>, <|image|>, <|/image|> and their closing tags).
This script reconstructs the exact architecture from the raw state_dicts (no config.json ships with the weights) and exposes a SupraA2A class with high-level methods for text completion, image reconstruction (a VQ-VAE sanity check), and text-conditioned image generation.
Quick start:
python run_supra_a2a.py --mode text --prompt "Once upon a time"
python run_supra_a2a.py --mode reconstruct --image photo.png --out recon.png
python run_supra_a2a.py --mode text2image --prompt "a red square" --out gen.png
python run_supra_a2a.py --mode chat
Expected weight files in --weights_dir (default: this script's directory):
any2any_gpt.pt, vqvae.pt, tokenizer.json, tokenizer_config.json
If any are missing locally, the script falls back to downloading from the Hub.
"""
from __future__ import annotations
import argparse
import math
import sys
from dataclasses import dataclass
from pathlib import Path
from typing import List, Optional, Tuple
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from PIL import Image
from safetensors.torch import load_file as load_safetensors
# ---------------------------------------------------------------------------
# Config
# ---------------------------------------------------------------------------
MODEL_WEIGHT_STEMS = ["model", "vqvae"] # each may exist as .safetensors or .pt
TOKENIZER_FILES = ["tokenizer.json", "tokenizer_config.json"]
# Patterns used for the Hub fallback download; covers both weight formats so
# this works whether the repo ships .safetensors (preferred), .pt, or both.
HUB_ALLOW_PATTERNS = [f"{s}.safetensors" for s in MODEL_WEIGHT_STEMS] + [
f"{s}.pt" for s in MODEL_WEIGHT_STEMS
] + TOKENIZER_FILES
DEFAULT_REPO_ID = "SupraLabs/Supra-A2A-Nano-Exp"
# NOTE: the GPT checkpoint does not store n_head explicitly (qkv is a single
# fused Linear layer). 256 / 4 = 64 head_dim (the GPT-2 convention) is the
# most likely value and is the default here, but it cannot be verified from
# the weights alone. A wrong value will NOT break loading (shapes still
# match) but will silently produce incorrect attention. If generations look
# off, try N_HEAD = 8 (head_dim 32) instead.
N_HEAD = 4
# Pixel normalization used by the VQ-VAE decoder's final layer. There is no
# parametrized activation after the last ConvTranspose2d in the checkpoint,
# so this is also an assumption rather than a certainty. Default: sigmoid
# (assumes images were trained in [0, 1]). If reconstructions look washed
# out / inverted, try "tanh" (assumes [-1, 1] training).
VQVAE_OUTPUT_ACTIVATION = "sigmoid" # "sigmoid" | "tanh" | "none"
# ---------------------------------------------------------------------------
# Weight resolution (local dir -> Hugging Face Hub fallback)
# ---------------------------------------------------------------------------
def _has_all_weights(d: Path) -> bool:
weights_ok = all(
(d / f"{stem}.safetensors").exists() or (d / f"{stem}.pt").exists()
for stem in MODEL_WEIGHT_STEMS
)
tok_ok = all((d / f).exists() for f in TOKENIZER_FILES)
return weights_ok and tok_ok
def resolve_weights_dir(weights_dir: Path, repo_id: str) -> Path:
"""Return a directory containing the tokenizer files plus, for each model in
MODEL_WEIGHT_STEMS, either a .safetensors or a .pt file.
Looks locally first; if anything is missing, tries to download from the Hub
via huggingface_hub.snapshot_download (works once the SupraLabs/Supra-A2A-Nano-Exp
repo is public).
"""
if _has_all_weights(weights_dir):
return weights_dir
print(f"[info] weight files incomplete in {weights_dir}")
print(f"[info] trying to download from huggingface.co/{repo_id} ...")
try:
from huggingface_hub import snapshot_download
except ImportError as e:
raise RuntimeError(
"huggingface_hub is not installed and local weights are incomplete. "
"Run: pip install huggingface_hub"
) from e
try:
downloaded = snapshot_download(repo_id=repo_id, allow_patterns=HUB_ALLOW_PATTERNS)
except Exception as e:
raise RuntimeError(
f"Could not find the weights locally or on the Hub ({repo_id}). "
f"Place tokenizer files plus {MODEL_WEIGHT_STEMS} (.safetensors or .pt) "
f"manually in {weights_dir}.\nOriginal error: {e}"
) from e
downloaded = Path(downloaded)
if not _has_all_weights(downloaded):
raise RuntimeError(f"Download from {repo_id} completed but required files are still missing.")
return downloaded
def load_state_dict_any(weights_dir: Path, stem: str) -> dict:
"""Load a checkpoint by stem name, preferring .safetensors over legacy .pt."""
st_path = weights_dir / f"{stem}.safetensors"
pt_path = weights_dir / f"{stem}.pt"
if st_path.exists():
return load_safetensors(str(st_path))
if pt_path.exists():
return torch.load(pt_path, map_location="cpu", weights_only=False)
raise FileNotFoundError(f"Neither {st_path.name} nor {pt_path.name} found in {weights_dir}")
# ---------------------------------------------------------------------------
# VQ-VAE (conv encoder / codebook / transposed-conv decoder)
# ---------------------------------------------------------------------------
class VectorQuantizer(nn.Module):
"""Discrete codebook with nearest-neighbor lookup (inference only, no EMA)."""
def __init__(self, num_codes: int, dim: int):
super().__init__()
self.num_codes = num_codes
self.dim = dim
self.embedding = nn.Embedding(num_codes, dim)
def encode(self, z: torch.Tensor) -> torch.Tensor:
"""z: (B, C, H, W) -> discrete indices (B, H, W)."""
b, c, h, w = z.shape
z_flat = z.permute(0, 2, 3, 1).reshape(-1, c)
codebook = self.embedding.weight # (num_codes, dim)
dist = (
z_flat.pow(2).sum(1, keepdim=True)
- 2 * z_flat @ codebook.t()
+ codebook.pow(2).sum(1)
)
idx = dist.argmin(dim=1)
return idx.view(b, h, w)
def decode(self, idx: torch.Tensor) -> torch.Tensor:
"""idx: (B, H, W) -> z_q (B, C, H, W)."""
z_q = self.embedding(idx) # (B, H, W, C)
return z_q.permute(0, 3, 1, 2).contiguous()
class VQVAE(nn.Module):
"""Conv autoencoder with /8 downsampling, codebook size/dim read from the checkpoint."""
def __init__(self, codebook_size: int = 256, code_dim: int = 64):
super().__init__()
self.enc = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=4, stride=2, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(32, 64, kernel_size=4, stride=2, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(64, code_dim, kernel_size=4, stride=2, padding=1),
)
self.vq = VectorQuantizer(codebook_size, code_dim)
self.dec = nn.Sequential(
nn.ConvTranspose2d(code_dim, 64, kernel_size=4, stride=2, padding=1),
nn.ReLU(inplace=True),
nn.ConvTranspose2d(64, 32, kernel_size=4, stride=2, padding=1),
nn.ReLU(inplace=True),
nn.ConvTranspose2d(32, 3, kernel_size=4, stride=2, padding=1),
)
def _final_activation(self, x: torch.Tensor) -> torch.Tensor:
if VQVAE_OUTPUT_ACTIVATION == "sigmoid":
return torch.sigmoid(x)
if VQVAE_OUTPUT_ACTIVATION == "tanh":
return torch.tanh(x)
return x
@torch.no_grad()
def encode_to_indices(self, img: torch.Tensor) -> torch.Tensor:
"""img: (B, 3, H, W) normalized to [0, 1], H and W multiples of 8."""
z = self.enc(img)
return self.vq.encode(z)
@torch.no_grad()
def decode_from_indices(self, idx: torch.Tensor) -> torch.Tensor:
"""idx: (B, H, W) -> image (B, 3, H*8, W*8) in [0, 1]."""
z_q = self.vq.decode(idx)
x = self.dec(z_q)
return self._final_activation(x).clamp(0, 1)
# ---------------------------------------------------------------------------
# GPT (nanoGPT-style: pre-norm, fused qkv, 4x MLP)
# ---------------------------------------------------------------------------
class CausalSelfAttention(nn.Module):
def __init__(self, n_embd: int, n_head: int, block_size: int):
super().__init__()
assert n_embd % n_head == 0, "n_embd must be divisible by n_head"
self.n_head = n_head
self.qkv = nn.Linear(n_embd, 3 * n_embd)
self.proj = nn.Linear(n_embd, n_embd)
mask = torch.tril(torch.ones(block_size, block_size)).view(1, 1, block_size, block_size)
self.register_buffer("mask", mask)
def forward(self, x: torch.Tensor) -> torch.Tensor:
b, t, c = x.shape
qkv = self.qkv(x)
q, k, v = qkv.split(c, dim=2)
hd = c // self.n_head
q = q.view(b, t, self.n_head, hd).transpose(1, 2)
k = k.view(b, t, self.n_head, hd).transpose(1, 2)
v = v.view(b, t, self.n_head, hd).transpose(1, 2)
att = (q @ k.transpose(-2, -1)) / math.sqrt(hd)
att = att.masked_fill(self.mask[:, :, :t, :t] == 0, float("-inf"))
att = F.softmax(att, dim=-1)
y = att @ v
y = y.transpose(1, 2).contiguous().view(b, t, c)
return self.proj(y)
class Block(nn.Module):
def __init__(self, n_embd: int, n_head: int, block_size: int):
super().__init__()
self.ln1 = nn.LayerNorm(n_embd)
self.attn = CausalSelfAttention(n_embd, n_head, block_size)
self.ln2 = nn.LayerNorm(n_embd)
self.mlp = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.GELU(),
nn.Linear(4 * n_embd, n_embd),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = x + self.attn(self.ln1(x))
x = x + self.mlp(self.ln2(x))
return x
class GPT(nn.Module):
def __init__(self, vocab_size: int, n_embd: int, block_size: int, n_layer: int, n_head: int):
super().__init__()
self.block_size = block_size
self.tok_emb = nn.Embedding(vocab_size, n_embd)
self.pos_emb = nn.Embedding(block_size, n_embd)
self.blocks = nn.ModuleList([Block(n_embd, n_head, block_size) for _ in range(n_layer)])
self.ln_f = nn.LayerNorm(n_embd)
self.head = nn.Linear(n_embd, vocab_size, bias=False)
def forward(self, idx: torch.Tensor) -> torch.Tensor:
b, t = idx.shape
assert t <= self.block_size, f"sequence length ({t}) exceeds model context ({self.block_size})"
pos = torch.arange(t, device=idx.device)
x = self.tok_emb(idx) + self.pos_emb(pos)[None, :, :]
for blk in self.blocks:
x = blk(x)
x = self.ln_f(x)
return self.head(x)
# ---------------------------------------------------------------------------
# Unified pipeline
# ---------------------------------------------------------------------------
@dataclass
class VocabLayout:
text_vocab_size: int # BPE + control tokens (<|text|>, ...)
visual_offset: int # = text_vocab_size: where visual codes start
visual_vocab_size: int # VQ-VAE codebook size
total_vocab_size: int # text_vocab_size + visual_vocab_size
class SupraA2A:
"""High-level wrapper: tokenizer + VQ-VAE + GPT sharing one unified vocabulary."""
def __init__(self, weights_dir: Path, device: Optional[str] = None, n_head: int = N_HEAD):
self.device = torch.device(device or ("cuda" if torch.cuda.is_available() else "cpu"))
from transformers import PreTrainedTokenizerFast
self.tokenizer = PreTrainedTokenizerFast(tokenizer_file=str(weights_dir / "tokenizer.json"))
self.tokenizer.pad_token = "<|endoftext|>"
self.tokenizer.bos_token = "<|endoftext|>"
self.tokenizer.eos_token = "<|endoftext|>"
self.tokenizer.unk_token = "<|endoftext|>"
vq_state = load_state_dict_an
相似文章
[成立] SupraLabs - 为你带来真正开源的人工智能模型!
SupraLabs 宣布成立,专注于训练和发布面向边缘设备的开源小型语言模型(SLM),已在 Hugging Face 上发布 Supra-Mini-v4-2M 等模型。
[新发布] Supra-50M 正式推出!
SupraLabs 发布了 Supra-50M,一个紧凑的 5000 万参数因果语言模型,包含基础版和指令版,基于 fineweb-edu 的 200 亿个 token 训练,在多项关键基准测试中达到了可与 GPT-2 和 SmolLM 等更大模型竞争的水平。
全新AI图像生成器碾压全场
OpenAI发布ChatGPT Images 2.0,新图像模型在11项真实场景测试中全面击败Google的Nano Banana Pro,测试涵盖动漫海报、UI截图、品牌画板与数据信息图,文字清晰可读,排版精准。
HYDRA-X: 原生统一多模态模型与整体视觉分词器
HYDRA-X 提出了一种统一的多模态模型,将图像和视频分词集成到单个视觉变换器中,在理解和生成任务上均取得了强劲性能。
[新模型] Supra-Title-0.3B 刚刚发布!
Supra Labs 发布了 Supra Title,这是一个参数为 350M 的专用模型,用于生成聊天对话标题。该模型基于 LFM2.5 构建,以 GGUF 格式运行在任何硬件上,且无需系统提示。