Brain-IT-VQA: From Brain Signals to Answers
Summary
Brain-IT-VQA framework decodes visual content from fMRI signals using transformer architecture, outperforming previous methods. The authors also introduce NSD-VQA, a new dataset with richer annotations for evaluating fMRI-based visual question answering.
View Cached Full Text
Cached at: 06/02/26, 03:24 AM
Paper page - Brain-IT-VQA: From Brain Signals to Answers
Source: https://huggingface.co/papers/2605.29588
Abstract
Brain-IT-VQA framework decodes visual content from fMRI signals using transformer-based architecture and introduces NSD-VQA dataset for improved visual question answering evaluation.
Decoding visual content fromfMRIsignals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years invisual question answering(VQA) fromfMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure ofvisual representationsin the brain. We presentBrain-IT-VQA, a framework forvisual question answeringfromfMRI. Building on the Brain InteractionTransformer(Brain-IT), our method decodeslanguage tokensfrombrain activityand integrates them with alanguage modelto answer visual questions. Our model substantially outperforms previousfMRI-based captioning and VQA approaches. We further introduceNSD-VQA, a new dataset and benchmark forvisual question answeringfromfMRI. Unlike existing image-fMRIVQA datasets, which typically provide only a few broad and weakly controlled questions per image,NSD-VQAprovides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limitedfMRItest data. Together,Brain-IT-VQA andNSD-VQAprovide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded fromfMRIresponses to natural images. We further analyze the contributions of differentbrain regionsacross question types.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.29588
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.29588 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.29588 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.29588 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Neural Module Networks for Visual Question Answering
This article explains the Neural Module Networks (NMN) architecture from the paper 'Deep Compositional Question Answering with Neural Module Networks,' detailing how it handles the compositional structure of visual question answering tasks by decomposing questions into modular steps.
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.
Distilling Answer-Set Programming Rules from LLMs for Neurosymbolic Visual Question Answering
This paper presents a method for distilling answer-set programming rules from large language models to enhance neurosymbolic visual question answering, showing that only a few examples are needed to generate correct rules.
SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]
SGOCR is an open-source dataset pipeline for generating spatially-grounded, OCR-focused visual question answering (VQA) tuples with rich metadata to support diverse VLM training. The pipeline uses a multi-stage approach combining models like Nvidia's nemotron-ocr-v2, Gemma4, Qwen3-VL, and Gemini-2.5-Flash, along with an agentic optimization loop.
Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding
This paper introduces a meta-optimized approach for semantic visual decoding from fMRI signals that generalizes to novel subjects without fine-tuning, using in-context learning to infer unique neural encoding patterns from a small set of image-brain activation examples. The method achieves strong cross-subject and cross-scanner generalization without requiring anatomical alignment or stimulus overlap.