Brain-IT-VQA: From Brain Signals to Answers

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

brain-decoding fmri visual-question-answering transformer dataset benchmark

Summary

Brain-IT-VQA framework decodes visual content from fMRI signals using transformer architecture, outperforming previous methods. The authors also introduce NSD-VQA, a new dataset with richer annotations for evaluating fMRI-based visual question answering.

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:24 AM

Paper page - Brain-IT-VQA: From Brain Signals to Answers

Source: https://huggingface.co/papers/2605.29588

Abstract

Brain-IT-VQA framework decodes visual content from fMRI signals using transformer-based architecture and introduces NSD-VQA dataset for improved visual question answering evaluation.

Decoding visual content fromfMRIsignals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years invisual question answering(VQA) fromfMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure ofvisual representationsin the brain. We presentBrain-IT-VQA, a framework forvisual question answeringfromfMRI. Building on the Brain InteractionTransformer(Brain-IT), our method decodeslanguage tokensfrombrain activityand integrates them with alanguage modelto answer visual questions. Our model substantially outperforms previousfMRI-based captioning and VQA approaches. We further introduceNSD-VQA, a new dataset and benchmark forvisual question answeringfromfMRI. Unlike existing image-fMRIVQA datasets, which typically provide only a few broad and weakly controlled questions per image,NSD-VQAprovides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limitedfMRItest data. Together,Brain-IT-VQA andNSD-VQAprovide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded fromfMRIresponses to natural images. We further analyze the contributions of differentbrain regionsacross question types.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2605\.29588

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.29588 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29588 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29588 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Brain-IT-VQA: From Brain Signals to Answers

Paper page - Brain-IT-VQA: From Brain Signals to Answers

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Neural Module Networks for Visual Question Answering

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Distilling Answer-Set Programming Rules from LLMs for Neurosymbolic Visual Question Answering

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding

Submit Feedback

Similar Articles

Neural Module Networks for Visual Question Answering

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Distilling Answer-Set Programming Rules from LLMs for Neurosymbolic Visual Question Answering

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding