Brain-IT-VQA: From Brain Signals to Answers

Hugging Face Daily Papers Papers

Summary

Brain-IT-VQA framework decodes visual content from fMRI signals using transformer architecture, outperforming previous methods. The authors also introduce NSD-VQA, a new dataset with richer annotations for evaluating fMRI-based visual question answering.

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:24 AM

Paper page - Brain-IT-VQA: From Brain Signals to Answers

Source: https://huggingface.co/papers/2605.29588

Abstract

Brain-IT-VQA framework decodes visual content from fMRI signals using transformer-based architecture and introduces NSD-VQA dataset for improved visual question answering evaluation.

Decoding visual content fromfMRIsignals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years invisual question answering(VQA) fromfMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure ofvisual representationsin the brain. We presentBrain-IT-VQA, a framework forvisual question answeringfromfMRI. Building on the Brain InteractionTransformer(Brain-IT), our method decodeslanguage tokensfrombrain activityand integrates them with alanguage modelto answer visual questions. Our model substantially outperforms previousfMRI-based captioning and VQA approaches. We further introduceNSD-VQA, a new dataset and benchmark forvisual question answeringfromfMRI. Unlike existing image-fMRIVQA datasets, which typically provide only a few broad and weakly controlled questions per image,NSD-VQAprovides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limitedfMRItest data. Together,Brain-IT-VQA andNSD-VQAprovide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded fromfMRIresponses to natural images. We further analyze the contributions of differentbrain regionsacross question types.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2605\.29588

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.29588 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29588 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29588 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Neural Module Networks for Visual Question Answering

ML at Berkeley

This article explains the Neural Module Networks (NMN) architecture from the paper 'Deep Compositional Question Answering with Neural Module Networks,' detailing how it handles the compositional structure of visual question answering tasks by decomposing questions into modular steps.

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

Reddit r/MachineLearning

SGOCR is an open-source dataset pipeline for generating spatially-grounded, OCR-focused visual question answering (VQA) tuples with rich metadata to support diverse VLM training. The pipeline uses a multi-stage approach combining models like Nvidia's nemotron-ocr-v2, Gemma4, Qwen3-VL, and Gemini-2.5-Flash, along with an agentic optimization loop.

Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding

Hugging Face Daily Papers

This paper introduces a meta-optimized approach for semantic visual decoding from fMRI signals that generalizes to novel subjects without fine-tuning, using in-context learning to infer unique neural encoding patterns from a small set of image-brain activation examples. The method achieves strong cross-subject and cross-scanner generalization without requiring anatomical alignment or stimulus overlap.