Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge
Summary
This paper introduces LaViD, a framework that transfers semantic knowledge from a language-only LLM to a vision student model by generating multiple-choice questions as conceptual signatures, achieving superior fine-grained classification performance and robustness.
View Cached Full Text
Cached at: 06/29/26, 05:28 AM
# Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge
Source: [https://arxiv.org/html/2606.27527](https://arxiv.org/html/2606.27527)
###### Abstract
Large Language Models \(LLMs\) possess broad conceptual knowledge acquired through large\-scale text pretraining, yet their potential to supervise models in other modalities remains underexplored\. In this work, we proposeLaViD—Language\-to\-Visual Knowledge Distillation—a simple and effective framework for transferring high\-level semantic knowledge from a language\-only teacher to a vision\-only student model\. Instead of relying on paired multimodal data,LaViDelicits conceptual signals from an LLM by prompting it to generate multiple\-choice questions \(MCQs\) that probe semantic distinctions between visual classes\. Each class is mapped to a soft label distribution over these MCQs, forming a rich conceptual signature that guides the student through an auxiliary distillation loss\. Notably, despite using a language\-only teacher without access to image data,LaViDconsistently outperforms recent methods like MaKD that distill from vision\-language models across multiple fine\-grained benchmarks\. It also achieves competitive or superior performance compared to state\-of\-the\-art visual distillation methods such as DKD and MLKD, with further gains when combined with logit standardization\. On the Waterbirds dataset,LaViDsubstantially improves worst\-group accuracy, demonstrating enhanced robustness to spurious correlations with distillation\. Code is available at[https://github\.com/lliangthomas/lavid](https://github.com/lliangthomas/lavid)\.
Machine Learning, ICML, Knowledge Distillation, Cross\-Modal Learning, Fine\-Grained Visual Classification, Large Language Models
## 1Introduction
Knowledge Distillation \(KD\)\(buciluǎ2006model; Hintonet al\.,[2015](https://arxiv.org/html/2606.27527#bib.bib9); Romeroet al\.,[2014](https://arxiv.org/html/2606.27527#bib.bib11)\)is a foundational technique to transfer knowledge from a large teacher model to a smaller student model\. This approach\(Tianet al\.,[2019](https://arxiv.org/html/2606.27527#bib.bib12); Chenet al\.,[2022](https://arxiv.org/html/2606.27527#bib.bib14); Yanget al\.,[2021](https://arxiv.org/html/2606.27527#bib.bib15); Zhaoet al\.,[2022](https://arxiv.org/html/2606.27527#bib.bib13); Haoet al\.,[2023](https://arxiv.org/html/2606.27527#bib.bib16)\)typically requires a dataset\-specific teacher that guides the student through the learning process through logits\(Hintonet al\.,[2015](https://arxiv.org/html/2606.27527#bib.bib9); Tianet al\.,[2019](https://arxiv.org/html/2606.27527#bib.bib12); Haoet al\.,[2023](https://arxiv.org/html/2606.27527#bib.bib16); Sunet al\.,[2024](https://arxiv.org/html/2606.27527#bib.bib17); Zhaoet al\.,[2022](https://arxiv.org/html/2606.27527#bib.bib13)\)or feature representations of the teacher\(Romeroet al\.,[2014](https://arxiv.org/html/2606.27527#bib.bib11); Chenet al\.,[2022](https://arxiv.org/html/2606.27527#bib.bib14); Zhang and Ma,[2020](https://arxiv.org/html/2606.27527#bib.bib18); Parket al\.,[2019](https://arxiv.org/html/2606.27527#bib.bib19); Tung and Mori,[2019](https://arxiv.org/html/2606.27527#bib.bib20)\)\. However, this reliance on purely visual supervision is often insufficient for Fine\-Grained Visual Classification, where models must discern subtle inter\-class distinctions and overfit to spurious background correlations instead of learning robust traits\.
The development of Large Language Models \(LLMs\)\(Touvronet al\.,[2023a](https://arxiv.org/html/2606.27527#bib.bib24); Grattafioriet al\.,[2024](https://arxiv.org/html/2606.27527#bib.bib26); Touvronet al\.,[2023b](https://arxiv.org/html/2606.27527#bib.bib25); Chowdheryet al\.,[2023](https://arxiv.org/html/2606.27527#bib.bib23); Brownet al\.,[2020](https://arxiv.org/html/2606.27527#bib.bib21); Yanget al\.,[2024](https://arxiv.org/html/2606.27527#bib.bib27); Raffelet al\.,[2020](https://arxiv.org/html/2606.27527#bib.bib22)\)has revolutionized the field\. What’s particularly interesting about the knowledge encoded in a large language model is itsconceptualnature: it often transcends the textual modality it is trained on\. For example, describing a “cat” in language—“a small, furry animal with pointed ears and whiskers”—carries the same conceptual meaning as visually identifying a cat in an image\. While the modality of the input differs \(text vs\. image\), the underlying notion of “catness” remains the same\. This observation suggests that the knowledge stored in LLMs is not tied to language, but instead reflects general, modality\-agnostic concepts, as hypothesized in\(Huhet al\.,[2024](https://arxiv.org/html/2606.27527#bib.bib10)\)\. It raises a compelling question:
Can such conceptual knowledge, encoded purely in text, be transferred to guide visual learning?
In this work, we introduceLanguage\-to\-Visual KnowledgeDistillation \(LaViD\), a simple yet effective approach that distills general knowledge from text\-only large language models \(LLMs\) into visual student models\. Rather than relying on paired multimodal data or task\-specific supervision,LaViDuses the broad world knowledge encoded in LLMs to provide conceptual guidance\. It does so by eliciting structured and interpretable signals through multiple\-choice questions that probe semantic distinctions between classes\. This allows visual models to learn not just from labeled data, but also from external textual knowledge—bridging the gap between language and vision without requiring aligned inputs\.
LaViDdistills conceptual knowledge from a language\-only teacher into a visual student model using a two\-stage process\. First, we prompt an LLM with dataset metadata to generate multiple\-choice questions \(MCQs\) that probe semantic differences between classes\. Each question includes a placeholder token \(<object\>\), which is replaced with each class name to instantiate class\-specific prompts\. The LLM’s pre\-softmax logits over answer options are extracted and normalized into soft label distributions, forming a semantic signature for each class\. Next, the student processes input images and predicts auxiliary logits aligned with the LLM’s question space\. Training minimizes a standard classification loss along with a mean squared error \(MSE\) loss between the student’s auxiliary predictions and the LLM\-derived targets\.
The core intuition behindLaViDis that LLMs encode structured world knowledge, enabling them to express nuanced conceptual relationships between categories\. By prompting the LLM with class\-specific multiple\-choice questions, we elicit semantic distinctions—such as coloration, shape, or behavior—that define how different classes relate at a conceptual level\. The resulting logits provide a structured view of inter\-class similarities and differences, which we use as supervision to guide the student model\. Unlike fixed class embeddings or similarity targets, these conceptual signatures are structured across multiple semantic dimensions induced by diverse questions, providing richer relational supervision than conventional label smoothing or representation matching\. This signal pushes the student to organize its internal representations around meaningful attributes, promoting deeper generalization beyond rote memorization of class labels\.
We evaluateLaViDacross six fine\-grained classification benchmarks and find it consistently outperforms both traditional KD methods and multimodal LLM\-based baselines\. Notably, despite using a language\-only teacher withno access to image data,LaViDsurpasses recent approaches MaKD\(Leeet al\.,[2025](https://arxiv.org/html/2606.27527#bib.bib58)\)that distill from vision\-language models InternVL\(Chenet al\.,[2024](https://arxiv.org/html/2606.27527#bib.bib66)\)\. It also achieves competitive or superior performance compared to state\-of\-the\-art KD methods like DKD\(Zhaoet al\.,[2022](https://arxiv.org/html/2606.27527#bib.bib13)\)and MLKD\(Jinet al\.,[2023](https://arxiv.org/html/2606.27527#bib.bib62)\), and can be further combined with logit standardization\(Sunet al\.,[2024](https://arxiv.org/html/2606.27527#bib.bib17)\)for additional gains\. Beyond accuracy, we demonstrate thatLaViDmitigates dataset bias: on the Waterbirds dataset\(Sagawaet al\.,[2019](https://arxiv.org/html/2606.27527#bib.bib57)\), it significantly improves worst\-group accuracy, indicating improved robustness to spurious correlations\. Extensive ablation studies further validate the importance of our design choices, including the use of MCQs, LLMs, and their semantic structure\.
## 2Related Work
Knowledge Distillation\.Knowledge distillation \(KD\) generally focuses on transferring knowledge from a larger teacher to a smaller student\(buciluǎ2006model; Hintonet al\.,[2015](https://arxiv.org/html/2606.27527#bib.bib9)\)\. In the unimodal setting, this process occurs within the same modality, where early work focused on matching logits\(Hintonet al\.,[2015](https://arxiv.org/html/2606.27527#bib.bib9)\)or intermediate features\(Romeroet al\.,[2014](https://arxiv.org/html/2606.27527#bib.bib11)\)\. Later studies extended KD across heterogeneous architectures\(Liuet al\.,[2022](https://arxiv.org/html/2606.27527#bib.bib39); Zhuet al\.,[2023](https://arxiv.org/html/2606.27527#bib.bib38)\)and demonstrated greater gains with large teacher–student performance gaps\(Huanget al\.,[2022](https://arxiv.org/html/2606.27527#bib.bib36); Fanet al\.,[2024](https://arxiv.org/html/2606.27527#bib.bib37)\)\. However, conventional KD typically requires training a dataset\-specific teacher, which adds computational overhead and risks transferring dataset biases\(Ojhaet al\.,[2023](https://arxiv.org/html/2606.27527#bib.bib35)\)\. Cross\-modal knowledge distillation transfers supervision across different modalities\(Xueet al\.,[2021](https://arxiv.org/html/2606.27527#bib.bib46),[2022](https://arxiv.org/html/2606.27527#bib.bib40); Garciaet al\.,[2018](https://arxiv.org/html/2606.27527#bib.bib49); Guptaet al\.,[2016](https://arxiv.org/html/2606.27527#bib.bib48)\)\. This has supported semantic generalization in open\-vocabulary recognition, where students align with textual embeddings or leverage CLIP’s image encoder\(Radfordet al\.,[2021](https://arxiv.org/html/2606.27527#bib.bib30); Guet al\.,[2021](https://arxiv.org/html/2606.27527#bib.bib43); Wuet al\.,[2023](https://arxiv.org/html/2606.27527#bib.bib44); Xuet al\.,[2023](https://arxiv.org/html/2606.27527#bib.bib45)\)\. Unlike these approaches, which rely on paired inputs or multimodal encoders,LaViDdistills general world knowledge from a language\-only teacher to a vision\-only studentwithout paired data, modality alignment, or a shared embedding space\.
Fine\-Grained Classification\.Fine\-Grained Visual Classification focuses on distinguishing classes within a broader meta\-class, often containing subtle and challenging inter\-class differences\. Typically, these approaches are separated into localization \(explicit discriminative regions\)\(Geet al\.,[2019](https://arxiv.org/html/2606.27527#bib.bib7); Wanget al\.,[2020b](https://arxiv.org/html/2606.27527#bib.bib6); Schmidtet al\.,[2025](https://arxiv.org/html/2606.27527#bib.bib2)\)and feature\-encoding \(implicit or conceptual differences\)\(Linet al\.,[2017](https://arxiv.org/html/2606.27527#bib.bib8); Zhenget al\.,[2019](https://arxiv.org/html/2606.27527#bib.bib3)\)\. While DFKD\-FGVC\(Shaoet al\.,[2024](https://arxiv.org/html/2606.27527#bib.bib5)\)recently proposed a KD method for fine\-grained tasks, it remains constrained to homogeneous teacher–student architectures\. Our work differentiates itself by demonstrating that a visual student can learn these nuanced distinctions directly fromconceptualknowledge, without requiring aligned modalities or a homogeneous teacher\.
## 3Methodology
Figure 1:Overview ofLaViD\.Stage \#1: The LLM is prompted with class metadata and names to generate diverse multiple\-choice questions \(MCQs\) that capture high\-level semantic differences\. These are instantiated with each class to extract soft label distributions over answer options, forming a conceptual signature per class\.Stage \#2: The student processes an image through a visual backbone and auxiliary head to predict logits aligned with the LLM’s question space\. It is trained with a standard classification loss \(not shown\) and an auxiliary MSE loss against the LLM\-derived targets\.In this section, we presentLanguage\-to\-Visual KnowledgeDistillation \(LaViD\), a framework for transferring conceptual knowledge from atext\-only large language model \(LLM\)to a purelyvisual student model\. Unlike multimodal approaches that rely on paired vision–language inputs,LaViDdistills structured semantic supervision from language alone to guide visual representation learning\. Concretely,LaViDelicits relational conceptual knowledge from the LLM through structured semantic queries and aligns the student with the resulting multi\-dimensional class relationships\. WhileLaViDadopts a distillation\-style objective, it differs fundamentally from conventional knowledge distillation, which transfers instance\-level predictions from task\-trained teachers; instead, it reframes distillation as a mechanism for cross\-modal concept transfer, injecting external world knowledge into visual learners rather than mimicking teacher outputs\. Importantly, the conceptual targets inLaViDare fixed per class but structured across diverse semantic dimensions induced by the generated questions, distinguishing them from class embeddings or label smoothing schemes that capture only coarse similarity structure\. This structured semantic regularization encourages visual representations to align with meaningful conceptual factors, which we find contributes to improved robustness against spurious correlations in practice\.
### 3\.1Overview
Let𝒳\\mathcal\{X\}be the dataset, where𝒳=\{𝐱i\}i=1N\\mathcal\{X\}=\\\{\\mathbf\{x\}\_\{i\}\\\}\_\{i=1\}^\{N\}and𝐱i∈ℝ3×H×W\\mathbf\{x\}\_\{i\}\\in\\mathbb\{R\}^\{3\\times H\\times W\}represents the input image of size3×H×W3\\times H\\times W\. Let𝒴=\{yi\}i=1N\\mathcal\{Y\}=\\\{y\_\{i\}\\\}\_\{i=1\}^\{N\}be the corresponding set of labels, whereyi∈\{0,1\}ky\_\{i\}\\in\\\{0,1\\\}^\{k\}is the one\-hot encoded class label for theii\-th sample, withkkbeing the number of classes\.
We define the total lossLLas the sum of the supervised lossLsupL\_\{\\text\{sup\}\}and the distillation lossLLaViDL\_\{\\text\{LaViD\}\}:
L=Lsup\+λLLaViDL=L\_\{\\text\{sup\}\}\+\\lambda L\_\{\\text\{LaViD\}\}
The supervised lossLsupL\_\{\\text\{sup\}\}is the standard cross\-entropy loss for classification:
Lsup=−1N∑i=1N∑c=1kyi,clogp\(yi=c\|𝐱i\)L\_\{\\text\{sup\}\}=\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{c=1\}^\{k\}y\_\{i,c\}\\log p\(y\_\{i\}=c\|\\mathbf\{x\}\_\{i\}\)
wherep\(yi=c\|𝐱i\)p\(y\_\{i\}=c\|\\mathbf\{x\}\_\{i\}\)is the predicted probability for classccfor theii\-th sample\.

CardinalEuropean GoldfinchAmerican Goldfinch
Figure 2:Structured semantics from LLM supervision\.The heatmap shows LLM logits for two questions where theEuropean Goldfinchaligns with theCardinalon head color \(left\) but with theAmerican Goldfinchon crest absence \(right\)\. These patterns provide relational supervision beyond standard class labels\.
### 3\.2Logit Extraction from the LLM
To construct the distillation targets used in theLaViDloss, we first prompt the LLM to generate a set of multiple\-choice questions that probe semantic distinctions between classes\. These questions are then instantiated per class and used to query the LLM for logits over answer options\. The resulting distributions serve as supervision signals to guide the student model during training\. We emphasize that MCQ generation and logit extraction are performed once per dataset \(rather than per training sample\), making the supervision cost negligible relative to student training and eliminating the need to train or run large multimodal teachers during learning\.
Multiple\-choice question generation\.Let𝒞=\{c1,…,ck\}\\mathcal\{C\}=\\\{c\_\{1\},\\ldots,c\_\{k\}\\\}denote the set of class names in the dataset\. Given𝒞\\mathcal\{C\}and dataset metadata, we prompt the LLM to generate a set of multiple\-choice questions \(MCQs\) aimed at distinguishing between these classes\. Each question must focus on a visually grounded concept, include a special<object\>token as a placeholder for a target class, and assign each class to exactly one answer option\. The answer choices must not include any class names or direct references as this would not force the LLM to think semantically\. We use a single prompt to collect a set ofQQsuch questions, each accompanied byMManswer options\. The full prompt is in the appendix\.
Per\-class logit extraction\.Once the MCQs are obtained, we instantiate each question by replacing the<object\>token with the name of a specific classc∈𝒞c\\in\\mathcal\{C\}, resulting in a complete prompt\. Each prompt is formatted using a chat interface, where the user poses the question followed by a list of labeled answer options \(e\.g\., “A\. option one”, “B\. option two”, etc\.\), and the assistant is expected to reply with the correct option label \(e\.g\., “A”\)\. We extract the pre\-softmax logits for the next\-token prediction following the assistant’s response prompt, focusing on the logits assigned to the first token of each answer label \(“A”, “B”, etc\.\)\. This process is repeated for allQQquestions\. For each classcc, the resulting LLM supervision takes the form of aQ×MQ\\times Mmatrix, where each row corresponds to a question and contains a softmax\-normalized probability distribution over theMMoptions\. These per\-class matrices serve as the distillation targets for training the student model\.
### 3\.3Student Training with Distillation Loss
To align the student model with the conceptual supervision from the LLM, we equip the visual backbone with an auxiliary linear head that maps the final feature vector into a flattened output of dimensionQMQM\. Specifically, given an imagexx, the student produces a feature vectorf\(x\)f\(x\), which is projected to a vectors\(x\)∈ℝQMs\(x\)\\in\\mathbb\{R\}^\{QM\}\. We reshape this into aQ×MQ\\times Mmatrix, denotedS\(x\)∈ℝQ×MS\(x\)\\in\\mathbb\{R\}^\{Q\\times M\}, which represents the student’s predictions overMManswer options for each of theQQquestions\.
Each image is associated with a ground\-truth class labelyy, which indexes a class in𝒞\\mathcal\{C\}, and the corresponding LLM supervision matrixTy∈ℝQ×MT\_\{y\}\\in\\mathbb\{R\}^\{Q\\times M\}serves as the soft target\. The distillation loss is defined as:
LLaViD\(x,y\)=1QM‖S\(x\)−Ty‖22\.L\_\{\\text\{\\text\{LaViD\}\}\}\(x,y\)=\\frac\{1\}\{QM\}\\left\\\|S\(x\)\-T\_\{y\}\\right\\\|\_\{2\}^\{2\}\.
This loss guides the student to align with the class\-level conceptual knowledge encoded by the LLM, providing an auxiliary training signal alongside conventional supervision\.
### 3\.4Structuring Class Semantics through Language
To illustrate the intuition behindLaViD, we present a toy example with two representative multiple\-choice questions generated by GPT\-4o and three bird species from the CUB dataset:Cardinal,European Goldfinch, andAmerican Goldfinch\. Each question targets a visually grounded trait, such as head color or presence of a crest, and another language model produces logits over answer options for each class\. These logits, shown in Figure[2](https://arxiv.org/html/2606.27527#S3.F2), reveal consistent and interpretable semantic structure\.
Notably, theEuropean Goldfinchshares the same predicted head color as theCardinal\(“Red”\) but disagrees on the crest question, where theCardinalis “Prominent and Upright” while the goldfinch is “Absent\.” Conversely, theEuropeanandAmerican Goldfinchdiffer on head color but align on crest absence\. These relationships are not incidental: they reflect consistent semantic distinctions captured by the LLM and transferred to the student model during distillation\.
When the student is trained on aEuropean Goldfinchimage, it is encouraged to produce auxiliary logits that agree with theCardinalon the head color question but diverge on the crest question\. In contrast, training on theAmerican Goldfinchleads to agreement with theEuropean Goldfinchon crest absence but not head color\. These supervision signals introduce structured relational constraints that reflectexternal conceptual knowledgecaptured by the LLM\.LaViDencourages the student to shape its internal representations in a way that reflects semantic and visual relationships across classes\.
These structured patterns encourage the student to embed visual classes in a space where both intra\-class consistency and inter\-class structure are preserved, aligned with the knowledge encoded in language\. Conceptually,LaViDacts as a semantic regularizer that biases visual representations toward world\-knowledge\-consistent class relationships, offering a potential explanation for its robustness benefits\.
## 4Experiments
Table 1:Top\-1 \(%\) accuracy of competing distillation approaches across six fine\-grained classification benchmarks\. RN\-18, MNV2, and SNV2 denote ResNet\-18, MobileNetV2, and ShuffleNetV2 student models, respectively\. RN\-50\(Heet al\.,[2016](https://arxiv.org/html/2606.27527#bib.bib61)\)serves as the conventional visual teacher, while InternVL\(Chenet al\.,[2024](https://arxiv.org/html/2606.27527#bib.bib66)\)\(InternVL2\-8B\), LLaVA\(Liuet al\.,[2023](https://arxiv.org/html/2606.27527#bib.bib33)\)\(LLaVA\-1\.5\-7B\), and Qwen\(Yanget al\.,[2024](https://arxiv.org/html/2606.27527#bib.bib27)\)\(Qwen2\.5\-7B\) act as multimodal or language\-only teachers\. QN\+R50 denotes a hybrid teacher setup combining Qwen2\.5\-7B and ResNet\-50\. The best and second\-best results are marked inboldandunderline, respectively\.Figure 3:Grad\-CAM visualizations on the Waterbirds dataset\.LaViDstudent better focuses on the bird rather than background artifacts\.### 4\.1Datasets and Implementation Details
Datasets\.We evaluateLaViDon six fine\-grained classification benchmarks: CUB\-200\(Wahet al\.,[2011](https://arxiv.org/html/2606.27527#bib.bib50)\), Caltech\-101\(Liet al\.,[2022](https://arxiv.org/html/2606.27527#bib.bib51)\), 102Flowers\(Nilsback and Zisserman,[2008](https://arxiv.org/html/2606.27527#bib.bib52)\), FGVC Aircraft\(Majiet al\.,[2013](https://arxiv.org/html/2606.27527#bib.bib53)\), Oxford\-IIIT Pet\(Parkhiet al\.,[2012](https://arxiv.org/html/2606.27527#bib.bib54)\), and Stanford Cars\(Krauseet al\.,[2013](https://arxiv.org/html/2606.27527#bib.bib55)\)\. These datasets naturally emphasize subtle visual distinctions between classes, making them well\-suited for concept\-driven supervision like ours\. We further testLaViDon large\-scale datasets by evaluating it on ImageNet\(Denget al\.,[2009](https://arxiv.org/html/2606.27527#bib.bib56)\)\. Due to the limitations of our method in spanning large general classes, we do not run the full 1000\-way classification task, but distinctively group into 9 semantically coherent subsets \(e\.g\. birds, instruments\) to assess the method’s scalability\. The full details are provided in the limitations and appendix\.
Table 2:Top\-1 accuracy \(%\) for ImageNet\- and CLIP\-pretrained ViT/B\-16 models\.Implementation DetailsWe selected three well\-studied student models for our main results: ResNet\-18\(Heet al\.,[2016](https://arxiv.org/html/2606.27527#bib.bib61)\), MobileNetV2\(Sandleret al\.,[2018](https://arxiv.org/html/2606.27527#bib.bib60)\), and ShuffleNetV2\(Maet al\.,[2018](https://arxiv.org/html/2606.27527#bib.bib59)\)\. Unless otherwise specified, all experiments use Qwen2\.5\-7B\(Yanget al\.,[2024](https://arxiv.org/html/2606.27527#bib.bib27)\)as the language teacher inLaViD, with MCQs generated by GPT\-4o\(OpenAI,[2024](https://arxiv.org/html/2606.27527#bib.bib67)\)\. The full hyperparameters and training configurations are detailed in Appendix[A\.2](https://arxiv.org/html/2606.27527#A1.SS2)\. All reported results are averaged over three trials\.
BaselinesWe position our work within the broader context of knowledge distillation and compareLaViDagainst several representative baselines\. We include the following traditional distillation methods: KD\(Hintonet al\.,[2015](https://arxiv.org/html/2606.27527#bib.bib9)\), RKD\(Parket al\.,[2019](https://arxiv.org/html/2606.27527#bib.bib19)\), DKD\(Zhaoet al\.,[2022](https://arxiv.org/html/2606.27527#bib.bib13)\), MLKD\(Jinet al\.,[2023](https://arxiv.org/html/2606.27527#bib.bib62)\), and Logit Standardization \(LS\)\(Sunet al\.,[2024](https://arxiv.org/html/2606.27527#bib.bib17)\)\. All these baselines require a dataset\-specific teacher model trained on the same data as the student, making them effective in\-domain\. In addition, we compare with MaKD\(Leeet al\.,[2025](https://arxiv.org/html/2606.27527#bib.bib58)\), a recent method that distills from multimodal large language models \(MLLMs\) by prompting with individual images\. To further examine the effectiveness of MLLM\-based supervision, we also adapt two feature\-based distillation methods—CRD\(Tianet al\.,[2019](https://arxiv.org/html/2606.27527#bib.bib12)\)and FitNet\(Romeroet al\.,[2014](https://arxiv.org/html/2606.27527#bib.bib11)\)—using LLaVA\-1\.5\(Liuet al\.,[2023](https://arxiv.org/html/2606.27527#bib.bib33)\)as the teacher\. This establishes a more comprehensive multimodal feature\-based baseline where student models are guided by MLLM\-derived representations\. Since MLLMs operate over token sequences, we extract features from multiple layers and find that middle layers \(e\.g\., layer \-12\) tend to provide stronger supervision; full ablation results are provided in the Appendix\.
Table 3:Top\-1 accuracy \(%\) with student ResNet\-18 on ImageNet WordNet hierarchy synsets\. CONT, INST, INV, VERT denote Container, Instrumentality, Invertebrate, and Vertebrate, respectively\.
### 4\.2Main Results
Table[1](https://arxiv.org/html/2606.27527#S4.T1)comparesLaViDwith both traditional KD methods and recent approaches leveraging MLLMs as teachers\. Notably,LaViDconsistently outperforms MLLM\-based baselines, including MaKD\(Leeet al\.,[2025](https://arxiv.org/html/2606.27527#bib.bib58)\)and adaptations of FitNet\(Romeroet al\.,[2014](https://arxiv.org/html/2606.27527#bib.bib11)\)and CRD\(Tianet al\.,[2019](https://arxiv.org/html/2606.27527#bib.bib12)\)with LLaVA\(Liuet al\.,[2023](https://arxiv.org/html/2606.27527#bib.bib33)\)as the teacher—despite our own language teacher \(Qwen2\.5\-7B\) never accessing training images\. This demonstrates the effectiveness of conceptual supervision even in the absence of aligned multimodal data\.
Figure 4:Qualitative examples of high\- and low\-entropy questions on the Flowers and CUB datasets\.Our method also achieves competitive or superior performance compared to traditional visual teacher KD methods\. For example,LaViDsurpasses DKD\(Zhaoet al\.,[2022](https://arxiv.org/html/2606.27527#bib.bib13)\)and MLKD\(Jinet al\.,[2023](https://arxiv.org/html/2606.27527#bib.bib62)\)across several datasets such as CUB, Aircraft, Pets, and more\. Furthermore, we find that our approach can be effectively combined with logit standardization \(LS\)\(Sunet al\.,[2024](https://arxiv.org/html/2606.27527#bib.bib17)\), leading to additional gains across multiple datasets and student architectures\.
We further evaluateLaViDon subsets of ImageNet constructed using WordNet hierarchy synsets\. As shown in Table[3](https://arxiv.org/html/2606.27527#S4.T3),LaViDconsistently outperforms the baseline across all 9 semantic groups\. These results reinforce the effectiveness of language\-derived supervision even at larger scale, without requiring access to multimodal data\.
These results highlight that distillation from language\-only teachers not only provides strong standalone performance, but also complements existing visual KD techniques—establishingLaViDas a simple, modular, and broadly effective distillation paradigm\.
To assessLaViDbeyond convolutional backbones, we evaluate its effectiveness on vision transformers, including ViT\(Dosovitskiyet al\.,[2020](https://arxiv.org/html/2606.27527#bib.bib65)\)and CLIP\(Radfordet al\.,[2021](https://arxiv.org/html/2606.27527#bib.bib30)\), as student models in two variations: standard transformers initialized with ImageNet\-pretrained weights\(Denget al\.,[2009](https://arxiv.org/html/2606.27527#bib.bib56)\)to mitigate their data hunger, and CLIP models where we follow\(Wortsmanet al\.,[2022](https://arxiv.org/html/2606.27527#bib.bib68)\)by initializing the classifier with the text embedding of “a photo of a class\.” As shown in Table[2](https://arxiv.org/html/2606.27527#S4.T2),LaViDconsistently improves over the baseline, demonstrating its applicability and generalizability to transformer\-based models\.
Table 4:Top\-1 accuracy \(%\) of different distillation approaches evaluated on the Waterbirds dataset, grouped from the combination of \{waterbird, landbird\} and \{water background, land background\}\. The best and second\-best results are marked inboldandunderline, respectively\. Average, Best, Worst represent the accuracy for each group\.
### 4\.3Overcoming Dataset Biases
Prior work shows that student models can inherit biases from their teachers during knowledge distillation\(Ojhaet al\.,[2023](https://arxiv.org/html/2606.27527#bib.bib35)\)\. In contrast,LaViDleverages general\-purpose language models that are not trained on the visual data, providing supervision grounded in broad conceptual knowledge rather than dataset\-specific patterns\. This offers a unique opportunity to regularize the student with semantic guidance instead of spurious heuristics\.
We validate this on Waterbirds\(Sagawaet al\.,[2019](https://arxiv.org/html/2606.27527#bib.bib57)\), where spurious correlations between species and background make worst\-group performance particularly challenging\. As shown in Table[4](https://arxiv.org/html/2606.27527#S4.T4),LaViDconsistently achieves higher worst\-group accuracy across all student architectures compared to independently trained models\. The worst\-performing group reflects biased models’ tendency to rely on background cues rather than relevant features, and traditional KD methods often exacerbate this issue by reinforcing shortcuts\.LaViD, however, mitigates these effects without compromising overall performance\.
We provide further analysis with Grad\-CAM in Figure[3](https://arxiv.org/html/2606.27527#S4.F3), demonstratingLaViDenforces student models to focus on the bird rather than spurious background elements\.
### 4\.4Analysis
Unless otherwise specified, we conduct ablation studies using ResNet\-18 on the CUB dataset\. This configuration balances representative analysis with computational efficiency to analyze the core design choices inLaViD\.
LLM vs\. Word EmbeddingSinceLaViD’s MCQ supervision produces a logit vector that encodes inter\-class relationships, we compare it with a word\-embedding baseline that captures similar structure\. In this variant, we directly use the pretrained word embedding of each class name from MiniLM\(Wanget al\.,[2020a](https://arxiv.org/html/2606.27527#bib.bib29)\)or BERT\(Devlinet al\.,[2019](https://arxiv.org/html/2606.27527#bib.bib31)\)as the supervision signal\. As shown in Table[5](https://arxiv.org/html/2606.27527#S4.T5),LaViDconsistently outperforms these word\-embedding baselines, demonstrating that LLM\-derived MCQs provide richer and more informative supervision than static embeddings\.

Figure 5:Ablation on different LLM teachers with a ResNet\-18 student on the CUB dataset\.Table 5:Comparison betweenLaViDand variants using static word embeddings \(MiniLM, BERT\) on the CUB dataset\.
Choice of LLM TeachersIn Figure[5](https://arxiv.org/html/2606.27527#S4.F5), we compare various LLM teachers within theLaViDframework, including Qwen2\.5 \(0\.5B, 7B, 70B\), Gemma\-3 \(12B, 27B\)\(Teamet al\.,[2025](https://arxiv.org/html/2606.27527#bib.bib64)\), Mistral 0\.3\-7B\(Jianget al\.,[2023](https://arxiv.org/html/2606.27527#bib.bib69)\), and LLaMA\-3 \(3B, 8B, 70B\)\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.27527#bib.bib26)\)\. Performance generally improves with model size, reflecting stronger semantic understanding\. Qwen and Gemma consistently outperform LLaMA, and upon inspecting the logits, we find LLaMA’s are notably softer, which may limit its effectiveness as a teacher\. We adopt Qwen2\.5\-7B for experiments as a tradeoff between performance and efficiency\.
Choice of MCQ generators\.We further examine whether LaViD is sensitive to the LLM used for MCQ generation\. Specifically, we replace GPT\-4o\(OpenAI,[2024](https://arxiv.org/html/2606.27527#bib.bib67)\)with Gemini 2\.5 Pro\(Comaniciet al\.,[2025](https://arxiv.org/html/2606.27527#bib.bib70)\)for generating MCQs, while keeping Qwen\-7B fixed as the LLM teacher for extracting logits\. As shown in Table[6](https://arxiv.org/html/2606.27527#S4.T6), performance remains similar across the two MCQ generators, with only a small change on CUB and nearly identical performance on Caltech\. This suggests that LaViD is not highly sensitive to the specific frontier LLM used to generate MCQs in our setting\. This stability is consistent with our use of a constrained MCQ generation protocol, where the LLM is given the target class set and asked to generate questions that distinguish between these classes, rather than to freely propose labels or concepts\.
Table 6:Effect of changing the LLM used for MCQ generation while keeping the LLM teacher consistent\.Figure 6:Effect of the number of MCQs \(with 5 answer options\) on accuracy of ResNet\-18 student on CUB\. Accuracy improves with more questions, but plateaus beyond 50\.
Figure 7:Effect of the number of answer options \(with 50 questions\) on CUB\. Performance stabilizes after 5 options, highlighting question quality as a key factor\.
Number of Questions and Answer OptionsWe investigate how the number of MCQs and their answer options affect distillation quality\. Figure[7](https://arxiv.org/html/2606.27527#S4.F7)shows that increasing the number of questions improves student accuracy, but plateaus after 50 questions\. This is likely because later questions degrade in semantic quality, as visually relevant distinctions become saturated\. A similar trend is observed in Figure[7](https://arxiv.org/html/2606.27527#S4.F7)for the number of answer options, where performance levels off beyond five\. The effect is less pronounced, suggesting that question quality is more critical than granularity of choices\. Based on these trends, we use 50 questions and 5 answer options in all experiments unless otherwise specified\.
Question Quality AnalysisWe measure discriminability via the average prediction entropy𝔼c∈𝒞\[H\(pq\(c\)\)\]\\mathbb\{E\}\_\{c\\in\\mathcal\{C\}\}\[H\(p\_\{q\}^\{\(c\)\}\)\]\. High entropy signals distinctive class semantics, while low entropy implies invariant attributes—both of which are shown in Fig\.[4](https://arxiv.org/html/2606.27527#S4.F4)\. These low\-entropy questions are rare \(fewer than 5% of all questions\), and removing them has negligible impact on accuracy, demonstrating thatLaViDremains robust even in the presence of less informative supervision\.
Additional Ablation StudiesFurther ablation studies including the effect ofLaViDloss weight and the number of questions and answer options, are provided in the Appendix\.
Overall, although LaViD relies on language\-model\-generated questions to elicit conceptual structure, our ablations demonstrate that performance degrades gracefully under reduced question diversity and remains stable across different LLM backbones\. This suggests that the method is not overly sensitive to individual prompt formulations but instead benefits from the aggregate semantic structure captured across multiple queries\.
## Limitations
WhileLaViDdemonstrates strong performance across diverse fine\-grained classification tasks, it has some limitations\. The method relies on language\-model\-generated multiple\-choice questions for conceptual supervision\. The effectiveness of the supervision depends on the semantic coverage of the generated questions; however, our ablations show that performance degrades gracefully under reduced question diversity, suggesting that LaViD is not overly sensitive to individual prompt quality\. In domains where distinctions are difficult to verbalize or LLMs lack domain familiarity, the conceptual signal may be less complete\. Moreover,LaViDassumes access to interpretable class names or metadata; in domains where class labels are abstract, underspecified, or not semantically meaningful, the approach may be less effective\.
While our primary evaluation focuses on fine\-grained recognition, which benefits most from external conceptual supervision, the proposed framework is agnostic to dataset size and model architecture\. The main practical challenge in large\-scale regimes lies in generating sufficiently diverse semantic queries to cover heterogeneous class spaces\. Our subset experiments on ImageNet suggest that the conceptual supervision remains beneficial as class diversity increases, and scaling question generation strategies is a promising direction for future work\. Notably, because supervision is generated once per dataset and reused throughout training, scaling to larger label spaces does not incur additional per\-sample computational overhead\.
## 5Conclusion
In this work, we presentLaViD, a new paradigm for cross\-modal knowledge distillation that transfers world knowledge from language\-only large language models \(LLMs\) to vision\-only student models\. Our approach leverages multiple\-choice questions generated from dataset metadata to extract structured semantic supervision without requiring paired image\-text data or multimodal pretraining\. Across six fine\-grained benchmarks, we show thatLaViDconsistently outperforms both traditional visual KD methods and recent multimodal approaches with a language\-only teacher\. Moreover,LaViDshows strong robustness to spurious correlations and dataset biases, suggesting that external conceptual knowledge can steer student models toward more meaningful representations\. Ablation studies further validate the importance of each component\. Altogether, our findings establishLaViDas a simple, effective, and general framework for infusing visual learners with high\-level semantic understanding from language\.
## Impact Statement
This work explores language\-driven supervision for visual models, using language models to provide concept\-level guidance through structured prompts\. Rather than relying on language\-vision pretraining or multimodal architectures, our method studies whether general\-purpose knowledge encoded in LLMs can be distilled into visual learners as a complementary form of supervision\.
Because LLMs are trained on broad and heterogeneous corpora, their outputs may reflect cultural assumptions, normative framing, or outdated knowledge\. Even when supervision is provided through structured prompts rather than open\-ended generation, these patterns may influence the resulting visual model\. This work therefore highlights both the potential of large models as indirect teachers and the need to better understand the responsibilities and risks involved in transferring knowledge across modalities\.
## Acknowledgment
This work was supported in part by NSF IIS2404180, and Institute of Information & communications Technology Planning& Evaluation \(IITP\) grants funded by the Korea government \(MSIT\) \(No\. 2022\-0\-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration\), and \(No\. RS\-2025\-2543949\. Environment\-Aware and Domain\-Adaptive Multimodal Embodied AI for Real\-World Interaction\)\.
## References
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p2.1)\.
- D\. Chen, J\. Mei, H\. Zhang, C\. Wang, Y\. Feng, and C\. Chen \(2022\)Knowledge distillation with the reused teacher classifier\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 11933–11942\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p1.1)\.
- Z\. Chen, J\. Wu, W\. Wang, W\. Su, G\. Chen, S\. Xing, M\. Zhong, Q\. Zhang, X\. Zhu, L\. Lu,et al\.\(2024\)Internvl: scaling up vision foundation models and aligning for generic visual\-linguistic tasks\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 24185–24198\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p7.1),[Table 1](https://arxiv.org/html/2606.27527#S4.T1),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.6.2)\.
- A\. Chowdhery, S\. Narang, J\. Devlin, M\. Bosma, G\. Mishra, A\. Roberts, P\. Barham, H\. W\. Chung, C\. Sutton, S\. Gehrmann,et al\.\(2023\)Palm: scaling language modeling with pathways\.Journal of Machine Learning Research24\(240\),pp\. 1–113\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p2.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§4\.4](https://arxiv.org/html/2606.27527#S4.SS4.p4.1)\.
- J\. Deng, W\. Dong, R\. Socher, L\. Li, K\. Li, and L\. Fei\-Fei \(2009\)Imagenet: a large\-scale hierarchical image database\.In2009 IEEE conference on computer vision and pattern recognition,pp\. 248–255\.Cited by:[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2606.27527#S4.SS2.p5.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[Figure 5](https://arxiv.org/html/2606.27527#S4.F5.4.1.3.3.1),[Figure 5](https://arxiv.org/html/2606.27527#S4.F5.4.1.6.6.1),[Figure 5](https://arxiv.org/html/2606.27527#S4.F5.4.1.9.9.1),[§4\.4](https://arxiv.org/html/2606.27527#S4.SS4.p2.1)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly,et al\.\(2020\)An image is worth 16x16 words: transformers for image recognition at scale\.arXiv preprint arXiv:2010\.11929\.Cited by:[§4\.2](https://arxiv.org/html/2606.27527#S4.SS2.p5.1)\.
- J\. Fan, C\. Li, X\. Liu, and A\. Yao \(2024\)ScaleKD: strong vision transformers could be excellent teachers\.arXiv preprint arXiv:2411\.06786\.Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p1.1)\.
- N\. C\. Garcia, P\. Morerio, and V\. Murino \(2018\)Modality distillation with multiple stream networks for action recognition\.InProceedings of the European Conference on Computer Vision \(ECCV\),pp\. 103–118\.Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p1.1)\.
- W\. Ge, X\. Lin, and Y\. Yu \(2019\)Weakly supervised complementary parts models for fine\-grained image classification from the bottom up\.External Links:1903\.02827,[Link](https://arxiv.org/abs/1903.02827)Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p2.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p2.1),[§4\.4](https://arxiv.org/html/2606.27527#S4.SS4.p3.1)\.
- X\. Gu, T\. Lin, W\. Kuo, and Y\. Cui \(2021\)Open\-vocabulary object detection via vision and language knowledge distillation\.arXiv preprint arXiv:2104\.13921\.Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p1.1)\.
- S\. Gupta, J\. Hoffman, and J\. Malik \(2016\)Cross modal distillation for supervision transfer\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 2827–2836\.Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p1.1)\.
- Z\. Hao, J\. Guo, K\. Han, Y\. Tang, H\. Hu, Y\. Wang, and C\. Xu \(2023\)One\-for\-all: bridge the gap between heterogeneous architectures in knowledge distillation\.Advances in Neural Information Processing Systems36,pp\. 79570–79582\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p1.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 770–778\.Cited by:[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.27527#S4.T1),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.6.2)\.
- G\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.arXiv preprint arXiv:1503\.02531\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p1.1),[§2](https://arxiv.org/html/2606.27527#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.14.14.2),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.25.25.2),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.3.3.2),[Table 4](https://arxiv.org/html/2606.27527#S4.T4.2.15.15.2),[Table 4](https://arxiv.org/html/2606.27527#S4.T4.2.3.3.2),[Table 4](https://arxiv.org/html/2606.27527#S4.T4.2.9.9.2)\.
- T\. Huang, S\. You, F\. Wang, C\. Qian, and C\. Xu \(2022\)Knowledge distillation from a stronger teacher\.Advances in Neural Information Processing Systems35,pp\. 33716–33727\.Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p1.1)\.
- M\. Huh, B\. Cheung, T\. Wang, and P\. Isola \(2024\)The platonic representation hypothesis\.ICML\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p2.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.External Links:2310\.06825,[Link](https://arxiv.org/abs/2310.06825)Cited by:[§4\.4](https://arxiv.org/html/2606.27527#S4.SS4.p3.1)\.
- Y\. Jin, J\. Wang, and D\. Lin \(2023\)Multi\-level logit distillation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 24276–24285\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p3.1),[§4\.2](https://arxiv.org/html/2606.27527#S4.SS2.p2.1),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.17.17.2),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.28.28.2),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.6.6.2),[Table 4](https://arxiv.org/html/2606.27527#S4.T4.2.12.12.2),[Table 4](https://arxiv.org/html/2606.27527#S4.T4.2.18.18.2),[Table 4](https://arxiv.org/html/2606.27527#S4.T4.2.6.6.2)\.
- J\. Krause, M\. Stark, J\. Deng, and L\. Fei\-Fei \(2013\)3d object representations for fine\-grained categorization\.InProceedings of the IEEE international conference on computer vision workshops,pp\. 554–561\.Cited by:[§A\.1](https://arxiv.org/html/2606.27527#A1.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p1.1)\.
- T\. Lee, J\. Bang, S\. Kwon, and T\. Kim \(2025\)Multi\-aspect knowledge distillation with large language model\.arXiv preprint arXiv:2501\.13341\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p3.1),[§4\.2](https://arxiv.org/html/2606.27527#S4.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.20.20.2),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.31.31.2),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.9.9.2)\.
- F\. Li, M\. Andreeto, M\. Ranzato, and P\. Perona \(2022\)Caltech 101\.CaltechDATA\.External Links:[Document](https://dx.doi.org/10.22002/D1.20086)Cited by:[§A\.1](https://arxiv.org/html/2606.27527#A1.SS1.SSS0.Px6.p1.1),[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p1.1)\.
- T\. Lin, A\. RoyChowdhury, and S\. Maji \(2017\)Bilinear cnns for fine\-grained visual recognition\.External Links:1504\.07889,[Link](https://arxiv.org/abs/1504.07889)Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p2.1)\.
- H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee \(2023\)Visual instruction tuning\.Advances in neural information processing systems36,pp\. 34892–34916\.Cited by:[Appendix B](https://arxiv.org/html/2606.27527#A2.p1.1),[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p3.1),[§4\.2](https://arxiv.org/html/2606.27527#S4.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.27527#S4.T1),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.6.2)\.
- Y\. Liu, J\. Cao, B\. Li, W\. Hu, J\. Ding, and L\. Li \(2022\)Cross\-architecture knowledge distillation\.InProceedings of the Asian conference on computer vision,pp\. 3396–3411\.Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p1.1)\.
- N\. Ma, X\. Zhang, H\. Zheng, and J\. Sun \(2018\)Shufflenet v2: practical guidelines for efficient cnn architecture design\.InProceedings of the European conference on computer vision \(ECCV\),pp\. 116–131\.Cited by:[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p2.1)\.
- S\. Maji, E\. Rahtu, J\. Kannala, M\. Blaschko, and A\. Vedaldi \(2013\)Fine\-grained visual classification of aircraft\.arXiv preprint arXiv:1306\.5151\.Cited by:[§A\.1](https://arxiv.org/html/2606.27527#A1.SS1.SSS0.Px5.p1.1),[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p1.1)\.
- M\. Nilsback and A\. Zisserman \(2008\)Automated flower classification over a large number of classes\.InIndian Conference on Computer Vision, Graphics and Image Processing,Cited by:[§A\.1](https://arxiv.org/html/2606.27527#A1.SS1.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p1.1)\.
- U\. Ojha, Y\. Li, A\. Sundara Rajan, Y\. Liang, and Y\. J\. Lee \(2023\)What knowledge gets distilled in knowledge distillation?\.Advances in Neural Information Processing Systems36,pp\. 11037–11048\.Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p1.1),[§4\.3](https://arxiv.org/html/2606.27527#S4.SS3.p1.1)\.
- OpenAI \(2024\)GPT\-4o system card\.Note:[https://arxiv\.org/abs/2410\.21276](https://arxiv.org/abs/2410.21276)Accessed: 2025\-05\-16Cited by:[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p2.1),[§4\.4](https://arxiv.org/html/2606.27527#S4.SS4.p4.1)\.
- W\. Park, D\. Kim, Y\. Lu, and M\. Cho \(2019\)Relational knowledge distillation\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 3967–3976\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.15.15.2),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.26.26.2),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.4.4.2),[Table 4](https://arxiv.org/html/2606.27527#S4.T4.2.10.10.2),[Table 4](https://arxiv.org/html/2606.27527#S4.T4.2.16.16.2),[Table 4](https://arxiv.org/html/2606.27527#S4.T4.2.4.4.2)\.
- O\. M\. Parkhi, A\. Vedaldi, A\. Zisserman, and C\. Jawahar \(2012\)Cats and dogs\.In2012 IEEE conference on computer vision and pattern recognition,pp\. 3498–3505\.Cited by:[§A\.1](https://arxiv.org/html/2606.27527#A1.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p1.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p1.1),[§4\.2](https://arxiv.org/html/2606.27527#S4.SS2.p5.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of machine learning research21\(140\),pp\. 1–67\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p2.1)\.
- A\. Romero, N\. Ballas, S\. E\. Kahou, A\. Chassang, C\. Gatta, and Y\. Bengio \(2014\)Fitnets: hints for thin deep nets\.arXiv preprint arXiv:1412\.6550\.Cited by:[Appendix B](https://arxiv.org/html/2606.27527#A2.p1.1),[§1](https://arxiv.org/html/2606.27527#S1.p1.1),[§2](https://arxiv.org/html/2606.27527#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p3.1),[§4\.2](https://arxiv.org/html/2606.27527#S4.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.10.10.2),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.21.21.2),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.32.32.2)\.
- S\. Sagawa, P\. W\. Koh, T\. B\. Hashimoto, and P\. Liang \(2019\)Distributionally robust neural networks for group shifts: on the importance of regularization for worst\-case generalization\.arXiv preprint arXiv:1911\.08731\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p7.1),[§4\.3](https://arxiv.org/html/2606.27527#S4.SS3.p2.1)\.
- M\. Sandler, A\. Howard, M\. Zhu, A\. Zhmoginov, and L\. Chen \(2018\)Mobilenetv2: inverted residuals and linear bottlenecks\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 4510–4520\.Cited by:[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p2.1)\.
- J\. Schmidt, S\. Stober, J\. Denzler, and P\. Bodesheim \(2025\)Saccadic vision for fine\-grained visual classification\.External Links:2509\.15688,[Link](https://arxiv.org/abs/2509.15688)Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p2.1)\.
- R\. Shao, W\. Zhang, J\. Yin, and J\. Wang \(2024\)Data\-free knowledge distillation for fine\-grained visual categorization\.External Links:2404\.12037,[Link](https://arxiv.org/abs/2404.12037)Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p2.1)\.
- A\. Steiner, A\. Kolesnikov, X\. Zhai, R\. Wightman, J\. Uszkoreit, and L\. Beyer \(2021\)How to train your vit? data, augmentation, and regularization in vision transformers\.arXiv preprint arXiv:2106\.10270\.Cited by:[§A\.2](https://arxiv.org/html/2606.27527#A1.SS2.p1.1)\.
- S\. Sun, W\. Ren, J\. Li, R\. Wang, and X\. Cao \(2024\)Logit standardization in knowledge distillation\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 15731–15740\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p1.1),[§1](https://arxiv.org/html/2606.27527#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p3.1),[§4\.2](https://arxiv.org/html/2606.27527#S4.SS2.p2.1),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.18.18.2),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.29.29.2),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.7.7.2)\.
- G\. Team, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière,et al\.\(2025\)Gemma 3 technical report\.arXiv preprint arXiv:2503\.19786\.Cited by:[§4\.4](https://arxiv.org/html/2606.27527#S4.SS4.p3.1)\.
- Y\. Tian, D\. Krishnan, and P\. Isola \(2019\)Contrastive representation distillation\.arXiv preprint arXiv:1910\.10699\.Cited by:[Appendix B](https://arxiv.org/html/2606.27527#A2.p1.1),[§1](https://arxiv.org/html/2606.27527#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p3.1),[§4\.2](https://arxiv.org/html/2606.27527#S4.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.11.11.2),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.22.22.2),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.33.33.2)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023a\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p2.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023b\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p2.1)\.
- F\. Tung and G\. Mori \(2019\)Similarity\-preserving knowledge distillation\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 1365–1374\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p1.1)\.
- C\. Wah, S\. Branson, P\. Welinder, P\. Perona, and S\. Belongie \(2011\)The caltech\-ucsd birds\-200\-2011 dataset\.Cited by:[§A\.1](https://arxiv.org/html/2606.27527#A1.SS1.SSS0.Px4.p1.1),[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p1.1)\.
- W\. Wang, F\. Wei, L\. Dong, H\. Bao, N\. Yang, and M\. Zhou \(2020a\)Minilm: deep self\-attention distillation for task\-agnostic compression of pre\-trained transformers\.Advances in neural information processing systems33,pp\. 5776–5788\.Cited by:[Figure 5](https://arxiv.org/html/2606.27527#S4.F5.4.1.2.2.2),[Figure 5](https://arxiv.org/html/2606.27527#S4.F5.4.1.5.5.2),[Figure 5](https://arxiv.org/html/2606.27527#S4.F5.4.1.8.8.2),[§4\.4](https://arxiv.org/html/2606.27527#S4.SS4.p2.1)\.
- Z\. Wang, S\. Wang, H\. Li, Z\. Dou, and J\. Li \(2020b\)Graph\-propagation based correlation learning for weakly supervised fine\-grained image classification\.Proceedings of the AAAI Conference on Artificial Intelligence34\(07\),pp\. 12289–12296\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/6912),[Document](https://dx.doi.org/10.1609/aaai.v34i07.6912)Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p2.1)\.
- M\. Wortsman, G\. Ilharco, J\. W\. Kim, M\. Li, S\. Kornblith, R\. Roelofs, R\. G\. Lopes, H\. Hajishirzi, A\. Farhadi, H\. Namkoong,et al\.\(2022\)Robust fine\-tuning of zero\-shot models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 7959–7971\.Cited by:[§4\.2](https://arxiv.org/html/2606.27527#S4.SS2.p5.1)\.
- S\. Wu, W\. Zhang, S\. Jin, W\. Liu, and C\. C\. Loy \(2023\)Aligning bag of regions for open\-vocabulary object detection\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 15254–15264\.Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p1.1)\.
- X\. Xu, T\. Xiong, Z\. Ding, and Z\. Tu \(2023\)Masqclip for open\-vocabulary universal image segmentation\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 887–898\.Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p1.1)\.
- Z\. Xue, Z\. Gao, S\. Ren, and H\. Zhao \(2022\)The modality focusing hypothesis: towards understanding crossmodal knowledge distillation\.arXiv preprint arXiv:2206\.06487\.Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p1.1)\.
- Z\. Xue, S\. Ren, Z\. Gao, and H\. Zhao \(2021\)Multimodal knowledge expansion\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 854–863\.Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\. 5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.27527#S4.T1),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.6.2)\.
- J\. Yang, B\. Martinez, A\. Bulat, G\. Tzimiropoulos,et al\.\(2021\)Knowledge distillation via softmax regression representation learning\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p1.1)\.
- L\. Zhang and K\. Ma \(2020\)Improve object detection with feature\-based knowledge distillation: towards accurate and efficient detectors\.InInternational conference on learning representations,Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p1.1)\.
- B\. Zhao, Q\. Cui, R\. Song, Y\. Qiu, and J\. Liang \(2022\)Decoupled knowledge distillation\.InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition,pp\. 11953–11962\.Cited by:[§1](https://arxiv.org/html/2606.27527#S1.p1.1),[§1](https://arxiv.org/html/2606.27527#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.27527#S4.SS1.p3.1),[§4\.2](https://arxiv.org/html/2606.27527#S4.SS2.p2.1),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.16.16.2),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.27.27.2),[Table 1](https://arxiv.org/html/2606.27527#S4.T1.2.1.5.5.2),[Table 4](https://arxiv.org/html/2606.27527#S4.T4.2.11.11.2),[Table 4](https://arxiv.org/html/2606.27527#S4.T4.2.17.17.2),[Table 4](https://arxiv.org/html/2606.27527#S4.T4.2.5.5.2)\.
- H\. Zheng, J\. Fu, Z\. Zha, and J\. Luo \(2019\)Learning deep bilinear transformation for fine\-grained image representation\.External Links:1911\.03621,[Link](https://arxiv.org/abs/1911.03621)Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p2.1)\.
- J\. Zhu, Y\. Luo, X\. Zheng, H\. Wang, and L\. Wang \(2023\)A good student is cooperative and reliable: cnn\-transformer collaborative learning for semantic segmentation\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 11720–11730\.Cited by:[§2](https://arxiv.org/html/2606.27527#S2.p1.1)\.
## Appendix
## Appendix AAdditional Experimental Information
### A\.1Dataset Details
We conduct experiments on a diverse set of widely used fine\-grained visual classification benchmarks\. Below, we provide a brief description of each dataset used in our evaluation\.
#### Stanford Cars
\(Krauseet al\.,[2013](https://arxiv.org/html/2606.27527#bib.bib55)\)The Stanford Cars dataset comprises 16,185 images across 196 fine\-grained car categories defined by make, model, and year \(e\.g\.,2012 Tesla Model S,2012 BMW M3 Coupe\)\. The data are split into 8,144 training and 8,041 testing samples, with each class approximately balanced between the two splits\.
#### Oxford Pets
\(Parkhiet al\.,[2012](https://arxiv.org/html/2606.27527#bib.bib54)\)Oxford Pets contains 7,384 images of 37 cat and dog breeds, with roughly 200 samples per class\. The dataset is divided into 3,690 training and 3,694 testing images and is characterized by significant variability in scale, pose, lighting, and appearance, making it a useful benchmark for robust recognition\.
#### 102 Flowers
\(Nilsback and Zisserman,[2008](https://arxiv.org/html/2606.27527#bib.bib52)\)The 102 Flowers dataset includes images of 102 flower species, with 6,552 images used for training and 1,637 for testing\. Each class contains between 40 and 258 samples, exhibiting considerable diversity in viewpoint, scale, and illumination conditions\.
#### CUB\-200
\(Wahet al\.,[2011](https://arxiv.org/html/2606.27527#bib.bib50)\)CUB\-200 is a standard benchmark for fine\-grained visual categorization, consisting of 11,788 bird images spanning 200 species\. The dataset is split into 5,994 training and 5,794 testing samples\. Each image is annotated with rich metadata, including part locations, binary attributes, and bounding boxes, enabling more detailed analysis beyond classification accuracy\.
#### FGVC\-Aircraft
\(Majiet al\.,[2013](https://arxiv.org/html/2606.27527#bib.bib53)\)FGVC\-Aircraft comprises 9,967 images covering 100 aircraft model variants, with around 100 samples per class\. The dataset is divided into 6,667 training and 3,300 testing images\. Each image is provided with a tight bounding box and a four\-level hierarchical label describing the aircraft type and model\.
#### Caltech\-101
\(Liet al\.,[2022](https://arxiv.org/html/2606.27527#bib.bib51)\)Caltech\-101 contains images from 101 object categories\. To focus exclusively on object classification, the background category included in the original release is excluded\. The dataset consists of 4,310 training and 4,367 testing images, with most categories containing around 50 samples\.
### A\.2Training Details
On all datasets apart from ImageNet, we train the student models for 240 epochs with a batch size of 16\. The initial learning rate is 0\.01, divided by 10 at epochs 150, 180, and 210\. The optimizer is SGD with a momentum of 0\.9 and weight decay of 5e\-4\. For the traditional KD methods, the teachers are trained with the same hyperparameters\. On ImageNet, we follow standard practices: train all models for 100 epochs with a batch size of 512 on 8 GPUs\. The initial learning rate is 0\.2, divided by 10 at epoch 30, 60, and 90\. ViT models are fine\-tuned for 500 epochs with a batch size of 512 using a cosine learning rate scheduler\(Steineret al\.,[2021](https://arxiv.org/html/2606.27527#bib.bib63)\), while CLIP models are fine\-tuned for 75 epochs with a batch size of 16\. The LaViD loss weightλ\\lambdafor experiments is presented in Table[A](https://arxiv.org/html/2606.27527#A1.T1)\. We also show the impact of loss weight using CUB dataset as an example \(Figure[A](https://arxiv.org/html/2606.27527#A1.F1)\)\.
Figure A:Effect of loss weightλ\\lambdaon student accuracy for ResNet\-18, MobileNetV2, and ShuffleNetV2 on the CUB dataset\.Table A:LaViD loss weight used in trainingλ\\lambda\(tested with intervals of 10\)\.
### A\.3Computational Resources
All experiments in this work were conducted using consumer\-grade GPUs, specifically NVIDIA GeForce RTX 2080 Ti or RTX 3090, depending on resource availability at the time of training\. Due to occasional hardware scheduling differences, the actual GPU hours may vary slightly across runs\. Nevertheless, the total training time for each experiment remained relatively small — typically on the order of only a few hours — reflecting the lightweight nature of the models and tasks considered in this study\.
### A\.4ImageNet Hierarchy Synsets
As described in Section 4\.2, we evaluateLaViDon nine separate subsets of ImageNet, each corresponding to a semantic group defined by the WordNet hierarchy\. These include domains such asmammal,bird, andartifact\. Out of the 1,000 ImageNet classes, 858 are associated with WordNet synsets that fall into one of these groups\. Below, we provide the full list of classes for each group, corresponding to the results reported in Table[3](https://arxiv.org/html/2606.27527#S4.T3)\.
Artifact:altar, apiary, bakery, Band\-Aid, baluster / handrail, barbershop, barn, bath towel, lighthouse, bell tower, baby bib, ring binder, birdhouse, boathouse, bookstore, bottle cap, brass memorial plaque, breakwater, breastplate, butcher shop, castle, chain\-link fence, chain mail, church, movie theater, cliff dwelling, cloak, clogs, spiral or coil, candy store, cowboy boot, cuirass, dam, dishcloth, dock, dome, doormat, fire screen, fountain, gas mask or respirator, greenhouse, radiator grille, grocery store, handkerchief, holster, home theater, honeycomb, lampshade, lens cap, library, slip\-on shoe, sawmill, manhole cover, megalith, monastery, mosque, mosquito net, tent, necklace, baby pacifier, obelisk, palace, paper towel, patio, pedestal, Pickelhaube, picket fence, pillow, planetarium, plate rack, prison, quilt, restaurant, sneaker, sandal, scabbard, scoreboard, shield, shoe store, shoji screen / room divider, balaclava ski mask, sliding door, stage, through arch bridge, stone wall, stupa, suspension bridge, teddy bear, thatched roof, tile roof, tobacco shop, totem pole, toy store, triumphal arch, turnstile, umbrella, vaulted or arched ceiling, velvet fabric, viaduct, window screen, window shade, wool, split\-rail fence, yurt, dust jacket
Bird:rooster, hen, ostrich, brambling, goldfinch, house finch, junco, indigo bunting, American robin, bulbul, jay, magpie, chickadee, American dipper, kite \(bird of prey\), bald eagle, vulture, great grey owl, black grouse, ptarmigan, ruffed grouse, prairie grouse, peafowl, quail, partridge, african grey parrot, macaw, sulphur\-crested cockatoo, lorikeet, coucal, bee eater, hornbill, hummingbird, jacamar, toucan, duck, red\-breasted merganser, goose, black swan, white stork, black stork, spoonbill, flamingo, little blue heron, great egret, bittern bird, crane bird, limpkin, common gallinule, American coot, bustard, ruddy turnstone, dunlin, common redshank, dowitcher, oystercatcher, pelican, king penguin, albatross
Container:ambulance, amphibious vehicle, trash can, backpack, barrel, wheelbarrow, bathtub, station wagon, beaker, beer bottle, beer glass, tandem bicycle, bucket, taxicab, cauldron, cardboard box / carton, cassette, storage chest, cocktail shaker, coffee mug, coffeemaker, convertible, crate, electric locomotive, envelope, fire truck, forklift, freight car, garbage truck, goblet, go\-kart, golf cart, half\-track, hamper, horse\-drawn vehicle, jeep, rickshaw, ladle, limousine, messenger bag, mailbox, measuring cup, milk can, minivan, mixing bowl, mobile home, ford model t, moped, mortar and pestle, vespa, mountain bike, moving van, bullock cart, product packet / packaging, railroad car, pencil case, Petri dish, pickup truck, piggy bank, pill bottle, drink pitcher, plastic bag, police van, soda bottle, plant pot, purse, race car, rain barrel, recreational vehicle, safe, salt shaker, shopping basket, shopping cart, sleeping bag, snowmobile, snowplow, soap dispenser, soup bowl, sports car, steam locomotive, tram, tank, teapot, thimble, tow truck, tractor, semi\-trailer truck, tray, tricycle, hot tub, unicycle, vase, wallet, sink, water bottle, water jug, water tower, whiskey jug, wine bottle, wooden spoon
Device:abacus, accordion, acoustic guitar, analog clock, assault rifle, banjo, barometer, bassoon, binoculars, hunting bow, buckle, candle, cannon, car mirror, carousel, car wheel, automated teller machine, cello, chainsaw, bell or wind chime, combination lock, computer keyboard, cornet, construction crane, desktop computer, digital clock, digital watch, disc brake, drum, electric fan, electric guitar, flute, French horn, gas pump, gong, grand piano, guillotine, hair clip, hand\-held computer, hard disk drive, harmonica, harp, combine harvester, hook, hourglass, carved pumpkin, joystick, knot, laptop computer, lighter, music speaker, loupe magnifying glass, magnetic compass, maraca, marimba, maypole, microphone, missile, computer mouse, mousetrap, muzzle, metal nail, neck brace, notebook computer, oboe, ocarina, odometer, oil filter, pipe organ, oxygen mask, paddle wheel, padlock, paintbrush, pan flute, parking meter, plectrum, pier, pinwheel, potter’s wheel, power drill, printer, projector, hockey puck, radiator, radio telescope, fishing casting reel, remote control, revolver, rifle, ruler measuring stick, safety pin, saxophone, weighing scale, CRT monitor, screw, ski, slide rule, slot machine, snorkel, solar thermal collector, space heater, spider web, spotlight, steel drum, stethoscope, stopwatch, stove, strainer, sundial, sunglasses, swing, electrical switch, syringe, threshing machine, torch, tripod, trombone, typewriter keyboard, upright piano, vending machine, violin, wall clock, whistle, airplane wing, website
Instrumentality:aircraft carrier, airliner, airship, balance beam, balloon, ballpoint pen, barbell, barber chair, baseball, basketball, bassinet, bobsleigh, bookcase, broom, high\-speed train, canoe, can opener, tool kit, cassette player, catamaran, CD player, mobile phone, chain, chiffonier, china cabinet, cleaver, container ship, corkscrew, cradle, infant bed, Crock Pot, croquet ball, crutch, desk, rotary dial telephone, dining table, dog sled, drilling rig, drumstick, dumbbell, entertainment center, face powder, filing cabinet, fireboat, flagpole, folding chair, fountain pen, four\-poster bed, frying pan, golf ball, gondola, hair spray, hammer, hatchet, gymnastic horizontal bar, iPod, jigsaw puzzle, lawn mower, letter opener, lifeboat, ocean liner, lipstick, lotion, matchstick, maze, medicine cabinet, minibus, modem, monitor, oscilloscope, paddle, parachute, parallel bars, park bench, payphone, pencil sharpener, perfume, photocopier, ping\-pong ball, pirate ship, block plane, farm plow, plunger, Polaroid camera, pole, pool table, prayer rug, punching bag, quill, racket, radio, reflex camera, rocking chair, eraser, rugby ball, school bus, schooner, screwdriver, shovel, shower curtain, soccer ball, keyboard space bar, space shuttle, spatula, motorboat, spindle, stretcher, couch, submarine, sunscreen, mop, table lamp, tape player, television, tennis ball, front curtain, throne, toilet seat, trimaran, trolleybus, volleyball, wardrobe, military aircraft, wok, shipwreck, sailboat, comic book, crossword
Invertebrate:trilobite, harvestman, scorpion, yellow garden spider, barn spider, European garden spider, southern black widow, tarantula, wolf spider, tick, centipede, jellyfish, sea anemone, brain coral, flatworm, nematode, conch, snail, slug, sea slug, chiton, chambered nautilus, Dungeness crab, rock crab, fiddler crab, red king crab, American lobster, spiny lobster, crayfish, hermit crab, isopod, tiger beetle, ladybug, ground beetle, longhorn beetle, leaf beetle, dung beetle, rhinoceros beetle, weevil, fly, bee, ant, grasshopper, cricket insect, stick insect, cockroach, praying mantis, cicada, leafhopper, lacewing, dragonfly, damselfly, red admiral butterfly, ringlet butterfly, monarch butterfly, small white butterfly, sulphur butterfly, gossamer\-winged butterfly, starfish, sea urchin, sea cucumber
Mammal:tusker, echidna, platypus, wallaby, koala, wombat, grey whale, killer whale, dugong, sea lion, Chihuahua, Japanese Chin, Maltese, Pekingese, Shih Tzu, King Charles Spaniel, Papillon, toy terrier, Rhodesian Ridgeback, Afghan Hound, Basset Hound, Beagle, Bloodhound, Bluetick Coonhound, Black and Tan Coonhound, Treeing Walker Coonhound, English foxhound, Redbone Coonhound, borzoi, Irish Wolfhound, Italian Greyhound, Whippet, Ibizan Hound, Norwegian Elkhound, Otterhound, Saluki, Scottish Deerhound, Weimaraner, Staffordshire Bull Terrier, American Staffordshire Terrier, Bedlington Terrier, Border Terrier, Kerry Blue Terrier, Irish Terrier, Norfolk Terrier, Norwich Terrier, Yorkshire Terrier, Wire Fox Terrier, Lakeland Terrier, Sealyham Terrier, Airedale Terrier, Cairn Terrier, Australian Terrier, Dandie Dinmont Terrier, Boston Terrier, Miniature Schnauzer, Giant Schnauzer, Standard Schnauzer, Scottish Terrier, Tibetan Terrier, Australian Silky Terrier, Soft\-coated Wheaten Terrier, West Highland White Terrier, Lhasa Apso, Flat\-Coated Retriever, Curly\-coated Retriever, Golden Retriever, Labrador Retriever, Chesapeake Bay Retriever, German Shorthaired Pointer, Vizsla, English Setter, Irish Setter, Gordon Setter, Brittany dog, Clumber Spaniel, English Springer Spaniel, Welsh Springer Spaniel, Cocker Spaniel, Sussex Spaniel, Irish Water Spaniel, Kuvasz, Schipperke, Groenendael dog, Malinois, Briard, Australian Kelpie, Komondor, Old English Sheepdog, Shetland Sheepdog, collie, Border Collie, Bouvier des Flandres dog, Rottweiler, German Shepherd Dog, Dobermann, Miniature Pinscher, Greater Swiss Mountain Dog, Bernese Mountain Dog, Appenzeller Sennenhund, Entlebucher Sennenhund, Boxer, Bullmastiff, Tibetan Mastiff, French Bulldog, Great Dane, St\. Bernard, husky, Alaskan Malamute, Siberian Husky, Dalmatian, Affenpinscher, Basenji, pug, Leonberger, Newfoundland dog, Great Pyrenees dog, Samoyed, Pomeranian, Chow Chow, Keeshond, brussels griffon, Pembroke Welsh Corgi, Cardigan Welsh Corgi, Toy Poodle, Miniature Poodle, Standard Poodle, Mexican hairless dog \(xoloitzcuintli\), grey wolf, Alaskan tundra wolf, red wolf or maned wolf, coyote, dingo, dhole, African wild dog, hyena, red fox, kit fox, Arctic fox, grey fox, tabby cat, tiger cat, Persian cat, Siamese cat, Egyptian Mau, cougar, lynx, leopard, snow leopard, jaguar, lion, tiger, cheetah, brown bear, American black bear, polar bear, sloth bear, mongoose, meerkat, cottontail rabbit, hare, Angora rabbit, hamster, porcupine, fox squirrel, marmot, beaver, guinea pig, common sorrel horse, zebra, pig, wild boar, warthog, hippopotamus, ox, water buffalo, bison, ram \(adult male sheep\), bighorn sheep, Alpine ibex, hartebeest, impala \(antelope\), gazelle, arabian camel, llama, weasel, mink, European polecat, black\-footed ferret, otter, skunk, badger, armadillo, three\-toed sloth, orangutan, gorilla, chimpanzee, gibbon, siamang, guenon, patas monkey, baboon, macaque, langur, black\-and\-white colobus, proboscis monkey, marmoset, white\-headed capuchin, howler monkey, titi monkey, Geoffroy’s spider monkey, common squirrel monkey, ring\-tailed lemur, indri, Asian elephant, African bush elephant, red panda, giant panda
Vertebrate:tench, goldfish, great white shark, tiger shark, hammerhead shark, electric ray, stingray, fire salamander, smooth newt, newt, spotted salamander, axolotl, American bullfrog, tree frog, tailed frog, loggerhead sea turtle, leatherback sea turtle, mud turtle, terrapin, box turtle, banded gecko, green iguana, Carolina anole, desert grassland whiptail lizard, agama, frilled\-necked lizard, alligator lizard, Gila monster, European green lizard, chameleon, Komodo dragon, Nile crocodile, American alligator, triceratops, worm snake, ring\-necked snake, eastern hog\-nosed snake, smooth green snake, kingsnake, garter snake, water snake, vine snake, night snake, boa constrictor, African rock python, Indian cobra, green mamba, sea snake, Saharan horned viper, eastern diamondback rattlesnake, sidewinder rattlesnake, snoek fish, eel, silver salmon, rock beauty fish, clownfish, sturgeon, gar fish, lionfish, pufferfish
### A\.5Multiple Choice Question Generation
To extract conceptual supervision from the language model, we prompt it to generate multiple\-choice questions that help distinguish between visual classes\. These questions serve as a bridge between semantic knowledge encoded in the LLM and the class\-level structure of the visual dataset\. Each question is designed to capture a visual or contextually related attribute that differentiates one class from another, and the resulting class\-wise logits from the LLM are used as distillation targets\. Below, we provide the exact prompt used to generate 50 questions with five answer options per question\.
```
Your task:
1. Generate 50 questions for
distinguishing between the classes in
a dataset with the requirements below.
2. Each question should be centered around
visual concepts while
slight deviation is acceptable.
An example of a deviation would
be about the environment.
3. Each question should have 5 answer
options and each class can
only have one correct answer option.
Its best to maximize
the number classes that each pick
a different answer option.
4. Each question should contain
the class in the question.
5. Questions should maximize the
separation between classes like a
decision tree maximizing entropy.
6. Use your understanding of all of
the classes and their visual
differences to create these questions.
7. Only output ALL of the questions
and answer options.
8. Do not repeat questions.
9. Do not write code.
10. Do not include class names in the
answer options.
The classes:
<classes>
Output format:
- For each question, use the specific
format:
Ψ[Question]
Ψ1. [Option 1]
- Do not add additional commentary.
- Do not include the square brackets
in the answer.
- Do not number the questions.
```
Table B:Top\-1 accuracy \(%\) for student models trained with feature\-based distillation from different LLM layers of LLaVA\-1\.5\. We compare FitNet and CRD across six fine\-grained classification datasets, using three student architectures: ResNet\-18 \(RN18\), MobileNetV2 \(MNV2\), and ShuffleNetV2 \(SNV2\)\. Each column corresponds to a different LLaVA transformer layer, with “\-1” indicating the closest to output layer and “\-30” the closest to input layer\. Mid\-to\-late layers often yield the best results, indicating that semantically rich supervision emerges progressively within the LLM\.
## Appendix BAblation Study on Feature\-Based Distillation from a Multimodal Teacher
To establish stronger feature\-based LLM baselines for comparison, we adapt two representative knowledge distillation methods—FitNet\(Romeroet al\.,[2014](https://arxiv.org/html/2606.27527#bib.bib11)\)and Contrastive Representation Distillation \(CRD\)\(Tianet al\.,[2019](https://arxiv.org/html/2606.27527#bib.bib12)\)—to use LLaVA\(Liuet al\.,[2023](https://arxiv.org/html/2606.27527#bib.bib33)\)as the teacher\. These serve as key baselines in our main evaluation \(Section 4\.2\)\. Since traditional feature\-based KD methods rely on matching internal activations, we extract features from various layers of LLaVA and assess their impact on student performance\. We find that distillation performance varies substantially by layer, motivating a layer\-wise ablation study to fairly configure each baseline\.
#### Experimental Setup
We use the standard LLaVA\-1\.5 model and follow its prompting format, where the image is prepended to the language prompt using a special token \(e\.g\.,<image\>\)\. The full prompt is structured as a user\-assistant exchange:
> USER: <image\> Is there a <class\> in this image? ASSISTANT:
We feed this prompt into LLaVA and extract the embedding of the final token at each transformer layer of the LLM\. This token embedding reflects the fused multimodal representation at various levels of abstraction\. We then use this as the distillation target for training a vision\-only student model\. Distillation is applied using either FitNet \(withℓ2\\ell\_\{2\}regression\) or CRD \(with contrastive learning\), and we vary the teacher layer from which the token embedding is extracted\.
#### Feature Layer Selection for Feature\-based Distillation
Table[B](https://arxiv.org/html/2606.27527#A1.T2)reports top\-1 accuracy on the same main six classification datasets using ResNet\-18, MobileNetV2, and ShuffleNetV2 as student architectures\. Each column corresponds to a different LLaVA LLM transformer layer, with “\-1” representing the final layer and “\-30” the earliest layer\. We observe that mid\-to\-late layers \(e\.g\., \-12 to \-18\) tend to produce stronger supervision signals, suggesting that class\-level semantic structure becomes more explicit in deeper LLM layers\. Neither method is consistently outperforming the other, however, in most cases, CRD does do better, which reflects the strength of contrastive alignment in high\-dimensional spaces\. Overall, because of their inconsistent performance, we illustrate thatLaViDis still superior in harnessing an LLM for vision distillation, even without the vision modality\.Similar Articles
Large Vision-Language Models Get Lost in Attention
This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
Switch-KD proposes a novel visual-switch knowledge distillation framework for efficiently compressing vision-language models by unifying multimodal knowledge transfer within a shared text-probability space. The method achieves 3.6-point average improvement across 10 multimodal benchmarks when distilling a 0.5B TinyLLaVA student from a 3B teacher model.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
LoMo proposes a data curation method that reformulates single-modality prompts into interleaved multimodal sequences to improve cross-modal representation alignment in vision-language models, achieving consistent gains on multiple benchmarks.
From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
This paper studies how audio and visual information flow inside Audio-Visual Large Language Models (AVLLMs), revealing that AVLLMs follow sequential or parallel routing depending on input configuration, and that some tokens can be discarded after information transfer for efficiency.