EVLA: An Electro-Aware Multimodal Assistant for Physically-Grounded Driving Reasoning and Control
Summary
Introduces EVLA, a framework that enhances vision-language driving assistants with real-time awareness of electrified powertrain states, enabling energy-optimal and physically grounded decisions.
View Cached Full Text
Cached at: 06/30/26, 05:28 AM
# EVLA: An Electro-Aware Multimodal Assistant for Physically-Grounded Driving Reasoning and Control
Source: [https://arxiv.org/html/2606.28938](https://arxiv.org/html/2606.28938)
Yuxin Liu Zihan Chen Haoyu Wang Mingxuan Zhang Ruijie Lin Siyuan Zhao College of Computer Science and Technology Zhejiang University Hangzhou, Zhejiang, China
###### Abstract
Modern vision\-language models \(VLMs\) for driving assistants typically treat vehicle dynamics as a black box, resulting in decisions that lack awareness of the vehicle’s real\-time electro\-mechanical state\. To bridge this gap, we introduce the Electro\-Visual\-Language Assistant \(EVLA\)—a novel framework that combines multi\-modal scene understanding with real\-time perception of the electrified powertrain state \(e\.g\., motor torque, battery SOC\)\. Our approach features two key innovations: first, a Unified Co\-State Encoder \(UCSE\) that fuses visual, textual, and vehicle\-state inputs into a shared latent representation, augmented with an Energy\-Efficiency Field to model spatial energy costs; and second, an Electro\-aware Structured Reasoning Chain \(ESRC\), which replaces external chain\-of\-thought prompting with an internal, deterministic reasoning process grounded in physical constraints and optimization objectives\. Trained end\-to\-end with a physics\-guided joint loss, EVLA learns to generate context\-aware and energy\-optimal driving decisions\. Extensive evaluations on a driving QA benchmark demonstrate that EVLA substantially outperforms strong fine\-tuned VLM baselines, improving the final score by \+0\.0871 and accuracy by \+5\.6%\. Ablation studies validate the necessity of each component, and efficiency analyses show that EVLA achieves 36% faster inference than multi\-stage pipelines\. This work underscores that integrating vehicle\-state awareness and structured physical reasoning is crucial for developing next\-generation, physically\-grounded driving assistants\.
Figure 1:Motivation of EVLA\. Existing vision–language driving assistants ignore electrified powertrain states, leading to physically ungrounded reasoning and unreliable control\. EVLA explicitly integrates visual perception, language instructions, and vehicle state information to enable energy\-aware and physically grounded driving decisions\.## 1Introduction
Recent advancements in Vision\-Language Models \(VLMs\) have demonstrated considerable promise for intelligent driving systems\. By jointly processing visual scenes and natural language, VLMs can interpret road conditions, detect obstacles, and answer complex queries about the driving environment, thereby enhancing the reasoning capabilities and situational awareness of autonomous agentsZhouet al\.\([2024](https://arxiv.org/html/2606.28938#bib.bib9)\); Xuet al\.\([2024](https://arxiv.org/html/2606.28938#bib.bib8)\); Liuet al\.\([2023](https://arxiv.org/html/2606.28938#bib.bib7)\); Zhouet al\.\([2025](https://arxiv.org/html/2606.28938#bib.bib46)\); Qiet al\.\([2022](https://arxiv.org/html/2606.28938#bib.bib30)\)\. This progress is exemplified by benchmarks such as the CVPR 2024 Driving with Language challenge, which focuses on developing models capable of addressing diverse driving questions using multi\-view image inputs\.
Despite these advances, a fundamental limitation persists\. Existing VLM\-based approaches for driving largely operate as passive visual question\-answering systems\. They treat the autonomous vehicle as a black box, lacking explicit comprehension of its internalelectro\-mechanical state—including motor torque, battery state\-of\-charge, or thermal limits\. This oversight hinders holistic reasoning for tasks such as energy\-efficient planning, where decisions must integrate both external scene semantics and internal vehicle dynamicsWuet al\.\([2024a](https://arxiv.org/html/2606.28938#bib.bib36)\); Tianet al\.\([2025](https://arxiv.org/html/2606.28938#bib.bib38)\)\. Complementary evidence from physics\-informed lane\-change intention prediction shows that explicitly encoding kinematics and interaction\-safety variables materially improves maneuver anticipation across both straight\-highway and ramp scenarios, especially when the horizon increasesShiet al\.\([2025](https://arxiv.org/html/2606.28938#bib.bib77)\)\.Moreover, current methods often depend on heuristic post\-processing or unstructured, open\-ended Chain\-of\-Thought prompting, which can compromise robustness and physical consistencyLin \([2025a](https://arxiv.org/html/2606.28938#bib.bib40)\); Heet al\.\([2025b](https://arxiv.org/html/2606.28938#bib.bib42)\)\.
To address this gap, we propose theElectro\-Visual\-Language Assistant \(EVLA\), a novel framework designed for state\-aware, physically\-grounded driving assistance\. Inspired byQu and Ma \([2025](https://arxiv.org/html/2606.28938#bib.bib29)\); Songet al\.\([2025](https://arxiv.org/html/2606.28938#bib.bib43)\)and building uponWuet al\.\([2024b](https://arxiv.org/html/2606.28938#bib.bib35),[c](https://arxiv.org/html/2606.28938#bib.bib32)\); Caoet al\.\([2025b](https://arxiv.org/html/2606.28938#bib.bib49)\), our primary contribution is a unified architecture that seamlessly integrates visual perception, language understanding, and real\-time powertrain state reasoning, achieving superior performance through joint modeling of scene dynamics and vehicle physics\. Specifically, our work introduces three key innovations\. First, extending the federated learning paradigms ofWuet al\.\([2022](https://arxiv.org/html/2606.28938#bib.bib31)\); Wanget al\.\([2023](https://arxiv.org/html/2606.28938#bib.bib34)\); Yu \([2025](https://arxiv.org/html/2606.28938#bib.bib53)\), we propose aUnified Co\-State Encoder \(UCSE\)that fuses multi\-view images, textual queries, and a real\-time vehicle state vector into a shared latent representation, from which an interpretableEnergy\-Efficiency Field \(EEF\)map is derivedXinet al\.\([2025a](https://arxiv.org/html/2606.28938#bib.bib50)\); Wanget al\.\([2025b](https://arxiv.org/html/2606.28938#bib.bib60)\)\. Second, outperforming traditional Chain\-of\-Thought approachesLin \([2025b](https://arxiv.org/html/2606.28938#bib.bib39)\); Yanet al\.\([2025](https://arxiv.org/html/2606.28938#bib.bib61)\), we develop anElectro\-aware Structured Reasoning Chain \(ESRC\), a deterministic internal module that replaces external prompting by performing structured parsing, constraint formalization, and symbolic deduction based on the joint scene\-and\-state contextBaiet al\.\([2025](https://arxiv.org/html/2606.28938#bib.bib56)\); Wanget al\.\([2024](https://arxiv.org/html/2606.28938#bib.bib63)\)\. Third, we introduce aPhysics\-Guided Joint Training Objectivethat supervises the model not only on language generation but also on state prediction, control consistency, and EEF estimation, ensuring its reasoning is grounded in domain knowledgeWuet al\.\([2020](https://arxiv.org/html/2606.28938#bib.bib33)\); Wang \([2025](https://arxiv.org/html/2606.28938#bib.bib72)\); Yuet al\.\([2025a](https://arxiv.org/html/2606.28938#bib.bib66)\)\.
Extensive experiments on the DriveLM\-nuScenes benchmark show that EVLA substantially outperforms strong fine\-tuning baselines, setting a new state\-of\-the\-artYanget al\.\([2025](https://arxiv.org/html/2606.28938#bib.bib44)\); Biet al\.\([2025](https://arxiv.org/html/2606.28938#bib.bib67)\)\. For instance, our full model achieves a final score of0\.8548, exceeding the best baseline by a significant margin \(\+0\.0871\), demonstrating improvements comparable toHeet al\.\([2025a](https://arxiv.org/html/2606.28938#bib.bib45)\); Caoet al\.\([2025a](https://arxiv.org/html/2606.28938#bib.bib48)\); Xuet al\.\([2025](https://arxiv.org/html/2606.28938#bib.bib68)\); Chenet al\.\([2025a](https://arxiv.org/html/2606.28938#bib.bib86)\); Youet al\.\([2026](https://arxiv.org/html/2606.28938#bib.bib85)\); Chenet al\.\([2025c](https://arxiv.org/html/2606.28938#bib.bib84)\); Zhanget al\.\([2026a](https://arxiv.org/html/2606.28938#bib.bib83)\); Zhaoet al\.\([2026](https://arxiv.org/html/2606.28938#bib.bib82)\); Huanget al\.\([2026](https://arxiv.org/html/2606.28938#bib.bib81)\); Chenet al\.\([2025b](https://arxiv.org/html/2606.28938#bib.bib80)\)\. Ablation studies validate the necessity of each proposed component, demonstrating that jointly modeling scene dynamics and vehicle physics is essential, particularly for complex prediction and planning tasks\. Furthermore, EVLA’s end\-to\-end design provides a more efficient inference pipeline compared to prior multi\-stage approaches\.
The remainder of this paper is organized as follows\. We review related work in[Section˜2](https://arxiv.org/html/2606.28938#S2)\. We detail the EVLA methodology in[Section˜3](https://arxiv.org/html/2606.28938#S3)\. The dataset, training protocol, and comprehensive experimental results are presented in[Section˜4](https://arxiv.org/html/2606.28938#S4)\. Finally,[Section˜5](https://arxiv.org/html/2606.28938#S5)summarizes our findings and contributions\.
## 2Related Work
### 2\.1Vision\-Language Models for Autonomous Driving
The integration of vision\-language models into autonomous driving has emerged as a promising research direction\. Zhou et al\.Zhouet al\.\([2024](https://arxiv.org/html/2606.28938#bib.bib9)\); Hanet al\.\([2025](https://arxiv.org/html/2606.28938#bib.bib69)\)provide a comprehensive survey on VLMs in autonomous driving, covering perception, navigation, planning, and end\-to\-end driving applications\. Recent advances in multimodal large language models have further expanded the capabilities of such systemsLianget al\.\([2024](https://arxiv.org/html/2606.28938#bib.bib17)\); Xinet al\.\([2025b](https://arxiv.org/html/2606.28938#bib.bib51)\); Niuet al\.\([2024a](https://arxiv.org/html/2606.28938#bib.bib62)\); Youet al\.\([2025](https://arxiv.org/html/2606.28938#bib.bib71)\)\. Early attempts focused on scene captioning and visual question answering for driving scenarios\. More recently, DriveGPT4Xuet al\.\([2024](https://arxiv.org/html/2606.28938#bib.bib8)\); Yuet al\.\([2025b](https://arxiv.org/html/2606.28938#bib.bib76)\)pioneered interpretable end\-to\-end autonomous driving by leveraging large language models to simultaneously predict control signals and provide natural language explanations\. DriveVLMTianet al\.\([2024](https://arxiv.org/html/2606.28938#bib.bib10)\); Yuet al\.\([2025c](https://arxiv.org/html/2606.28938#bib.bib78)\)introduced a hybrid system combining VLM reasoning with traditional driving pipelines, demonstrating improved spatial reasoning capabilities\.Zhanget al\.\([2025b](https://arxiv.org/html/2606.28938#bib.bib87),[e](https://arxiv.org/html/2606.28938#bib.bib88),[c](https://arxiv.org/html/2606.28938#bib.bib89),[d](https://arxiv.org/html/2606.28938#bib.bib90),[a](https://arxiv.org/html/2606.28938#bib.bib91)\); Moet al\.\([2026](https://arxiv.org/html/2606.28938#bib.bib92)\); Yuet al\.\([2026](https://arxiv.org/html/2606.28938#bib.bib93)\); Zhanget al\.\([2026b](https://arxiv.org/html/2606.28938#bib.bib94)\)
The DriveLM benchmarkSimaet al\.\([2024](https://arxiv.org/html/2606.28938#bib.bib1)\)established a graph\-structured visual question answering framework for driving, enabling systematic evaluation of perception, prediction, and planning capabilities\. Built upon the nuScenes datasetCaesaret al\.\([2020](https://arxiv.org/html/2606.28938#bib.bib2)\), DriveLM provides diverse QA pairs that test models’ understanding of complex driving scenarios\. LLaVALiuet al\.\([2023](https://arxiv.org/html/2606.28938#bib.bib7)\)and its successor LLaVA\-NeXTLiuet al\.\([2024a](https://arxiv.org/html/2606.28938#bib.bib16)\)have become popular backbone architectures for multimodal driving assistants due to their strong visual instruction\-following capabilities\.
Despite these advances, existing VLM\-based approaches treat the vehicle as an opaque entity, ignoring critical internal states such as battery charge, motor efficiency, and thermal constraints\. Our work addresses this gap by explicitly modeling the electrified powertrain state within the VLM framework\.
### 2\.2Electrified Powertrain and Energy Management
Energy management in electrified vehicles has been extensively studied in the control systems communityZhanget al\.\([2015](https://arxiv.org/html/2606.28938#bib.bib13)\); Weiet al\.\([2025a](https://arxiv.org/html/2606.28938#bib.bib26)\)\. Key challenges include optimizing motor efficiency, managing battery state\-of\-charge, and balancing performance with energy consumption\. Traditional approaches rely on rule\-based strategies or model predictive control, which require explicit vehicle models and cannot easily integrate perceptual informationWang \([2024](https://arxiv.org/html/2606.28938#bib.bib73)\); Wang and Sayil \([2024](https://arxiv.org/html/2606.28938#bib.bib74)\)\.
Recent work has explored learning\-based approaches for energy\-optimal driving, but these typically operate independently from perception systems\. To our knowledge, EVLA is the first framework to jointly model visual perception, language understanding, and electrified powertrain dynamics within a unified architecture, enabling energy\-aware decisions that are grounded in both scene context and vehicle physics\.
### 2\.3Structured Reasoning in Language Models
Chain\-of\-thought \(CoT\) prompting has demonstrated significant improvements in complex reasoning tasks for large language modelsNiuet al\.\([2024b](https://arxiv.org/html/2606.28938#bib.bib65)\)\. However, external CoT prompting relies on carefully crafted templates and may produce inconsistent or physically implausible reasoning chains\. Recent work has explored internalizing reasoning processes within model architecturesLin \([2025c](https://arxiv.org/html/2606.28938#bib.bib41)\); Weiet al\.\([2025b](https://arxiv.org/html/2606.28938#bib.bib70)\)\.
Our proposed Electro\-aware Structured Reasoning Chain \(ESRC\) differs from generic CoT approaches by incorporating domain\-specific constraints from vehicle physics\. Rather than generating free\-form reasoning text, ESRC performs structured parsing, constraint formalization, and symbolic deduction, ensuring that reasoning outputs adhere to physical laws and powertrain limitations\.
Figure 2:Architecture of EVLA\. EVLA encodes visual scenes, language instructions, and electrified vehicle states via modality\-specific encoders, fuses them with a Unified Co\-State Encoder, and performs electro\-aware structured reasoning to generate safe, energy\-efficient, and interpretable driving actions\.Beyond generic chain\-of\-thought prompting, structured and constraint\-aware reasoning has long been recognized as a fundamental requirement for ensuring reliability and verifiability in complex systems\. Prior studies on system diagnosability and network reliability have demonstrated that explicitly modeling structural constraints and feasibility conditions is crucial for dependable decision\-making, particularly in large\-scale interconnected systems and comparison\-based diagnostic modelsWang and Wang \([2016](https://arxiv.org/html/2606.28938#bib.bib25),[2018](https://arxiv.org/html/2606.28938#bib.bib27),[2019](https://arxiv.org/html/2606.28938#bib.bib24)\); Wanget al\.\([2025a](https://arxiv.org/html/2606.28938#bib.bib23)\); Xianget al\.\([2025](https://arxiv.org/html/2606.28938#bib.bib54)\); Panet al\.\([2024](https://arxiv.org/html/2606.28938#bib.bib58)\)\. These works collectively highlight that reliable reasoning should be grounded in formal constraints and structural properties rather than unconstrained heuristic inference\. This perspective directly motivates our Electro\-aware Structured Reasoning Chain \(ESRC\), which internalizes constraint formalization and symbolic feasibility checking within the driving assistant framework\.
## 3Methodology: Electro\-Visual\-Language Assistant \(EVLA\)
We introduce the Electro\-Visual\-Language Assistant \(EVLA\), a novel framework that integrates multimodal visual\-language understanding with real\-time perception and reasoning of vehicle electrified powertrain states\. Unlike prior VLM\-based driving assistants that treat vehicle dynamics as a black box, EVLA explicitly models the interplay among visual scenes, linguistic instructions, and core electromechanical states \(e\.g\., motor torque, battery state of charge \(SOC\), inverter temperature\) to generate context\-aware and energy\-optimal decisions\.
### 3\.1Problem Formulation and Input Representation
At each time steptt, the model receives three inputs: multi\-view camera imagesℐt=\{It\(1\),…,It\(V\)\}\\mathcal\{I\}\_\{t\}=\\\{I\_\{t\}^\{\(1\)\},\\ldots,I\_\{t\}^\{\(V\)\}\\\}, a textual query or commandQtQ\_\{t\}, and a real\-time vehicle state vector𝐬tveh∈ℝD\\mathbf\{s\}\_\{t\}^\{veh\}\\in\\mathbb\{R\}^\{D\}\. This vector encapsulates key powertrain parameters:
𝐬tveh=\[τm,ωm,Pbatt,SOC,Tinv,Tmotor,…\]tT,\\mathbf\{s\}\_\{t\}^\{veh\}=\[\\tau\_\{m\},\\omega\_\{m\},P\_\{batt\},SOC,T\_\{inv\},T\_\{motor\},\\ldots\]\_\{t\}^\{T\},\(1\)whereτm\\tau\_\{m\}denotes motor torque,ωm\\omega\_\{m\}motor speed,PbattP\_\{batt\}battery power,SOCSOCstate of charge, andTinvT\_\{inv\}andTmotorT\_\{motor\}inverter and motor temperatures, respectively\. The goal is to produce a holistic response𝒜t\\mathcal\{A\}\_\{t\}, consisting of a natural language answerAttextA\_\{t\}^\{text\}and, when applicable, a set of suggested control parameters𝐚tctrl\\mathbf\{a\}\_\{t\}^\{ctrl\}\(e\.g\., target deceleration, recuperation level\) that adhere to safety, comfort, and powertrain efficiency constraints\.
### 3\.2Unified Co\-State Encoder \(UCSE\)
The first core innovation is theUnified Co\-State Encoder \(UCSE\)EθUCSEE\_\{\\theta\}^\{UCSE\}, which projects heterogeneous inputs into a shared, semantically rich latent space that jointly represents scene content and powertrain status:
𝐙tco=EθUCSE\(ℐt,𝐬tveh,Qt\)\.\\mathbf\{Z\}\_\{t\}^\{co\}=E\_\{\\theta\}^\{UCSE\}\(\\mathcal\{I\}\_\{t\},\\mathbf\{s\}\_\{t\}^\{veh\},Q\_\{t\}\)\.\(2\)Here,𝐙tco\\mathbf\{Z\}\_\{t\}^\{co\}denotes thecooperative latent state\.EθUCSEE\_\{\\theta\}^\{UCSE\}is implemented as a multimodal transformer\. Visual features𝐅tvis\\mathbf\{F\}\_\{t\}^\{vis\}are extracted fromℐt\\mathcal\{I\}\_\{t\}using a vision encoder \(e\.g\., CLIP\-ViTRadfordet al\.\([2021](https://arxiv.org/html/2606.28938#bib.bib11)\)\)\. The design of multimodal embeddings follows recent advances in representation learningZhanget al\.\([2024](https://arxiv.org/html/2606.28938#bib.bib20),[2025f](https://arxiv.org/html/2606.28938#bib.bib64)\)\. The vehicle state𝐬tveh\\mathbf\{s\}\_\{t\}^\{veh\}is projected via a linear layer to𝐅tveh\\mathbf\{F\}\_\{t\}^\{veh\}, and the textQtQ\_\{t\}is tokenized into𝐅ttext\\mathbf\{F\}\_\{t\}^\{text\}\. These modalities are then fused through a transformer with cross\-attention layers:
𝐙tco=Transformer\-Fusion\(𝐅tvis,𝐅tveh,𝐅ttext\)\.\\mathbf\{Z\}\_\{t\}^\{co\}=\\text\{Transformer\-Fusion\}\(\\mathbf\{F\}\_\{t\}^\{vis\},\\mathbf\{F\}\_\{t\}^\{veh\},\\mathbf\{F\}\_\{t\}^\{text\}\)\.\(3\)A key output derived from𝐙tco\\mathbf\{Z\}\_\{t\}^\{co\}is theEnergy\-Efficiency Field \(EEF\)map𝐌tEEF∈ℝH×W\\mathbf\{M\}\_\{t\}^\{EEF\}\\in\\mathbb\{R\}^\{H\\times W\}\. For each spatial location in the egocentric view,𝐌tEEF\\mathbf\{M\}\_\{t\}^\{EEF\}estimates a scalar value proportional to the expected energy cost \(or recuperation potential\) of a vehicle action centered at that location, given the current powertrain state𝐬tveh\\mathbf\{s\}\_\{t\}^\{veh\}\. This replaces and generalizes the heuristic depth\-based object distance estimation used in prior work\.
From a modeling perspective, the Energy\-Efficiency Field \(EEF\) can be viewed as a structured spatial abstraction that embeds system\-level constraints into a learnable representation\. Similar abstraction principles have been extensively studied in combinatorial structures, graph connectivity, and constrained optimization over complex networks, where global properties emerge from local structural rulesMu\-Jiang\-shanet al\.\([2010](https://arxiv.org/html/2606.28938#bib.bib28)\); Wanget al\.\([2013](https://arxiv.org/html/2606.28938#bib.bib55)\); Linet al\.\([2017](https://arxiv.org/html/2606.28938#bib.bib22)\); Wanget al\.\([2011](https://arxiv.org/html/2606.28938#bib.bib57),[2012](https://arxiv.org/html/2606.28938#bib.bib59)\)\. More recently, spatio\-temporal graph learning frameworks have further demonstrated the effectiveness of integrating structured representations with data\-driven models for non\-stationary and dynamic systemsWeiet al\.\([2025a](https://arxiv.org/html/2606.28938#bib.bib26)\); Deng \([2026](https://arxiv.org/html/2606.28938#bib.bib79)\)\. These insights support our design choice of representing energy\-aware driving costs as a structured field derived from the unified co\-state latent space\.
### 3\.3Electro\-aware Structured Reasoning Chain \(ESRC\)
The second innovation is theElectro\-aware Structured Reasoning Chain \(ESRC\), an internal structured reasoning module that replaces external template\-based Chain\-of\-Thought prompting\. ESRC takes𝐙tco\\mathbf\{Z\}\_\{t\}^\{co\}as input and performs a deterministic multi\-step reasoning process to produce a structured reasoning traceℛtstruct\\mathcal\{R\}\_\{t\}^\{struct\}along with the final response components:
ℛtstruct,Attext,𝐚tctrl=ESRC\(𝐙tco\)\.\\mathcal\{R\}\_\{t\}^\{struct\},A\_\{t\}^\{text\},\\mathbf\{a\}\_\{t\}^\{ctrl\}=\\text\{ESRC\}\(\\mathbf\{Z\}\_\{t\}^\{co\}\)\.\(4\)
ESRC consists of four sequential sub\-functions\. TheScene & Powertrain Parserdecomposes𝐙tco\\mathbf\{Z\}\_\{t\}^\{co\}into explicit factors:
ℛtscene\\displaystyle\\mathcal\{R\}\_\{t\}^\{scene\}=fparserscene\(𝐙tco\),\\displaystyle=f\_\{parser\}^\{scene\}\(\\mathbf\{Z\}\_\{t\}^\{co\}\),\(5\)ℛtpowertrain\\displaystyle\\mathcal\{R\}\_\{t\}^\{powertrain\}=fparserpowertrain\(𝐙tco\),\\displaystyle=f\_\{parser\}^\{powertrain\}\(\\mathbf\{Z\}\_\{t\}^\{co\}\),\(6\)whereℛtscene\\mathcal\{R\}\_\{t\}^\{scene\}captures objects, lanes, and traffic signals, whileℛtpowertrain\\mathcal\{R\}\_\{t\}^\{powertrain\}encodes powertrain status such as “motor in high\-efficiency zone” or “battery charging limited\.”
TheConstraint & Objective Formalizertranslates the parsed context and query intent into an optimization problem:
𝒫topt=\(ℱtobj,𝒞t\)=fformalizer\(ℛtscene,ℛtpowertrain,Qt\),\\mathcal\{P\}\_\{t\}^\{opt\}=\(\\mathcal\{F\}\_\{t\}^\{obj\},\\mathcal\{C\}\_\{t\}\)=f\_\{formalizer\}\(\\mathcal\{R\}\_\{t\}^\{scene\},\\mathcal\{R\}\_\{t\}^\{powertrain\},Q\_\{t\}\),\(7\)whereℱtobj\\mathcal\{F\}\_\{t\}^\{obj\}is a multi\-objective function balancing safety, progress, and energy efficiency, and𝒞t\\mathcal\{C\}\_\{t\}is a set of constraints from traffic rules and physical limits \(e\.g\.,τm≤τmax\(ωm\)\\tau\_\{m\}\\leq\\tau\_\{\\max\}\(\\omega\_\{m\}\),Pbatt∈\[Pdischargemax,Pchargemax\]P\_\{batt\}\\in\[P\_\{discharge\}^\{\\max\},P\_\{charge\}^\{\\max\}\]\)\.
TheSymbolic Reasoner, implemented as a lightweight rule\-augmented graph network, performs approximate feasibility checking and symbolic deduction on𝒫topt\\mathcal\{P\}\_\{t\}^\{opt\}:
ℛtreason=fsymbolic\(𝒫topt\),\\mathcal\{R\}\_\{t\}^\{reason\}=f\_\{symbolic\}\(\\mathcal\{P\}\_\{t\}^\{opt\}\),\(8\)producing interpretable reasoning traces such as “Path A infeasible due to thermal constraint” or “Moderate recuperation suggested for energy balance\.”
Finally, theLanguage & Control Generatorproduces the natural language answerAttextA\_\{t\}^\{text\}conditioned on the full reasoning context\[𝐙tco,ℛtstruct\]\[\\mathbf\{Z\}\_\{t\}^\{co\},\\mathcal\{R\}\_\{t\}^\{struct\}\], whereℛtstruct=\(ℛtscene,ℛtpowertrain,ℛtreason\)\\mathcal\{R\}\_\{t\}^\{struct\}=\(\\mathcal\{R\}\_\{t\}^\{scene\},\\mathcal\{R\}\_\{t\}^\{powertrain\},\\mathcal\{R\}\_\{t\}^\{reason\}\)\. In parallel, a control prediction head \(a small multilayer perceptron\) regresses the suggested control parameters𝐚tctrl\\mathbf\{a\}\_\{t\}^\{ctrl\}from the same context\. This structured approach grounds reasoning in both visual semantics and powertrain physics, addressing the limitations of open\-ended CoT prompts\.
### 3\.4Physics\-Guided Joint Training Objective
EVLA is trained end\-to\-end with a joint loss functionℒjoint\\mathcal\{L\}\_\{joint\}that incorporates domain knowledge, extending beyond pure language modeling:
ℒjoint=λ1ℒLM\+λ2ℒstate\+λ3ℒcontrol\+λ4ℒEEF\.\\mathcal\{L\}\_\{joint\}=\\lambda\_\{1\}\\mathcal\{L\}\_\{LM\}\+\\lambda\_\{2\}\\mathcal\{L\}\_\{state\}\+\\lambda\_\{3\}\\mathcal\{L\}\_\{control\}\+\\lambda\_\{4\}\\mathcal\{L\}\_\{EEF\}\.\(9\)
The language modeling lossℒLM\\mathcal\{L\}\_\{LM\}is the standard autoregressive loss for the language answerAttextA\_\{t\}^\{text\}\. The state prediction lossℒstate\\mathcal\{L\}\_\{state\}supervises future vehicle state prediction: for samples with temporal sequences, the model predicts a future vehicle state𝐬^t\+Δtveh\\hat\{\\mathbf\{s\}\}\_\{t\+\\Delta t\}^\{veh\}from\(𝐙tco,ℛtstruct\)\(\\mathbf\{Z\}\_\{t\}^\{co\},\\mathcal\{R\}\_\{t\}^\{struct\}\), supervised by the ground\-truth state, thereby enforcing learning of electromechanical dynamics\. The control consistency lossℒcontrol\\mathcal\{L\}\_\{control\}minimizes the difference between predicted and expert controls𝐚tctrl∗\\mathbf\{a\}\_\{t\}^\{ctrl\*\}for samples with expert control signals from simulation or logged data\. The EEF estimation lossℒEEF\\mathcal\{L\}\_\{EEF\}applies anℓ2\\ell\_\{2\}loss between predicted and proxy EEF maps, where the proxy ground\-truth is defined based on instantaneous vehicle power consumptionPtotal\(t\)P\_\{total\}\(t\)and geometric relationships to perceived objects or areas\.
### 3\.5Implementation and Training Details
We initialize EVLA’s vision and language components from a pretrained VLM \(LLaVA\-NeXTLiuet al\.\([2024a](https://arxiv.org/html/2606.28938#bib.bib16)\)\)\. The UCSE fusion layers, ESRC modules, control head, and EEF prediction head are newly initialized\. Efficient fine\-tuning strategies such as LoRAHuet al\.\([2022](https://arxiv.org/html/2606.28938#bib.bib4)\)are applied to the large language model components to control parameter countXinet al\.\([2024](https://arxiv.org/html/2606.28938#bib.bib52)\); Deng \([2025](https://arxiv.org/html/2606.28938#bib.bib75)\)\. Training employs a hybrid dataset combining driving scene question\-answer pairs \(e\.g\., from DriveLM\-nuScenes\) with newly synthesized or simulated data, where textual queries are paired with corresponding vehicle state trajectories\{𝐬veh\}\\\{\\mathbf\{s\}^\{veh\}\\\}and optimal control sequences\{𝐚ctrl∗\}\\\{\\mathbf\{a\}^\{ctrl\*\}\\\}\. This ensures exposure to the electromechanical concepts crucial for the joint training objectives\. The AdamW optimizer with a cosine learning rate schedule is used for training\.
## 4Experiments
### 4\.1Dataset
#### 4\.1\.1Training Dataset
For training in the Driving with Language track, we utilize the DriveLM\-nuScenes datasetSimaet al\.\([2024](https://arxiv.org/html/2606.28938#bib.bib1)\)\. Derived from the nuScenes datasetCaesaret al\.\([2020](https://arxiv.org/html/2606.28938#bib.bib2)\), it comprises 4,072 sample frames across 696 scenes, resulting in a total of 377,983 question\-answer \(QA\) pairs\. Each scene consists of a series of sample frames, and each frame provides six camera images \(each with a resolution of1600×9001600\\times 900\), information on several pre\-defined key objects, and associated QA pairs\. The key object information includes the status, visual description, and 2D bounding box coordinates within the images for crucial scene entities, each tagged with a unique KeyObj identifier\. The QA pairs span multiple\-choice, yes/no, and dialogue formats, covering tasks related to perception, prediction, planning, and driving behavior\.
To improve the model’s ability to accurately identify these key objects, we leverage their metadata to generate auxiliary QA pairs for training\. An example is provided below, where the answer corresponds to the object’s description:
Q: The image dimensions are 1600 by 900\. The tag <<c4,CAM\_FRONT,920\.8,383\.3\>\> denotes a key object whose bounding box center in the CAM\_FRONT image is at \(920\.8, 383\.3\)\. What is the object <<c4,CAM\_FRONT,920\.8,383\.3\>\> and what is its state?
A: <<c4,CAM\_FRONT,920\.8,383\.3\>\> is a white truck located in front of the ego\-vehicle\. It is moving\.
To enhance the precision of spatial understanding, we employ the Depth Anything modelYanget al\.\([2024](https://arxiv.org/html/2606.28938#bib.bib3)\)to estimate pixel\-wise depth for all training images\. For each key object, we compute depth values for all pixels within its provided bounding box and take the 75th percentile as the representative object depth\. This numerical value is then mapped to a categorical textual description \(e\.g\., “close”, “far”\) and appended to the object’s metadata\.
#### 4\.1\.2Validation Dataset
The validation dataset follows the same distribution as the training set, containing 799 sample frames from 149 nuScenesCaesaret al\.\([2020](https://arxiv.org/html/2606.28938#bib.bib2)\)scenes, with 15,480 questions in total\. Different evaluation metrics are applied to different question types, and the final score is a weighted sum of these individual scores\. For key objects present in the validation set, we extract their coordinates from the KeyObj tags\. We then sample depth values from an11×1111\\times 11pixel patch centered on each coordinate and compute the object’s representative depth using the same 75th\-percentile method applied during training\.
### 4\.2Training Protocol
We fine\-tune the baseline LLaVA model using the training data described in Section[4\.1](https://arxiv.org/html/2606.28938#S4.SS1)\. To maintain computational and parameter efficiency, we avoid full\-model fine\-tuning and instead employ LoRAHuet al\.\([2022](https://arxiv.org/html/2606.28938#bib.bib4)\)to adapt all fully\-connected layers within LLaVA’s language model\. Low\-rank adaptation methods have proven effective for scaling large language modelsLianget al\.\([2025](https://arxiv.org/html/2606.28938#bib.bib18)\)\. We also explore DoRALiuet al\.\([2024b](https://arxiv.org/html/2606.28938#bib.bib5)\), an advanced variant of LoRA\. For input preparation, we process each question as follows\. If the question references specific key objects, we select the corresponding camera image that contains them, prepend the textual descriptions of those objects to the question, and use this combined input\. If the question contains only directional cues, we select the corresponding directional image and prepend descriptions of all key objects visible from that perspective\. For questions with neither object nor direction references, we use the front\-facing image and prepend descriptions of all in\-view key objects\. All experiments are conducted using PyTorch on a platform with an Intel Xeon Gold 5218R CPU, eight NVIDIA RTX 3090 GPUs, and 256 GB of memory\. For LoRA/DoRA, we set the rank and alpha to 8 and 16, respectively\. We use a cosine learning rate scheduler with an initial rate of2×10−52\\times 10^\{\-5\}and a warm\-up phase for the first 3% of training steps\. Each system is fine\-tuned for one epoch to prevent overfitting\.
For our proposedElectro\-Visual\-Language Assistant \(EVLA\)\(detailed in Section[3](https://arxiv.org/html/2606.28938#S3)\), we extend the above protocol\. We initialize the vision and language components from the LLaVA\-NeXT\-7B checkpoint\. The newly introduced modules—the Unified Co\-State Encoder \(UCSE\), the Electro\-aware Structured Reasoning Chain \(ESRC\), and the control/EEF prediction heads—are trained from scratch\. LoRA fine\-tuningHuet al\.\([2022](https://arxiv.org/html/2606.28938#bib.bib4)\)is similarly applied to the large language model components with rank and alpha set to 8 and 16\. The joint training objectiveℒjoint\\mathcal\{L\}\_\{joint\}\(Section[3](https://arxiv.org/html/2606.28938#S3)\) is optimized using the AdamW optimizer with empirically set loss weights:λ1=1\.0\\lambda\_\{1\}=1\.0,λ2=0\.5\\lambda\_\{2\}=0\.5,λ3=0\.2\\lambda\_\{3\}=0\.2, andλ4=0\.1\\lambda\_\{4\}=0\.1\. EVLA is trained for 2 epochs on the combined dataset \(DriveLM\-nuScenes and synthetic powertrain\-augmented data\) with a batch size of 16 per GPU, following the same learning rate schedule as the baselines\.
Figure 3:Training dynamics comparison between EVLA and LoRA\-LLaVA baseline\. \(a\) Training loss convergence showing EVLA’s faster optimization\. \(b\) Validation loss demonstrating better generalization\. \(c\) Validation score progression, where EVLA achieves significantly higher final performance\.
### 4\.3Evaluation and Main Results
All models are evaluated on the validation set from Section[4\.1](https://arxiv.org/html/2606.28938#S4.SS1)\. The primary metric is the official competition score, a weighted average across different question types \(perception, prediction, planning, etc\.\)\. We also report Accuracy and BERTScoreZhanget al\.\([2020](https://arxiv.org/html/2606.28938#bib.bib6)\)for language quality assessment\. We compare our proposed EVLA model against baseline fine\-tuning methods \(LoRA and DoRA applied to LLaVA\)\. Additionally, following common practice, we implement aFusionsystem that selects the best answer for each question type from the individual baseline systems via a voting or score\-maximization strategy\.
The overall performance is summarized in Table[1](https://arxiv.org/html/2606.28938#S4.T1)\. EVLA achieves the highest scores across all metrics, establishing a new state\-of\-the\-art on this benchmark\. It significantly outperforms the best individual baseline \(LoRA\-LLaVA\) by \+0\.0871 in the final score and \+5\.6% in Accuracy\. The fusion of baseline systems provides a moderate performance boost but still falls short of EVLA, indicating that our unified architecture is more effective than a post\-hoc ensemble of specialized models\.
Table 1:Overall performance comparison on the validation set\. The best results are inbold\.A detailed breakdown by question category is provided in Table[2](https://arxiv.org/html/2606.28938#S4.T2)\. EVLA consistently ranks first in every category\. The most substantial improvements are observed inPredictionandPlanningtasks, where the model’s ability to integrate powertrain states and perform structured reasoning via the ESRC provides a decisive advantage over baselines relying solely on visual\-language correlation\. Even forPerceptiontasks, the richer scene representation from the Unified Co\-State Encoder contributes to more accurate object and state identification, consistent with recent advances in object detection and semantic segmentationRenet al\.\([2024](https://arxiv.org/html/2606.28938#bib.bib19)\)\. Figure[4](https://arxiv.org/html/2606.28938#S4.F4)visualizes the performance gains across all question categories\.
Figure 4:Performance comparison across question categories\. EVLA \(blue\) consistently outperforms both DoRA\-LLaVA \(green\) and LoRA\-LLaVA \(yellow\) baselines, with the most significant improvements in Prediction and Planning tasks\.Table 2:Detailed performance breakdown by question type \(Score\)\.
### 4\.4Ablation Study on EVLA Components
To validate the contribution of each key component in the EVLA framework, we conduct an ablation study, with results summarized in Table[3](https://arxiv.org/html/2606.28938#S4.T3)\. The baseline is a variant where the vehicle state input and all corresponding modules \(UCSE’s state fusion, ESRC’s powertrain parser, and physics\-guided losses\) are removed, effectively reducing it to an enhanced visual\-language model trained with our pipeline\.
Integrating the Unified Co\-State Encoder \(with vehicle state\) while using a standard language model head instead of the ESRC leads to noticeable gains, particularly in Prediction and Planning scores, demonstrating the benefit of a joint visual\-powertrain representation\. Incorporating the Electro\-aware Structured Reasoning Chain while using simple multi\-modal input concatenation \(instead of UCSE\) also improves results, highlighting the value of explicit, structured reasoning\. The complete model, integrating both UCSE and ESRC, achieves the best performance\. The synergy between a unified latent representation and a deterministic reasoning chain is evident, as the full model’s improvement exceeds the sum of gains from individual components\. This confirms our design hypothesis that jointly modeling scene dynamics and vehicle physics is crucial for advanced driving assistance\. Figure[5](https://arxiv.org/html/2606.28938#S4.F5)provides a visual comparison of component contributions\.
Figure 5:Ablation study visualization showing the contribution of each EVLA component\. The Full EVLA model \(rightmost\) achieves significant improvements over the baseline across all task categories, with percentage gains annotated\.Table 3:Ablation study of the proposed EVLA components\.
### 4\.5Efficiency of Physics\-Guided Training
We analyze the impact of the physics\-guided joint training objectiveℒjoint\\mathcal\{L\}\_\{joint\}\. Table[4](https://arxiv.org/html/2606.28938#S4.T4)shows the performance when ablating specific loss terms during EVLA’s training\. Using only the language modeling loss \(ℒLM\\mathcal\{L\}\_\{LM\}\) yields the lowest performance\. Incorporating the state prediction loss \(ℒstate\\mathcal\{L\}\_\{state\}\) and the control consistency loss \(ℒcontrol\\mathcal\{L\}\_\{control\}\) significantly boosts performance on Prediction and Planning tasks, as they enforce the learning of vehicle dynamics\. Adding the EEF estimation loss \(ℒEEF\\mathcal\{L\}\_\{EEF\}\) further refines the model’s spatial understanding with respect to energy efficiency, culminating in the best overall score\. This demonstrates that our multi\-task learning strategy effectively injects domain knowledge, resulting in a more capable and physically\-grounded model\. Figure[6](https://arxiv.org/html/2606.28938#S4.F6)visualizes the incremental performance gains from each loss component\.
Figure 6:Impact of training loss components on EVLA performance\. Each additional loss term contributes to improved final score, with arrows indicating incremental gains\.Table 4:Impact of different components of the joint training lossℒjoint\\mathcal\{L\}\_\{joint\}on EVLA’s final score\.
### 4\.6Inference Framework Comparison
The original baseline employs a complex multi\-stage inference pipeline involving offline depth estimation, object state querying, and manual prompt engineering\. In contrast, EVLA’s inference is streamlined and end\-to\-end\. The UCSE internally performs depth and state estimation via the EEF map and latent representation, while the ESRC replaces external Chain\-of\-Thought prompting with an internal structured reasoning process\. To compare efficiency, we report the average inference time per sample in Table[5](https://arxiv.org/html/2606.28938#S4.T5)and Figure[7](https://arxiv.org/html/2606.28938#S4.F7)\. Despite its richer modeling, EVLA is more efficient than the original multi\-stage pipeline because it avoids sequential calls to external models \(e\.g\., depth estimator, separate VLM\) and complex prompt construction\. This shows that our integrated architecture offers a favorable accuracy\-speed trade\-off, enhancing practicality for real\-time applications\.
Figure 7:Inference time comparison\. EVLA achieves 1\.6×\\timesfaster inference compared to the multi\-stage baseline pipeline\.Table 5:Average inference time per sample \(in seconds\) on a single NVIDIA RTX 3090 GPU\.
### 4\.7Parameter Sensitivity Analysis
We investigate the sensitivity of EVLA to key hyperparameters, including the loss weightsλstate\\lambda\_\{state\}andλcontrol\\lambda\_\{control\}, as well as the LoRA rank\. Figure[8](https://arxiv.org/html/2606.28938#S4.F8)presents the results\. Forλstate\\lambda\_\{state\}, performance peaks at 0\.5, with both lower and higher values leading to decreased scores\. Similarly,λcontrol=0\.2\\lambda\_\{control\}=0\.2achieves optimal performance\. The LoRA rank shows relatively stable performance across values from 8 to 32, with rank 8 selected for computational efficiency\. These results demonstrate that EVLA is robust to hyperparameter choices within reasonable ranges\.
Figure 8:Parameter sensitivity analysis\. \(a\) State loss weightλstate\\lambda\_\{state\}with optimal value at 0\.5\. \(b\) Control loss weightλcontrol\\lambda\_\{control\}with optimal value at 0\.2\. \(c\) LoRA rank showing stable performance across values 8\-32\. Red dashed lines indicate selected values\.
### 4\.8Discussion
The experimental results consistently demonstrate the superiority of our Electro\-Visual\-Language Assistant \(EVLA\)\. By fundamentally extending the model’s understanding to include the vehicle’s powertrain state and embedding a structured reasoning process, EVLA achieves significant gains over strong visual\-language model fine\-tuning baselines\. The ablation studies confirm that both the Unified Co\-State Encoder \(UCSE\) and the Electro\-aware Structured Reasoning Chain \(ESRC\) are critical to this success\. Furthermore, the physics\-guided training objectives ensure the model’s reasoning is grounded in plausible dynamics\. EVLA’s performance advantage is most pronounced in complex tasks like prediction and planning, which require a deeper understanding of cause\-and\-effect relationships involving vehicle dynamics\. This work establishes that for next\-generation driving assistants, moving beyond a passive visual question\-answering paradigm to an active, state\-aware, and physically\-grounded reasoning framework is essential\.
## 5Conclusion
In this work, we introduce the Electro\-Visual\-Language Assistant \(EVLA\), a novel framework that advances driving assistants by integrating multi\-modal visual\-language understanding with real\-time vehicle powertrain state awareness\. Key innovations of EVLA include the Unified Co\-State Encoder \(UCSE\), which learns a joint representation of scene and vehicle dynamics, and the Electro\-aware Structured Reasoning Chain \(ESRC\), designed for explicit, structured reasoning grounded in physical constraints\.
Our comprehensive experimental evaluation demonstrates the effectiveness of EVLA\. On the DriveLM\-nuScenes benchmark, EVLA significantly outperforms strong fine\-tuning baselines \(e\.g\., LoRA/DoRA\-LLaVA\), achieving a final score of 0\.8548, an accuracy of 79\.4%, and a BERTScore of 0\.8927\. Notably, it exhibits substantial gains in complex reasoning tasks such as prediction and planning\. Ablation studies confirm that both the UCSE and ESRC are critical components, with their combined integration yielding synergistic performance improvements beyond individual contributions\. Additionally, the physics\-guided joint training objective—incorporating state prediction, control consistency, and Energy\-Efficiency Field estimation losses—proves essential for learning physically\-grounded representations and achieving optimal performance\. Compared to multi\-stage baselines, EVLA also offers a more efficient, end\-to\-end inference pipeline\.
This work establishes that explicitly modeling and reasoning with vehicle electro\-mechanical states within a unified visual\-language framework is a promising direction for developing more capable, context\-aware, and energy\-efficient driving assistants\. As with other large language model systems, considerations around robustness and reliability remain importantPenget al\.\([2024](https://arxiv.org/html/2606.28938#bib.bib21)\)\. A current limitation is the reliance on simulated or synthesized data for powertrain states; future work will focus on validation with real\-world vehicle data and extending the framework to more complex, long\-horizon driving scenarios\.
## References
- \[1\]\(2025\)Multi\-agent collaborative framework for intelligent it operations: an aoi system with context\-aware compression and dynamic task scheduling\.arXiv preprint arXiv:2512\.13956\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[2\]Z\. Bi, L\. Chen, J\. Song, H\. Luo, E\. Ge, J\. Huang, T\. Wang, K\. Chen, C\. X\. Liang, Z\. Wei,et al\.\(2025\)Exploring efficiency frontiers of thinking budget in medical reasoning: scaling laws between computational resources and reasoning quality\.arXiv:2508\.12140\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p4.1)\.
- \[3\]H\. Caesar, V\. Bankiti, A\. H\. Lang, S\. Vora, V\. E\. Liong, Q\. Xu, A\. Krishnan, Y\. Pan, G\. Baldan, and O\. Beijbom\(2020\)NuScenes: a multimodal dataset for autonomous driving\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 11621–11631\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p2.1),[§4\.1\.1](https://arxiv.org/html/2606.28938#S4.SS1.SSS1.p1.1),[§4\.1\.2](https://arxiv.org/html/2606.28938#S4.SS1.SSS2.p1.1)\.
- \[4\]Z\. Cao, Y\. He, A\. Liu, J\. Xie, Z\. Wang, and F\. Chen\(2025\)CoFi\-dec: hallucination\-resistant decoding via coarse\-to\-fine generative feedback in large vision\-language models\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 10709–10718\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p4.1)\.
- \[5\]Z\. Cao, Y\. He, A\. Liu, J\. Xie, Z\. Wang, and F\. Chen\(2025\)PurifyGen: a risk\-discrimination and semantic\-purification model for safe text\-to\-image generation\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 816–825\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[6\]H\. Chen, J\. Peng, D\. Min, C\. Sun, K\. Chen, Y\. Yan, X\. Yang, and L\. Cheng\(2025\)Mvi\-bench: a comprehensive benchmark for evaluating robustness to misleading visual inputs in lvlms\.arXiv preprint arXiv:2511\.14159\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p4.1)\.
- \[7\]K\. Chen, Z\. Lin, Z\. Xu, Y\. Shen, Y\. Yao, J\. Rimchala, J\. Zhang, and L\. Huang\(2025\)R2i\-bench: benchmarking reasoning\-driven text\-to\-image generation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 12606–12641\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p4.1)\.
- \[8\]K\. Chen, Z\. Xu, Y\. Shen, Z\. Lin, Y\. Yao, and L\. Huang\(2025\)SuperFlow: training flow matching models with rl on the fly\.arXiv preprint arXiv:2512\.17951\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p4.1)\.
- \[9\]X\. Deng\(2025\)Enhancing neural network performance on tabular data via knowledge distillation and rankgauss transformation\.In2025 6th International Conference on Big Data & Artificial Intelligence & Software Engineering \(ICBASE\),pp\. 418–423\.Cited by:[§3\.5](https://arxiv.org/html/2606.28938#S3.SS5.p1.2)\.
- \[10\]X\. Deng\(2026\)Graph inference towards icd coding\.arXiv preprint arXiv:2601\.07496\.Cited by:[§3\.2](https://arxiv.org/html/2606.28938#S3.SS2.p2.1)\.
- \[11\]X\. Han, X\. Gao, X\. Qu, and Z\. Yu\(2025\)Multi\-agent medical decision consensus matrix system: an intelligent collaborative framework for oncology mdt consultations\.arXiv preprint arXiv:2512\.14321\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[12\]Y\. He, S\. Li, K\. Li, J\. Wang, B\. Li, T\. Shi, Y\. Xin, K\. Li, J\. Yin, M\. Zhang,et al\.\(2025\)GE\-adapter: a general and efficient adapter for enhanced video editing with pretrained text\-to\-image diffusion models\.Expert Systems with Applications,pp\. 129649\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p4.1)\.
- \[13\]Y\. He, S\. Li, J\. Wang, K\. Li, X\. Song, X\. Yuan, K\. Li, K\. Lu, M\. Huo, J\. Tang,et al\.\(2025\)Enhancing low\-cost video editing with lightweight adaptors and temporal\-aware inversion\.arXiv preprint arXiv:2501\.04606\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p2.1)\.
- \[14\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§3\.5](https://arxiv.org/html/2606.28938#S3.SS5.p1.2),[§4\.2](https://arxiv.org/html/2606.28938#S4.SS2.p1.1),[§4\.2](https://arxiv.org/html/2606.28938#S4.SS2.p2.5)\.
- \[15\]Y\. Huang, B\. Li, N\. Li, Z\. Wang, K\. Chen, H\. Ge, Q\. Si, Y\. Shen, R\. Yang, G\. Wang,et al\.\(2026\)GUI agents for continual game generation\.arXiv preprint arXiv:2605\.28258\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p4.1)\.
- \[16\]C\. X\. Liang, Z\. Bi, T\. Wang, M\. Liu, X\. Song, Y\. Zhang, J\. Song, Q\. Niu, B\. Peng, K\. Chen,et al\.\(2025\)Low\-rank adaptation for scalable large language models: a comprehensive survey\.Cited by:[§4\.2](https://arxiv.org/html/2606.28938#S4.SS2.p1.1)\.
- \[17\]C\. X\. Liang, P\. Tian, C\. H\. Yin, Y\. Yua, W\. An\-Hou, L\. Ming, T\. Wang, Z\. Bi, and M\. Liu\(2024\)A comprehensive survey and guide to multimodal large language models in vision\-language tasks\.arXiv:2411\.06284\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[18\]S\. Lin\(2025\)Abductive inference in retrieval\-augmented language models: generating and validating missing premises\.External Links:2511\.04020,[Link](https://arxiv.org/abs/2511.04020)Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p2.1)\.
- \[19\]S\. Lin\(2025\)Hybrid fuzzing with llm\-guided input mutation and semantic feedback\.External Links:2511\.03995,[Link](https://arxiv.org/abs/2511.03995)Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[20\]S\. Lin\(2025\)LLM\-driven adaptive source\-sink identification and false positive mitigation for static analysis\.External Links:2511\.04023,[Link](https://arxiv.org/abs/2511.04023)Cited by:[§2\.3](https://arxiv.org/html/2606.28938#S2.SS3.p1.1)\.
- \[21\]Y\. Lin, M\. Wang, L\. Xu, and F\. Zhang\(2017\)The maximum forcing number of a polyomino\.Australas\. J\. Combin69,pp\. 306–314\.Cited by:[§3\.2](https://arxiv.org/html/2606.28938#S3.SS2.p2.1)\.
- \[22\]H\. Liu, C\. Li, Y\. Li, B\. Li, Y\. Zhang, S\. Shen, and Y\. J\. Lee\(2024\)LLaVA\-next: improved reasoning, ocr, and world knowledge\.arXiv preprint\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p2.1),[§3\.5](https://arxiv.org/html/2606.28938#S3.SS5.p1.2)\.
- \[23\]H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee\(2023\)Visual instruction tuning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p2.1)\.
- \[24\]S\. Liu, C\. Wang, H\. Yin, P\. Molchanov, Y\. F\. Wang, K\. Cheng, and M\. Chen\(2024\)DoRA: weight\-decomposed low\-rank adaptation\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§4\.2](https://arxiv.org/html/2606.28938#S4.SS2.p1.1)\.
- \[25\]M\. Mo, Y\. Tan, H\. Zhang, H\. Zhang, and Y\. He\(2026\)ShieldedCode: learning robust representations for virtual machine protected code\.arXiv preprint arXiv:2601\.20679\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[26\]W\. Mu\-Jiang\-shan, Y\. Jun, L\. Shang\-wei,et al\.\(2010\)Ordered and hamilton digraphs\.Chinese Quarterly Journal of Mathematics25\(3\),pp\. 317–326\.Cited by:[§3\.2](https://arxiv.org/html/2606.28938#S3.SS2.p2.1)\.
- \[27\]Q\. Niu, K\. Chen, M\. Li, P\. Feng, Z\. Bi, L\. K\. Yan, Y\. Zhang, C\. H\. Yin, C\. Fei, J\. Liu, B\. Peng, T\. Wang, Y\. Wang, S\. Chen, and M\. Liu\(2024\)From text to multimodality: exploring the evolution and impact of large language models in medical practice\.External Links:2410\.01812,[Link](https://arxiv.org/abs/2410.01812)Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[28\]Q\. Niu, J\. Liu, Z\. Bi, P\. Feng, B\. Peng, K\. Chen, M\. Li, L\. K\. Yan, Y\. Zhang, C\. H\. Yin, C\. Fei, T\. Wang, Y\. Wang, S\. Chen, and M\. Liu\(2024\)Large language models and cognitive science: a comprehensive review of similarities, differences, and challenges\.External Links:2409\.02387,[Link](https://arxiv.org/abs/2409.02387)Cited by:[§2\.3](https://arxiv.org/html/2606.28938#S2.SS3.p1.1)\.
- \[29\]C\. Pan, Y\. Qu, Y\. Yao, and M\. Wang\(2024\)HybridGNN: a self\-supervised graph neural network for efficient maximum matching in bipartite graphs\.Symmetry16\(12\),pp\. 1631\.Cited by:[§2\.3](https://arxiv.org/html/2606.28938#S2.SS3.p3.1)\.
- \[30\]B\. Peng, K\. Chen, M\. Li, P\. Feng, Z\. Bi, J\. Liu, and Q\. Niu\(2024\)Securing large language models: addressing bias, misinformation, and prompt attacks\.arXiv:2409\.08087\.Cited by:[§5](https://arxiv.org/html/2606.28938#S5.p3.1)\.
- \[31\]H\. Qi, Z\. Hu, Z\. Yang, J\. Zhang, J\. J\. Wu, C\. Cheng, C\. Wang, and L\. Zheng\(2022\)Capacitive aptasensor coupled with microfluidic enrichment for real\-time detection of trace sars\-cov\-2 nucleocapsid protein\.Analytical chemistry94\(6\),pp\. 2812–2819\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p1.1)\.
- \[32\]D\. Qu and Y\. Ma\(2025\)Magnet\-bn: markov\-guided bayesian neural networks for calibrated long\-horizon sequence forecasting and community tracking\.Mathematics13\(17\),pp\. 2740\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[33\]A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational Conference on Machine Learning \(ICML\),pp\. 8748–8763\.Cited by:[§3\.2](https://arxiv.org/html/2606.28938#S3.SS2.p1.9)\.
- \[34\]J\. Ren, Z\. Bi, Q\. Niu, J\. Liu, B\. Peng, S\. Zhang, X\. Pan, J\. Wang, K\. Chen, C\. H\. Yin,et al\.\(2024\)Deep learning and machine learning–object detection and semantic segmentation: from theory to applications\.arXiv:2410\.15584\.Cited by:[§4\.3](https://arxiv.org/html/2606.28938#S4.SS3.p3.1)\.
- \[35\]J\. Shi, Y\. Lin, Y\. Hua, Z\. Wang, Z\. Zhang, W\. Zheng, Y\. Song, K\. Lu, and S\. Lu\(2025\)Multi\-scenario highway lane\-change intention prediction: a physics\-informed ai framework for three\-class classification\.arXiv preprint arXiv:2509\.17354\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p2.1)\.
- \[36\]C\. Sima, K\. Renz, K\. Chitta, L\. Chen, H\. Zhang, C\. Xie, P\. Luo, A\. Geiger, and H\. Li\(2024\)DriveLM: driving with graph visual question answering\.InEuropean Conference on Computer Vision \(ECCV\),Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p2.1),[§4\.1\.1](https://arxiv.org/html/2606.28938#S4.SS1.SSS1.p1.1)\.
- \[37\]X\. Song, Y\. He, S\. Li, J\. Wang, H\. He, X\. Yuan, R\. Wang, J\. Chen, K\. Li, K\. Lu,et al\.\(2025\)Efficient temporal consistency in diffusion\-based video editing with adaptor modules: a theoretical framework\.arXiv preprint arXiv:2504\.16016\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[38\]X\. Tian, J\. Gu, B\. Li, Y\. Liu, C\. Hu, Y\. Wang, K\. Zhan, P\. Jia, X\. Lang, and H\. Zhao\(2024\)DriveVLM: the convergence of autonomous driving and large vision\-language models\.arXiv preprint arXiv:2402\.12289\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[39\]Y\. Tian, Z\. Yang, C\. Liu, Y\. Su, Z\. Hong, Z\. Gong, and J\. Xu\(2025\)CenterMamba\-sam: center\-prioritized scanning and temporal prototypes for brain lesion segmentation\.External Links:2511\.01243,[Link](https://arxiv.org/abs/2511.01243)Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p2.1)\.
- \[40\]H\. Wang, X\. Zhang, Y\. Xia, and X\. Wu\(2023\)An intelligent blockchain\-based access control framework with federated learning for genome\-wide association studies\.Computer Standards & Interfaces84,pp\. 103694\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[41\]M\. Wang, W\. Yang, and S\. Wang\(2013\)Conditional matching preclusion number for the cayley graph on the symmetric group\.Acta Math\. Appl\. Sin\.\(Chinese Series\)36\(5\),pp\. 813–820\.Cited by:[§3\.2](https://arxiv.org/html/2606.28938#S3.SS2.p2.1)\.
- \[42\]M\. Wang and S\. Wang\(2016\)Diagnosability of cayley graph networks generated by transposition trees under the comparison diagnosis model\.Annals of Applied Mathematics32\(2\),pp\. 166–173\.Cited by:[§2\.3](https://arxiv.org/html/2606.28938#S2.SS3.p3.1)\.
- \[43\]M\. Wang, S\. Xu, J\. Jiang, D\. Xiang, and S\. Hsieh\(2025\)Global reliable diagnosis of networks based on self\-comparative diagnosis model and g\-good\-neighbor property\.Journal of Computer and System Sciences,pp\. 103698\.Cited by:[§2\.3](https://arxiv.org/html/2606.28938#S2.SS3.p3.1)\.
- \[44\]S\. Wang, M\. Wang, K\. Feng, S\. Lin, and M\. Zhang\(2012\)Relation of the isolated scattering number of a graph and its complement graph\.Journal of Shanxi University \(Natural Science Edition\)35\(2\),pp\. 206–210\.Cited by:[§3\.2](https://arxiv.org/html/2606.28938#S3.SS2.p2.1)\.
- \[45\]S\. Wang and M\. Wang\(2018\)The edge connectivity of expanded k\-ary n\-cubes\.Discrete Dynamics in Nature and Society2018\(1\),pp\. 7867342\.Cited by:[§2\.3](https://arxiv.org/html/2606.28938#S2.SS3.p3.1)\.
- \[46\]S\. Wang and M\. Wang\(2019\)A note on the connectivity of m\-ary n\-dimensional hypercubes\.Parallel Processing Letters29\(04\),pp\. 1950017\.Cited by:[§2\.3](https://arxiv.org/html/2606.28938#S2.SS3.p3.1)\.
- \[47\]S\. Wang, J\. Wangmu, Z\. Qi, and Y\. Ren\(2011\)Embedding paths into the 4\-ary n\-cube with faulty nodes\.In2011 International Conference on Consumer Electronics, Communications and Networks \(CECNet\),pp\. 4949–4951\.Cited by:[§3\.2](https://arxiv.org/html/2606.28938#S3.SS2.p2.1)\.
- \[48\]T\. Wang, S\. Chen, Y\. Wang, Y\. Zhang, X\. Song, Z\. Bi, M\. Liu, Q\. Niu, J\. Liu, P\. Feng, X\. Sun, B\. Peng, C\. Zhang, K\. Chen, M\. Li, C\. Fei, and L\. K\. Yan\(2025\)From in silico to in vitro: a comprehensive guide to validating bioinformatics findings\.External Links:2502\.03478,[Link](https://arxiv.org/abs/2502.03478)Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[49\]T\. Wang, M\. Liu, B\. Peng, X\. Song, C\. Zhang, X\. Sun, Q\. Niu, J\. Liu, S\. Chen, K\. Chen, M\. Li, P\. Feng, Z\. Bi, Y\. Wang, Y\. Zhang, C\. Fei, and L\. K\. Yan\(2024\)From bench to bedside: a review of clinical trials in drug discovery and development\.External Links:2412\.09378,[Link](https://arxiv.org/abs/2412.09378)Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[50\]Y\. Wang and S\. Sayil\(2024\)Soft error evaluation and mitigation in gate diffusion input circuits\.In2024 IEEE 6th International Conference on Power, Intelligent Computing and Systems \(ICPICS\),pp\. 121–128\.Cited by:[§2\.2](https://arxiv.org/html/2606.28938#S2.SS2.p1.1)\.
- \[51\]Y\. Wang\(2024\)Low\-power design of advanced image processing algorithms under fpga in real\-time applications\.In2024 IEEE 4th International Conference on Power, Electronics and Computer Applications \(ICPECA\),pp\. 1080–1084\.Cited by:[§2\.2](https://arxiv.org/html/2606.28938#S2.SS2.p1.1)\.
- \[52\]Y\. Wang\(2025\)Zynq soc\-based acceleration of retinal blood vessel diameter measurement\.Archives of Advanced Engineering Science,pp\. 1–9\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[53\]Z\. Wei, H\. An, Y\. Yao, W\. Su, G\. Li, Saifullah, B\. Sun, and M\. Wang\(2025\)FSTGAT: financial spatio\-temporal graph attention network for non\-stationary financial systems and its application in stock price prediction\.Symmetry17\(8\),pp\. 1344\.Cited by:[§2\.2](https://arxiv.org/html/2606.28938#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.28938#S3.SS2.p2.1)\.
- \[54\]Z\. Wei, P\. Hu, S\. Lang, H\. Yan, L\. Mei, Y\. Zhang, C\. Yang, J\. Hao, and Z\. Han\(2025\)Automated red\-teaming framework for large language model security assessment: a comprehensive attack generation and detection system\.arXiv preprint arXiv:2512\.20677\.Cited by:[§2\.3](https://arxiv.org/html/2606.28938#S2.SS3.p1.1)\.
- \[55\]X\. Wu, J\. Dong, W\. Bao, B\. Zou, L\. Wang, and H\. Wang\(2024\)Augmented intelligence of things for emergency vehicle secure trajectory prediction and task offloading\.IEEE Internet of Things Journal11\(22\),pp\. 36030–36043\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p2.1)\.
- \[56\]X\. Wu, H\. Wang, W\. Tan, D\. Wei, and M\. Shi\(2020\)Dynamic allocation strategy of vm resources with fuzzy transfer learning method\.Peer\-to\-Peer Networking and Applications13\(6\),pp\. 2201–2213\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[57\]X\. Wu, H\. Wang, Y\. Zhang, B\. Zou, and H\. Hong\(2024\)A tutorial\-generating method for autonomous online learning\.IEEE Transactions on Learning Technologies17,pp\. 1532–1541\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[58\]X\. Wu, Y\. Zhang, K\. Lai, M\. Yang, G\. Yang, and H\. Wang\(2024\)A novel centralized federated deep fuzzy neural network with multi\-objectives neural architecture search for epistatic detection\.IEEE Transactions on Fuzzy Systems33\(1\),pp\. 94–107\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[59\]X\. Wu, Y\. Zhang, M\. Shi, P\. Li, R\. Li, and N\. N\. Xiong\(2022\)An adaptive federated learning scheme with differential privacy preserving\.Future Generation Computer Systems127,pp\. 362–372\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[60\]D\. Xiang, S\. Hsieh,et al\.\(2025\)G\-good\-neighbor diagnosability under the modified comparison model for multiprocessor systems\.Theoretical Computer Science1028,pp\. 115027\.Cited by:[§2\.3](https://arxiv.org/html/2606.28938#S2.SS3.p3.1)\.
- \[61\]Y\. Xin, J\. Du, Q\. Wang, Z\. Lin, and K\. Yan\(2024\)Vmt\-adapter: parameter\-efficient transfer learning for multi\-task dense scene understanding\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 16085–16093\.Cited by:[§3\.5](https://arxiv.org/html/2606.28938#S3.SS5.p1.2)\.
- \[62\]Y\. Xin, Q\. Qin, S\. Luo, K\. Zhu, J\. Yan, Y\. Tai, J\. Lei, Y\. Cao, K\. Wang, Y\. Wang,et al\.\(2025\)Lumina\-dimoo: an omni diffusion large language model for multi\-modal generation and understanding\.arXiv preprint arXiv:2510\.06308\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[63\]Y\. Xin, J\. Yan, Q\. Qin, Z\. Li, D\. Liu, S\. Li, V\. S\. Huang, Y\. Zhou, R\. Zhang, L\. Zhuo,et al\.\(2025\)Lumina\-mgpt 2\.0: stand\-alone autoregressive image modeling\.arXiv preprint arXiv:2507\.17801\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[64\]S\. Xu, H\. L\. Kao, T\. Xu, H\. Zhang, J\. Wang, R\. Ding, G\. Liu, T\. Shi, Z\. Yu, G\. Pan,et al\.\(2025\)Adaptive detector\-verifier framework for zero\-shot polyp detection in open\-world settings\.arXiv preprint arXiv:2512\.12492\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p4.1)\.
- \[65\]Z\. Xu, Y\. Zhang, E\. Xie, Z\. Zhao, Y\. Guo, K\. K\. Wong, Z\. Li, and H\. Zhao\(2024\)DriveGPT4: interpretable end\-to\-end autonomous driving via large language model\.IEEE Robotics and Automation Letters\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[66\]L\. K\. Q\. Yan, Q\. Niu, M\. Li, Y\. Zhang, C\. H\. Yin, C\. Fei, B\. Peng, Z\. Bi, P\. Feng, K\. Chen, T\. Wang, Y\. Wang, S\. Chen, M\. Liu, J\. Liu, X\. Song, R\. Bao, Z\. Jiang, and Z\. Qin\(2025\)Large language model benchmarks in medical tasks\.External Links:2410\.21348,[Link](https://arxiv.org/abs/2410.21348)Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[67\]C\. Yang, Y\. He, A\. X\. Tian, D\. Chen, J\. Wang, T\. Shi, A\. Heydarian, and P\. Liu\(2025\)Wcdt: world\-centric diffusion transformer for traffic scene generation\.In2025 IEEE International Conference on Robotics and Automation \(ICRA\),pp\. 6566–6572\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p4.1)\.
- \[68\]L\. Yang, B\. Kang, Z\. Huang, X\. Xu, J\. Feng, and H\. Zhao\(2024\)Depth anything: unleashing the power of large\-scale unlabeled data\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§4\.1\.1](https://arxiv.org/html/2606.28938#S4.SS1.SSS1.p5.1)\.
- \[69\]M\. You, K\. Chen, and D\. Cheng\(2026\)Drdgrl: dual\-relational dynamic graph representation learning for delay\-sensitive stock trend prediction\.InInternational Conference on Database Systems for Advanced Applications,pp\. 35–50\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p4.1)\.
- \[70\]W\. You, Z\. Yu, Z\. Han, X\. Liu, and Y\. Zhang\(2025\)Large language models for enhanced user experience in virtual and augmented reality: a comprehensive framework for ranking and recommendation systems\.Available at SSRN 5964834\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[71\]L\. Yu, X\. Han, Y\. Kang, C\. Tseng, D\. Zhang, Z\. Bi, and Z\. Han\(2025\)Affective multimodal agents with proactive knowledge grounding for emotionally aligned marketing dialogue\.arXiv preprint arXiv:2511\.21728\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[72\]W\. Yu, S\. Wei, J\. Liu, Y\. Li, M\. Hu, A\. Liu, H\. Zhang, and I\. King\(2026\)Probability\-entropy calibration: an elastic indicator for adaptive fine\-tuning\.arXiv preprint arXiv:2602\.01745\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[73\]Z\. Yu, M\. Y\. I\. Idris, P\. Wang, Y\. Xia, and Y\. Xiang\(2025\)Forgetme: benchmarking the selective forgetting capabilities of generative models\.Engineering Applications of Artificial Intelligence161,pp\. 112087\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[74\]Z\. Yu, J\. Wang, and M\. Y\. I\. Idris\(2025\)Iidm: improved implicit diffusion model with knowledge distillation to estimate the spatial distribution density of carbon stock in remote sensing imagery\.Knowledge\-Based Systems,pp\. 115131\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[75\]Z\. Yu\(2025\)Ai for science: a comprehensive review on innovations, challenges, and future directions\.International Journal of Artificial Intelligence for Science \(IJAI4S\)1\(1\)\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p3.1)\.
- \[76\]C\. Zhang, B\. Peng, X\. Sun, Q\. Niu, J\. Liu, K\. Chen, M\. Li, P\. Feng, Z\. Bi, M\. Liu,et al\.\(2024\)From word vectors to multimodal embeddings: techniques, applications, and future directions for large language models\.arXiv:2411\.05036\.Cited by:[§3\.2](https://arxiv.org/html/2606.28938#S3.SS2.p1.9)\.
- \[77\]H\. Zhang, B\. Huang, Z\. Li, X\. Xiao, H\. Y\. Leong, Z\. Zhang, X\. Long, T\. Wang, and H\. Xu\(2025\)Sensitivity\-lora: low\-load sensitivity\-based fine\-tuning for large language models\.arXiv preprint arXiv:2509\.09119\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[78\]H\. Zhang, Z\. Li, R\. Bao, Y\. Gao, X\. Xiao, H\. Zhang, S\. Zhang, B\. Huang, Y\. Wu, T\. Wang,et al\.\(2025\)HyperAdaLoRA: accelerating lora rank allocation during training via hypernetworks without sacrificing performance\.arXiv preprint arXiv:2510\.02630\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[79\]H\. Zhang, M\. Lyu, Z\. Chen, X\. Xing, Y\. Ao, and Y\. Lin\(2025\)Pdtrim: targeted pruning for prefill\-decode disaggregation in inference\.arXiv preprint arXiv:2509\.04467\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[80\]H\. Zhang, M\. Lyu, C\. He, Y\. Ao, and Y\. Lin\(2025\)Trimtokenator: towards adaptive visual token pruning for large multimodal models\.arXiv preprint arXiv:2509\.00320\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[81\]H\. Zhang, M\. Lyu, B\. Huang, Y\. Ao, and Y\. Lin\(2025\)TrimTokenator\-lc: towards adaptive visual token pruning for large multimodal models with long contexts\.arXiv preprint arXiv:2512\.22748\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[82\]H\. Zhang, X\. Mao, G\. Dong, Z\. Li, X\. Su, K\. Chen, J\. Yang, and Z\. Lin\(2026\)MemMark: state\-evolution attribution watermarking for agent long\-term memory systems\.arXiv preprint arXiv:2605\.25002\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p4.1)\.
- \[83\]H\. Zhang, H\. You, Z\. Zhang, L\. Gan, H\. Zhang, W\. Huang, and J\. Huang\(2026\)Mitigating generic token dominance in cross\-domain foundation model for text\-attributed graphs\.InInternational Conference on Database Systems for Advanced Applications,pp\. 251–265\.Cited by:[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[84\]P\. Zhang, F\. Yan, and C\. Du\(2015\)A survey on energy management strategy for hybrid electric vehicles\.IEEE Transactions on Vehicular Technology64\(5\),pp\. 1694–1707\.Cited by:[§2\.2](https://arxiv.org/html/2606.28938#S2.SS2.p1.1)\.
- \[85\]T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi\(2020\)BERTScore: evaluating text generation with bert\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§4\.3](https://arxiv.org/html/2606.28938#S4.SS3.p1.1)\.
- \[86\]Y\. Zhang, N\. Deng, X\. Song, Z\. Bi, T\. Wang, Z\. Yao, K\. Chen, M\. Li, Q\. Niu, J\. Liu, B\. Peng, S\. Zhang, M\. Liu, L\. Zhang, X\. Pan, J\. Wang, P\. Feng, Y\. Wen, L\. K\. Yan, H\. Tseng, Y\. Zhong, Y\. Wang, Z\. Qin, B\. Jing, J\. Yang, J\. Zhou, C\. X\. Liang, and J\. Song\(2025\)Advanced deep learning methods for protein structure prediction and design\.External Links:2503\.13522,[Link](https://arxiv.org/abs/2503.13522)Cited by:[§3\.2](https://arxiv.org/html/2606.28938#S3.SS2.p1.9)\.
- \[87\]Q\. Zhao, Z\. Dou, D\. Zhang, X\. Li, C\. Song, Z\. Wan, X\. Li, Y\. Zhang, K\. Chen, Q\. Pan,et al\.\(2026\)STRIDE: strategic trajectory reasoning via discriminative estimation for verifiable reinforcement learning\.arXiv preprint arXiv:2606\.15866\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p4.1)\.
- \[88\]X\. Zhou, M\. Liu, E\. Yurtsever, B\. L\. Zagar, W\. Zimmer, H\. Cao, and A\. C\. Knoll\(2024\)Vision language models in autonomous driving: a survey and outlook\.IEEE Transactions on Intelligent Vehicles\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.28938#S2.SS1.p1.1)\.
- \[89\]Y\. Zhou, Y\. He, Y\. Su, S\. Han, J\. Jang, G\. Bertasius, M\. Bansal, and H\. Yao\(2025\)ReAgent\-v: a reward\-driven multi\-agent framework for video understanding\.arXiv preprint arXiv:2506\.01300\.Cited by:[§1](https://arxiv.org/html/2606.28938#S1.p1.1)\.Similar Articles
PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation
This paper introduces PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, enabling style-diverse non-ego agents for closed-loop simulation and improving driving scores on Bench2Drive.
EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies
EventVLA introduces a sparse visual evidence memory framework for long-horizon robotic manipulation, achieving an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.
AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding
AffordanceVLA introduces a unified framework using structured affordance forecasting as an intermediate representation to improve perception-action mapping in robotic manipulation, leveraging vision-language models and a Mixture-of-Transformer architecture.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.