FusionSense: Tri-Stage Near-Sensor Learning for Runtime-Adaptive Multimodal Edge Intelligence

arXiv cs.LG 05/25/26, 04:00 AM Papers
Summary
FusionSense introduces a tri-stage near-sensor learning framework for multimodal edge intelligence that jointly reduces compute and communication by using fusion-aware filtering, achieving up to 33× energy savings and significant data-reduction gains on RGB-Depth/LiDAR tasks.
arXiv:2605.22868v1 Announce Type: new Abstract: Autonomous systems and smart-industry deployments increasingly split computation across near-sensor, edge, and cloud resources, where tight energy, latency, and reliability budgets demand run-time adaptivity. In practice, deciding what to compute and transmit at each point is pivotal; yet as multimodal sensor suites (cameras, LiDAR/depth, etc.) proliferate at the edge, most prior approaches either (i) fuse modalities on powerful servers or (ii) apply uni-modal near-sensor filters that ignore cross-modal dependencies, leading to redundant transmissions or missed events. We present FusionSense, a fusion-aware intelligent sensing framework for energy-constrained autonomous edge systems. Lightweight near-sensor classifiers are trained via a three-step procedure: (i) a server-side fusion model learns the downstream task, (ii) filter-out-safe (FoS) labels quantify each modality's necessity relative to the fused decision, and (iii) an edge-side fusion model is compacted by injecting near-sensor predictions as auxiliary signals. The result is a run-time decision layer that jointly reduces compute and communication while scaling linearly with sensor count. On a dual-modality (RGB+Depth/LiDAR) setup with SynDrone, FusionSense sustains task quality at substantially higher data-reduction rates than uni-modal filters and delivers large end-to-end gains: up to 33x lower energy at 1% FoI prevalence, 11x at 10%, a 92.3% reduction in quality loss at a fixed 30% data reduction, and roughly 1.5x higher energy savings than the best prior filtering baseline.
Original Article
View Cached Full Text
Cached at: 05/25/26, 08:54 AM
# Tri-Stage Near-Sensor Learning for Runtime-Adaptive Multimodal Edge Intelligence
Source: [https://arxiv.org/html/2605.22868](https://arxiv.org/html/2605.22868)
,Ryozo MasukawaUniversity of California, IrvineIrvine, CAUSA[rmasukaw@uci\.edu](https://arxiv.org/html/2605.22868v1/mailto:[email protected]),Minhyoung NaKookmin UniversitySeoulSouth Korea[minhyoung0724@kookmin\.ac\.kr](https://arxiv.org/html/2605.22868v1/mailto:[email protected]),Hyunwoo OhUniversity of California, IrvineIrvine, CAUSA[hyunwooo@uci\.edu](https://arxiv.org/html/2605.22868v1/mailto:[email protected]),Yoshiki YamaguchiShibaura Institute of TechnologySaitamaJapan[bp21016@shibaura\-it\.ac\.jp](https://arxiv.org/html/2605.22868v1/mailto:[email protected]),Wenjun HuangUniversity of California, IrvineIrvine, CAUSA[wenjunh3@uci\.edu](https://arxiv.org/html/2605.22868v1/mailto:[email protected]),SungHeon JeongUniversity of California, IrvineIrvine, CAUSA[sungheoj@uci\.edu](https://arxiv.org/html/2605.22868v1/mailto:[email protected])andMohsen ImaniUniversity of California, IrvineIrvine, CAUSA[m\.imani@uci\.edu](https://arxiv.org/html/2605.22868v1/mailto:[email protected])

###### Abstract\.

Autonomous systems and smart\-industry deployments increasingly split computation across near\-sensor, edge, and cloud resources, where tight energy, latency, and reliability budgets demand*run\-time*adaptivity\. In practice, deciding*what*to compute and transmit at each point is pivotal; yet as multimodal sensor suites \(cameras, LiDAR/depth, etc\.\) proliferate at the edge, most prior approaches either \(i\) fuse modalities on powerful servers or \(ii\) apply uni\-modal near\-sensor filters that ignore cross\-modal dependencies, leading to redundant transmissions or missed events\. We presentFusionSense, a fusion\-aware intelligent sensing framework for energy\-constrained autonomous edge systems\. Lightweight near\-sensor classifiers are trained via a three\-step procedure: \(i\) a server\-side fusion model learns the downstream task, \(ii\)*filter\-out\-safe*\(FoS\) labels quantify each modality’s necessity relative to the fused decision, and \(iii\) an edge\-side fusion model is compacted by injecting near\-sensor predictions as auxiliary signals\. The result is a run\-time decision layer that jointly reduces compute and communication while scaling linearly with sensor count\. On a dual\-modality \(RGB\+Depth/LiDAR\) setup with SynDrone,FusionSensesustains task quality at substantially higher data\-reduction rates than uni\-modal filters and delivers large end\-to\-end gains: up to 33×\\timeslower energy at 1% FoI prevalence, 11×\\timesat 10%, a 92\.3% reduction in quality loss at a fixed 30% data reduction, and roughly 1\.5×\\timeshigher energy savings than the best prior filtering baseline\.

## 1\.Introduction

Autonomous systems in modern manufacturing increasingly split workloads across near\-sensor/on\-device processing, edge nodes, and cloud back\-ends, where platforms must make run\-time decisions under tight energy, latency, reliability, and thermal constraints\(Alikhaniet al\.,[2024](https://arxiv.org/html/2605.22868#bib.bib362); Isuwaet al\.,[2023](https://arxiv.org/html/2605.22868#bib.bib363); Taufiqueet al\.,[2025](https://arxiv.org/html/2605.22868#bib.bib364); Yunet al\.,[2025b](https://arxiv.org/html/2605.22868#bib.bib354),[2026c](https://arxiv.org/html/2605.22868#bib.bib370)\)\. As multimodal sensors \(e\.g\., RGB, LiDAR/depth, IMU\) become standard at the edge, it is essential to model inter\-sensor correlations*before*transmission and fusion; otherwise, systems waste uplink bandwidth and risk accuracy drops when complementary cues are required\(Alikhaniet al\.,[2023](https://arxiv.org/html/2605.22868#bib.bib365)\)\. While large back\-end models often exploit such cross\-modal structure, edge devices must also internalize it to remain both efficient and accurate\(Taufiqueet al\.,[2025](https://arxiv.org/html/2605.22868#bib.bib364); Xunet al\.,[2024](https://arxiv.org/html/2605.22868#bib.bib366)\)\. In practice,*intelligent sensing*—pre\-processing at or near the sensor—has been adopted in agriculture\(Liakoset al\.,[2018](https://arxiv.org/html/2605.22868#bib.bib252)\), healthcare\(Bahriet al\.,[2018](https://arxiv.org/html/2605.22868#bib.bib253)\), logistics\(Woschanket al\.,[2020](https://arxiv.org/html/2605.22868#bib.bib256)\), privacy\(Pramaniket al\.,[2023](https://arxiv.org/html/2605.22868#bib.bib254)\), and security\(Khalidet al\.,[2023](https://arxiv.org/html/2605.22868#bib.bib255)\); however, most deployments are uni\-modal and task\-specific, and thus fail to capture the cross\-modal dependencies\.

As the number of edge devices increases, the amount of data that must be processed at servers also increases, leading to severe scalability issues\. A promising direction is near\-sensor adaptation\(Yunet al\.,[2024](https://arxiv.org/html/2605.22868#bib.bib353); Huanget al\.,[2024](https://arxiv.org/html/2605.22868#bib.bib355)\), where tiny AI models positioned at sensors filter redundant data so that only task\-relevant information is relayed for server\-side operations\. This strategy is effective when the probability of observing a*frame of interest*\(FoI\) is low, conserving energy with modest quality loss\.*Yet,*when modern edge nodes integrate*multiple*sensors, simply applying independent uni\-modal filters becomes suboptimal: it \(i\) transmits redundant modalities even when a subset suffices for the fused decision, and \(ii\) may over\-filter and miss events that require complementary cues\. Moreover, data volume and compute grow with each added modality, shifting scalability and thermal challenges to the edge itself\.

We proposeFusionSense, a fusion\-aware intelligent sensing framework that follows a pragmatic design principle\. We treat multi\-sensor streams as correlated random variables and train lightweight near\-sensor classifiers to approximate the joint decision boundaries of a more complex server\-side fusion model\. Our methodology is informed by multimodal fusion research\(Sharmaet al\.,[2022](https://arxiv.org/html/2605.22868#bib.bib226); Kimet al\.,[2022](https://arxiv.org/html/2605.22868#bib.bib233)\)and validated empirically\. In particular,FusionSenseintroduces a*three\-step*learning process that \(1\) trains a high\-capacity server fusion model for the downstream task, \(2\) derives*filter\-out\-safe*\(FoS\) labels that indicate when each modality is unnecessary for the fused decision, and \(3\) compacts an edge\-side fusion model by injecting near\-sensor predictions as auxiliary signals—yielding a run\-time decision layer that co\-optimizes compute and communication across the edge–cloud continuum and scales*linearly*with the number of sensors\.

We instantiate and evaluateFusionSenseon a dual\-modality camera–LiDAR setup\. Experiments demonstrate thatFusionSensemaintains task quality at substantially higher data\-reduction rates than uni\-modal filters and unlocks large end\-to\-end gains: up to 33×\\timeslower energy at 1% FoI prevalence and 11×\\timesat 10%; at a fixed 30% data reduction, we observe a 92\.3% reduction in quality loss relative to a uni\-modal filter; and compared with the best prior filtering baseline, our method yields roughly 1\.5×\\timeshigher energy savings\.

Our novel work provides the following contributions:

- •We presentFusionSense, a*fusion\-aware intelligent sensing*framework that reduces total energy usage of multimodal fusion systems by learning near\-sensor keep/drop decisions conditioned on the fused task—addressing a growing gap in multimodal edge sensing for smart\-industry autonomy\.
- •FusionSenseintroduces athree\-step trainingmethod that \(i\) learns a server fusion model, \(ii\) derives modality\-wise FoS supervision from the fused decision, and \(iii\) compacts an edge fusion model via near\-sensor predictions, enabling deployment on power\-constrained devices across the compute continuum\.
- •On camera–LiDAR experiments,FusionSensedemonstrates up to 33×\\timeslower energy at 1% FoI and 11×\\timesat 10%, a 92\.3% reduction in quality loss at 30% data reduction, and roughly 1\.5×\\timeshigher energy savings than the best prior filtering baseline—while scaling*linearly*with sensor count\.

## 2\.Related Work

### 2\.1\.Multi\-modal Sensor Fusion

Understanding complex environments often requires fusing heterogeneous sensor streams so that complementary cues can be jointly exploited for perception and decision making\(Sharmaet al\.,[2022](https://arxiv.org/html/2605.22868#bib.bib226)\)\. Recent progress in multimodal representation learning further shows that diverse modalities can be embedded into shared spaces to improve recognition robustness and transfer\(Girdharet al\.,[2023](https://arxiv.org/html/2605.22868#bib.bib241)\)\. In autonomous perception, camera–LiDAR fusion has become a cornerstone for reliable 3D understanding under diverse operating conditions\(Satoet al\.,[2023](https://arxiv.org/html/2605.22868#bib.bib242)\)\. Concretely, feature\-level and bird’s\-eye\-view pipelines such as BEVFusion\(Liuet al\.,[2023](https://arxiv.org/html/2605.22868#bib.bib243)\), 3D Dual\-Fusion\(Kimet al\.,[2022](https://arxiv.org/html/2605.22868#bib.bib233)\), EA\-LSS\(Huet al\.,[2023](https://arxiv.org/html/2605.22868#bib.bib236)\), and UFO\(Kimet al\.,[2024](https://arxiv.org/html/2605.22868#bib.bib237)\)report strong results on large\-scale benchmarks like nuScenes\(Caesaret al\.,[2020](https://arxiv.org/html/2605.22868#bib.bib235)\)\. These systems typically presume that all modalities \(or suitably preprocessed features\) are available to a central fusion model; their primary objective is to maximize downstream accuracy and robustness given full\-modality inputs\. In contrast, our perspective emphasizes the*pre\-fusion*stage—deciding*what*to transmit or compute before fusion—so that multimodal benefits can be retained while reducing end\-to\-end resource usage\.

### 2\.2\.Energy Efficient Edge/AIoT Computing

Deep neural networks have delivered state\-of\-the\-art results in vision, audio, and beyond, and are increasingly deployed on mobile/embedded platforms\(Liuet al\.,[2020](https://arxiv.org/html/2605.22868#bib.bib249); Javedet al\.,[2017](https://arxiv.org/html/2605.22868#bib.bib239); Yunet al\.,[2025a](https://arxiv.org/html/2605.22868#bib.bib367),[2026b](https://arxiv.org/html/2605.22868#bib.bib369)\)\. This trend has catalyzed numerous AIoT applications spanning smart environments and monitoring\(Li and Han,[2022](https://arxiv.org/html/2605.22868#bib.bib356); Teng,[2021](https://arxiv.org/html/2605.22868#bib.bib357); Madhusudhan and Pravisha,[2024](https://arxiv.org/html/2605.22868#bib.bib358); Yunet al\.,[2026a](https://arxiv.org/html/2605.22868#bib.bib368)\)\. To fit tight energy and latency envelopes, classic efficiency levers include quantization, pruning, and specialized low\-power hardware\(Sunet al\.,[2023](https://arxiv.org/html/2605.22868#bib.bib359)\)\. A complementary line explores*near\-sensor intelligence*, where lightweight models suppress or summarize raw data so only task\-relevant information is forwarded downstream\(Yunet al\.,[2024](https://arxiv.org/html/2605.22868#bib.bib353); Huanget al\.,[2024](https://arxiv.org/html/2605.22868#bib.bib355)\)\. However, in distributed settings the communication subsystem \(e\.g\., Wi\-Fi/5G\) can dominate energy and latency\(Shiet al\.,[2016](https://arxiv.org/html/2605.22868#bib.bib248); Yanget al\.,[2023](https://arxiv.org/html/2605.22868#bib.bib247)\), meaning that sending “everything, but compressed”\(Chenet al\.,[2020](https://arxiv.org/html/2605.22868#bib.bib231); Wanget al\.,[2022](https://arxiv.org/html/2605.22868#bib.bib225)\)may still be suboptimal when informative events are sparse or when certain modalities are temporarily unhelpful\. These observations motivate*fusion\-aware*decisions that co\-optimize compute and communication by conditioning transmission on the contribution of each modality to the fused task, rather than applying modality\-agnostic compression or uni\-modal filtering\.

### 2\.3\.Intelligent Sensing over Dynamic Multimodal Data

Real\-world operation exposes sensors to modality\-specific degradations \(e\.g\., low light, glare, fog\) that can impair individual streams and shift which modalities are most informative\(Linnhoffet al\.,[2022](https://arxiv.org/html/2605.22868#bib.bib244)\); in parallel, security concerns such as LiDAR spoofing highlight the need for redundancy and graceful degradation\(Satoet al\.,[2023](https://arxiv.org/html/2605.22868#bib.bib242)\)\. A rich literature therefore studies dynamic fusion and robustness: learning when and how to trust each modality\(Liuet al\.,[2017](https://arxiv.org/html/2605.22868#bib.bib219); Tsaiet al\.,[2018](https://arxiv.org/html/2605.22868#bib.bib227); Takahashi and Tan,[2019](https://arxiv.org/html/2605.22868#bib.bib229); Zhi\-Xuanet al\.,[2020](https://arxiv.org/html/2605.22868#bib.bib228)\)\. Notably, the Crossmodal Compensation Model \(CCM\) detects corrupted inputs and compensates in a self\-supervised manner\(Leeet al\.,[2021](https://arxiv.org/html/2605.22868#bib.bib232)\), while DynMM uses mixture\-of\-experts routing across modality combinations to cut computation without sacrificing accuracy\(Xue and Marculescu,[2023](https://arxiv.org/html/2605.22868#bib.bib230)\)\. Yet, most of these methods assume that all modalities \(or features\) have already traversed the uplink to a central model; they optimize*post\-acquisition*fusion rather than*pre\-fusion*acquisition/transmission policies\. Our work is complementary: we leverage the fused model’s decision to supervise lightweight near\-sensor selectors that act*before*transmission, thereby avoiding unnecessary data movement while preserving the benefits of dynamic, robust multimodal fusion downstream\.

![Refer to caption](https://arxiv.org/html/2605.22868v1/x1.png)Figure 1\.Comparison of our proposed sensing and information processing pipeline with other approaches: \(a\) Conventional approach, \(b\) Compression\-based approach, \(c\) Using a previously proposed filter\-out approach designed for a single sensor environment, and \(d\) ours\.

## 3\.Intelligent Multi\-Sensing Design

Complex machine learning tasks often require substantial models that are challenging to deploy on edge devices\. For example, a cutting\-edge transformer\-based classification model\(Chenet al\.,[2022](https://arxiv.org/html/2605.22868#bib.bib343)\)required 80 hours of training on four NVIDIA Tesla V100 GPUs, even though it was designed to use fewer computational resources than its predecessors\. Many studies also involve hefty models for intricate tasks, such as large pre\-trained multi\-modality models\(Wuet al\.,[2022](https://arxiv.org/html/2605.22868#bib.bib344)\)and substantial transformer models\(Baadeet al\.,[2022](https://arxiv.org/html/2605.22868#bib.bib345)\)\. As a result, implementing these advanced deep learning\-based multi\-modal fusion tasks in real time on edge sensors poses significant practical difficulties\. Our approach addresses these challenges by binarizing the tasks, specifically by detecting only the essential”frame of interest”\(FoI\) data from multiple sensors needed for complex operations\. Unlike earlier methods that used near\-sensor modules, our strategy achieves optimal energy savings through a design tailored for a fusion environment\. In this section, we present our intelligent multi\-sensing framework, compare it with prior methods in a fusion context, and explain our model training technique\.

### 3\.1\.Intelligent Multi\-Sensing Framework

Our proposed approach is essentially different from previous approaches\(Chenet al\.,[2020](https://arxiv.org/html/2605.22868#bib.bib231); Huanget al\.,[2024](https://arxiv.org/html/2605.22868#bib.bib355); Yunet al\.,[2024](https://arxiv.org/html/2605.22868#bib.bib353)\)since it is specifically designed for multi\-sensor fusion scenarios\. Comparison with previous approaches: \(a\) Conventional approach without near\-sensor paradigm, \(b\) Compression\-based approach where it compresses sensed data before transmitting them, \(c\) Using a previously proposed filter\-out approach designed for single sensor environment, and \(d\) ours, using diagrams is shown in[Figure 1](https://arxiv.org/html/2605.22868#S2.F1)\. In the case of the conventional approach, it sends out all the sensed data from multiple sensors, leading to excessive data transmission and processing resulting in extreme energy inefficiency when the probability of encountering a frame of interest is low\. Next, the compression\-based approach uses a near\-sensor paradigm by placing compression modules near each sensor\. Although it can reduce communication costs between different modules by the compression rate showing better energy efficiency compared to the conventional approach, cost from computation is still exhibited and merely contributes to the energy efficiency in the low FoI probability scenario\. Another filter\-out\-based approach specifically designed for the low FoI probability cases, resolves the issue by placing near\-sensor models for each sensor\. Although this approach achieved promising energy efficiency in a single\-sensor scenario, in a multi\-sensor scenario where we need multiple near\-sensor models for each sensor, it is inappropriate to naively implement the approach to achieve the maximum possible energy efficiency with low\-quality loss resulting from miss filter\-out due to its individuality unaware of relationships between different modalities in the fusion model\.

To keep the system deployable on real\-world edge devices, we do not require specialized hardware beyond standard accelerators \(e\.g\., Google Edge TPU or NVIDIA Jetson\) for near\-sensor models\. Scalability is addressed by the linear addition of near\-sensor modules per sensor and a single centralized fusion process at either the edge or server\. Thus, from a hardware perspective, the design neither multiplies complexity nor forces sophisticated networking conditions\.

### 3\.2\.Three\-step Model Training

![Refer to caption](https://arxiv.org/html/2605.22868v1/x2.png)Figure 2\.Overview of the ProposedThree\-step TrainingMethod: \(a\) presents a schematic representation of the entire Three\-step Training process\. The process begins with the initial training phase depicted in \(b\), proceeds to the secondary training phase illustrated in \(c\), and concludes with the tertiary training phase outlined in \(d\)\.Table 1\.Truth table or decision table of label augmentation for near\-sensor models training\. Labels for RGB and Depth near\-sensor models \(right side\) are decided by the three binary values: FoI, RGB FoS, and Depth FoS \(left side\)\.[Figure 2](https://arxiv.org/html/2605.22868#S3.F2)illustrates our proposed three\-step training across from near sensor, edge side to server side\. The overall view of three\-step training in terms of edge\-server architecture view is shown in[Figure 2](https://arxiv.org/html/2605.22868#S3.F2)\.\(a\)\. In this architecture, there are three types of machine learning models to train: i\) near\-sensor models, ii\) an edge\-side fusion model, and iii\) a server\-side fusion model\. We first start with training the server\-side fusion model which solves complex tasks requiring heavy machine learning models as shown in[Figure 2](https://arxiv.org/html/2605.22868#S3.F2)\.\(b\)\. Here, we use a late fusion model that fuses different modalities by combining corresponding feature embeddings from each separated feature extraction model\. In our implementation, each sensor \(RGB or depth\) has a dedicated backbone CNN, and a final fully connected fusion layer merges these features\. The training is standard supervised learning with multi\-label classification\. However, it is not necessary to follow the late fusion design since there are no constraints on the server\-side model, where its model architecture can be varied by the task to conduct, in our proposed three\-step training\. We target to have near\-sensor models each aware of correlations between other modalities in this server\-side fusion model\. This is achieved by conducting the next training step which is illustrated in[Figure 2](https://arxiv.org/html/2605.22868#S3.F2)\.\(c\)\.

To train fusion\-aware near\-sensor models, it takes a data augmentation approach by introducingFilter\-out Safeor FoS of each modality\. If FoS is11for a certain modality, it indicates even if we do not give that modality data as input to the server\-side fusion model or filter that modality, the output or decision made from the server\-side model will not be changed\. Otherwise, where FoS is 0, the output or decision made from the server\-side fusion model will be changed\. We anticipate that this FoS will effectively capture cases where certain modalities may either provide duplicated information already contained in other modalities or contain irrelevant data due to severe weather conditions affecting certain sensors etc\. This capability allows the system to disregard such modalities, thereby reducing redundancy\. In order to have FoS for each modality, we first retrieve the decision from the server\-side fusion model when we give all of the modalities \(1\)\. Next, we also retrieve the decision of the server\-side fusion model when we filter out RGB data \(2\) and also when we filter out depth data \(3\)\. The filter\-out is done by giving zeroed data with the same dimensionality as the original input data of the corresponding modality\. Then we compare the output from the original input data with the outputs from filtered\-out input data\. FoS for RGB input data is set to be 1 if the output from the original input and the output from2is the same and if not, it is set to be 0 \(4\)\. For the depth modality, the same procedure is applied except we use depth filtered\-out case instead of RGB filtered\-out case to decide FoS for depth \(5\)\. These FoSs for RGB and depth modality are retrieved for all data points, augmenting the dataset \(6\)\. Finally, near\-sensor models for RGB \(7\) and depth modality \(8\) are trained by label augmentation using the newly augmented FoS information\. The label augmentation for each modality is conducted using three pieces of information: FoI, RGB FoS, and Depth FoS, as shown in[Table 1](https://arxiv.org/html/2605.22868#S3.T1)\.

Lastly, compactization of the server\-side fusion model to deploy on the edge side is conducted which is illustrated in[Figure 2](https://arxiv.org/html/2605.22868#S3.F2)\.\(d\)\. It is important to make the edge\-side fusion model as compact as possible considering the practical purpose where edge devices are required to operate without outside energy sources\. It is achieved by introducing prediction scores of each near\-sensor model to the edge\-side late fusion model\. We assumed that introducing prediction scores from the near\-sensor models to the late fusion model can weakly and one directionally – from near\-sensor models to the fusion model – combine the fusion model with the near\-sensor models resulting in the fusion model behaving like a larger model\. Based on this assumption, we train a fusion model following the same architecture as the server\-side fusion model but smaller size using filtered modality data with the prediction scores from the near\-sensor models\.

## 4\.Experiments

![Refer to caption](https://arxiv.org/html/2605.22868v1/x3.png)Figure 3\.Comparative distribution of energy consumption across four methods: the conventional method, the compressive near\-sensor approach, the previous filtering\-out approach using individual near\-sensor models, and our proposed method, across varying probabilities of FoI\. The total energy consumption values are normalized to the total of the conventional method and displayed at the center of each distribution\.### 4\.1\.Experimental Setup

We implemented and executed our framework with both a software framework and GPU accelerators\. Specifically, the implementation was completed by using PyTorch and NumPy which support CNN layers and classification\. In model training, the Adam optimizer, exponential learning rate scheduler withγ=0\.95\\gamma=0\.95, and binary cross entropy loss were used\. We used 60 epochs in total and selected the best model based on a 10% validation split from the training set, and we performed a grid search on small sets of hyperparameters \(batch sizes of 8, 16, 32 and learning rates of10−310^\{\-3\}or10−410^\{\-4\}\)\. For feature extractors, we used MobileNetV3\(Koonce and Koonce,[2021](https://arxiv.org/html/2605.22868#bib.bib360)\)for near\-sensor models and RegNet 400mf\(Radosavovicet al\.,[2020](https://arxiv.org/html/2605.22868#bib.bib361)\)for the server\-side model\. We avoided overly large CNN architectures \(like ResNet101 or Transformers\) on the near\-sensor side, thus simplifying real deployment\. Note that our near\-sensor backbones are significantly lighter \(MobileNetV3\-level complexity\) compared to the server\-side fusion model, keeping the computational overhead small\. In late fusion models, we used multiple linear layers with the ReLU activation function\.

### 4\.2\.Dataset

We evaluated our proposed method using the SynDrone dataset\(Rizzoliet al\.,[2023](https://arxiv.org/html/2605.22868#bib.bib257)\), a public multi\-modal dataset designed for urban classification applications for UAVs\. We selected RGB images and depth maps to train our fusion\-based intelligent sensing system\. The dataset includes seven coarse classes, such as roads, construction, etc\. and each coarse class contains several subclasses\. Among these coarse classes, we decided to have the vehicle coarse class as our object of interest where we consider frames containing such objects of interest as a frame of interest\. The vehicle coarse class contains 6 subclasses: Car, Truck, Bus, Train, Motorcycle, and Bicycle\. In our evaluation, we assume a scenario of the edge\-side or server\-side fusion model conducting complex tasks, which is multi\-label classification using the 6 subclasses\. We took 80% of the dataset as a train set and considered the other as a test set\. Models are only trained with the train set and evaluated with the test set in all of our evaluations\. Since our focus is to showcase how near\-sensor multi\-modal filtering can improve energy efficiency, we limit ourselves to this representative dataset, which already offers a variety of urban environment conditions and object classes\.

### 4\.3\.Evaluation of trade\-off between quality loss and effective data generation

![Refer to caption](https://arxiv.org/html/2605.22868v1/x4.png)Figure 4\.Trade\-off relationship between data efficiency indicating the saved portion of the data in size and quality loss which is the performance drop rate of the server\-side model when using filtered\-out data\.To demonstrate the data selection efficiency of our near\-sensor model compared to the previous intelligent sensing approach, we analyzed the trade\-off between data filter\-out rate and quality loss\. This relationship is depicted in[Figure 4](https://arxiv.org/html/2605.22868#S4.F4)\. We define the data filter\-out rate, or data efficiency, as the proportion of data volume that is filtered out relative to the total data volume of all frames in the test set\. Quality loss is assessed by the reduction in performance, specifically the macro F1 score, of the server\-side fusion model when it processes the filtered data\. From the theoretical perspective, each near\-sensor model tries to approximate the server\-side fusion output by leveraging the correlation discovered during training\. The results reveal a distinct trade\-off curve; our approach exhibits a steeper curve, achieving significantly higher data efficiency with substantially lower quality loss\. Specifically, we observed a 92\.3% reduction in quality loss while targeting a 30% reduction in data generation\.

### 4\.4\.Evaluation of edge fusion model compactization

![Refer to caption](https://arxiv.org/html/2605.22868v1/x5.png)Figure 5\.Quality loss by different edge\-side fusion model sizes in terms of the number of trainable parameters compared to the server\-side fusion model \(left\) and when converting the model size into energy efficiency compared to the energy consumption of server\-side fusion model in Joule \(right\)\.In this evaluation, we demonstrate the effectiveness of our proposed compact edge fusion model, designed to perform multi\-label classification tasks similarly to the server\-side fusion model\. Unlike the latter, our compact model utilizes prediction scores from the near\-sensor models during the late fusion stage\. For comparison, we trained a baseline fusion model that does not incorporate near\-sensor model scores yet maintains the same number of trainable parameters as our compact model\.

[Figure 5](https://arxiv.org/html/2605.22868#S4.F5)presents the evaluation results, showcasing the quality loss relative to the server\-side model across varying levels of model size efficiency, which indicates the reduction in trainable parameters compared to the server\-side fusion model\. Additionally, the figure includes a conversion of model size to energy efficiency\. To estimate energy consumption, we considered a scenario using the Raspberry Pi 3 Model B as the edge computing device\. Based on previous research\(Velasco\-Monteroet al\.,[2018](https://arxiv.org/html/2605.22868#bib.bib220)\), we converted the model size to energy usage \(in Joules\) for different model sizes running on a Raspberry Pi\.

While the parameter count of the edge model does not explode for two modalities, we note that in principle more modalities could increase total parameters\. However, our approach partially mitigates this effect by shifting some complexity to near\-sensor modules that individually remain lightweight and can be parallelized\. The results reveal that our compact model achieves approximately 1% lower model size and energy consumption compared to a typical edge fusion model, with a permissible quality loss of above 20%\. Although this represents a modest improvement in energy efficiency, it suggests that further optimizations, particularly in leveraging directional information from near\-sensor models to an edge\-side model, could enhance the compactness of the edge model without full integration\.

### 4\.5\.Energy efficiency on end\-to\-end system

Inspired by a previous study on transmission energy consumption\(Nirjonet al\.,[2013](https://arxiv.org/html/2605.22868#bib.bib223)\), we undertake a comprehensive evaluation to assess the end\-to\-end energy consumption of our proposed intelligent sensing framework\. In this analysis, we focus on a scenario involving near\-sensor models with data transmission to the cloud for processing by a server\-side fusion model where the server is equipped with 13th Gen Intel\(R\) Core\(TM\) i9\-13900KF with NVIDIA RTX 4070 GPU\. We exclude considerations of edge\-side modeling, focusing instead on real\-time network communication to the server\-side model\. Also, we tested with three different scenarios where each has a different probability of observing FoI of 1%, 5%, and 10%\. To enhance the system’s overall energy efficiency, developing an energy\-optimized near\-sensor model is crucial\. We implemented our models on Google Edge TPU, utilizing ASIC acceleration, and estimated the TPU’s energy consumption by measuring latency and referencing recent studies for average power consumption metrics\(Niet al\.,[2022](https://arxiv.org/html/2605.22868#bib.bib222)\)\.

[Figure 3](https://arxiv.org/html/2605.22868#S4.F3)provides an energy consumption breakdown for four different approaches: the conventional method, which naively transmits all data; compressive sensing, specifically using Bit Depth Compression \(BDC\) to reduce energy costs, as highlighted in recent real\-time applications\(Hwanget al\.,[2023](https://arxiv.org/html/2605.22868#bib.bib350)\); the previous filtering\-out approaches\(Yunet al\.,[2024](https://arxiv.org/html/2605.22868#bib.bib353); Huanget al\.,[2024](https://arxiv.org/html/2605.22868#bib.bib355)\); and our proposed method employing fusion\-aware near\-sensor models developed through the three\-step training process\. The energy costs normalized to the conventional approach are central to each distribution\. In case of the compression method, unlike the conventional approach, shows a significant reduction in communication costs\. However, it still sends out all data, cannot exploit the low FoI probability scenarios, and fails to effectively reduce total energy consumption\. On the other hand, our method demonstrates the most substantial energy savings, achieving up to 33×\\timeslower energy consumption than the conventional approach, and outperforming other state\-of\-the\-art methods, which show at most a 22×\\timesreduction\.

While the proposed system appears complex, it can be feasibly deployed with off\-the\-shelf edge TPUs or GPUs\. The near\-sensor models are light enough for real\-time use, and adding new sensors only requires adding lightweight binary classifiers\. Hence, we believe that the complexity is manageable, and the energy savings justify this design\.

## 5\.Conclusions

We introducedFusionSense, a fusion\-aware intelligent sensing framework that moves part of multimodal decision making to the sensing front\-end\. Using a three\-step recipe—server\-side fusion learning,*filter\-out\-safe*supervision, and a compact edge fusion model fed by near\-sensor scores—our system makes*pre\-fusion*keep/drop decisions that co\-optimize compute and communication\. On camera–LiDAR experiments,FusionSenseachieves up to 33×\\timeslower energy at 1% FoI and 11×\\timesat 10%, with a 92\.3% reduction in quality loss at 30% data reduction and roughly 1\.5×\\timeshigher energy savings than the best prior filter\. The design scales*linearly*with sensor count and runs on common accelerators without specialized hardware\. Future work will extend to additional modalities and tasks and explore online adaptation under dynamic energy and deadline constraints\.

###### Acknowledgements\.

This work was supported in part by the DARPA Young Faculty Award, the National Science Foundation \(NSF\) under Grants \#2127780, \#2319198, \#2321840, \#2312517, and \#2235472, the Semiconductor Research Corporation \(SRC\), the Office of Naval Research through the Young Investigator Program Award, and Grants \#N00014\-21\-1\-2225 and N00014\-24\-1\-2547, Army Research Office Grant \#W911NF2410360\. Additionally, support was provided by the Air Force Office of Scientific Research under Award \#FA9550\-22\-1\-0253\.

## References

- H\. Alikhani, A\. Kanduri, P\. Liljeberg, A\. M\. Rahmani, and N\. Dutt \(2023\)DynaFuse: dynamic fusion for resource efficient multimodal machine learning inference\.IEEE Embedded Systems Letters15\(4\),pp\. 222–225\.Cited by:[§1](https://arxiv.org/html/2605.22868#S1.p1.1)\.
- H\. Alikhani, Z\. Wang, A\. Kanduri, P\. Lilieberg, A\. M\. Rahmani, and N\. Dutt \(2024\)SEAL: sensing efficient active learning on wearables through context\-awareness\.In2024 Design, Automation & Test in Europe Conference & Exhibition \(DATE\),pp\. 1–2\.Cited by:[§1](https://arxiv.org/html/2605.22868#S1.p1.1)\.
- A\. Baade, P\. Peng, and D\. Harwath \(2022\)Mae\-ast: masked autoencoding audio spectrogram transformer\.arXiv preprint arXiv:2203\.16691\.Cited by:[§3](https://arxiv.org/html/2605.22868#S3.p1.1)\.
- S\. Bahri, N\. Zoghlami, M\. Abed, and J\. M\. R\. Tavares \(2018\)Big data for healthcare: a survey\.IEEE access7,pp\. 7397–7408\.Cited by:[§1](https://arxiv.org/html/2605.22868#S1.p1.1)\.
- H\. Caesar, V\. Bankiti, A\. H\. Lang, S\. Vora, V\. E\. Liong, Q\. Xu, A\. Krishnan, Y\. Pan, G\. Baldan, and O\. Beijbom \(2020\)NuScenes: a multimodal dataset for autonomous driving\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§2\.1](https://arxiv.org/html/2605.22868#S2.SS1.p1.1)\.
- K\. Chen, X\. Du, B\. Zhu, Z\. Ma, T\. Berg\-K irkpatrick, and S\. Dubnov \(2022\)HTS\-at: a hierarchical token\-semantic audio transformer for sound classification and detection\.InICASSP 2022 \- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),Vol\.,pp\. 646–650\.External Links:[Document](https://dx.doi.org/10.1109/ICASSP43922.2022.9746312)Cited by:[§3](https://arxiv.org/html/2605.22868#S3.p1.1)\.
- Z\. Chen, K\. Fan, S\. Wang, L\. Duan, W\. Lin, and A\. C\. Kot \(2020\)Toward intelligent sensing: intermediate deep feature compression\.IEEE Transactions on Image Processing29\(\),pp\. 2230–2243\.External Links:[Document](https://dx.doi.org/10.1109/TIP.2019.2941660)Cited by:[§2\.2](https://arxiv.org/html/2605.22868#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2605.22868#S3.SS1.p1.1)\.
- R\. Girdhar, A\. El\-Nouby, Z\. Liu, M\. Singh, K\. V\. Alwala, A\. Joulin, and I\. Misra \(2023\)Imagebind: one embedding space to bind them all\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 15180–15190\.Cited by:[§2\.1](https://arxiv.org/html/2605.22868#S2.SS1.p1.1)\.
- H\. Hu, F\. Wang, J\. Su, Y\. Wang, L\. Hu, W\. Fang, J\. Xu, and Z\. Zhang \(2023\)EA\-lss: edge\-aware lift\-splat\-shot framework for 3d bev object detection\.arXiv preprint arXiv:2303\.178952\.Cited by:[§2\.1](https://arxiv.org/html/2605.22868#S2.SS1.p1.1)\.
- W\. Huang, A\. Rezvani, H\. Chen, Y\. Ni, S\. Yun, S\. Jeong, and M\. Imani \(2024\)A plug\-in tiny ai module for intelligent and selective sensor data transmission\.arXiv preprint arXiv:2402\.02043\.Cited by:[§1](https://arxiv.org/html/2605.22868#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.22868#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2605.22868#S3.SS1.p1.1),[§4\.5](https://arxiv.org/html/2605.22868#S4.SS5.p2.2)\.
- S\. Hwang, K\. Kim, S\. Kim, and J\. W\. Kwak \(2023\)Lossless data compression for time\-series sensor data based on dynamic bit packing\.Sensors23\(20\),pp\. 8575\.Cited by:[§4\.5](https://arxiv.org/html/2605.22868#S4.SS5.p2.2)\.
- S\. Isuwa, D\. Amos, A\. K\. Singh, B\. M\. Al\-Hashimi, and G\. V\. Merrett \(2023\)Content\-and lighting\-aware adaptive brightness scaling for improved mobile user experience\.In2023 Design, Automation & Test in Europe Conference & Exhibition \(DATE\),pp\. 1–2\.Cited by:[§1](https://arxiv.org/html/2605.22868#S1.p1.1)\.
- A\. Javed, H\. Larijani, A\. Ahmadinia, R\. Emmanuel, M\. Mannion, and D\. Gibson \(2017\)Design and implementation of a cloud enabled random neural network\-based decentralized smart controller with intelligent sensor nodes for hvac\.IEEE Internet of Things Journal4\(2\),pp\. 393–403\.External Links:[Document](https://dx.doi.org/10.1109/JIOT.2016.2627403)Cited by:[§2\.2](https://arxiv.org/html/2605.22868#S2.SS2.p1.1)\.
- B\. Khalid, K\. N\. Qureshi, K\. Z\. Ghafoor, and G\. Jeon \(2023\)An improved biometric based user authentication and key agreement scheme for intelligent sensor based wireless communication\.Microprocessors and Microsystems96,pp\. 104722\.Cited by:[§1](https://arxiv.org/html/2605.22868#S1.p1.1)\.
- O\. Kim, J\. Seo, S\. Ahn, and C\. H\. Kim \(2024\)UFO: uncertainty\-aware lidar\-image fusion for off\-road semantic terrain map estimation\.arXiv preprint arXiv:2403\.02642\.Cited by:[§2\.1](https://arxiv.org/html/2605.22868#S2.SS1.p1.1)\.
- Y\. Kim, K\. Park, M\. Kim, D\. Kum, and J\. W\. Choi \(2022\)3D dual\-fusion: dual\-domain dual\-query camera\-lidar fusion for 3d object detection\.arXiv preprint arXiv:2211\.13529\.Cited by:[§1](https://arxiv.org/html/2605.22868#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.22868#S2.SS1.p1.1)\.
- B\. Koonce and B\. Koonce \(2021\)MobileNetV3\.Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization,pp\. 125–144\.Cited by:[§4\.1](https://arxiv.org/html/2605.22868#S4.SS1.p1.3)\.
- M\. A\. Lee, M\. Tan, Y\. Zhu, and J\. Bohg \(2021\)Detect, reject, correct: crossmodal compensation of corrupted sensors\.In2021 IEEE International Conference on Robotics and Automation \(ICRA\),Vol\.,pp\. 909–916\.External Links:[Document](https://dx.doi.org/10.1109/ICRA48506.2021.9561847)Cited by:[§2\.3](https://arxiv.org/html/2605.22868#S2.SS3.p1.1)\.
- J\. Li and H\. Han \(2022\)Emotional design strategy of smart furniture for small households based on user experience\.InInternational Conference on Human\-Computer Interaction,pp\. 311–320\.Cited by:[§2\.2](https://arxiv.org/html/2605.22868#S2.SS2.p1.1)\.
- K\. G\. Liakos, P\. Busato, D\. Moshou, S\. Pearson, and D\. Bochtis \(2018\)Machine learning in agriculture: a review\.Sensors18\(8\),pp\. 2674\.Cited by:[§1](https://arxiv.org/html/2605.22868#S1.p1.1)\.
- C\. Linnhoff, K\. Hofrichter, L\. Elster, P\. Rosenberger, and H\. Winner \(2022\)Measuring the influence of environmental conditions on automotive lidar sensors\.Sensors22\(14\),pp\. 5266\.Cited by:[§2\.3](https://arxiv.org/html/2605.22868#S2.SS3.p1.1)\.
- G\. Liu, A\. Siravuru, S\. Prabhakar, M\. Veloso, and G\. Kantor \(2017\)Learning end\-to\-end multimodal sensor policies for autonomous navigation\.InConference on Robot Learning,pp\. 249–261\.Cited by:[§2\.3](https://arxiv.org/html/2605.22868#S2.SS3.p1.1)\.
- Z\. Liu, E\. Ren, F\. Qiao, Q\. Wei, X\. Liu, L\. Luo, H\. Zhao, and H\. Yang \(2020\)NS\-cim: a current\-mode computation\-in\-memory architecture enabling near\-sensor processing for intelligent iot vision nodes\.IEEE Transactions on Circuits and Systems I: Regular Papers67\(9\),pp\. 2909–2922\.External Links:[Document](https://dx.doi.org/10.1109/TCSI.2020.2984161)Cited by:[§2\.2](https://arxiv.org/html/2605.22868#S2.SS2.p1.1)\.
- Z\. Liu, H\. Tang, A\. Amini, X\. Yang, H\. Mao, D\. L\. Rus, and S\. Han \(2023\)Bevfusion: multi\-task multi\-sensor fusion with unified bird’s\-eye view representation\.In2023 IEEE international conference on robotics and automation \(ICRA\),pp\. 2774–2781\.Cited by:[§2\.1](https://arxiv.org/html/2605.22868#S2.SS1.p1.1)\.
- R\. Madhusudhan and P\. Pravisha \(2024\)Blockchain based artificial intelligence of things \(aiot\) for wildlife monitoring\.InInternational Conference on Advanced Information Networking and Applications,pp\. 25–36\.Cited by:[§2\.2](https://arxiv.org/html/2605.22868#S2.SS2.p1.1)\.
- Y\. Ni, Y\. Kim, T\. Rosing, and M\. Imani \(2022\)Online performance and power prediction for edge tpu via comprehensive characterization\.In2022 Design, Automation & Test in Europe Conference & Exhibition \(DATE\),pp\. 612–615\.Cited by:[§4\.5](https://arxiv.org/html/2605.22868#S4.SS5.p1.1)\.
- S\. Nirjon, R\. F\. Dickerson, P\. Asare, Q\. Li, D\. Hong, J\. A\. Stankovic, P\. Hu, G\. Shen, and X\. Jiang \(2013\)Auditeur: a mobile\-cloud service platform for acoustic event detection on smartphones\.InProceeding of the 11th annual international conference on Mobile systems, applications, and services,pp\. 403–416\.Cited by:[§4\.5](https://arxiv.org/html/2605.22868#S4.SS5.p1.1)\.
- S\. Pramanik, D\. Pandey, S\. Joardar, M\. Niranjanamurthy, B\. K\. Pandey, and J\. Kaur \(2023\)An overview of iot privacy and security in smart cities\.InAIP Conference Proceedings,Vol\.2495\.Cited by:[§1](https://arxiv.org/html/2605.22868#S1.p1.1)\.
- I\. Radosavovic, R\. P\. Kosaraju, R\. Girshick, K\. He, and P\. Dollár \(2020\)Designing network design spaces\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10428–10436\.Cited by:[§4\.1](https://arxiv.org/html/2605.22868#S4.SS1.p1.3)\.
- G\. Rizzoli, F\. Barbato, M\. Caligiuri, and P\. Zanuttigh \(2023\)SynDrone\-multi\-modal uav dataset for urban scenarios\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 2210–2220\.Cited by:[§4\.2](https://arxiv.org/html/2605.22868#S4.SS2.p1.1)\.
- T\. Sato, Y\. Hayakawa, R\. Suzuki, Y\. Shiiki, K\. Yoshioka, and Q\. A\. Chen \(2023\)Revisiting lidar spoofing attack capabilities against object detection: improvements, measurement, and new attack\.arXiv preprint arXiv:2303\.10555\.Cited by:[§2\.1](https://arxiv.org/html/2605.22868#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2605.22868#S2.SS3.p1.1)\.
- A\. Sharma, V\. Sharma, M\. Jaiswal, H\. Wang, D\. N\. K\. Jayakody, C\. M\. W\. Basnayaka, and A\. Muthanna \(2022\)Recent trends in ai\-based intelligent sensing\.Electronics11\(10\),pp\. 1661\.Cited by:[§1](https://arxiv.org/html/2605.22868#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.22868#S2.SS1.p1.1)\.
- W\. Shi, J\. Cao, Q\. Zhang, Y\. Li, and L\. Xu \(2016\)Edge computing: vision and challenges\.IEEE Internet of Things Journal3\(5\),pp\. 637–646\.External Links:[Document](https://dx.doi.org/10.1109/JIOT.2016.2579198)Cited by:[§2\.2](https://arxiv.org/html/2605.22868#S2.SS2.p1.1)\.
- K\. Sun, X\. Wang, and Q\. Zhao \(2023\)A review of aiot\-based edge devices and lightweight deployment\.Authorea Preprints\.Cited by:[§2\.2](https://arxiv.org/html/2605.22868#S2.SS2.p1.1)\.
- K\. Takahashi and J\. Tan \(2019\)Deep visuo\-tactile learning: estimation of tactile properties from images\.In2019 International Conference on Robotics and Automation \(ICRA\),pp\. 8951–8957\.Cited by:[§2\.3](https://arxiv.org/html/2605.22868#S2.SS3.p1.1)\.
- Z\. Taufique, A\. Vyas, A\. Miele, P\. Liljeberg, and A\. Kanduri \(2025\)HiDP: hierarchical dnn partitioning for distributed inference on heterogeneous edge platforms\.In2025 Design, Automation & Test in Europe Conference \(DATE\),pp\. 1–7\.Cited by:[§1](https://arxiv.org/html/2605.22868#S1.p1.1)\.
- D\. Teng \(2021\)AIoT powered wild animal tracing and protection system research proposal for mres in engineering science supervised by niki trigoni\.Cited by:[§2\.2](https://arxiv.org/html/2605.22868#S2.SS2.p1.1)\.
- Y\. H\. Tsai, P\. P\. Liang, A\. Zadeh, L\. Morency, and R\. Salakhutdinov \(2018\)Learning factorized multimodal representations\.arXiv preprint arXiv:1806\.06176\.Cited by:[§2\.3](https://arxiv.org/html/2605.22868#S2.SS3.p1.1)\.
- D\. Velasco\-Montero, J\. Fernández\-Berni, R\. Carmona\-Galán, and Á\. Rodríguez\-Vázquez \(2018\)Performance analysis of real\-time dnn inference on raspberry pi\.InReal\-Time Image and Video Processing 2018,Vol\.10670,pp\. 115–123\.Cited by:[§4\.4](https://arxiv.org/html/2605.22868#S4.SS4.p2.1)\.
- S\. Wang, S\. Wang, W\. Yang, X\. Zhang, S\. Wang, S\. Ma, and W\. Gao \(2022\)Towards analysis\-friendly face representation with scalable feature and texture compression\.IEEE Transactions on Multimedia24\(\),pp\. 3169–3181\.External Links:[Document](https://dx.doi.org/10.1109/TMM.2021.3094300)Cited by:[§2\.2](https://arxiv.org/html/2605.22868#S2.SS2.p1.1)\.
- M\. Woschank, E\. Rauch, and H\. Zsifkovits \(2020\)A review of further directions for artificial intelligence, machine learning, and deep learning in smart logistics\.Sustainability12\(9\),pp\. 3760\.Cited by:[§1](https://arxiv.org/html/2605.22868#S1.p1.1)\.
- H\. Wu, P\. Seetharaman, K\. Kumar, and J\. P\. Bello \(2022\)Wav2clip: learning robust audio representations from clip\.InICASSP 2022\-2022 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 4563–4567\.Cited by:[§3](https://arxiv.org/html/2605.22868#S3.p1.1)\.
- Z\. Xue and R\. Marculescu \(2023\)Dynamic multimodal fusion\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\) Workshops,pp\. 2575–2584\.Cited by:[§2\.3](https://arxiv.org/html/2605.22868#S2.SS3.p1.1)\.
- L\. Xun, M\. Hu, H\. Zhao, A\. K\. Singh, J\. Hare, and G\. V\. Merrett \(2024\)Fluid dynamic dnns for reliable and adaptive distributed inference on edge devices\.In2024 Design, Automation & Test in Europe Conference & Exhibition \(DATE\),pp\. 1–2\.Cited by:[§1](https://arxiv.org/html/2605.22868#S1.p1.1)\.
- X\. Yang, Z\. Liu, K\. Tang, X\. Yin, C\. Zhuo, Q\. Wei, and F\. Qiao \(2023\)Breaking the energy\-efficiency barriers for smart sensing applications with “sensing with computing” architectures\.Science China Information Sciences66\(10\),pp\. 200409\.Cited by:[§2\.2](https://arxiv.org/html/2605.22868#S2.SS2.p1.1)\.
- S\. Yun, H\. Chen, R\. Masukawa, H\. Errahmouni Barkam, A\. Ding, W\. Huang, A\. Rezvani, S\. Angizi, and M\. Imani \(2024\)HyperSense: hyperdimensional intelligent sensing for energy\-efficient sparse data processing\.Advanced Intelligent Systems6\(12\),pp\. 2400228\.Cited by:[§1](https://arxiv.org/html/2605.22868#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.22868#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2605.22868#S3.SS1.p1.1),[§4\.5](https://arxiv.org/html/2605.22868#S4.SS5.p2.2)\.
- S\. Yun, R\. Masukawa, H\. Chen, S\. Jeong, W\. Huang, A\. Rezvani, M\. Na, Y\. Yamaguchi, and M\. Imani \(2025a\)Hyperdimensional intelligent sensing for efficient real\-time audio processing on extreme edge\.IEEE Access\.Cited by:[§2\.2](https://arxiv.org/html/2605.22868#S2.SS2.p1.1)\.
- S\. Yun, R\. Masukawa, R\. Hassan, M\. Na, and M\. Imani \(2026a\)Contextual fusion strategies for multimodal gnn\-based reasoning: performance and computational trade\-offs\.IEEE Access\.Cited by:[§2\.2](https://arxiv.org/html/2605.22868#S2.SS2.p1.1)\.
- S\. Yun, R\. Masukawa, M\. Na, and M\. Imani \(2025b\)Missiongnn: hierarchical multimodal gnn\-based weakly supervised video anomaly recognition with mission\-specific knowledge graph generation\.In2025 IEEE/CVF Winter Conference on Applications of Computer Vision \(WACV\),pp\. 4736–4745\.Cited by:[§1](https://arxiv.org/html/2605.22868#S1.p1.1)\.
- S\. Yun, H\. Oh, R\. Masukawa, and M\. Imani \(2026b\)DecoHD: decomposed hyperdimensional classification under extreme memory budgets\.In2026 Design, Automation & Test in Europe Conference \(DATE\),Cited by:[§2\.2](https://arxiv.org/html/2605.22868#S2.SS2.p1.1)\.
- S\. Yun, H\. Oh, R\. Masukawa, P\. Mercati, N\. D\. Bastian, and M\. Imani \(2026c\)LogHD: robust compression of hyperdimensional classifiers via logarithmic class\-axis reduction\.In2026 Design, Automation & Test in Europe Conference \(DATE\),Cited by:[§1](https://arxiv.org/html/2605.22868#S1.p1.1)\.
- T\. Zhi\-Xuan, H\. Soh, and D\. Ong \(2020\)Factorized inference in deep markov models for incomplete multimodal time series\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.34,pp\. 10334–10341\.Cited by:[§2\.3](https://arxiv.org/html/2605.22868#S2.SS3.p1.1)\.
FusionSense: Tri-Stage Near-Sensor Learning for Runtime-Adaptive Multimodal Edge Intelligence

Similar Articles

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Entropy-Guided Tensor Compression for Multimodal Federated Learning on Edge Devices

Before Fusion, Ask What to Keep: Contextual Calibration of Multimodal Signals

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

M$^2$FedAQI: Multimodal Federated Learning for Air Quality Prediction on Heterogeneous Edge Devices

Submit Feedback

Similar Articles

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation
Entropy-Guided Tensor Compression for Multimodal Federated Learning on Edge Devices
Before Fusion, Ask What to Keep: Contextual Calibration of Multimodal Signals
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
M$^2$FedAQI: Multimodal Federated Learning for Air Quality Prediction on Heterogeneous Edge Devices