Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming
Summary
Co-GLANCE is a real-time onboard perception and decision-making system for heterogeneous robot teams that distills vision-language model capabilities into efficient models and uses conformal prediction with selective abstention to quantify and resolve perceptual uncertainty, outperforming cloud-based VLM baselines by 25-36% while achieving 350x lower latency.
View Cached Full Text
Cached at: 06/10/26, 06:18 AM
# Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming
Source: [https://arxiv.org/html/2606.09919](https://arxiv.org/html/2606.09919)
Michal P\. Podolinsky∗Neel P\. Bhatt∗Pranay Samineni Rohan Siva Christian EllisUfuk Topcu The University of Texas at Austin∗Equal Contribution \{michal\.podolinsky,npbhatt,pranay\_s,rohansiva\}@utexas\.edu chrisitan\.ellis@austin\.utexas\.edu, utopcu@utexas\.edu
###### Abstract
Perceptual uncertainty is a central challenge for heterogeneous robot teams operating in unstructured outdoor environments, where no single viewpoint affords reliable scene understanding\. Perceptual uncertainty, arising from sources such as occlusions, manifests differently across robot viewpoints depending on scene structure\. Detecting and resolving sources of perceptual uncertainty requires both scene\-based contextual reasoning and capability\-aware robot allocation\. While vision\-language models provide strong semantic priors for both, they are computationally prohibitive for onboard inference and lack calibrated uncertainty quantification\. We introduceCo\-GLANCE, a real\-time onboard perception and decision\-making system for uncertainty resolution in heterogeneous robot teams\.Co\-GLANCEdistills the semantic reasoning capabilities of a vision\-language model into an end\-to\-end model for occlusion segmentation and robot allocation, eliminating the need for cloud\-based inference\. To quantify perceptual uncertainty,Co\-GLANCEcombines conformal prediction with selective abstention to provide statistically valid coverage guarantees for segmentation, robot allocation, and detection outputs\. These calibrated uncertainty estimates directly trigger active perception, dispatching the most appropriate robot to acquire informative viewpoints and resolve uncertainty\. Across real\-world scenarios,Co\-GLANCEoutperforms cloud\-based vision\-language model baselines in occlusion segmentation and robot allocation accuracy by 25% and 36%, respectively, while reducing per\-frame inference latency350×350\\times\. We also release an air\-ground dataset for future research\. Code, videos, and dataset available at:[co\-glance\.github\.io](https://co-glance.github.io/)\.
> Keywords:Heterogeneous Robot Teams, Active Perception, Uncertainty Quantification, Vision\-Language Models, Knowledge Distillation
## 1Introduction
Figure 1:Air\-ground robot teaming setting\.Figure 2:Co\-GLANCEsystem overview: \(1\) perceptual uncertainty detection, \(2\) occlusion uncertainty, \(3\) resolution of high\-uncertainty areas, \(4\) object detection, \(5\) detection uncertainty, and \(6\) uncertainty\-driven active perception\.Heterogeneous air\-ground robot teams provide complementary sensing and mobility capabilities for operating in complex outdoor environments\. However, no single viewpoint affords reliable scene understanding in unstructured settings\. Perceptual uncertainty arising from occlusions manifests differently across platforms depending on scene geometry and traversal capability: vegetation may obstruct an aerial robot while remaining transparent to a ground robot beneath the canopy, whereas obstacles insignificant from above may fully occlude a ground robot’s view\. Hence, detecting and resolving uncertainty requires contextual scene understanding and capability\-aware robot coordination\.
Recent advances in Vision\-Language Models \(VLMs\) have shown promise for semantic reasoning in heterogeneous robotic systems\[[19](https://arxiv.org/html/2606.09919#bib.bib62),[16](https://arxiv.org/html/2606.09919#bib.bib32),[21](https://arxiv.org/html/2606.09919#bib.bib2),[5](https://arxiv.org/html/2606.09919#bib.bib271)\]\. In principle, they can identify ambiguous regions and reason about which platform should resolve them\. In practice, however, they are computationally expensive, often require cloud inference, and lack calibrated uncertainty estimates\. Existing active perception methods similarly rely on heuristic or uncalibrated confidence, limiting reliability in safety\-critical settings\.
Conformal prediction provides distribution\-free coverage guarantees for uncertainty quantification\[[2](https://arxiv.org/html/2606.09919#bib.bib178)\]\. However, standard methods produce prediction sets rather than decisions, while selective prediction introduces abstention without resolving uncertainty\. In heterogeneous teams, uncertainty must be actionable: deciding when additional sensing is needed and which robot should acquire it\.
To address these challenges, we introduceCo\-GLANCE, an onboard uncertainty\-aware perception and decision\-making framework for heterogeneous robot teams\.Co\-GLANCEdistills semantic reasoning from a VLM into a lightweight end\-to\-end model for occlusion segmentation and robot allocation, removing cloud inference\. We further introduce a contextual self\-review mechanism that improves consistency of VLM\-generated supervision via multi\-turn refinement in a cached conversational context\. We combine selective abstention with conformal prediction to produce calibrated uncertainty estimates for segmentation, robot allocation, and detection, which directly drive active perception and robot dispatch\.
- •Onboard Uncertainty\-Aware Perception for Heterogeneous Robot Teams\.Co\-GLANCECo\-GLANCEperforms real\-time occlusion segmentation and robot allocation, improving accuracy by 25% and 36% over cloud\-based vision\-language model baselines\.
- •Calibrated Uncertainty Estimation for Active Perception\.We combine selective abstention and conformal prediction to provide statistically valid uncertainty guarantees for segmentation, robot allocation, and object detection outputs\.
- •Contextual Self\-Review for VLM Distillation\.We introduce a multi\-turn self\-review mechanism that improves the consistency of VLM\-generated supervision for occlusion reasoning and robot assignment\.
- •Real\-World Deployment and Dataset Release\.We validateCo\-GLANCEon aerial\-ground robots, achieving350×350\\timeslower inference latency and releasing a multimodal air\-ground dataset\.
## 2Related Work
Foundation Models for Multi\-Robot Systems\.Recent work has explored large language and vision\-language models for heterogeneous robot coordination and planning\[[9](https://arxiv.org/html/2606.09919#bib.bib60),[12](https://arxiv.org/html/2606.09919#bib.bib59),[26](https://arxiv.org/html/2606.09919#bib.bib58),[8](https://arxiv.org/html/2606.09919#bib.bib57)\]\. SPINE and SPINE\-HT\[[21](https://arxiv.org/html/2606.09919#bib.bib2),[19](https://arxiv.org/html/2606.09919#bib.bib62)\]extend these ideas to unstructured environments via semantic mapping and feasibility\-aware planning\. While effective for high\-level reasoning and task decomposition, these approaches focus less on uncertainty\-aware perception and typically rely on cloud inference without calibrated reliability guarantees\. In contrast, our work targets onboard uncertainty\-aware perception and capability\-aware robot allocation for heterogeneous teams\.
VLM Distillation for Robotics\.Recent work uses VLMs as training\-time supervisors for lightweight downstream models\[xuVLMAD2024,[14](https://arxiv.org/html/2606.09919#bib.bib24)\], transferring multimodal reasoning into compact deployable networks\. Applications span autonomous driving, navigation, medical segmentation, and remote perception\. Most relevant,\[[20](https://arxiv.org/html/2606.09919#bib.bib270)\]distills language reasoning for onboard inference but still relies on external visual reasoning modules and does not address uncertainty\-aware perception\. In contrast, we distill both occlusion reasoning and robot allocation into an end\-to\-end onboard model, while refining pseudo\-labels via contextual self\-review\.
Active Perception and Uncertainty Quantification\.Active perception methods aim to select informative viewpoints to reduce ambiguity\[[4](https://arxiv.org/html/2606.09919#bib.bib158),yangHEHAHierarchicalPlanning2025,[16](https://arxiv.org/html/2606.09919#bib.bib32)\], but often rely on heuristic or uncalibrated confidence signals\. Conformal prediction provides distribution\-free uncertainty guarantees\[[22](https://arxiv.org/html/2606.09919#bib.bib175),[2](https://arxiv.org/html/2606.09919#bib.bib178)\], while selective prediction improves reliability via abstention\[[1](https://arxiv.org/html/2606.09919#bib.bib83),[6](https://arxiv.org/html/2606.09919#bib.bib139)\]\. However, conformal methods yield prediction sets that are difficult to directly use in planning, and selective prediction does not resolve abstentions\. We instead combine both within a heterogeneous perception framework where calibrated uncertainty directly drives robot allocation and active perception\.
## 3Methodology
We provide a visual overview ofCo\-GLANCEin[Figure 2](https://arxiv.org/html/2606.09919#S1.F2)\.Co\-GLANCEcombines context\-aware occlusion segmentation \(§[3\.1](https://arxiv.org/html/2606.09919#S3.SS1),[Figure 2](https://arxiv.org/html/2606.09919#S1.F2)\\raisebox\{\-\.9pt\} \{1\}⃝\), calibrated perception guarantees \(§[3\.2](https://arxiv.org/html/2606.09919#S3.SS2),[Figure 2](https://arxiv.org/html/2606.09919#S1.F2)\\raisebox\{\-\.9pt\} \{2\}⃝\\raisebox\{\-\.9pt\} \{5\}⃝\), and capability\-aware robot allocation \(§[3\.3](https://arxiv.org/html/2606.09919#S3.SS3),[Figure 2](https://arxiv.org/html/2606.09919#S1.F2)\\raisebox\{\-\.9pt\} \{3\}⃝\) to resolve visible occlusions under onboard computational constraints\. Low\-confidence detections \([Figure 2](https://arxiv.org/html/2606.09919#S1.F2)\\raisebox\{\-\.9pt\} \{4\}⃝\) trigger active perception \([Figure 2](https://arxiv.org/html/2606.09919#S1.F2)\\raisebox\{\-\.9pt\} \{6\}⃝\) until a high\-confidence observation is confirmed \([Figure 2](https://arxiv.org/html/2606.09919#S1.F2)\\raisebox\{\-\.9pt\} \{5\}⃝\)\.
We define: \(1\)Occluded area: A region currently not visible because an object lies between it and the active viewpoint\. We further require that the occluded area is large enough to hypothetically conceal a person in any posture\. \(2\)Platform allocation label: Encodes which platform isnecessaryto resolve an occlusion, not which is currently closest or most convenient\. The label space is\{ground,both,either\}\\\{\\texttt\{ground\},\\texttt\{both\},\\texttt\{either\}\\\}, wheregroundrequires the ground robot,bothrequires both robots, andeitherpermits flexible assignment\. \(3\)Active perception: The deliberate dispatch of an agent to a viewpoint that reduces uncertainty on the identity of an ambiguous object\[[3](https://arxiv.org/html/2606.09919#bib.bib5)\]\.
### 3\.1Perceptual Uncertainty Detection
Figure 3:Perceptual uncertainty detection: \(1\) occlusion segmentation and robot allocation by VLM with self\-review, \(2\) knowledge distillation, and \(3\) onboard inference using the distilled model\.Co\-GLANCEdistills occlusion segmentation and platform allocation from a large VLM into a lightweight YOLO\-seg\-nano model for onboard inference \(Figure[3](https://arxiv.org/html/2606.09919#S3.F3)\)\. The VLM is prompted to generate occluder keywords from aerial RGB frames \(Figure[3](https://arxiv.org/html/2606.09919#S3.F3)\(1a\)\), which are passed to an open\-vocabulary segmentation model to produce candidate masks \(Figure[3](https://arxiv.org/html/2606.09919#S3.F3)\(1b\)\)\. Since the VLM cannot reliably predict how its keywords will be grounded, the resulting masks often misalign with the intended regions or miss occlusions entirely\. To address this, a contextual self\-review stage presents the candidate masks back to the VLM in a multi\-turn conversation, allowing it to remove incorrect masks, refine misaligned regions, propose new keywords, and assign platform labels \(Figure[3](https://arxiv.org/html/2606.09919#S3.F3)\(1c–d\)\)\. The distilled model trained on these pseudo\-labels segments occlusions and allocates platforms in a single forward pass\. Full VLM prompts and additional details are provided in[Appendix A](https://arxiv.org/html/2606.09919#A1)\.
### 3\.2Uncertainty Quantification
Co\-GLANCEapplies a two\-stage uncertainty quantification scheme to occlusion segmentation and platform allocation, and person detection\. The scheme is identical for both prediction types\. The risk\-controlled stage produces guaranteed singletons, where confidence is sufficient, through selective abstention\[[1](https://arxiv.org/html/2606.09919#bib.bib83)\]; the coverage\-controlled stage provides calibrated set predictions through conformal prediction\[[2](https://arxiv.org/html/2606.09919#bib.bib178)\]on the remainder, informing active perception while it is underway\. The two\-stage uncertainty quantification scheme is further detailed in[Appendix B](https://arxiv.org/html/2606.09919#A2)\.
#### Stage 1 \- Risk\-Controlled Stage
Let\(Xi,Yi\)i=1,…,n\(X\_\{i\},Y\_\{i\}\)\_\{i=1,\\ldots,n\}be i\.i\.d\. sample\-label pairs,Y^\(Xi\)=argmaxyf^\(Xi\)\\hat\{Y\}\(X\_\{i\}\)=\\arg\\max\_\{y\}\\hat\{f\}\(X\_\{i\}\)the model prediction, andP^\(Xi\)=maxyf^\(Xi\)\\hat\{P\}\(X\_\{i\}\)=\\max\_\{y\}\\hat\{f\}\(X\_\{i\}\)its confidence\. For occlusion segmentation and allocation,Y^\(Xi\)\\hat\{Y\}\(X\_\{i\}\)is the correct mask and allocation label jointly; for person detection,Y^\(Xi\)\\hat\{Y\}\(X\_\{i\}\)is the correct person mask\. The empirical risk isR^\(λ\)=1n\(λ\)∑i=1n𝟙\{Yi≠Y^\(Xi\)andP^\(Xi\)≥λ\}\\hat\{R\}\(\\lambda\)=\\frac\{1\}\{n\(\\lambda\)\}\\sum\_\{i=1\}^\{n\}\\mathds\{1\}\\\{Y\_\{i\}\\neq\\hat\{Y\}\(X\_\{i\}\)\\text\{ and \}\\hat\{P\}\(X\_\{i\}\)\\geq\\lambda\\\}wheren\(λ\)=∑i=1n𝟙\{P^\(Xi\)≥λ\}n\(\\lambda\)=\\sum\_\{i=1\}^\{n\}\\mathds\{1\}\\\{\\hat\{P\}\(X\_\{i\}\)\\geq\\lambda\\\}\. TreatingR^\(λ\)\\hat\{R\}\(\\lambda\)as a Binomial random variable, its upper confidence bound isR^\+\(λ\)=sup\{r:BinomCDF\(R^\(λ\);n\(λ\),r\)≥δ\}\\hat\{R\}^\{\+\}\(\\lambda\)=\\sup\\\{r:\\mathrm\{BinomCDF\}\(\\hat\{R\}\(\\lambda\);n\(\\lambda\),r\)\\geq\\delta\\\}\. We selectλ^\\hat\{\\lambda\}as the last value on a discrete grid \(fixed sequence testing\) whereR^\+\(λ\)≤α\\hat\{R\}^\{\+\}\(\\lambda\)\\leq\\alpha, yielding:
ℙ\(ℙ\(Ytrue=Ypred∣P^\(Xtest\)≥λ^\)≥1−α\)≥1−δ,\\mathbb\{P\}\\\!\\big\(\\mathbb\{P\}\(Y\_\{\\text\{true\}\}=Y\_\{\\text\{pred\}\}\\mid\\hat\{P\}\(X\_\{\\text\{test\}\}\)\\geq\\hat\{\\lambda\}\)\\geq 1\-\\alpha\\big\)\\geq 1\-\\delta,\(1\)
whereα\\alphais the risk tolerance,δ\\deltathe confidence parameter, and the outer probability is over the calibration set\[[1](https://arxiv.org/html/2606.09919#bib.bib83)\]\. In words: with probability at least1−δ1\-\\delta, predictions aboveλ^\\hat\{\\lambda\}are correct with probability at least1−α1\-\\alpha\. Predictions aboveλ^\\hat\{\\lambda\}are used for decision\-making; those belowλ^\\hat\{\\lambda\}are passed to stage 2 and generate an active perception request\.
#### Stage 2 \- Coverage\-Controlled Stage
Let\(Xi,Yi\)i=1,…,n\(X\_\{i\},Y\_\{i\}\)\_\{i=1,\\ldots,n\}be i\.i\.d\. calibration samples withP^\(Xi\)<λ^\\hat\{P\}\(X\_\{i\}\)<\\hat\{\\lambda\}\. We define the nonconformity scoreS\(Xi,Yi\)=1−f^Yi\(Xi\)S\(X\_\{i\},Y\_\{i\}\)=1\-\\hat\{f\}\_\{Y\_\{i\}\}\(X\_\{i\}\), where higher values indicate worse agreement between prediction and label, and compute the calibration quantileq^\\hat\{q\}at level⌈\(n\+1\)\(1−ϵ\)⌉/n\\lceil\(n\+1\)\(1\-\\epsilon\)\\rceil/n\. The prediction set is then:
C^\(Xtest\)=\{y:f^y\(Xtest\)≥1−q^\},\\hat\{C\}\(X\_\{\\text\{test\}\}\)=\\left\\\{y:\\hat\{f\}\_\{y\}\(X\_\{\\text\{test\}\}\)\\geq 1\-\\hat\{q\}\\right\\\},\(2\)
which yields the marginal coverage guaranteeℙ\[Ytest∈C^\(Xtest\)\]≥1−ϵ\\mathbb\{P\}\[Y\_\{\\text\{test\}\}\\in\\hat\{C\}\(X\_\{\\text\{test\}\}\)\]\\geq 1\-\\epsilon\[[2](https://arxiv.org/html/2606.09919#bib.bib178)\], where probability is taken jointly over calibration and test samples\. This allowsCo\-GLANCEto make predictions on all model outputs\.
### 3\.3Uncertainty Resolution
#### Robot Allocation and Routing
We feed certified allocation labels and agent positions into a Heterogeneous Vehicle Routing Problem \(HVRP\)\[ortools\], minimizing total heuristic travel cost where ground robot traversal is weighted10×10\\timeshigher than aerial traversal to reflect its slower speed and terrain constraints, with Euclidean distances used as cost approximations\.Ground\-labeled occlusions are assigned exclusively to the ground robot,either\-labeled occlusions are assigned to either robot, andboth\-labeled or low\-confidence occlusions appear in both routes as independent nodes\. The ground robot is dispatched to a ground\-level viewpoint for each assigned occlusion while the aerial robot performs a fixed circular sweep around each occlusion\. See plots in[Appendix C](https://arxiv.org/html/2606.09919#A3)\.
#### Active Perception
Co\-GLANCEsupports active perception through two channels: uncertain robot allocation and uncertain object detection\. Forrobot allocation, stage 1 singleton labels are passed directly to the planner; abstentions trigger a conservative both\-agent dispatch, ensuring all high\-uncertainty regions are visited\. Forobject detection, stage 1 abstentions trigger an active perception request dispatching the ground agent to acquire a closer viewpoint, while stage 2 CP sets inform the system of the most likely object classes present\. In our experiments, allocation abstentions are resolved conservatively via both\-agent dispatch rather than through CP prediction sets; the sets could enable a dynamic re\-routing strategy as agents observe the environment\. For object detection, CP sets can used to track possible object identities during active perception\.
## 4Experiments
We evaluateCo\-GLANCEin semi\-structured outdoor environments containing both natural vegetation and built infrastructure, where occlusions are highly viewpoint\-dependent and often require complementary aerial and ground observations\. In addition to validating the system in real\-world conditions, we release a multimodal air\-ground perception dataset collected in these environments to support future research in heterogeneous robot perception and uncertainty\-aware active perception\.
### 4\.1Experimental Setup
#### Robot Platform
We pair a DJI Matrice 600 aerial robot with a Boston Dynamics Spot quadruped ground robot\. The quadruped can traverse rough terrain and dense vegetation that wheeled platforms cannot\[biswal2021development\], while the aerial robot provides unconstrained overhead views\. Additional information can be found in[Appendix D](https://arxiv.org/html/2606.09919#A4)\.
#### Environment
We train the distilled model on over 10000 masks and test it on nearly 200 masks from a site containing the aforementioned vegetation and structural characteristics\. In addition, we performed a real\-world demonstration ofCo\-GLANCEin two scenarios while initializing the robots from opposite sides of the scene, thereby inducing different initial viewing geometries\.
#### Demonstration Scenarios
In each scenario,Co\-GLANCEoperates in a single deployment in which both the aerial and ground robots are initialized from the same initial observation and executed simultaneously under identical environmental conditions\. The aerial robot is responsible for onboard perception and for generating waypoints for both platforms, while the ground robot executes the received trajectory\. We evaluate two initial configurations in the same environment so as to induce different initial viewing geometries without changing the underlying perception problem\.
#### Baselines
We compare against three baselines for occlusion prediction and robot allocation, all sharing the same routing module for navigation \(§[3\.3](https://arxiv.org/html/2606.09919#S3.SS3.SSS0.Px1)\)\.Expertserves as an empirical upper bound as it has access to ground truth masks\. TheVLMbaseline is a zero\-shot, cloud\-based system combining ChatGPT\-5\.4 with Grounding DINO\-base\[[13](https://arxiv.org/html/2606.09919#bib.bib7)\]and SAM 2 Hiera\-Large\[[18](https://arxiv.org/html/2606.09919#bib.bib6)\]\.VLM \+ Self\-Reviewaugments this pipeline with our contextual self\-review mechanism \(§[3\.1](https://arxiv.org/html/2606.09919#S3.SS1)\)\.
#### Metrics
We evaluate along three axes\.Perception quality:includes occlusion detection accuracy \(fraction of correctly identified occlusions\), robot allocation accuracy \(agreement with expert assignments\), and segmentation performance measured via precision, recall, and F1 score at an IoU threshold of 0\.5 between predicted and ground\-truth occlusion masks\.Calibration:characterizes both empirical and guaranteed stage 1 error rates, the abstention rate, and the stage 2 set\-valued error rate, quantifying adherence to the target risk while maintaining actionable predictions\.Efficiency:considers model size, per\-frame inference latency, and API token usage\.
### 4\.2Dataset
Table 1:Comparison of datasets for aerial, quadruped, and air\-ground collaborative perception\.DatasetRobotsPlatformsReal / SimSensorsGround TruthRTK / GPSFull BagsQROD\-111\[[11](https://arxiv.org/html/2606.09919#bib.bib71)\]1QuadrupedRealRGB2D boxes, tracking IDs✗✗EAGLE / CEAR\[[25](https://arxiv.org/html/2606.09919#bib.bib72)\]1QuadrupedRealEvent, RGB\-D, IMU, LiDAR, Joint EncoderRobot perception / odometry labels✗✗CDrone\[[15](https://arxiv.org/html/2606.09919#bib.bib70)\]1UAVSimRGB2D/3D boxes, tracking, depth, segmentation✗✗UVCPNet / V2U\-COO\[[24](https://arxiv.org/html/2606.09919#bib.bib65)\]2UAV \+ vehicleSimCamera3D object boxes✗✗M3OT\[[17](https://arxiv.org/html/2606.09919#bib.bib69)\]2UAVsRealRGB, thermal2D boxes, tracking IDs✗✗Griffin\[[23](https://arxiv.org/html/2606.09919#bib.bib66)\]2UAV \+ vehicleSimCamera / LiDAR3D boxes, tracking, occlusion labels✗✗Our Dataset2UAV \+ quadrupedRealRGB, RTK GPS, IMU2D boxes, tracking IDs✓✓
Real air\-ground data is costly to collect, requiring two robots operating outdoors simultaneously with synchronized sensing and metric localization across platforms\. As shown in[Table 1](https://arxiv.org/html/2606.09919#S4.T1), many existing datasets sidestep this with simulation, road scenes, or homogeneous UAV teams, and release processed labels rather than raw sensor streams\. Our dataset addresses this by providing more than 4000 synchronized aerial and ground frames across several scenarios, recorded with a DJI Matrice 600 and a Boston Dynamics Spot in semi\-structured outdoor terrain\. Depending on the scenario, available streams include RGB, estimated depth, RTK GPS, IMU; raw ROS 2 bags from both platforms are also released to support evaluation of perception and autonomy stacks beyond static image benchmarks\. A full breakdown of dataset scenarios and sensor availability is provided in[Appendix E](https://arxiv.org/html/2606.09919#A5)\. The dataset is available at[co\-glance\.github\.io](https://co-glance.github.io/)\.
### 4\.3Quantitative Results
#### Distilled Model Performance
We present a comparison of the VLM baseline \(with and without self\-review\) against the distilled model in Table[2](https://arxiv.org/html/2606.09919#S4.T2)\. The distilled model outperforms the VLM model on precision, recall, F1 score by 22%, 15%, and 19% respectively which highlights that the distilled model can generalize away from noise in the VLM’s pseudo\-labels\. Moreover, self\-review improves recall and allocation accuracy by over 15% by grounding the VLM’s keywords against the generated segmented masks\. The distilled model trails VLM self\-review on allocation accuracy by 5%, due to the self\-review allocation step being harder to distill than the segmentation task itself\. However, given the gains on other metrics,Co\-GLANCEprovides an overall balanced approach\. Additional quantitative results are provided in[Appendix F](https://arxiv.org/html/2606.09919#A6)\.
Table 2:Model\-level evaluation onn=199n=199hand\-annotated masks across3434held\-out frames\. These frames were not seen during model training\.SystemPrecisionRecallF1Alloc\. Acc\.VLM \(no review\)0\.4580\.5430\.4970\.694VLM \(self\-review\)0\.4890\.6680\.5650\.850Co\-GLANCE\(distilled\) \[ours\]0\.6800\.6930\.6870\.797
#### Effect of Uncertainty Quantification
Table[3](https://arxiv.org/html/2606.09919#S4.T3)summarizes the effect of incorporating uncertainty quantification across robot allocation and person detection\. The guaranteed error rates are chosen to balance error rate against abstention rate, as different tasks may tolerate higher abstention in exchange for lower error\. We find thatα=15%\\alpha=15\\%for robot allocation andα=20%\\alpha=20\\%for person detection offer a reasonable operating point under this tradeoff\. The stage 2 error rateϵ\\epsilonis chosen to match the stage 1 error rate for increased interoperability\. For both tasks, stage 1 ensures that the guaranteed error rate is not exceeded with probability1−δ1\-\\deltaby construction \(δ=0\.1\\delta=0\.1\)\.
For robot allocation, stage 1 \(cal\.n=778n=778masks, testn=222n=222masks\) reduces empirical error relative to the non\-UQ baseline while introducing a minimal abstention rate of just above10%10\\%\. Stage 2 conformal prediction further provides coverage\-certified predictions for all abstained instances, ensuring that no inputs are left without a statistically valid decision\. A similar trend is observed in person detection \(cal\.n=126n=126masks, testn=30n=30masks\), where selective abstention reduces the effective empirical error rate by 16% compared to the non\-UQ baseline, at the cost of a higher abstention rate due to the increased difficulty of the task and the use of an off\-the\-shelf detector not trained on aerial viewpoints\. Despite this, stage 2 conformal prediction again ensures bounded error on the remaining ambiguous cases, completing the decision pipeline with formal guarantees\.
Table 3:Probabilistic guarantees provided byCo\-GLANCE\. Guarantees hold by construction; empirical error rates are reported for reference, not as validation\.SystemError Rate\(Stage 1\)↓\\downarrowEmpirical Rate\(Stage 1\)↓\\downarrowAbstentionRateError Rate\(Stage 2\)↓\\downarrowRobot allocationw/o UQ–15%15\\%––w/ UQ≤15%\\leq 15\\%12\.6%12\.6\\%10\.4%10\.4\\%≤15%\\leq 15\\%Person detectionw/o UQ–36%36\\%––w/ UQ≤20%\\leq 20\\%20%20\\%50%50\\%≤20%\\leq 20\\%
### 4\.4Demonstrations
[Figure 4](https://arxiv.org/html/2606.09919#S4.F4)shows the trajectories of the Expert, VLM, andCo\-GLANCEplans across two real\-world air\-ground scenarios initialized from opposite sides of the environment\. The Expert baseline has access to ground\-truth occlusion labels, whereas the VLM baseline relies on cloud\-based reasoning without calibrated uncertainty\.Co\-GLANCEgenerates onboard occlusion\-aware robot allocations and invokes active perception when the aerial robot spots the person with high uncertainty, dispatching the ground robot to acquire complementary viewpoints\. Across both scenarios,Co\-GLANCEachieves more complete occlusion coverage and invokes active perception to resolve uncertainty while operating entirely onboard\. Table[4](https://arxiv.org/html/2606.09919#S4.T4)summarizes end\-to\-end system performance across two real\-world deployment scenarios\.Co\-GLANCEachieves 25% higher occlusion detection accuracy and 36% higher robot allocation accuracy than the cloud\-based VLM baseline while operating entirely onboard and without network connectivity\. In addition to improving task performance,Co\-GLANCEreduces per\-frame inference latency by about 350x and eliminates API token usage altogether\. These results demonstrate that distilling VLM reasoning into a lightweight onboard model enables practical real\-time deployment while preserving strong perception and allocation performance in heterogeneous air\-ground settings\. Video demonstrations ofCo\-GLANCEavailable at:[Appendix G](https://arxiv.org/html/2606.09919#A7)\.
Figure 4:Co\-GLANCEcompared with baselines for both scenariosTable 4:System evaluation over two real\-world scenarios\. Inference time and tokens are per\-frame, dataset\-wide \(Co\-GLANCE: 1000 frames, Xavier NX; VLM:\>\>1828 frames, RTX 4090\)\.SystemDet\. Acc\.Alloc\. Acc\.Infer\. Time\(ms/fr\.\)API Tokens\(avg\./fr\.\)Expert12/1212/12—0VLM \(Cloud\)6/125\.5/12\>\>13,00016\.7kCo\-GLANCE\[ours\]7\.5/127\.5/12370
## 5Conclusion
We introducedCo\-GLANCE, an uncertainty\-aware active perception framework for heterogeneous air\-ground robot teams\. By distilling semantic reasoning from vision\-language models into a lightweight onboard model,Co\-GLANCEenables real\-time occlusion segmentation, robot allocation, and uncertainty\-driven active perception without cloud connectivity\. Experimental results demonstrate improved perception and allocation performance over cloud\-based VLM baselines while reducing inference latency by approximately350×350\\times\. We additionally release a multimodal air\-ground perception dataset to support future research\.
## 6Limitations and Future Work
WhileCo\-GLANCEdemonstrates strong real\-world performance, several limitations remain:\(1\)the system is evaluated in semi\-structured outdoor environments with a limited number of robots,\(2\)the active perception policy is heuristic rather than jointly optimized with downstream planning objectives, and\(3\)the distilled model may inherit biases from the VLM\-generated pseudo\-labels used during training\. Future work will explore larger heterogeneous robot teams, planner\-aware active perception, and multimodal uncertainty fusion for long\-horizon autonomous operation\.
## References
- \[1\]\(2022\-09\)Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control\.arXiv\.Note:arXiv:2110\.01052 \[cs\]External Links:[Link](http://arxiv.org/abs/2110.01052),[Document](https://dx.doi.org/10.48550/arXiv.2110.01052)Cited by:[Appendix B](https://arxiv.org/html/2606.09919#A2.SS0.SSS0.Px2.p3.2),[Table 8](https://arxiv.org/html/2606.09919#A2.T8),[§2](https://arxiv.org/html/2606.09919#S2.p3.1),[§3\.2](https://arxiv.org/html/2606.09919#S3.SS2.SSS0.Px1.p3.7),[§3\.2](https://arxiv.org/html/2606.09919#S3.SS2.p1.1)\.
- \[2\]A\. N\. Angelopoulos and S\. Bates\(2022\-12\)A Gentle Introduction to Conformal Prediction and Distribution\-Free Uncertainty Quantification\.arXiv\.Note:arXiv:2107\.07511 \[cs\]External Links:[Link](http://arxiv.org/abs/2107.07511),[Document](https://dx.doi.org/10.48550/arXiv.2107.07511)Cited by:[Appendix B](https://arxiv.org/html/2606.09919#A2.SS0.SSS0.Px2.p3.2),[Table 8](https://arxiv.org/html/2606.09919#A2.T8),[§1](https://arxiv.org/html/2606.09919#S1.p3.1),[§2](https://arxiv.org/html/2606.09919#S2.p3.1),[§3\.2](https://arxiv.org/html/2606.09919#S3.SS2.SSS0.Px2.p3.1),[§3\.2](https://arxiv.org/html/2606.09919#S3.SS2.p1.1)\.
- \[3\]R\. Bajcsy, Y\. Aloimonos, and J\. K\. Tsotsos\(2016\-03\)Revisiting Active Perception\.arXiv\.Note:arXiv:1603\.02729 \[cs\.CV\]External Links:[Link](http://arxiv.org/abs/1603.02729),[Document](https://dx.doi.org/10.48550/arXiv.1603.02729)Cited by:[§3](https://arxiv.org/html/2606.09919#S3.p2.1)\.
- \[4\]N\. P\. Bhatt, Y\. Yang, R\. Siva, D\. Milan, U\. Topcu, and Z\. Wang\(2025\-04\)Know Where You’re Uncertain When Planning with Multimodal Foundation Models: A Formal Framework\.arXiv\.Note:arXiv:2411\.01639 \[cs\]External Links:[Link](http://arxiv.org/abs/2411.01639),[Document](https://dx.doi.org/10.48550/arXiv.2411.01639)Cited by:[§2](https://arxiv.org/html/2606.09919#S2.p3.1)\.
- \[5\]F\. Cladera, Z\. Ravichandran, J\. Hughes, V\. Murali, C\. Nieto\-Granda, M\. A\. Hsieh, G\. J\. Pappas, C\. J\. Taylor, and V\. Kumar\(2025\-05\)Air\-Ground Collaboration for Language\-Specified Missions in Unknown Environments\.arXiv\(en\)\.Note:arXiv:2505\.09108 \[cs\]External Links:[Link](http://arxiv.org/abs/2505.09108),[Document](https://dx.doi.org/10.48550/arXiv.2505.09108)Cited by:[§1](https://arxiv.org/html/2606.09919#S1.p2.1)\.
- \[6\]S\. Feldman, L\. Ringel, S\. Bates, and Y\. Romano\(2023\-01\)Achieving Risk Control in Online Learning Settings\.arXiv\.Note:arXiv:2205\.09095 \[cs\]External Links:[Link](http://arxiv.org/abs/2205.09095),[Document](https://dx.doi.org/10.48550/arXiv.2205.09095)Cited by:[§2](https://arxiv.org/html/2606.09919#S2.p3.1)\.
- \[7\]Glenn Jocher and Jing Qiu\(2026\)Ultralytics YOLO26\.External Links:[Link](https://github.com/ultralytics/ultralytics)Cited by:[Figure 11](https://arxiv.org/html/2606.09919#A2.F11)\.
- \[8\]P\. Gupta, D\. Isele, E\. Sachdeva, P\. Huang, B\. Dariush, K\. Lee, and S\. Bae\(2025\-01\)Generalized Mission Planning for Heterogeneous Multi\-Robot Teams via LLM\-constructed Hierarchical Trees\.arXiv\.Note:arXiv:2501\.16539 \[cs\]External Links:[Link](http://arxiv.org/abs/2501.16539),[Document](https://dx.doi.org/10.48550/arXiv.2501.16539)Cited by:[§2](https://arxiv.org/html/2606.09919#S2.p1.1)\.
- \[9\]S\. S\. Kannan, V\. L\. N\. Venkatesh, and B\. Min\(2024\-03\)SMART\-LLM: Smart Multi\-Agent Robot Task Planning using Large Language Models\.arXiv\.Note:arXiv:2309\.10062 \[cs\]External Links:[Link](http://arxiv.org/abs/2309.10062),[Document](https://dx.doi.org/10.48550/arXiv.2309.10062)Cited by:[§2](https://arxiv.org/html/2606.09919#S2.p1.1)\.
- \[10\]A\. Krizhevsky\(2009\)Learning Multiple Layers of Features from Tiny Images\.Technical report\(en\)\.Cited by:[Appendix B](https://arxiv.org/html/2606.09919#A2.SS0.SSS0.Px1.p2.1.1),[Appendix B](https://arxiv.org/html/2606.09919#A2.SS0.SSS0.Px1.p3.2)\.
- \[11\]Y\. Li, K\. Zhang, Z\. Chen, W\. Ouyang, M\. Cui, C\. Jiang, D\. Yang, and Z\. Chen\(2023\-12\)Towards object tracking for quadruped robots\.Journal of Visual Communication and Image Representation97,pp\. 103958\(en\)\.External Links:ISSN 10473203,[Link](https://linkinghub.elsevier.com/retrieve/pii/S1047320323002080),[Document](https://dx.doi.org/10.1016/j.jvcir.2023.103958)Cited by:[Table 1](https://arxiv.org/html/2606.09919#S4.T1.1.1.2.1)\.
- \[12\]K\. Liu, Z\. Tang, D\. Wang, Z\. Wang, X\. Li, and B\. Zhao\(2025\-03\)COHERENT: Collaboration of Heterogeneous Multi\-Robot System with Large Language Models\.arXiv\.Note:arXiv:2409\.15146 \[cs\]External Links:[Link](http://arxiv.org/abs/2409.15146),[Document](https://dx.doi.org/10.48550/arXiv.2409.15146)Cited by:[§2](https://arxiv.org/html/2606.09919#S2.p1.1)\.
- \[13\]S\. Liu, Z\. Zeng, T\. Ren, F\. Li, H\. Zhang, J\. Yang, Q\. Jiang, C\. Li, J\. Yang, H\. Su, J\. Zhu, and L\. Zhang\(2023\-03\)Grounding DINO: Marrying DINO with Grounded Pre\-Training for Open\-Set Object Detection\.\(en\)\.External Links:[Link](https://arxiv.org/abs/2303.05499v5)Cited by:[Appendix A](https://arxiv.org/html/2606.09919#A1.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.09919#S4.SS1.SSS0.Px4.p1.1)\.
- \[14\]A\. M\. Mansourian, R\. Ahmadi, M\. Ghafouri, A\. M\. Babaei, E\. B\. Golezani, Z\. Y\. Ghamchi, V\. Ramezanian, A\. Taherian, K\. Dinashi, A\. Miri, and S\. Kasaei\(2025\-10\)A Comprehensive Survey on Knowledge Distillation\.arXiv\.Note:arXiv:2503\.12067 \[cs\]External Links:[Link](http://arxiv.org/abs/2503.12067),[Document](https://dx.doi.org/10.48550/arXiv.2503.12067)Cited by:[§2](https://arxiv.org/html/2606.09919#S2.p2.1)\.
- \[15\]J\. Meier, L\. Scalerandi, O\. Dhaouadi, J\. Kaiser, N\. Araslanov, and D\. Cremers\(2024\-10\)CARLA Drone: Monocular 3D Object Detection from a Different Perspective\.arXiv\.Note:arXiv:2408\.11958 \[cs\]External Links:[Link](http://arxiv.org/abs/2408.11958),[Document](https://dx.doi.org/10.48550/arXiv.2408.11958)Cited by:[Table 1](https://arxiv.org/html/2606.09919#S4.T1.1.1.4.1)\.
- \[16\]D\. Morilla\-Cabello and E\. Montijano\(2026\-01\)CHORAL: Traversal\-Aware Planning for Safe and Efficient Heterogeneous Multi\-Robot Routing\.arXiv\.Note:arXiv:2601\.10340 \[cs\]External Links:[Link](http://arxiv.org/abs/2601.10340),[Document](https://dx.doi.org/10.48550/arXiv.2601.10340)Cited by:[§1](https://arxiv.org/html/2606.09919#S1.p2.1),[§2](https://arxiv.org/html/2606.09919#S2.p3.1)\.
- \[17\]Z\. Nie, L\. Xue, Z\. Fang, J\. Ren, Y\. Wei, and J\. Zheng\(2025\-12\)M3OT: A Multi\-Drone Multi\-Modality dataset for Multi\-Object Tracking\.Scientific Data12\(1\),pp\. 1927\(en\)\.External Links:ISSN 2052\-4463,[Link](https://www.nature.com/articles/s41597-025-06204-0),[Document](https://dx.doi.org/10.1038/s41597-025-06204-0)Cited by:[Table 1](https://arxiv.org/html/2606.09919#S4.T1.1.1.6.1)\.
- \[18\]N\. Ravi, V\. Gabeur, Y\. Hu, R\. Hu, C\. Ryali, T\. Ma, H\. Khedr, R\. Rädle, C\. Rolland, L\. Gustafson, E\. Mintun, J\. Pan, K\. V\. Alwala, N\. Carion, C\. Wu, R\. Girshick, P\. Dollár, and C\. Feichtenhofer\(2024\-08\)SAM 2: Segment Anything in Images and Videos\.\(en\)\.External Links:[Link](https://arxiv.org/abs/2408.00714v2)Cited by:[Appendix A](https://arxiv.org/html/2606.09919#A1.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.09919#S4.SS1.SSS0.Px4.p1.1)\.
- \[19\]Z\. Ravichandran, F\. Cladera, A\. Prabhu, J\. Hughes, V\. Murali, C\. Taylor, G\. J\. Pappas, and V\. Kumar\(2025\)Heterogeneous Robot Collaboration in Unstructured Environments with Grounded Generative Intelligence\.arXiv\.Note:Version Number: 1External Links:[Link](https://arxiv.org/abs/2510.26915),[Document](https://dx.doi.org/10.48550/ARXIV.2510.26915)Cited by:[§1](https://arxiv.org/html/2606.09919#S1.p2.1),[§2](https://arxiv.org/html/2606.09919#S2.p1.1)\.
- \[20\]Z\. Ravichandran, I\. Hounie, F\. Cladera, A\. Ribeiro, G\. J\. Pappas, and V\. Kumar\(2025\)Distilling On\-device Language Models for Robot Planning with Minimal Human Intervention\.arXiv\(en\)\.Note:Version Number: 1External Links:[Link](https://arxiv.org/abs/2506.17486),[Document](https://dx.doi.org/10.48550/ARXIV.2506.17486)Cited by:[§2](https://arxiv.org/html/2606.09919#S2.p2.1)\.
- \[21\]Z\. Ravichandran, V\. Murali, M\. Tzes, G\. J\. Pappas, and V\. Kumar\(2025\-03\)SPINE: Online Semantic Planning for Missions with Incomplete Natural Language Specifications in Unstructured Environments\.arXiv\.Note:arXiv:2410\.03035 \[cs\.RO\]External Links:[Link](http://arxiv.org/abs/2410.03035),[Document](https://dx.doi.org/10.48550/arXiv.2410.03035)Cited by:[§1](https://arxiv.org/html/2606.09919#S1.p2.1),[§2](https://arxiv.org/html/2606.09919#S2.p1.1)\.
- \[22\]V\. Vovk, A\. Gammerman, and G\. Shafer\(2005\)Algorithmic learning in a random world\.Springer US\(English \(US\)\)\.External Links:ISBN 978\-0\-387\-00152\-4,[Link](https://www.researchwithrutgers.com/en/publications/algorithmic-learning-in-a-random-world/),[Document](https://dx.doi.org/10.1007/b106715)Cited by:[§2](https://arxiv.org/html/2606.09919#S2.p3.1)\.
- \[23\]J\. Wang, X\. Cao, J\. Zhong, Y\. Zhang, H\. Yu, L\. He, and S\. Xu\(2025\-03\)Griffin: Aerial\-Ground Cooperative Detection and Tracking Dataset and Benchmark\.arXiv\.Note:arXiv:2503\.06983 \[cs\] version: 1External Links:[Link](http://arxiv.org/abs/2503.06983),[Document](https://dx.doi.org/10.48550/arXiv.2503.06983)Cited by:[Table 1](https://arxiv.org/html/2606.09919#S4.T1.1.1.7.1)\.
- \[24\]Y\. Wang, P\. Cheng, P\. Tian, Z\. Yuan, L\. Zhao, J\. Tian, W\. Wang, Z\. Wang, and X\. Sun\(2024\-06\)UVCPNet: A UAV\-Vehicle Collaborative Perception Network for 3D Object Detection\.arXiv\.Note:arXiv:2406\.04647 \[cs\]External Links:[Link](http://arxiv.org/abs/2406.04647),[Document](https://dx.doi.org/10.48550/arXiv.2406.04647)Cited by:[Table 1](https://arxiv.org/html/2606.09919#S4.T1.1.1.5.1)\.
- \[25\]S\. Zhu, Z\. Xiong, and D\. Kim\(2024\-04\)EAGLE: The First Event Camera Dataset Gathered by an Agile Quadruped Robot\.arXiv\.Note:arXiv:2404\.04698 \[cs\] version: 1External Links:[Link](http://arxiv.org/abs/2404.04698),[Document](https://dx.doi.org/10.48550/arXiv.2404.04698)Cited by:[Table 1](https://arxiv.org/html/2606.09919#S4.T1.1.1.3.1)\.
- \[26\]Y\. Zhu, J\. Chen, X\. Zhang, M\. Guo, and Z\. Li\(2025\-08\)DEXTER\-LLM: Dynamic and Explainable Coordination of Multi\-Robot Systems in Unknown Environments via Large Language Models\.arXiv\.Note:arXiv:2508\.14387 \[cs\]External Links:[Link](http://arxiv.org/abs/2508.14387),[Document](https://dx.doi.org/10.48550/arXiv.2508.14387)Cited by:[§2](https://arxiv.org/html/2606.09919#S2.p1.1)\.
## Appendix AAdditional Information on Perceptual Uncertainty Detection
#### VLM Prompts
The VLM baseline without self\-review \([Table 2](https://arxiv.org/html/2606.09919#S4.T2)\) uses the prompt in[Figure 5](https://arxiv.org/html/2606.09919#A1.F5)\. The VLM baseline with self\-review uses the prompt in[Figure 6](https://arxiv.org/html/2606.09919#A1.F6)to detect occlusions and the prompts in[Figure 7](https://arxiv.org/html/2606.09919#A1.F7)and in[Figure 8](https://arxiv.org/html/2606.09919#A1.F8)to review them and allocate the robots\.
#### Processing Time Breakdown
The self\-review pipeline \(see[subsection 3\.1](https://arxiv.org/html/2606.09919#S3.SS1)\) comprises three components: ChatGPT\-5\.4 as the VLM, Grounding DINO\-base\[[13](https://arxiv.org/html/2606.09919#bib.bib7)\]for open\-vocabulary detection, and SAM 2 Hiera\-Large\[[18](https://arxiv.org/html/2606.09919#bib.bib6)\]for mask segmentation; hardware specifications are provided in[Table 6](https://arxiv.org/html/2606.09919#A1.T6)\. As shown in[Table 5](https://arxiv.org/html/2606.09919#A1.T5), the VLM dominates processing time, accounting for over 93% of the total\. Open\-vocabulary detection and segmentation are invoked an average of 2\.39 times per frame, where each call corresponds to one keyword or keyword pair\. The VLM also triggers more than two self\-review passes per frame on average, reflecting the difficulty of selecting keywords for open\-vocabulary segmentation\. Together, these observations motivate distilling the VLM’s reasoning into a lightweight onboard model, reducing per\-frame inference from over 13 seconds to real\-time operation\.
Table 5:Timing breakdown of the contextual self\-review distillation pipeline \(n=1,828n=1,828frames\)\.∗Averaged over 1,828 frames, accounting for multiple detection, segmentation, and review passes per frame\.StageCountCalls/FrameAvg \(s\)Min \(s\)Max \(s\)Total \(s\)Pass 1 \(VLM Annotation\)1,8281\.003\.9611\.63841\.3067,241\.2Open Vocabulary Detection4,3652\.390\.2640\.1302\.9121,152\.6Open Vocabulary Segmentation4,3652\.390\.0980\.0000\.294429\.1Pass 2 \(Self\-Review\)3,7702\.064\.1072\.03124\.12715,482\.5Total \(average∗/ frame\)\-\-\-\-13\.344∗
Table 6:Development machine hardware and software specifications\.ComponentSpecificationCPUIntel Core i9\-13900K \(24\-core\)GPUNVIDIA RTX 4090RAM32 GBOSUbuntu 22\.04\.5 LTS \(Jammy Jellyfish\)KernelLinux 6\.8\.0\-124\-generic x86\_64
#### Class Distribution
[Table 7](https://arxiv.org/html/2606.09919#A1.T7)shows that the class distribution is skewed toward \{either\} \(54\.7%\), followed by \{ground\} \(32\.2%\) and \{both\} \(13\.1%\)\. This distribution is expected: most occlusions can be resolved by either platform repositioning independently, a smaller subset requires ground\-level inspection specifically, and the fewest cases demand simultaneous coverage from both robots\.
Single\-Pass Occlusion Detection and Robot Allocation Prompt[⬇](data:text/plain;base64,WW91IGFyZSBhbmFseXppbmcgYW4gYWVyaWFsIGRyb25lIGltYWdlIHRvIGlkZW50aWZ5IHRydWUgcGh5c2ljYWwgb2NjbHVzaW9uczogdmlzaWJsZSBvYmplY3RzIG9yIHN0cnVjdHVyZXMgdGhhdCBibG9jayB0aGUgZHJvbmUncyBsaW5lIG9mIHNpZ2h0IHRvIG1lYW5pbmdmdWwgaGlkZGVuIHNwYWNlIGJlaGluZCwgdW5kZXIsIG9yIGluc2lkZSB0aGVtLgoKUExBVEZPUk0gQ09OVEVYVDoKLSBUaGUgZHJvbmUgZmxpZXMgYXQgYXBwcm94aW1hdGVseSAxMCBtZXRlcnMgYWx0aXR1ZGUgd2l0aCBpdHMgY2FtZXJhIHBvaW50aW5nIGRpYWdvbmFsbHkgZG93bndhcmQgYXQgcm91Z2hseSA0NSBkZWdyZWVzIGZyb20gaG9yaXpvbnRhbC4gU21hbGwgZHJvbmUgbW92ZW1lbnRzIG9mIDEtMiBtZXRlcnMgY2hhbmdlIHRoZSB2aWV3cG9pbnQgdmVyeSBsaXR0bGUgLS0gb25seSBsYXJnZSByZXBvc2l0aW9uaW5nIG1lYW5pbmdmdWxseSBjaGFuZ2VzIHdoYXQgaXMgdmlzaWJsZS4KLSBBIHF1YWRydXBlZCBncm91bmQgcm9ib3Qgb3BlcmF0ZXMgYXQgZ3JvdW5kIGxldmVsIHdpdGggYSBmb3J3YXJkLWZhY2luZyBjYW1lcmEgYXQgYXBwcm94aW1hdGVseSAwLjUgbWV0ZXJzIGhlaWdodC4gSXQgY2FuIG5hdmlnYXRlIHVuZXZlbiB0ZXJyYWluIGFuZCB3YWxrIGFyb3VuZCBvYnN0YWNsZXMsIGJ1dCBjYW5ub3QgZmx5LCBjbGltYiB3YWxscywgb3IgYWNjZXNzIGVsZXZhdGVkIHN1cmZhY2VzLgoKRm9jdXMgb24gdGhlIGRpc3RpbmN0aW9uIGJldHdlZW4gIm9iamVjdCBwcmVzZW50IiBhbmQgIm9jY2x1c2lvbiBwcmVzZW50Ii4gQSB2aXNpYmxlIG9iamVjdCBpcyBvbmx5IGFuIG9jY2x1c2lvbiB3aGVuIGl0IGhpZGVzIG1lYW5pbmdmdWwgc3BhY2UuCgpWQUxJRCBPQ0NMVVNJT046Ci0gQSBwaHlzaWNhbCBvYmplY3QsIHZlZ2V0YXRpb24gbWFzcywgc3RydWN0dXJlLCB2ZWhpY2xlLCBvciB0ZXJyYWluIGZvcm0gYmxvY2tzIGxpbmUgb2Ygc2lnaHQuCi0gVGhlIGhpZGRlbiBzcGFjZSBpcyBsYXJnZSBlbm91Z2ggdG8gY29uY2VhbCBhIHN0YW5kaW5nIG9yIGNyb3VjaGluZyBwZXJzb24uIFRoaXMgaXMgYSBzaXplIGNvbnN0cmFpbnQgb25seSAtLSBmbGFnIGFueSBoaWRkZW4gc3BhY2UgbGFyZ2UgZW5vdWdoIHRvIGNvbnRhaW4gYSBwZXJzb24gcmVnYXJkbGVzcyBvZiB3aGV0aGVyIHlvdSBleHBlY3QgYSBwZXJzb24gdG8gYWN0dWFsbHkgYmUgdGhlcmUuIERvIG5vdCB1c2UgbGlrZWxpaG9vZCBvZiBodW1hbiBwcmVzZW5jZSBhcyBhIGNyaXRlcmlvbi4KLSBUaGUgaGlkZGVuIHNwYWNlIGlzIGhpZGRlbiBieSBnZW9tZXRyeSwgbm90IGJ5IGJsdXIsIGRpc3RhbmNlLCBjcm9wLCBkYXJrbmVzcyBvciBnbGFyZS4KCk5PVCBBTiBPQ0NMVVNJT046Ci0gT3BlbiBncm91bmQsIHBhdmVtZW50LCBncmFzcywgZGlydCwgc2t5LCB3YXRlciwgc2hhZG93cywgbWFya2luZ3MsIHJlZmxlY3Rpb25zLCBvciBpbWFnZSBib3JkZXJzLgotIFRoaW4gcG9sZXMsIHdpcmVzLCBzaWducywgc3BhcnNlIGJyYW5jaGVzLCBpc29sYXRlZCB0cnVua3MsIHNtYWxsIHJvY2tzLCBsb3cgcGxhbnRzLCBvciBmbGF0IHZpc2libGUgc3VyZmFjZXMuCgpGUk9NIEFFUklBTCBWSUVXLCBGTEFHOgotIFRyZWUgY2Fub3B5IG9yIGRlbnNlIGZvbGlhZ2UgaGlkaW5nIGdyb3VuZCB1bmRlcm5lYXRoLgotIERlbnNlIGJ1c2hlcywgaGVkZ2VzLCBzaHJ1YiBjbHVzdGVycywgb3IgdGFsbCB2ZWdldGF0aW9uIGhpZGluZyBpbnRlcmlvci9ncm91bmQgc3BhY2UuCi0gQnVpbGRpbmcgd2FsbHMvY29ybmVycywgcm9vZiBvdmVyaGFuZ3MsIGJyaWRnZXMsIHVuZGVycGFzc2VzLCBjb3ZlcmVkIGJheXMuCi0gTGFyZ2UgdmVoaWNsZXMsIHRyYWlsZXJzLCBjb250YWluZXJzLCBvciBlcXVpcG1lbnQgaGlkaW5nIHNwYWNlIGJlaGluZC9iZXNpZGUvdW5kZXIgdGhlbS4KLSBEaXRjaGVzLCBndWxsaWVzLCBiZXJtcywgcmF2aW5lcywgb3Igc3RlZXAgdGVycmFpbiBlZGdlcyBoaWRpbmcgYm90dG9tIG9yIGZhci1zaWRlIHNwYWNlLgoKS0VZV09SRFM6Ci0gVXNlIDItMyB3b3JkIGRldGVjdG9yLWZyaWVuZGx5IG5vdW4gcGhyYXNlcy4KLSBHb29kOiAidHJlZSBjYW5vcHkiLCAiZGVuc2UgYnVzaGVzIiwgImJ1aWxkaW5nIHdhbGwiLCAicm9vZiBvdmVyaGFuZyIsICJjYXJnbyB0cmFpbGVyIiwgImRpdGNoIGJhbmsiLgotIEJhZDogInNoYWRvdyIsICJvcGVuIGdyb3VuZCIsICJtYXliZSBoaWRkZW4gYXJlYSIsICJsYXJnZSBkYXJrIGdyZWVuIHRyZWUgY2Fub3B5Ii4KClBMQVRGT1JNIExBQkVMUzoKRm9yIGVhY2ggZGV0ZWN0ZWQgb2NjbHVzaW9uIGFzc2lnbiBleGFjdGx5IG9uZSBsYWJlbDoKLSBncm91bmQ6IHVuZGVyc2lkZS9pbnRlcmlvci9iZW5lYXRoLWNhbm9weSBzcGFjZSBub3QgdmlzaWJsZSBmcm9tIGFib3ZlIGJ1dCBpbnNwZWN0YWJsZSBieSB0aGUgZ3JvdW5kIHJvYm90LgotIGVpdGhlcjogZWl0aGVyIGRyb25lIG9yIGdyb3VuZCByb2JvdCBjYW4gaW5kZXBlbmRlbnRseSByZXNvbHZlIGl0LCBzdWNoIGFzIGEgc2ltcGxlIHdhbGwgb3IgdmVoaWNsZSB3aXRoIHdhbGstYXJvdW5kL2ZseS1vdmVyIGFjY2Vzcy4KLSBib3RoOiBib3RoIHBsYXRmb3JtcyBhcmUgbmVlZGVkOyB1c2Ugc3BhcmluZ2x5IGZvciBjb21wb3VuZCBvciBkZW5zZSBvY2NsdXNpb25zIG5laXRoZXIgcGxhdGZvcm0gYWxvbmUgcmVzb2x2ZXMgc3VjaCBhcyBmb2xpYWdlIHBsdXMgd2FsbC90ZXJyYWluIG9yIGEgdGhpY2tldCBoaWRpbmcgaW50ZXJpb3Igc3BhY2UgZnJvbSBib3RoIHZpZXdzLgoKUmV0dXJuIG9ubHk6ClRFWFRfUFJPTVBUID0gIm9iamVjdCBvbmUgLiBvYmplY3QgdHdvIC4iCkVYUExBTkFUSU9OUyA9IG5wLmFycmF5KFsiV2h5IG9iamVjdCBvbmUgaGlkZXMgbWVhbmluZ2Z1bCBzcGFjZS4iLCAiV2h5IG9iamVjdCB0d28gaGlkZXMgbWVhbmluZ2Z1bCBzcGFjZS4iXSkKUkVRVUlSRURfUExBVEZPUk0gPSBucC5hcnJheShbImxhYmVsIG9uZSIsICJsYWJlbCB0d28iXSkKClRFWFRfUFJPTVBULCBFWFBMQU5BVElPTlMsIGFuZCBSRVFVSVJFRF9QTEFURk9STSBtdXN0IGFsbCBoYXZlIHRoZSBzYW1lIGxlbmd0aCBhbmQgYmUgaW4gdGhlIHNhbWUgb3JkZXIuCkVhY2ggUkVRVUlSRURfUExBVEZPUk0gZW50cnkgbXVzdCBiZSBleGFjdGx5IG9uZSBvZjogZ3JvdW5kLCBlaXRoZXIsIGJvdGguCgpJZiBubyB0cnVlIG9jY2x1c2lvbnMgYXJlIHZpc2libGU6ClRFWFRfUFJPTVBUID0gIiIKRVhQTEFOQVRJT05TID0gbnAuYXJyYXkoW10pClJFUVVJUkVEX1BMQVRGT1JNID0gbnAuYXJyYXkoW10p)Youareanalyzinganaerialdroneimagetoidentifytruephysicalocclusions:visibleobjectsorstructuresthatblockthedrone’slineofsighttomeaningfulhiddenspacebehind,under,orinsidethem\.PLATFORMCONTEXT:\-Thedronefliesatapproximately10metersaltitudewithitscamerapointingdiagonallydownwardatroughly45degreesfromhorizontal\.Smalldronemovementsof1\-2meterschangetheviewpointverylittle\-\-onlylargerepositioningmeaningfullychangeswhatisvisible\.\-Aquadrupedgroundrobotoperatesatgroundlevelwithaforward\-facingcameraatapproximately0\.5metersheight\.Itcannavigateuneventerrainandwalkaroundobstacles,butcannotfly,climbwalls,oraccesselevatedsurfaces\.Focusonthedistinctionbetween"objectpresent"and"occlusionpresent"\.Avisibleobjectisonlyanocclusionwhenithidesmeaningfulspace\.VALIDOCCLUSION:\-Aphysicalobject,vegetationmass,structure,vehicle,orterrainformblockslineofsight\.\-Thehiddenspaceislargeenoughtoconcealastandingorcrouchingperson\.Thisisasizeconstraintonly\-\-flaganyhiddenspacelargeenoughtocontainapersonregardlessofwhetheryouexpectapersontoactuallybethere\.Donotuselikelihoodofhumanpresenceasacriterion\.\-Thehiddenspaceishiddenbygeometry,notbyblur,distance,crop,darknessorglare\.NOTANOCCLUSION:\-Openground,pavement,grass,dirt,sky,water,shadows,markings,reflections,orimageborders\.\-Thinpoles,wires,signs,sparsebranches,isolatedtrunks,smallrocks,lowplants,orflatvisiblesurfaces\.FROMAERIALVIEW,FLAG:\-Treecanopyordensefoliagehidinggroundunderneath\.\-Densebushes,hedges,shrubclusters,ortallvegetationhidinginterior/groundspace\.\-Buildingwalls/corners,roofoverhangs,bridges,underpasses,coveredbays\.\-Largevehicles,trailers,containers,orequipmenthidingspacebehind/beside/underthem\.\-Ditches,gullies,berms,ravines,orsteepterrainedgeshidingbottomorfar\-sidespace\.KEYWORDS:\-Use2\-3worddetector\-friendlynounphrases\.\-Good:"treecanopy","densebushes","buildingwall","roofoverhang","cargotrailer","ditchbank"\.\-Bad:"shadow","openground","maybehiddenarea","largedarkgreentreecanopy"\.PLATFORMLABELS:Foreachdetectedocclusionassignexactlyonelabel:\-ground:underside/interior/beneath\-canopyspacenotvisiblefromabovebutinspectablebythegroundrobot\.\-either:eitherdroneorgroundrobotcanindependentlyresolveit,suchasasimplewallorvehiclewithwalk\-around/fly\-overaccess\.\-both:bothplatformsareneeded;usesparinglyforcompoundordenseocclusionsneitherplatformaloneresolvessuchasfoliagepluswall/terrainorathickethidinginteriorspacefrombothviews\.Returnonly:TEXT\_PROMPT="objectone\.objecttwo\."EXPLANATIONS=np\.array\(\["Whyobjectonehidesmeaningfulspace\.","Whyobjecttwohidesmeaningfulspace\."\]\)REQUIRED\_PLATFORM=np\.array\(\["labelone","labeltwo"\]\)TEXT\_PROMPT,EXPLANATIONS,andREQUIRED\_PLATFORMmustallhavethesamelengthandbeinthesameorder\.EachREQUIRED\_PLATFORMentrymustbeexactlyoneof:ground,either,both\.Ifnotrueocclusionsarevisible:TEXT\_PROMPT=""EXPLANATIONS=np\.array\(\[\]\)REQUIRED\_PLATFORM=np\.array\(\[\]\)Figure 5:Prompt for single\-pass, end\-to\-end occlusion segmentation and allocation used for the VLM baseline\.Initial Occlusion Detection Prompt[⬇](data:text/plain;base64,WW91IGFyZSBhbmFseXppbmcgYW4gYWVyaWFsIGRyb25lIGltYWdlIHRvIGlkZW50aWZ5IHRydWUgcGh5c2ljYWwgb2NjbHVzaW9uczogdmlzaWJsZSBvYmplY3RzIG9yIHN0cnVjdHVyZXMgdGhhdCBibG9jayB0aGUgZHJvbmUncyBsaW5lIG9mIHNpZ2h0IHRvIG1lYW5pbmdmdWwgaGlkZGVuIHNwYWNlIGJlaGluZCwgdW5kZXIsIG9yIGluc2lkZSB0aGVtLgpQTEFURk9STSBDT05URVhUOgpUaGUgZHJvbmUgZmxpZXMgYXQgYXBwcm94aW1hdGVseSAxMCBtZXRlcnMgYWx0aXR1ZGUgd2l0aCBpdHMgY2FtZXJhIHBvaW50aW5nIGRpYWdvbmFsbHkgZG93bndhcmQgYXQgcm91Z2hseSA0NSBkZWdyZWVzIGZyb20gaG9yaXpvbnRhbCwgZ2l2aW5nIGFuIG9ibGlxdWUgZm9yd2FyZC1kb3dud2FyZCB2aWV3IG9mIHRoZSBzY2VuZS4gQSBxdWFkcnVwZWQgZ3JvdW5kIHJvYm90IG9wZXJhdGVzIGF0IGdyb3VuZCBsZXZlbCB3aXRoIGEgZm9yd2FyZC1mYWNpbmcgY2FtZXJhIGF0IGFwcHJveGltYXRlbHkgMC41IG1ldGVycyBoZWlnaHQsIGdpdmluZyBhIG5lYXItaG9yaXpvbnRhbCBncm91bmQtbGV2ZWwgcGVyc3BlY3RpdmUuCkZvY3VzIG9uIHRoZSBkaXN0aW5jdGlvbiBiZXR3ZWVuICJvYmplY3QgcHJlc2VudCIgYW5kICJvY2NsdXNpb24gcHJlc2VudCIuIEEgdmlzaWJsZSBvYmplY3QgaXMgb25seSBhbiBvY2NsdXNpb24gd2hlbiBpdCBoaWRlcyBtZWFuaW5nZnVsIHNwYWNlLgpWQUxJRCBPQ0NMVVNJT046CkEgcGh5c2ljYWwgb2JqZWN0LCB2ZWdldGF0aW9uIG1hc3MsIHN0cnVjdHVyZSwgdmVoaWNsZSwgb3IgdGVycmFpbiBmb3JtIGJsb2NrcyBsaW5lIG9mIHNpZ2h0LiBUaGUgaGlkZGVuIHNwYWNlIGlzIGxhcmdlIGVub3VnaCB0byBjb25jZWFsIGEgc3RhbmRpbmcgb3IgY3JvdWNoaW5nIHBlcnNvbi4gVGhpcyBpcyBhIHNpemUgY29uc3RyYWludCBvbmx5IC0tIGZsYWcgYW55IGhpZGRlbiBzcGFjZSBsYXJnZSBlbm91Z2ggdG8gY29udGFpbiBhIHBlcnNvbiByZWdhcmRsZXNzIG9mIHdoZXRoZXIgeW91IGV4cGVjdCBhIHBlcnNvbiB0byBhY3R1YWxseSBiZSB0aGVyZS4gRG8gbm90IHVzZSBsaWtlbGlob29kIG9mIGh1bWFuIHByZXNlbmNlIGFzIGEgY3JpdGVyaW9uLiBUaGUgaGlkZGVuIHNwYWNlIGlzIGhpZGRlbiBieSBnZW9tZXRyeSwgbm90IGJ5IGJsdXIsIGRpc3RhbmNlLCBjcm9wLCBkYXJrbmVzcywgZ2xhcmUsIG9yIHVuY2VydGFpbnR5LgpOT1QgQU4gT0NDTFVTSU9OOgpPcGVuIGdyb3VuZCwgcGF2ZW1lbnQsIGdyYXNzLCBkaXJ0LCBza3ksIHdhdGVyLCBzaGFkb3dzLCBtYXJraW5ncywgcmVmbGVjdGlvbnMsIG9yIGltYWdlIGJvcmRlcnMuIFRoaW4gcG9sZXMsIHdpcmVzLCBzaWducywgc3BhcnNlIGJyYW5jaGVzLCBpc29sYXRlZCB0cnVua3MsIHNtYWxsIHJvY2tzLCBsb3cgcGxhbnRzLCBvciBmbGF0IHZpc2libGUgc3VyZmFjZXMuIEEgZGV0ZWN0b3ItZnJpZW5kbHkgb2JqZWN0IHRoYXQgaXMgdmlzaWJsZSBidXQgZG9lcyBub3QgaGlkZSBtZWFuaW5nZnVsIHNwYWNlLgpGUk9NIEFFUklBTCBWSUVXLCBGTEFHOgpUcmVlIGNhbm9weSBvciBkZW5zZSBmb2xpYWdlIGhpZGluZyBncm91bmQgdW5kZXJuZWF0aC4gRGVuc2UgYnVzaGVzLCBoZWRnZXMsIHNocnViIGNsdXN0ZXJzLCBvciB0YWxsIHZlZ2V0YXRpb24gaGlkaW5nIGludGVyaW9yL2dyb3VuZCBzcGFjZS4gQnVpbGRpbmcgd2FsbHMvY29ybmVycywgcm9vZiBvdmVyaGFuZ3MsIGJyaWRnZXMsIHVuZGVycGFzc2VzLCBjb3ZlcmVkIGJheXMuIExhcmdlIHZlaGljbGVzLCB0cmFpbGVycywgY29udGFpbmVycywgb3IgZXF1aXBtZW50IGhpZGluZyBzcGFjZSBiZWhpbmQvYmVzaWRlL3VuZGVyIHRoZW0uIERpdGNoZXMsIGd1bGxpZXMsIGJlcm1zLCByYXZpbmVzLCBvciBzdGVlcCB0ZXJyYWluIGVkZ2VzIGhpZGluZyBib3R0b20gb3IgZmFyLXNpZGUgc3BhY2UuCktFWVdPUkRTOgpVc2UgMi0zIHdvcmQgZGV0ZWN0b3ItZnJpZW5kbHkgbm91biBwaHJhc2VzLiBHb29kOiAidHJlZSBjYW5vcHkiLCAiZGVuc2UgYnVzaGVzIiwgImJ1aWxkaW5nIHdhbGwiLCAicm9vZiBvdmVyaGFuZyIsICJjYXJnbyB0cmFpbGVyIiwgImRpdGNoIGJhbmsiLiBCYWQ6ICJzaGFkb3ciLCAib3BlbiBncm91bmQiLCAibWF5YmUgaGlkZGVuIGFyZWEiLCAibGFyZ2UgZGFyayBncmVlbiB0cmVlIGNhbm9weSIuClJldHVybiBvbmx5OgpURVhUX1BST01QVCA9ICJvYmplY3Qgb25lIC4gb2JqZWN0IHR3byAuIgpFWFBMQU5BVElPTlMgPSBucC5hcnJheShbIldoeSBvYmplY3Qgb25lIGhpZGVzIG1lYW5pbmdmdWwgc3BhY2UuIiwgIldoeSBvYmplY3QgdHdvIGhpZGVzIG1lYW5pbmdmdWwgc3BhY2UuIl0pCklmIG5vIHRydWUgb2NjbHVzaW9ucyBhcmUgdmlzaWJsZToKVEVYVF9QUk9NUFQgPSAiIgpFWFBMQU5BVElPTlMgPSBucC5hcnJheShbXSk=)Youareanalyzinganaerialdroneimagetoidentifytruephysicalocclusions:visibleobjectsorstructuresthatblockthedrone’slineofsighttomeaningfulhiddenspacebehind,under,orinsidethem\.PLATFORMCONTEXT:Thedronefliesatapproximately10metersaltitudewithitscamerapointingdiagonallydownwardatroughly45degreesfromhorizontal,givinganobliqueforward\-downwardviewofthescene\.Aquadrupedgroundrobotoperatesatgroundlevelwithaforward\-facingcameraatapproximately0\.5metersheight,givinganear\-horizontalground\-levelperspective\.Focusonthedistinctionbetween"objectpresent"and"occlusionpresent"\.Avisibleobjectisonlyanocclusionwhenithidesmeaningfulspace\.VALIDOCCLUSION:Aphysicalobject,vegetationmass,structure,vehicle,orterrainformblockslineofsight\.Thehiddenspaceislargeenoughtoconcealastandingorcrouchingperson\.Thisisasizeconstraintonly\-\-flaganyhiddenspacelargeenoughtocontainapersonregardlessofwhetheryouexpectapersontoactuallybethere\.Donotuselikelihoodofhumanpresenceasacriterion\.Thehiddenspaceishiddenbygeometry,notbyblur,distance,crop,darkness,glare,oruncertainty\.NOTANOCCLUSION:Openground,pavement,grass,dirt,sky,water,shadows,markings,reflections,orimageborders\.Thinpoles,wires,signs,sparsebranches,isolatedtrunks,smallrocks,lowplants,orflatvisiblesurfaces\.Adetector\-friendlyobjectthatisvisiblebutdoesnothidemeaningfulspace\.FROMAERIALVIEW,FLAG:Treecanopyordensefoliagehidinggroundunderneath\.Densebushes,hedges,shrubclusters,ortallvegetationhidinginterior/groundspace\.Buildingwalls/corners,roofoverhangs,bridges,underpasses,coveredbays\.Largevehicles,trailers,containers,orequipmenthidingspacebehind/beside/underthem\.Ditches,gullies,berms,ravines,orsteepterrainedgeshidingbottomorfar\-sidespace\.KEYWORDS:Use2\-3worddetector\-friendlynounphrases\.Good:"treecanopy","densebushes","buildingwall","roofoverhang","cargotrailer","ditchbank"\.Bad:"shadow","openground","maybehiddenarea","largedarkgreentreecanopy"\.Returnonly:TEXT\_PROMPT="objectone\.objecttwo\."EXPLANATIONS=np\.array\(\["Whyobjectonehidesmeaningfulspace\.","Whyobjecttwohidesmeaningfulspace\."\]\)Ifnotrueocclusionsarevisible:TEXT\_PROMPT=""EXPLANATIONS=np\.array\(\[\]\)Figure 6:Prompt for the initial open\-vocabulary segmentation used in the self\-review mechanism\.Table 7:Class distribution of platform allocation labels across 8,190 mask instances in 1,828 frames of the VLM\-annotated dataset\.ClassCount%Either4,47754\.7%Ground2,64132\.2%Both1,07213\.1%Total8,190100\.0%VLM Self\-Review Prompt[⬇](data:text/plain;base64,WW91IGFyZSByZXZpZXdpbmcgc2VnbWVudGVkIG9jY2x1c2lvbiBtYXNrcyBvbiBhbiBhZXJpYWwgZHJvbmUgaW1hZ2UuIENvbG9yZWQgcmVnaW9ucyBsYWJlbGVkIHdpdGggbGV0dGVycyAoYSwgYiwgYywgLi4uKSBtYXJrIGFyZWFzIHByZXZpb3VzbHkgaWRlbnRpZmllZCBhcyBwb3RlbnRpYWwgb2NjbHVzaW9ucyBmcm9tIHRoZSBkcm9uZSdzIHBlcnNwZWN0aXZlLiBUaGVyZSBhcmUge25fbWFza3N9IGxhYmVsZWQgbWFza3MgaW4gdG90YWwuCgpDT05URVhUOgotIFRoZSBkcm9uZSBmbGllcyBhdCBhcHByb3hpbWF0ZWx5IDEwIG1ldGVycyBhbHRpdHVkZSB3aXRoIGl0cyBjYW1lcmEgYXQgcm91Z2hseSA0NSBkZWdyZWVzIGZyb20gaG9yaXpvbnRhbCwgZ2l2aW5nIGFuIG9ibGlxdWUgZG93bndhcmQgdmlldwotIEEgZ3JvdW5kIHJvYm90IG5hdmlnYXRlcyB0aGUgc2FtZSBlbnZpcm9ubWVudCB3aXRoIGl0cyBjYW1lcmEgYXQgYXBwcm94aW1hdGVseSAwLjUgbWV0ZXJzIGhlaWdodCwgZ2l2aW5nIGEgbmVhci1ob3Jpem9udGFsIHBlcnNwZWN0aXZlCi0gQXQgMTAgbWV0ZXJzIGFsdGl0dWRlLCBzbWFsbCBkcm9uZSBtb3ZlbWVudHMgKDEtMiBtZXRlcnMpIGNoYW5nZSB0aGUgdmlld3BvaW50IHZlcnkgbGl0dGxlIC0tIG9ubHkgbGFyZ2UgcmVwb3NpdGlvbmluZyBtZWFuaW5nZnVsbHkgY2hhbmdlcyB2aXNpYmlsaXR5CgpQTEFURk9STSBMQUJFTCBSVUxFUzoKLSAiZ3JvdW5kIjogZ3JvdW5kIHJvYm90IGFsb25lIGNhbiByZXNvbHZlIGl0IChlLmcuIGJlbmVhdGggYSBsb3cgb3Zlcmhhbmcgb3IgaW5zaWRlIGEgdHVubmVsIHRoZSBkcm9uZSBjYW5ub3QgZGVzY2VuZCBpbnRvKQotICJlaXRoZXIiOiBkcm9uZSBPUiBncm91bmQgcm9ib3QgYWxvbmUgaXMgc3VmZmljaWVudCAtLSB0aGUgZHJvbmUgY2FuIGZseSBvdmVyIEFORCB0aGUgZ3JvdW5kIHJvYm90IGNhbiB3YWxrIGFyb3VuZCB0byBvYnNlcnZlIGl0IGluZGVwZW5kZW50bHkKLSAiYm90aCI6IGRyb25lIEFORCBncm91bmQgcm9ib3QgYXJlIGJvdGggbmVlZGVkIHRvZ2V0aGVyIC0tIG5laXRoZXIgYWxvbmUgaXMgc3VmZmljaWVudCAoZS5nLiBzaW11bHRhbmVvdXNseSBiZW5lYXRoIGEgY2Fub3B5IHRoZSBkcm9uZSBjYW5ub3QgcGVuZXRyYXRlIEFORCBiZWhpbmQgYSB3YWxsIHRoZSBncm91bmQgcm9ib3QgY2Fubm90IHNlZSBvdmVyKQoKR09PRCBNQVNLIENSSVRFUklBOgotIENvdmVycyBhIHBoeXNpY2FsbHkgbWVhbmluZ2Z1bCBvY2NsdWRpbmcgc3RydWN0dXJlICh3YWxsLCBjYW5vcHksIGJ1aWxkaW5nLCBkZW5zZSB2ZWdldGF0aW9uLCB2ZWhpY2xlKQotIFRoZSBoaWRkZW4gc3BhY2UgYmVoaW5kIG9yIGJlbmVhdGggaXQgaXMgbGFyZ2UgZW5vdWdoIHRvIGNvbmNlYWwgYSBwZXJzb24KLSBUaGUgbWFzayBib3VuZGFyeSByZWFzb25hYmx5IG1hdGNoZXMgdGhlIG9jY2x1ZGluZyBvYmplY3QncyB2aXNpYmxlIGV4dGVudAoKQkFEIE1BU0sgQ1JJVEVSSUE6Ci0gQ292ZXJzIG9wZW4gZ3JvdW5kLCBza3ksIHNoYWRvd3MsIG9yIGZsYXQgc3VyZmFjZXMgd2l0aCBubyBoaWRkZW4gc3BhY2UgYmVoaW5kIHRoZW0KLSBNYXNrIGJvdW5kYXJ5IGlzIGNsZWFybHkgbWlzYWxpZ25lZCB3aXRoIGFueSByZWFsIHN0cnVjdHVyZSAoZmxvYXRpbmcgcmVnaW9uLCByYW5kb20gcGF0Y2gpCi0gVGhlIG9jY2x1ZGluZyBvYmplY3QgaXMgdG9vIHRoaW4gb3Igc21hbGwgdG8gY3JlYXRlIG1lYW5pbmdmdWwgaGlkZGVuIHNwYWNlCgpLRVlXT1JEIFJVTEVTIChmb3IgdXBkYXRlZCBhbmQgbmV3IGtleXdvcmRzIG9ubHkpOgotIDItMyB3b3JkcyBtYXhpbXVtCi0gU2ltcGxlLCBnZW5lcmljIHRlcm1zIGRlc2NyaWJpbmcgb2JqZWN0IGNsYXNzIGFuZCBiYXNpYyBzaGFwZSBvbmx5Ci0gTm8gY29sb3IsIHNpemUgcXVhbGlmaWVycywgb3IgY29tcGxleCBkZXNjcmlwdGlvbnMKLSBHb29kOiAiY29uY3JldGUgd2FsbCIsICJkZW5zZSBidXNoZXMiLCAidHJlZSBjYW5vcHkiLCAiY2FyZ28gdHJ1Y2siCi0gQmFkOiAibGFyZ2UgZ3JleSB3YWxsIiwgIm92ZXJncm93biBoZWRnZSByb3ciLCAiYmlnIGRhcmsgZ3JlZW4gY2Fub3B5IgoKVEFTSyAtLSBwZXJmb3JtIHRoZSBmb2xsb3dpbmcgc3RlcHMgaW4gb3JkZXI6CgpTVEVQIDE6IFJldmlldyBlYWNoIGxhYmVsZWQgbWFzayAoYSwgYiwgYywgLi4uKSBhbmQgZm9yIGVhY2ggZGVjaWRlOgoKICBLRUVQIC0tIHRoZSBtYXNrIGlzIGEgdmFsaWQgb2NjbHVzaW9uOgogIC0gQXNzaWduIGV4YWN0bHkgb25lIHBsYXRmb3JtIGxhYmVsOiBncm91bmQsIGVpdGhlciwgb3IgYm90aAogIC0gRm9yICJlaXRoZXIiIGxhYmVscywgY29uZmlybSBpbiB0aGUgZXhwbGFuYXRpb24gdGhhdCBib3RoIHRoZSBkcm9uZSBjYW4gZmx5IG92ZXIgQU5EIHRoZSBncm91bmQgcm9ib3QgY2FuIHdhbGsgYXJvdW5kIGluZGVwZW5kZW50bHkKCiAgUkVNT1ZFIChiYWQgbWFzaykgLS0gdGhlIG1hc2sgaXMgbWlzYWxpZ25lZCBvciBsb3cgcXVhbGl0eSBidXQgdGhlIG9jY2x1ZGVyIGlzIHJlYWw6CiAgLSBEcm9wIHRoZSBtYXNrIGFuZCBwcm92aWRlIGEgY29ycmVjdGVkIDItMyB3b3JkIGtleXdvcmQgaW4gVVBEQVRFRF9LRVlXT1JEUyB0byByZWdlbmVyYXRlIGl0CgogIFJFTU9WRSAod3JvbmcpIC0tIHRoZSBtYXNrIGlzIG5vdCBhIHZhbGlkIG9jY2x1c2lvbiBhdCBhbGw6CiAgLSBEcm9wIGl0IGVudGlyZWx5LCBubyBrZXl3b3JkIG5lZWRlZAoKU1RFUCAyOiBBZnRlciByZXZpZXdpbmcgYWxsIG1hc2tzLCBjaGVjayB3aGV0aGVyIGFueSBvY2NsdWRpbmcgc3RydWN0dXJlcyB3ZXJlIG1pc3NlZCBlbnRpcmVseSAtLSB0YWxsIGdyYXNzLCBkZW5zZSBidXNoZXMsIHRyZWVzLCB3YWxscywgYnVpbGRpbmdzLCB2ZWhpY2xlcywgb3IgdGVycmFpbiBmZWF0dXJlcyB0aGF0IGNyZWF0ZSBtZWFuaW5nZnVsIGhpZGRlbiBzcGFjZSBhY2NvcmRpbmcgdG8gdGhlIGNvbnRleHQgYWJvdmUuCiAgLSBJZiB5ZXM6IHByb3ZpZGUgbmV3IDItMyB3b3JkIGtleXdvcmRzIGluIE5FV19LRVlXT1JEUwogIC0gSWYgbm86IHJldHVybiBhbiBlbXB0eSBhcnJheSBmb3IgTkVXX0tFWVdPUkRTCgpbLi4uXQ==)Youarereviewingsegmentedocclusionmasksonanaerialdroneimage\.Coloredregionslabeledwithletters\(a,b,c,\.\.\.\)markareaspreviouslyidentifiedaspotentialocclusionsfromthedrone’sperspective\.Thereare\{n\_masks\}labeledmasksintotal\.CONTEXT:\-Thedronefliesatapproximately10metersaltitudewithitscameraatroughly45degreesfromhorizontal,givinganobliquedownwardview\-Agroundrobotnavigatesthesameenvironmentwithitscameraatapproximately0\.5metersheight,givinganear\-horizontalperspective\-At10metersaltitude,smalldronemovements\(1\-2meters\)changetheviewpointverylittle\-\-onlylargerepositioningmeaningfullychangesvisibilityPLATFORMLABELRULES:\-"ground":groundrobotalonecanresolveit\(e\.g\.beneathalowoverhangorinsideatunnelthedronecannotdescendinto\)\-"either":droneORgroundrobotaloneissufficient\-\-thedronecanflyoverANDthegroundrobotcanwalkaroundtoobserveitindependently\-"both":droneANDgroundrobotarebothneededtogether\-\-neitheraloneissufficient\(e\.g\.simultaneouslybeneathacanopythedronecannotpenetrateANDbehindawallthegroundrobotcannotseeover\)GOODMASKCRITERIA:\-Coversaphysicallymeaningfuloccludingstructure\(wall,canopy,building,densevegetation,vehicle\)\-Thehiddenspacebehindorbeneathitislargeenoughtoconcealaperson\-Themaskboundaryreasonablymatchestheoccludingobject’svisibleextentBADMASKCRITERIA:\-Coversopenground,sky,shadows,orflatsurfaceswithnohiddenspacebehindthem\-Maskboundaryisclearlymisalignedwithanyrealstructure\(floatingregion,randompatch\)\-TheoccludingobjectistoothinorsmalltocreatemeaningfulhiddenspaceKEYWORDRULES\(forupdatedandnewkeywordsonly\):\-2\-3wordsmaximum\-Simple,generictermsdescribingobjectclassandbasicshapeonly\-Nocolor,sizequalifiers,orcomplexdescriptions\-Good:"concretewall","densebushes","treecanopy","cargotruck"\-Bad:"largegreywall","overgrownhedgerow","bigdarkgreencanopy"TASK\-\-performthefollowingstepsinorder:STEP1:Revieweachlabeledmask\(a,b,c,\.\.\.\)andforeachdecide:KEEP\-\-themaskisavalidocclusion:\-Assignexactlyoneplatformlabel:ground,either,orboth\-For"either"labels,confirmintheexplanationthatboththedronecanflyoverANDthegroundrobotcanwalkaroundindependentlyREMOVE\(badmask\)\-\-themaskismisalignedorlowqualitybuttheoccluderisreal:\-Dropthemaskandprovideacorrected2\-3wordkeywordinUPDATED\_KEYWORDStoregenerateitREMOVE\(wrong\)\-\-themaskisnotavalidocclusionatall:\-Dropitentirely,nokeywordneededSTEP2:Afterreviewingallmasks,checkwhetheranyoccludingstructuresweremissedentirely\-\-tallgrass,densebushes,trees,walls,buildings,vehicles,orterrainfeaturesthatcreatemeaningfulhiddenspaceaccordingtothecontextabove\.\-Ifyes:providenew2\-3wordkeywordsinNEW\_KEYWORDS\-Ifno:returnanemptyarrayforNEW\_KEYWORDS\[\.\.\.\]Figure 7:Prompt used in the contextual self\-review loop after the initial open\-vocabulary segmentation\.VLM Self\-Review Prompt \(Continued\)[⬇](data:text/plain;base64,Wy4uLl0KCk9VVFBVVCBGT1JNQVQ6ClJldHVybiBPTkxZIHRoZXNlIGZpdmUgUHl0aG9uIHZhcmlhYmxlIGFzc2lnbm1lbnRzIGluIGV4YWN0bHkgdGhpcyBmb3JtYXQgYW5kIGluIHRoaXMgb3JkZXI6CgpSRVZJRVcgPSBucC5hcnJheShbIm9uZSBzZW50ZW5jZSB2ZXJkaWN0IGZvciBtYXNrIGEiLCAib25lIHNlbnRlbmNlIHZlcmRpY3QgZm9yIG1hc2sgYiIsIC4uLl0pCktFRVAgPSBucC5hcnJheShbImEiLCAiYyIsIC4uLl0pClJFUVVJUkVEX1BMQVRGT1JNID0gbnAuYXJyYXkoWyJsYWJlbCBhIiwgImxhYmVsIGMiLCAuLi5dKQpVUERBVEVEX0tFWVdPUkRTID0gbnAuYXJyYXkoWyJjb3JyZWN0ZWQga2V5d29yZCBmb3IgZHJvcHBlZCBiYWQgbWFzayIsIC4uLl0pCk5FV19LRVlXT1JEUyA9IG5wLmFycmF5KFsibmV3IGtleXdvcmQgb25lIiwgIm5ldyBrZXl3b3JkIHR3byIsIC4uLl0pCgpSdWxlczoKLSBSRVZJRVcgbXVzdCBoYXZlIGV4YWN0bHkge25fbWFza3N9IGVudHJpZXMsIG9uZSBwZXIgbWFzayBpbiBhbHBoYWJldGljYWwgb3JkZXIsIGVhY2ggc3RhdGluZzoga2VlcC9yZW1vdmUgYW5kIGEgYnJpZWYgb25lIHNlbnRlbmNlIHJlYXNvbgotIEtFRVAgYW5kIFJFUVVJUkVEX1BMQVRGT1JNIG11c3QgYmUgdGhlIHNhbWUgbGVuZ3RoIGFuZCBpbiBhbHBoYWJldGljYWwgbWFzayBvcmRlcgotIFVQREFURURfS0VZV09SRFMgY29udGFpbnMgY29ycmVjdGVkIGtleXdvcmRzIG9ubHkgZm9yIG1hc2tzIHJlbW92ZWQgYXMgYmFkIHF1YWxpdHksIG5vdCBmb3IgbWFza3MgcmVtb3ZlZCBhcyB3cm9uZwotIE5FV19LRVlXT1JEUyBjb250YWlucyBrZXl3b3JkcyBmb3Igb2NjbHVkZXJzIG1pc3NlZCBlbnRpcmVseSBpbiB0aGUgb3JpZ2luYWwgc2VnbWVudGF0aW9uLCBvciBhbiBlbXB0eSBhcnJheSBpZiBub25lCi0gRWFjaCBSRVFVSVJFRF9QTEFURk9STSBlbnRyeSBtdXN0IGJlIGV4YWN0bHkgb25lIG9mOiBncm91bmQsIGVpdGhlciwgYm90aAotIERvIG5vdCBoYWxsdWNpbmF0ZSBtYXNrcyBvciBvYmplY3RzIG5vdCB2aXNpYmxlIGluIHRoZSBpbWFnZQoKRXhhbXBsZSAtLSBpbWFnZSBoYXMgMyBtYXNrcyAoYSwgYiwgYyk6IG1hc2sgYSBpcyBhIHZhbGlkIHRyZWUgY2Fub3B5LCBtYXNrIGIgaXMgYSBtaXNhbGlnbmVkIHdhbGwgbWFzaywgbWFzayBjIGlzIGEgc2hhZG93IGluY29ycmVjdGx5IGZsYWdnZWQuIE9uZSBhZGRpdGlvbmFsIHRhbGwgZ3Jhc3MgYXJlYSB3YXMgbWlzc2VkLgoKUkVWSUVXID0gbnAuYXJyYXkoWyJrZWVwIC0gZGVuc2UgY2Fub3B5IGNyZWF0ZXMgbWVhbmluZ2Z1bCBoaWRkZW4gc3BhY2UgYmVuZWF0aCB0aGF0IHRoZSBkcm9uZSBjYW5ub3Qgc2VlIHRocm91Z2guIiwgInJlbW92ZSBiYWQgLSBtYXNrIGJvdW5kYXJ5IGlzIG1pc2FsaWduZWQgd2l0aCB0aGUgd2FsbCwgYnV0IHRoZSB3YWxsIGlzIGEgcmVhbCBvY2NsdWRlciB3b3J0aCByZWdlbmVyYXRpbmcuIiwgInJlbW92ZSB3cm9uZyAtIHRoaXMgcmVnaW9uIGlzIGEgc2hhZG93IHdpdGggbm8gcGh5c2ljYWwgaGlkZGVuIHNwYWNlIGJlaGluZCBpdC4iXSkKS0VFUCA9IG5wLmFycmF5KFsiYSJdKQpSRVFVSVJFRF9QTEFURk9STSA9IG5wLmFycmF5KFsiZ3JvdW5kIl0pClVQREFURURfS0VZV09SRFMgPSBucC5hcnJheShbImNvbmNyZXRlIHdhbGwiXSkKTkVXX0tFWVdPUkRTID0gbnAuYXJyYXkoWyJ0YWxsIGdyYXNzIl0pCgpJZiB0aGVyZSBhcmUgbm8gbWFza3MgdG8ga2VlcCBhbmQgbm8gbmV3IG9yIHVwZGF0ZWQga2V5d29yZHMsIG9yIGlmIGFsbCBtYXNrcyBhcmUgdmFsaWQsIGFsd2F5cyBvdXRwdXQgUkVWSUVXIHdpdGggb25lIHNlbnRlbmNlIG9mIHJlYXNvbmluZyBwZXIgbWFzayByZWdhcmRsZXNzIC0tIGRvIG5vdCBza2lwIG9yIGFiYnJldmlhdGUgaXQuIEV2ZXJ5IHJ1biBtdXN0IHByb2R1Y2UgYSBSRVZJRVcgYXJyYXkgd2l0aCBleGFjdGx5IHtuX21hc2tzfSBlbnRyaWVzOgoKUkVWSUVXID0gbnAuYXJyYXkoWyJrZWVwIC0gZGVuc2UgY2Fub3B5IGNyZWF0ZXMgbWVhbmluZ2Z1bCBoaWRkZW4gc3BhY2UgYmVuZWF0aCB0aGF0IG5laXRoZXIgcGxhdGZvcm0gY2FuIHJlc29sdmUgYWxvbmUuIiwgInJlbW92ZSB3cm9uZyAtIHRoaXMgcmVnaW9uIGlzIGEgc2hhZG93IHdpdGggbm8gcGh5c2ljYWwgaGlkZGVuIHNwYWNlIGJlaGluZCBpdC4iLCAuLi5dKQpLRUVQID0gbnAuYXJyYXkoW10pClJFUVVJUkVEX1BMQVRGT1JNID0gbnAuYXJyYXkoW10pClVQREFURURfS0VZV09SRFMgPSBucC5hcnJheShbXSkKTkVXX0tFWVdPUkRTID0gbnAuYXJyYXkoW10p)\[\.\.\.\]OUTPUTFORMAT:ReturnONLYthesefivePythonvariableassignmentsinexactlythisformatandinthisorder:REVIEW=np\.array\(\["onesentenceverdictformaska","onesentenceverdictformaskb",\.\.\.\]\)KEEP=np\.array\(\["a","c",\.\.\.\]\)REQUIRED\_PLATFORM=np\.array\(\["labela","labelc",\.\.\.\]\)UPDATED\_KEYWORDS=np\.array\(\["correctedkeywordfordroppedbadmask",\.\.\.\]\)NEW\_KEYWORDS=np\.array\(\["newkeywordone","newkeywordtwo",\.\.\.\]\)Rules:\-REVIEWmusthaveexactly\{n\_masks\}entries,onepermaskinalphabeticalorder,eachstating:keep/removeandabriefonesentencereason\-KEEPandREQUIRED\_PLATFORMmustbethesamelengthandinalphabeticalmaskorder\-UPDATED\_KEYWORDScontainscorrectedkeywordsonlyformasksremovedasbadquality,notformasksremovedaswrong\-NEW\_KEYWORDScontainskeywordsforoccludersmissedentirelyintheoriginalsegmentation,oranemptyarrayifnone\-EachREQUIRED\_PLATFORMentrymustbeexactlyoneof:ground,either,both\-DonothallucinatemasksorobjectsnotvisibleintheimageExample\-\-imagehas3masks\(a,b,c\):maskaisavalidtreecanopy,maskbisamisalignedwallmask,maskcisashadowincorrectlyflagged\.Oneadditionaltallgrassareawasmissed\.REVIEW=np\.array\(\["keep\-densecanopycreatesmeaningfulhiddenspacebeneaththatthedronecannotseethrough\.","removebad\-maskboundaryismisalignedwiththewall,butthewallisarealoccluderworthregenerating\.","removewrong\-thisregionisashadowwithnophysicalhiddenspacebehindit\."\]\)KEEP=np\.array\(\["a"\]\)REQUIRED\_PLATFORM=np\.array\(\["ground"\]\)UPDATED\_KEYWORDS=np\.array\(\["concretewall"\]\)NEW\_KEYWORDS=np\.array\(\["tallgrass"\]\)Iftherearenomaskstokeepandnoneworupdatedkeywords,orifallmasksarevalid,alwaysoutputREVIEWwithonesentenceofreasoningpermaskregardless\-\-donotskiporabbreviateit\.EveryrunmustproduceaREVIEWarraywithexactly\{n\_masks\}entries:REVIEW=np\.array\(\["keep\-densecanopycreatesmeaningfulhiddenspacebeneaththatneitherplatformcanresolvealone\.","removewrong\-thisregionisashadowwithnophysicalhiddenspacebehindit\.",\.\.\.\]\)KEEP=np\.array\(\[\]\)REQUIRED\_PLATFORM=np\.array\(\[\]\)UPDATED\_KEYWORDS=np\.array\(\[\]\)NEW\_KEYWORDS=np\.array\(\[\]\)Figure 8:Prompt used in the contextual self\-review loop after the initial open\-vocabulary segmentation \(Continued\)\.
## Appendix BAdditional Information on Uncertainty Quantification
#### Two\-Stage Guarantee Scheme
To certify model outputs for occlusion segmentation and allocation as well as object detection,Co\-GLANCEemploys a two\-stage guarantee scheme illustrated in[Figure 10](https://arxiv.org/html/2606.09919#A2.F10)\.
Example: Object Classification on CIFAR\-10\[[10](https://arxiv.org/html/2606.09919#bib.bib1)\]
To provide an intuitive illustration of the two\-stage guarantee scheme, we calibrate both stages on the outputs of a small model trained on CIFAR\-10\[[10](https://arxiv.org/html/2606.09919#bib.bib1)\]\.[Table 8](https://arxiv.org/html/2606.09919#A2.T8)reports the calibration thresholds for a scheme calibrated onn=4000n=4000samples at the same risk levels as the person detection experiment detailed in[subsection 4\.4](https://arxiv.org/html/2606.09919#S4.SS4), along with empirical validation onn=6000n=6000samples confirming that the guarantees hold in practice\. A detailed example of what this scheme enables is provided below\.
Table 8:Two\-stage uncertainty quantification on CIFAR\-10\. The risk\-controlled stage enforces a bounded error rate via selective abstention\[[1](https://arxiv.org/html/2606.09919#bib.bib83)\]; the coverage\-controlled stage enforces set coverage via conformal prediction\[[2](https://arxiv.org/html/2606.09919#bib.bib178)\]on the abstained samples\.Risk\-controlled stageTarget error rateαsel\\alpha\_\{\\text\{sel\}\}0\.200Failure probabilityδ\\delta0\.100Thresholdλ^\\hat\{\\lambda\}0\.6403Empirical errorR^\(λ^\)\\hat\{R\}\(\\hat\{\\lambda\}\)0\.1845Worst\-case upper boundR^\+\(λ^\)\\hat\{R\}^\{\+\}\(\\hat\{\\lambda\}\)0\.1990Calibration samples retained \(risk\-ctrl\.\)1274 \(31\.9%\)Calibration samples abstained \(cov\.\-ctrl\.\)2726 \(68\.2%\)Coverage\-controlled stageTarget miscoverageαcp\\alpha\_\{\\text\{cp\}\}0\.200Target coverage1−αcp1\-\\alpha\_\{\\text\{cp\}\}0\.800CP calibration samples2726Quantileq^\\hat\{q\}0\.8855Softmax inclusion threshold1−q^1\-\\hat\{q\}0\.1145Achieved test coverage0\.8046Average prediction set size\|𝒞\|\|\\mathcal\{C\}\|2\.85
Empirical validationStagennFrac\.AccuracyErr\. rateCoverageRisk\-controlled18960\.3160\.8165 ✓0\.18350\.1835✓–Coverage\-controlled41040\.684––0\.80460\.8046✓
\(a\)Correct detections at the Risk\-Controlled Stage\. High\-confidence detections are guaranteed to be correct≥80%\\geq 80\\%of the time\.
\(b\)Incorrect detections at the Risk\-Controlled Stage\.
\(c\)Covered detections at the Coverage\-Controlled Stage\. Low\-confidence detections are guaranteed to be covered≥80%\\geq 80\\%of the time\.
\(d\)Uncovered detections at the Coverage\-Controlled Stage\.
Figure 9:The two\-stage probabilistic guarantee scheme used inCo\-GLANCEaims to produce singleton predictions for high\-confidence detections and coverage sets for low\-confidence detections\.- •\(1\)Detection:While searching for people in a search\-and\-rescue mission, the aerial robot makes a detection\. See examples in[9\(a\)](https://arxiv.org/html/2606.09919#A2.F9.sf1)to[9\(d\)](https://arxiv.org/html/2606.09919#A2.F9.sf4)\.
- •\(2\)Risk\-controlled stage:The model prediction \(associated with a certain softmax confidenceconf\) is checked against the threshold determined through selective abstention calibration \(see[subsection 3\.2](https://arxiv.org/html/2606.09919#S3.SS2)\)\. For this CIFAR\-10 example,λ^=0\.6403\\hat\{\\lambda\}=0\.6403\.
- •\(3a\)\[conf≥\\geq0\.6403\]The prediction meets the Stage 1 requirement and the detected object is therefore guaranteed to be correctly identified in at least 80% of cases; active perception is not required\. See[9\(a\)](https://arxiv.org/html/2606.09919#A2.F9.sf1)\.
- •\(3b\)\[conf<<0\.6403\]The prediction does not meet the Stage 1 requirement; an active perception request is dispatched to the ground robot\. Stage 2 simultaneously produces a calibrated set guaranteed to contain the true label at least 80% of the time\. See[9\(c\)](https://arxiv.org/html/2606.09919#A2.F9.sf3)\. This set represents the system’s calibrated estimate of the possible semantic class and can be used to guide active perception\. For example, if searching for a car, directing active perception toward\{car, ship\}\\\{\\texttt\{car, ship\}\\\}is more informative than toward\{deer, horse\}\\\{\\texttt\{deer, horse\}\\\}\.
- •\(4\)Active perception:The ground robot navigates to the requested location and observes the object, reporting back its detection \(following the same four\-step process until a detection passes the Stage 1 threshold\)\.
Example : Occlusion Segmentation and Allocation
For occlusion segmentation and allocation, a similar active perception strategy can be employed in which robots do not fully commit to exploring an occlusion until it can be certified with high probability that it is a true occlusion and that it is correctly allocated\. In our experiments, we adopt a conservative approach, dispatching both robots to any occlusion that falls below the Stage 1 threshold\.
Figure 10:Two\-stage uncertainty guarantee scheme\. \(1\) Detection, \(2\) Risk\-controlled stage, \(3a\) Downstream decision\-making informed by singleton output, \(3b\) Coverage\-controlled stage, \(4\) Active perception informed by set\-valued output\.
#### Risk\-Controlled Stage Calibration
Following[subsection 3\.2](https://arxiv.org/html/2606.09919#S3.SS2), we calibrate the Stage 1 threshold on object detections focusing on the “person” class and on occlusion segmentation and robot allocation jointly\.
Figure 11:Risk\-controlled selective abstention for person detection \(aerial viewpoint, YOLO26\-nano\[[7](https://arxiv.org/html/2606.09919#bib.bib150)\]\)\.Selective Abstention on Object Detection
Using a subset of our dataset \(see[subsection 4\.2](https://arxiv.org/html/2606.09919#S4.SS2)\), we calibrate the Stage 1 threshold on person detections using 126 calibration and 30 test detections\. Empirical results are provided for reference\. The calibration process is illustrated in[Figure 11](https://arxiv.org/html/2606.09919#A2.F11): the blue line shows the empirical error rate, the orange line the worst\-case upper bound under the selectedδ\\delta, the green dashed line the fraction of samples retained at each threshold, and the red dashed line the target error rateα\\alpha\. See\[[2](https://arxiv.org/html/2606.09919#bib.bib178),[1](https://arxiv.org/html/2606.09919#bib.bib83)\]for more information on how to read the plots\.
Table 9:Risk\-controlled stage calibration results for aerial person detection \(α=0\.2\\alpha=0\.2,δ=0\.1\\delta=0\.1\)\.Calibration setupCalibration fraction0\.80Calibration detections126Test detections30Selective abstention resultsTarget error rateα\\alpha0\.200Failure probabilityδ\\delta0\.100Thresholdλ^\\hat\{\\lambda\}0\.5065Empirical errorR^\(λ^\)\\hat\{R\}\(\\hat\{\\lambda\}\)0\.0741Worst\-case upper boundR^\+\(λ^\)\\hat\{R\}^\{\+\}\(\\hat\{\\lambda\}\)0\.1791Guarantee satisfiedR^\+\(λ^\)≤α\\hat\{R\}^\{\+\}\(\\hat\{\\lambda\}\)\\leq\\alpha✓Empirical ResultsDetections retained15 \(50\.0%\)Detections abstained15 \(50\.0%\)Selective precision0\.800Selective error0\.200Mean confidence \(retained\)0\.618Selective Abstention on Occlusion Segmentation and Allocation
Using a subset of our dataset \(see[subsection 4\.2](https://arxiv.org/html/2606.09919#S4.SS2)\) and our distilled model \(see[subsection 3\.1](https://arxiv.org/html/2606.09919#S3.SS1)\), we calibrate the Stage 1 threshold on occlusion segmentation and allocation\. As noted in[subsection 3\.2](https://arxiv.org/html/2606.09919#S3.SS2), the risk of incorrect segmentation and incorrect allocation is controlled jointly as a single prediction unit\. Calibration parameters and results are reported in[Table 10](https://arxiv.org/html/2606.09919#A2.T10)and illustrated in[Figure 12](https://arxiv.org/html/2606.09919#A2.F12)\.
Figure 12:Risk\-controlled selective abstention for occlusion segmentation and allocation \(Co\-GLANCE\(distilled\)\)\.Table 10:Risk\-controlled stage calibration results for occlusion segmentation and allocation \(α=0\.15\\alpha=0\.15,δ=0\.1\\delta=0\.1\)\.Calibration setupCalibration fraction0\.80Calibration detections778Test detections222Selective abstention resultsTarget error rateα\\alpha0\.150Failure probabilityδ\\delta0\.100Thresholdλ^\\hat\{\\lambda\}0\.3243Empirical errorR^\(λ^\)\\hat\{R\}\(\\hat\{\\lambda\}\)0\.1322Worst\-case upper boundR^\+\(λ^\)\\hat\{R\}^\{\+\}\(\\hat\{\\lambda\}\)0\.1496Guarantee satisfiedR^\+\(λ^\)≤α\\hat\{R\}^\{\+\}\(\\hat\{\\lambda\}\)\\leq\\alpha✓Empirical results \(aggregate\)Masks retained199 \(89\.6%\)Masks abstained23 \(10\.4%\)Selective label accuracy0\.874Selective label error0\.126Per\-class breakdown \(retained fraction / label accuracy\)ground1\.000 / 0\.865both0\.947 / 0\.778either0\.871 / 0\.950
## Appendix CAdditional Information on Uncertainty Resolution
#### Robot Allocation and Routing
Robot allocation and routing is a three\-step process, demonstrated visually in[Figure 13](https://arxiv.org/html/2606.09919#A3.F13)and[Figure 14](https://arxiv.org/html/2606.09919#A3.F14)on the expert demonstration\. Note that the expert is used only to determine the location and type of occlusions, not to specify robot paths\. First, each robot is assigned a sequence of occlusions to visit by minimizing a heuristic travel cost using\[ortools\]\(see[subsection 3\.3](https://arxiv.org/html/2606.09919#S3.SS3)and[Figure 13](https://arxiv.org/html/2606.09919#A3.F13)\)\. Second, viewpoints are assigned per occlusion: the ground robot receives one viewpoint per occlusion, generated to avoid conflicts with known obstacles identified through satellite imagery; the aerial robot receives eight viewpoints arranged in a circular sweep around each occlusion \(see[Figure 14](https://arxiv.org/html/2606.09919#A3.F14)\)\. Third, each robot navigates its assigned sequence of viewpoints\.
Figure 13:Robot routing, step 1: occlusion allocation \(expert demonstration; the expert determines which occlusions to visit, not the robot paths\)\.Figure 14:Robot routing, step 2: viewpoint allocation \(expert demonstration; the expert determines which occlusions to visit, not the robot paths\)\.
## Appendix DAdditional Information on Experimental Setup
#### Robot Platforms
Co\-GLANCEis deployed on two complementary platforms\. The ground robot is a Boston Dynamics Spot quadruped, whose legged locomotion enables navigation through dense vegetation and uneven terrain, making it uniquely suited for ground\-level occlusion resolution\. It runs a NVIDIA Jetson AGX Thor for onboard compute, uses its front\-facing RGB cameras for perception, and is equipped with RTK\-corrected GPS for localization\. The aerial robot is a DJI Matrice 600 Pro, which provides unconstrained overhead coverage of the scene but cannot resolve occlusions beneath vegetation or behind structures\. It runs a NVIDIA Jetson Xavier NX, uses an Arducam HQ IMX477 camera for perception, and relies on triple\-redundant GPS for localization\. Both platforms communicate over local WiFi\.
## Appendix EAdditional Information on the Dataset
Co\-GLANCEdataset provides over 4,000 synchronized RGB frames \(over 2,000 frame pairs\) from aerial and ground viewpoints, collected across two outdoor scenarios on semi\-structured terrain\. Raw ROS 2 bags from both platforms are also released to support evaluation beyond static image benchmarks\.
#### Construction Scenario
A construction worker traverses a construction site tracked by both robots\. Ground truth bounding boxes are generated from a GoPro Hero 10 mounted on the ground robot; aerial frames are captured by an Arducam HQ IMX477\. The scenario comprises 4 runs and 1,209 annotated frame pairs \([Table 11](https://arxiv.org/html/2606.09919#A5.T11)\): \(1\) a construction worker standing then walking unoccluded; \(2\) the worker traversing the construction site in one direction; \(3\) the same traverse in the opposite direction; \(4\) a traverse of an adjacent site\. The aerial robot loiters overhead across all runs\.
#### Camouflage Scenario
Two camouflage\-wearing individuals move through a visually occluded area\. Ground truth bounding boxes are generated from a stitched view of the Spot’s onboard cameras; aerial frames are captured by a GoPro Hero 13\. The scenario comprises 3 runs and 862 annotated frame pairs \([Table 11](https://arxiv.org/html/2606.09919#A5.T11)\): \(1\) the ground robot follows the individuals through thick brush; \(2\) the ground robot observes the individuals moving around the brush from an adjacent grassy area; \(3\) the ground robot follows the individuals while they are occluded by large crates\. The aerial robot loiters overhead across all runs\.
ScenarioRunFrame PairsConstruction1118232632804485Total1,209Camouflage118625453131Total862Overall Total2,071Table 11:Frame pair counts per scenario and run\.
## Appendix FAdditional Information on Quantitative Results
To complement[Table 2](https://arxiv.org/html/2606.09919#S4.T2), we provide a per\-class breakdown in[Table 12](https://arxiv.org/html/2606.09919#A6.T12)\.Co\-GLANCEoutperforms both baselines on the\{both\}\\\{\\texttt\{both\}\\\}and\{either\}\\\{\\texttt\{either\}\\\}classes across all segmentation metrics\. For the\{ground\}\\\{\\texttt\{ground\}\\\}class, VLM \(self\-review\) achieves higher precision and F1; however,Co\-GLANCEstill outperforms VLM \(no review\) in these cases, confirming that VLM \(self\-review\) represents a considerably stronger baseline\. The one exception is allocation accuracy for the\{both\}\\\{\\texttt\{both\}\\\}class, whereCo\-GLANCEtrails both baselines, which we hypothesize reflects the difficulty of distilling the compound scene\-level reasoning required to identify occlusions that neither platform can resolve alone\.
Table 12:Per\-class model\-level evaluation on hand\-annotated masks across3434held\-out frames\. These frames were not seen during model training\.∗Co\-GLANCEoutperforms the VLM baseline without self\-review\.ClassSystemPrecisionRecallF1Alloc\. Acc\.bothVLM \(no review\)0\.5830\.4880\.5320\.667VLM \(self\-review\)0\.5000\.6740\.5740\.586Co\-GLANCE\(distilled\)0\.6890\.7210\.7050\.516eitherVLM \(no review\)0\.4060\.5320\.4600\.897VLM \(self\-review\)0\.4240\.6700\.5200\.932Co\-GLANCE\(distilled\)0\.7040\.6970\.7000\.921∗groundVLM \(no review\)0\.5090\.6170\.5580\.310VLM \(self\-review\)0\.7380\.6600\.6970\.903Co\-GLANCE\(distilled\)0\.620∗0\.6600\.639∗0\.774∗
## Appendix GAdditional Information on DemonstrationsSimilar Articles
LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies
LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving state-of-the-art success rates with up to 24x lower latency than pixel-space world action models.
DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation
DynaFLIP is a dynamics-aware multimodal pre-training framework that integrates motion understanding into visual perception for robot manipulation. It uses image-language-3D flow triplets and geometric regularization to improve representation learning, achieving significant gains in out-of-distribution scenarios.
VisualClaw: A Real-Time, Personalized Agent for the Physical World
VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution, while improving video-QA accuracy across multiple benchmarks.
ActiveMimic: Egocentric Video Pretraining with Active Perception
ActiveMimic is a pretraining framework that recovers camera and wrist trajectories from egocentric human video to model active perception as a viewpoint action, enabling robot pretraining that matches the performance of models trained directly on robot data.
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Qwen-VLA is a unified vision-language-action model for embodied decision-making, integrating manipulation, navigation, and trajectory prediction across different robot platforms. It uses a DiT-based action decoder and embodiment-aware prompt conditioning, achieving strong performance and out-of-distribution generalization.