@vintcessun: 大规模监控里的中心化融合,当你摄像头到几十上百台,计算瓶颈直接卡死,根本没法扩容,光一个中央站就烧掉大半预算。 这就是为什么不搞分布式的多视角跟踪没法真正落地——中心化方案的扩展成本会随节点数指数飙升,而工程上要的是一次大规模、低成本部署…
摘要
MV3DT 是一种全分布式的多视角 3D 跟踪框架,通过点对点协调消除了中心化融合的计算瓶颈,在 100 台摄像头上以 30FPS 运行且通信开销仅 2.2%,零样本校准即可部署,性能达到或超过中心化方法。
查看缓存全文
缓存时间: 2026/06/16 03:16
大规模监控里的中心化融合,当你摄像头到几十上百台,计算瓶颈直接卡死,根本没法扩容,光一个中央站就烧掉大半预算。 这就是为什么不搞分布式的多视角跟踪没法真正落地——中心化方案的扩展成本会随节点数指数飙升,而工程上要的是一次大规模、低成本部署。 MV3DT 给出了实战级解法:每个摄像头跑单目3D感知 + 局部视觉几何关联,节点间只发少量状态和置信度,30FPS实时跑下来,通信开销压到2.2%,零样本校准即用。
Fully Distributed Multi-View 3D Tracking in Real-Time
Source: https://arxiv.org/html/2606.13127 11institutetext:University of Florida, Gainesville, FL, USA 11email:{bhernandezosorio,hmedeiros}@ufl.edu22institutetext:NVIDIA Corporation, Santa Clara, CA, USA 22email:{bhernandez,fangyul,aotianw,pshin,kpurandare}@nvidia.comFangyu LiAotian WuPaul J. ShinKaustubh PurandareHenry Medeiros
Abstract
Multi-camera tracking with overlapping fields of view typically relies on centralized fusion, which creates computational bottlenecks that prevent deployment at scale. We presentMV3DT, a fully distributed framework for real-time multi-view 3D tracking that achieves accurate identity propagation and occlusion recovery through peer-to-peer coordination, eliminating the need for central aggregation. Each camera node executes a lightweight modular pipeline comprising monocular 3D perception, distributed multi-view association, and collaborative fusion via lightweight messaging. MV3DT achieves 96.5% IDF1, 93.1% MOTA, and 94.6% MOTP on WILDTRACK, competitive with state-of-the-art centralized methods, and unprecedented 41.7% IDF1 and 50.9% MOTA on SCOUT while demonstrating superior scalability: sustaining 30 FPS on 100 cameras with<<10ms inter-camera latency and only 2.2% communication overhead. MV3DT operates in a zero-shot regime given camera calibrations, requiring no scene-specific learning and making it directly deployable in new environments. These results establish MV3DT as a practical solution for real-time multi-view tracking in large-scale overlapping camera networks.
1Introduction
Figure 1:MV3DT Overview.MV3DT deploys a modular pipeline on each camera node without requiring a central server.Monocular Detectionextracts 2D bounding boxes. Then, 3D foot location estimates and full-body bounding boxes are computed forData Association, where detection-to-targets matches, both intra-view and multi-view, are found using several similarity measures.Target Managementmaintains target state and ID consistency across overlapping cameras through distributed ID propagation, and integrates Kalman filtering for multi-view measurement fusion.Distributed Communicationuses MQTT publish/subscribe messaging for peer-to-peer coordination. Each camera maintains a local database of shared target states, enabling coordinated tracking without centralized aggregation. MV3DT achieves highly effective ID propagation and multi-view integration throughfully-distributed interactions, allowing for online and real-time deployment of large camera networks.Multi-camera multi-target tracking (MCMT) is a prevalent problem in computer vision. Large-scale applications such as warehouse monitoring and intelligent cities require tens to thousands of cameras to effectively cover the region of interest[24]. The increasing participation in the AI City Challenge reflects the growing importance of large-scale MCMT[25,45,41]. MCMT techniques can be classified as centralized, decentralized, or distributed, depending on how they execute processes and aggregate data[14]. They may also focus on camera topologies with overlapping or non-overlapping fields of view (FOV). The amount of FOV overlap determines the availability of multi-view geometric cues to improve tracking accuracy. On the other hand, non-overlapping camera systems typically use appearance representations and trajectory prediction based on camera linking models to achieve effective tracking[31,1].
Centralized approaches leverage overlapping camera setups to exploit global information from all cameras on asingle fusion stage, which can improve accuracy but concentrates all computation and communication in one logical node[14]. In large deployments, such centralized fusion often becomes impractical due to bandwidth, latency, and robustness constraints. Many distributed MCMT methods target non-overlapping or sparsely overlapping camera topologies, where cross-camera association relies primarily on appearance and temporal constraints. In these settings, cameras often operate without explicit 3D calibration and maintain consistency mainly through label and appearance exchange. The current paradigm of distributed techniques for overlapping camera setups relies on parallel single-camera tracking (SCT) processes followed by a centralized multi-camera aggregation stage. Thistwo-stagedependency on a central entity hinders real-time deployment and limits scalability.
Overlapping FOVs provide complementary 3D geometric information that reduces the impact of occlusions, which is one of the most challenging issues in multi-object tracking[4,18]. Although centralized methods have long benefited from overlapping FOVs, fully distributed MCMT systems that operate directly in the 3D ground plane and are demonstrated at large scale on fixed, calibrated overlapping camera networks remain scarce[1,51,33]. The fundamental difficulty is a self-conflicting requirement: the multi-view geometric cues improve tracking accuracy, yet exploiting them in a distributed manner is challenging and has limited the scalability of accurate multi-view tracking.
One of the main challenges for scalable deployment is the availability of computing resources. Advances in computer and communication technology have enabled larger interconnected camera networks for real-time supervision. Simultaneously, models and algorithm complexity have grown proportionally. Therefore, accurately tracking multiple objects in real-time across large camera networks while leveraging overlapping views remains an open challenge. Recent fully distributed approaches already demonstrate that peer-to-peer coordination can sustain online multi-camera tracking[51,33], but they primarily operate in 2D image space and do not exploit calibrated 3D ground-plane geometry in large-scale overlapping camera networks.
As illustrated inFig.˜1, we propose a fully distributed MCMT framework in which a modular pipeline processes each video feed in parallel without a central aggregator. Each camera node executes a pipeline comprising modules for data association, target management, motion estimation, monocular 3D perception, inter-camera communication, and distributed multi-view fusion. The framework enables multi-view identity propagation and 3D fusion through lightweight inter-camera communication, allowing each node to reason locally while achieving globally consistent associations across overlapping views.
Our main contributions are highlighted below:
- •A fully distributed multi-view 3D tracking framework for calibrated overlapping cameras, in which each node performs local 3D tracking, peer‑to‑peer ID propagation, and multi-view fusion without a central aggregation server.
- •An occlusion-aware monocular 3D detector with mechanisms that turn noisy single‑view detections into reliable multi‑view measurements.
- •A three‑stage distributed ID propagation protocol that ensures global ID convergence without a central node.
- •Large‑scale evaluation with state-of-the-art accuracy on standard benchmarks, measured communication overhead, and synchronized deployment.
2Related Work
Early object tracking systems based on multiple cameras[39,19]used triangulation techniques or exploited the geometry of the scene[22]to combine information from different perspectives. One of the main motivations for the development of multi-camera tracking techniques was the resolution of target occlusions in single-view systems[4]. These approaches were designed to operate on a single computer or on multiple computers orchestrated by a leader. As the scale of multi-camera systems increased, it became clear that these earlycentralizedapproaches were limited to small areas covered by few cameras[40].
Decentralizedapproaches group cameras into clusters and designate lead nodes or cluster heads to aggregate information from neighboring cameras, reducing communication overhead[21,53]. Cluster heads coordinate within their groups and communicate summaries across cluster boundaries. Fullydistributedapproaches rely on peer-to-peer strategies in which all camera nodes operate as equal participants without hierarchical coordination entities[7]. In such systems, cameras exchange information and reach consensus through decentralized algorithms, eliminating any dependency on leader nodes or coordinators. Distributed methods have mainly focused on camera networks with disjoint FOVs so far, where appearance-based object re-identification (reID) and inter-camera linking suffice for cross-camera association. Recent fully distributed MCMT systems further demonstrate that peer-to-peer coordination can sustain online multi-camera tracking. Some approaches tackle the association problem by sharing ID labels and appearance features across cameras and maintaining a distributed label–appearance table to reach ID consensus[51]; others share full tracklets across cameras and fuse these hypotheses into consistent multi-camera trajectories[33]. However, these methods operate primarily in 2D image space and cannot perform 3D ground-plane tracking in large-scale overlapping camera networks.
Camera topology shapes algorithm design[31,1]. For non-overlapping or sparsely overlapping networks, linking models discover spatial or topological connectivity between camera views, learning which cameras observe adjacent or connected regions to establish inter-camera associations[15,34]. Recent approaches present reID strategies when targets are not visible for prolonged periods[37,26]. These methods rely on discriminative appearance features, which are also used for multi-camera associations.
For overlapping FOVs, geometric approaches are commonly used for association. Examples include the projection of image coordinates onto a global coordinate system[19]and homography-based matching[38]. Recent methods leverage transformer-based architectures and bird’s-eye view (BEV) representations to aggregate multi-view information early in the pipeline[42,46]. BEVFormer[20]introduces spatiotemporal transformers that project multi-camera features onto a unified BEV space, enabling robust 3D object detection and tracking. Building on this, TrackTacular[42]combines temporal feature aggregation with appearance and motion cues for multi-view pedestrian and vehicle tracking. BEV-SUSHI[46]extends the BEV paradigm with hierarchical graph neural networks for long-term identity association. MVTrajecter[50]incorporates BEV motion and appearance costs, achieving state-of-the-art performance on pedestrian benchmarks. Other recent methods explore end-to-end temporal aggregation[52]and unified graph-based frameworks, such as the Unified Message Passing Network (UMPN)[12].
Tracking accuracy depends on detection quality[3], and multi-view detection fusion methods follow the same taxonomy: in non-overlapping networks, each camera runs a 2D detector and reID or linking models handle cross-camera association; in overlapping scenarios, centralized methods may perform fusion at detection time with multi-view or BEV detectors, while distributed systems keep detection local (single-view 2D per node) and perform 3D reasoning at association. Widely used single-view 2D detectors include YOLO[35,36,16], DETR and its variants[5,57,56], and recent YOLO iterations[44,43]. Multi-view detectors[9,8,11,2]improve accuracy in overlapping FOVs but require centralized aggregation and do not scale to distributed deployments.
3Multi-Object Tracking Framework
MV3DT introduces a novel fully distributed and modular MCMT paradigm. Our framework aims for a real-time, online, and accurate pipeline for multi-object tracking on multiple cameras; occlusion handling and scalability are key objectives. Similar in philosophy to[21,47], we exploit simultaneous information from multiple views in a fully 3D setting. We use a lightweight peer-to-peer communication strategy to share multi-view information and resolve tracking ambiguities in real-time. Rather than resorting to a centralized multi-view tracking mechanism or aggregating the results of multiple single-view trackers, our method treats each camera as an independent agent. Hence, it can be instantiated as a single process that communicates with other cameras within a communication network. This section describes the core components: detection, data association, target management, multi-view fusion, and communications.
3.1Object Detection Module
This module produces a set of object bounding boxes{𝐛d}\{\mathbf{b}_{d}\}for each input frame, where𝐛d=[u,v,w,h]\mathbf{b}_{d}=[u,v,w,h], and(u,v)(u,v)are the pixel coordinates of its top-left corner and(w,h)(w,h)are its width and height.
3.1.1Monocular Foot Localization with Occlusion Handling.
To enable 3D geometric reasoning, we model target objects as cylindersC=(rm,hm)C=(r_{m},h_{m})with radiusrmr_{m}and heighthmh_{m}and assume that targets move on the ground plane (z=0z=0). This recovery is applied to each𝐛d\mathbf{b}_{d}using a default cylinder with nominal heighthm=1.65h_{m}=1.65m and radiusrm=0.3r_{m}=0.3m following anthropometric conventions[32].



Figure 2:Full body bounding box and foot location recovered from an occluded detection: (left) projection of the cylinder model at the expected waist pointpwaistp_{\mathrm{waist}}, (center) convex hull of the projected cylinder used to recover the full body, (right) adjusting the projection based on top-edge comparison to handle occlusions.Algorithm 1Recover 3D coordinates from bounding box1:
𝐛=[u,v,w,h]\mathbf{b}=[u,v,w,h], cylinder model
C=(rm,hm)C=(r_{m},h_{m}), camera projection matrix
PcP_{c}(per camera)
2:recovered box
𝐛rec\mathbf{b}_{\mathrm{rec}}, ground location
(x,y,0)(x,y,0), visibility
vobjv_{\mathrm{obj}}, distance camera-object
dcamd_{\mathrm{cam}} 3:
pwaist←(u+w/2,v+h/2)p_{\mathrm{waist}}\leftarrow(u+w/2,v+h/2) 4:Project
pwaistp_{\mathrm{waist}}to
z=hm/2z=h_{m}/2using
PcP_{c}.
5:Project cylinder
CCto the image using
PcP_{c}.
6:Compute the convex hull of the projected cylinder (foot and head circles), take its axis-aligned bounding rectangle
𝐛C\mathbf{b}_{C} 7:
ydiff←vC−vy_{\mathrm{diff}}\leftarrow v_{C}-v(compare top edges)
8:if
ydiff<0y_{\mathrm{diff}}<0then
9:
yshift←ydiffy_{\mathrm{shift}}\leftarrow y_{\mathrm{diff}}(align top)
10:
xshiftx_{\mathrm{shift}}proportional to the projective “leaning”
11:else
12:
yshift←(hC−h)/2y_{\mathrm{shift}}\leftarrow(h_{C}-h)/2 13:
xshift←0x_{\mathrm{shift}}\leftarrow 0 14:endif
15:
pwaistadj←(pwaistx−xshift,pwaisty−yshift)p_{\mathrm{waist}}^{\mathrm{adj}}\leftarrow(p_{\mathrm{waist}}^{x}-x_{\mathrm{shift}},p_{\mathrm{waist}}^{y}-y_{\mathrm{shift}}) 16:Re-project and update
𝐛rec\mathbf{b}_{\mathrm{rec}}with adjusted waist
17:Use
PcP_{c}to back-project
pwaistadjp_{\mathrm{waist}}^{\mathrm{adj}}to world
z=0z=0(
x,y,0x,y,0)
18:Compute
vobjv_{\mathrm{obj}}and
dcamd_{\mathrm{cam}}usingEq.˜1
19:return
𝐛rec,(x,y,0),vobj,dcam\mathbf{b}_{\mathrm{rec}},(x,y,0),v_{\mathrm{obj}},d_{\mathrm{cam}}
We further assume that cameras are positioned at heightsz>hmz>h_{m}with anglesθ<90∘\theta<90^{\circ}with respect to thezz-axis to map each𝐛d\mathbf{b}_{d}to a location on the ground plane. When these assumptions are met, most partial occlusions affect only the lower portion of the target.Fig.˜2illustrates the steps for full body recovery. The waist point(u,v)(u,v)is back-projected to the world plane atz=hm/2z=h_{m}/2using the camera projection matrix; the cylinder is placed there and projected back into the image to obtain a predicted silhouette (blue convex hull inFig.˜2). We then compare this silhouette to𝐛d\mathbf{b}_{d}. If the projected model is taller than the detection, we treat the lower body as occluded and align the top edges; otherwise we align the bottom edges to refine the foot location. Algorithm1summarizes the steps.
This procedure yields the recovered (full-body) bounding box of the projected cylinder (𝐛rec\mathbf{b}_{\mathrm{rec}}) and ground-plane foot location(x,y)(x,y). From those, we derive the targetvisibilityvobjv_{\mathrm{obj}}and camera-target distancedcamd_{\mathrm{cam}}according to
vobj=min(1,area(𝐛)area(𝐛rec)),dcam=‖𝐜cam−(x,y,hm/2)‖2,\displaystyle v_{\mathrm{obj}}=\min\left(1,\frac{\mathrm{area}(\mathbf{b})}{\mathrm{area}(\mathbf{b}_{\mathrm{rec}})}\right),\;\;d_{\mathrm{cam}}=\left\|\mathbf{c}_{\mathrm{cam}}-(x,y,h_{m}/2)\right\|_{2},(1)where𝐜cam\mathbf{c}_{\mathrm{cam}}is the camera location. These metrics are used for weighting associations, gating matches, and prioritizing targets in multi-view fusion stages.
3.2Data Association
Data association modules ensure the consistency of target identities on multiple views and across time. They depend on the target tracking state, which can be one of the following:
Tentative:on probation but not confirmed.
Active:confirmed and visible.
Quasi-Active:confirmed in other views but not visible.
Inactive:confirmed but currently not visible.
Terminated:no longer available.
The transitions between tracking states are shown inFig.˜3.
TentativeActiveQuasi-ActiveTerminatedInactiveFigure 3:MV3DT track lifecycle and recovery logic. Tracks begin as Tentative, are promoted to Active after a short probation with consistent matches, and fall back to Inactive for shadow tracking when detections are missed. Quasi-Active denotes targets confirmed by peer cameras. enabling multi-view continuity, while Terminated closes stale tracks.#### 3.2.1Single View Data Association
ensures the consistency of target identities from frame to frame by matching detections𝐛d\mathbf{b}_{d}to target predictions𝐛t\mathbf{b}_{t}. Data association is performed using intersection over union as an overlap similaritySIoUS_{\mathrm{IoU}}, size similarityS𝑆𝑖𝑧𝑒(d,t)=min(area(𝐛d)area(𝐛t),area(𝐛t)area(𝐛d))S_{\mathit{Size}}(d,t)=\min\left(\frac{\mathrm{area}(\mathbf{b}_{d})}{\mathrm{area}(\mathbf{b}_{t})},\frac{\mathrm{area}(\mathbf{b}_{t})}{\mathrm{area}(\mathbf{b}_{d})}\right)(ratio of bounding box areas), and ReID similaritySReIDS_{\mathrm{ReID}}for appearance.
3.2.2ReID Features for Appearance.
A dedicated ReID module extracts a feature vector𝐟d\mathbf{f}_{d}from each detection crop using a ReID model as in[48]and maintains a gallery of such features{𝐟t,i}\{\mathbf{f}_{t,i}\}from each target’s history. ReID similaritySReIDS_{\mathrm{ReID}}is given by the maximum cosine distance between the detection feature and the elements of the gallery:SReID(d,t)=maxi𝐟d⊺𝐟t,iS_{\mathrm{ReID}}(d,t)=\max_{i}\mathbf{f}_{d}^{\intercal}\mathbf{f}_{t,i}(with‖𝐟d‖=1\|\mathbf{f}_{d}\|=1).
The overall similarity fordata associationis a convex combination:
SGlobal=w1SReID+w2SIoU+w3S𝑆𝑖𝑧𝑒,w1+w2+w3=1.S_{\mathrm{Global}}=w_{1}S_{\mathrm{ReID}}+w_{2}S_{\mathrm{IoU}}+w_{3}S_{\mathit{Size}},\qquad w_{1}+w_{2}+w_{3}=1.(2)Similar to ByteTrack[55], our method performs greedy association at each phase, but it comprises three logically distinct stages:
- •Stage 1: High confidence detections are associated with confirmedActiveandInactivetargets.
- •Stage 2:Activetargets unmatched in the first stage are associated with low confidence detections.
- •Stage 3: Remaining high confidence detections are associated withTentativetargets.
3.2.3Multi-View Association
shares data among cameras to maintain track continuity. From the multi-camera point of view, we define the ego-cam as the camera view under consideration, and peer-cams as cameras in the network whose FOVs overlap with the ego-cam’s. Each ego-cam stores tracklet data from its peer-cams in aPeerTargetDBstructure.
Multi-View Tracklet Matching.
Central to multi-view association istracklet matching: we compare ego and peer tracklets, each a sequence of(f,x,y)(f,x,y)(frame index and 3D feet positions). We search over a bounded time shiftΔt∈[−Δmax,Δmax]\Delta t\in[-\Delta_{\max},\Delta_{\max}]to account for small synchronization offsets between cameras. For each shift we collect frames where both tracklets are present, and if at leastn≥nminn\geq n_{\min}common frames are found, we computeST=1/(1+D2/n)S_{T}=1/(1+\sqrt{D^{2}/n}), whereD2=∑idi2D^{2}=\sum_{i}d_{i}^{2}, is the sum of the Euclidean distances between the two feet positions at theii-th common frame. We take the maximum score over all shifts. Algorithm2summarizes the procedure. Our multi-view target association mechanism allows for fully distributed ID propagation, which is executed simultaneously by each camera and consists of three stages.
Algorithm 2Tracklet match score (ego vs. peer)1:Tracklets
𝒯1\mathcal{T}_{1},
𝒯2\mathcal{T}_{2}(lists of
(f,x,y)(f,x,y)), time-shift range
Δmax\Delta_{\max}, minimum common frames
nminn_{\min} 2:Similarity score
ST∈[0,1]S_{\mathrm{T}}\in[0,1] 3:Build map: frame
f↦f\mapstoindex in
𝒯2\mathcal{T}_{2} 4:
Smax←0S_{\mathrm{max}}\leftarrow 0 5:for
Δt=−Δmax\Delta t=-\Delta_{\max}to
Δmax\Delta_{\max}do
6:
n←0n\leftarrow 0,
D2←0D^{2}\leftarrow 0 7:foreach
(f1,x1,y1)(f_{1},x_{1},y_{1})in
𝒯1\mathcal{T}_{1}do
8:Look up
f2=f1+Δtf_{2}=f_{1}+\Delta tin
𝒯2\mathcal{T}_{2}to get
(x2,y2)(x_{2},y_{2}) 9:ifentry existsthen
10:
D2←D2+(x1−x2)2+(y1−y2)2D^{2}\leftarrow D^{2}+(x_{1}{-}x_{2})^{2}+(y_{1}{-}y_{2})^{2};
n←n+1n\leftarrow n+1 11:endif
12:endfor
13:if
n≥nminn\geq n_{\min}then
14:
S←1/(1+D2/n)S\leftarrow 1/(1+\sqrt{D^{2}/n});
Smax←max(Smax,S)S_{\mathrm{max}}\leftarrow\max(S_{\mathrm{max}},S) 15:endif
16:endfor
17:return
SmaxS_{\mathrm{max}}
Stage 1 – Re-Association at ID Acquisition.
When a target becomesActive, we first try to re-associate it within the ego-cam (matching and recovering a shadow track) as described inSec.˜3.3. If no match is found, we match the target trajectory againstPeerTargetDBand adopt a peer ID ifST>τpeerS_{T}>\tau_{peer}. Since multiple cameras may simultaneously track the same target, the IDs are timestamped and preference is given to the oldest ID. This ensures ID convergence for the same target across cameras. If both steps fail, the target receives a new ID.
Stage 2 – Late Peer Re-Association.
In certain scenarios, such as initialization or in the presence of non-negligible communication delays, tracks on different cameras may becomeActivenearly simultaneously. To alleviate this problem, we allow tracklet matching to occur for a periodTrecT_{rec}(in frames) after the target’s activation. During this period, the target is considered recently active.
Stage 3 – ID Correction.
When multiple targets with similar motion patterns are close to each other or when targets remain in the image border for a long period, their 3D location estimates can be inaccurate, potentially leading to incorrect associations. This stage detects them by performing tracklet matching between associated targets. When any associated target no longer meets the matching criteria, i.e.,ST<τpeerS_{T}<\tau_{peer}, the algorithm checks for the age of the targets, and the older target keeps the ID.
3.3Target Management
Target management handles the lifecycle of tracked objects and their trajectories based on ego-cam and peer-cam data.
3.3.1Single-View Target Management.
Following the core principles of SORT[3]and DeepSORT[48], and as illustrated inFig.˜3, unmatched detections initialize new tracks inTentativemode, and after a probation age, the target becomesActiveif it accumulates enough matches. Targets without a current match transition toInactive, which keeps track through misdetections and brief occlusions by updating the target locations using motion prediction, a strategy we call shadow tracking.
Since target motion estimates can drift during shadow tracking, recovering a target after occlusion based solely on its current position can be inaccurate. Hence, we maintain a re-association database (ReAssocDB) that stores trajectory projections from targets in shadow tracking. When a track becomesActiveafter its probation period, its trajectory is compared with allInactivetrajectories inReAssocDBto obtainSTS_{\mathrm{T}}. If a match occurs (ST>τegoS_{\mathrm{T}}>\tau_{ego}), the new track inherits the identity and history of the correspondingInactivetarget.
If anInactivetarget is matched with a detection before reaching its maximum shadow tracking age, it transitions back toActive. Otherwise, it is terminated and its trajectory removed fromReAssocDB.
3.3.2Multi-View Target Management.
The following mechanisms use peer data to manage the lifecycle of the targets.
When no measurements are available in the ego-cam, we can use peer-cam observations to keep tracking the target. We introduce thequasi-activestate for those cases. In quasi-active tracking, the target state is updated using only peer-cam measurements. The target returns to active if matched with a detection or is terminated if not matched and peer measurements are no longer available.
To better address the early birth of targets, our method also initiates the association process based on peer-cam observations. When a new target is partially detected (i.e., it is becoming unoccluded or is at the edges of the FOV) or far from the camera, its detections can have very low confidence. If such detections align with a confident peer target, they probably represent an actual target. Thus, MV3DT starts a new track in these early detection cases.
Similar to the early tracker instantiation, when targets are fully occluded before the ego-cam can see them for the first time, MV3DT can “see-through” the occlusions by instantiating a new track for such targets. A see-through target is initialized in quasi-active mode, as long as a peer-cam provides a confident state and the target location projects onto the ego-cam FOV.
3.4Multi-View Measurement Fusion.
Motion estimation is performed by a Kalman filter[17]to maintain the state (feet location and constant velocity) of each target and predict their locations fordata associationusing ego-cam measurements. Our system also fuses measurements whenever a target is visible in different camera views. However, factors such as low visibility or a long distance from the camera degrade measurement quality, sometimes so much so that a peer measurement should be treated as an outlier. We therefore apply the following criteria before fusing a peer measurement with the ego-cam estimate: i) it projects onto the ego-cam FOV; ii) its visibility (from the monocular detection module) is greater than a minimum threshold; iii) its distance from the ego-cam target prediction is less than a threshold; iv) the peer target visibility is greater than the ego target visibility; and v) the distance from the target to the peer camera is smaller than its distance to the ego-cam. Criteria (i)–(iii) ensure the measurement is geometrically and qualitatively acceptable; (iv) and (v) ensure that fusing it actually improves over the ego estimate. Measurements meeting these criteria are fed into the state estimator jointly with the ego-cam measurements to update the target states. Each Kalman filter iteration runs one prediction step and multiple correction steps, one for each measurement.
3.5Inter-camera Communication
The communication module is based on a publish/subscribe paradigm, in which each camera publishes its data on a dedicated topic and other cameras selectively subscribe to these topics to receive the information they require.
Message Types.
The protocol defines three message types (Fig.4):tracklet, sent when a target becomes active and used to create entries in thePeerTargetDB;stateUpdate, sent after each Kalman update to refresh existing entries; andadoptedID, sent when an ID changes due to late peer re-association or ID correction to update the corresponding entry.
framecamIDtargetIDtargetID TsalltargetAgestatestateTimevisibilitycamDisttracklettracklet, stateUpdateprevIDadoptedIDFigure 4:Message fields: all message types includeframe,camID,targetID, andtargetID Ts(timestamp).trackletandstateUpdatealso carrytargetAge,state,stateTime,visibility, andcamDist;trackletfurther includes thetrackletpayload, whileadoptedIDadds onlyprevID(the ID replaced bytargetID).The communication and data processing overhead in MV3DT is dependent on a configurable subscription graph. For large-scale deployments, a naive, fully-connected graph (“all-to-all” configuration) would flood every camera with irrelevant data. This creates high network usage and CPU overhead. To prevent this, cameras only subscribe to its peers, i.e., other cameras with overlapping FOVs.
MV3DT’s modular design allows for different communication technologies and protocols. We currently support an in-memory module for single-computer deployments, using shared memory for low-latency message passing, and an MQTT-based module[23]for distributed setups, using Eclipse Mosquitto[13]as the message broker.
4Experiments and Results
We evaluate our approach on three benchmark datasets, comparing3D tracking accuracyandscalabilityagainst state-of-the-art methods. Our experiments demonstrate that MV3DT achieves competitive accuracy while offering superior scalability for large-scale deployments.
We use NVIDIA DeepStream[27]for video ingestion and generic people detectors from NVIDIA TAO[30]. We denote asPNTthe transformer-based detector PeopleNet Transformer 1.1[29]used in the tracking accuracy assessment and asPN3the lightweight detector PeopleNet 2.6.3[28]used for better time performance in the scalability evaluation.
4.1Datasets
WILDTRACK[6]is a widely used multi-view pedestrian tracking benchmark featuring 7 synchronized cameras with overlapping fields of view covering an outdoor plaza. The dataset provides video sequences for 2,000 frames at 10 FPS with 400 annotated frames at 2 FPS. Camera calibrations are also provided.SCOUT[12]is a recent larger scale dataset with 25 calibrated cameras capturing real outdoor pedestrian traffic over long sequences. It provides 12,000 frames at 10 FPS covering a 450-meter path, with ground truth annotations for 8 cameras.AI City 2024 Warehouse Synthetic Dataset[45]is one of the AI City Challenge scenes representing a 100-camera simulation of a large-scale warehouse environment for 6 minutes. It features dense camera coverage, and realistic occlusion patterns. The dataset is used for stress-test scalability and real-time performance analysis.
4.2Tracking Accuracy on WILDTRACK
Tab.˜1compares MV3DT against recent state-of-the-art multi-view tracking methods on WILDTRACK. MV3DT achieves 96.5% IDF1, 93.1% MOTA, and 94.6% MOTP at 27 FPS, tying the best reported IDF1 while achieving the highest MOTP compared to the state-of-the-art methods. Although UMPN[12]and MVTrajecter[50]achieve slightly higher MOTA, they rely on learned, scene-specific models. MV3DT trades marginal MOTA for deployability, scalability, and real-time throughput.
Table 1:Results on the WILDTRACK test set: last 10% of the 7 sequences.
4.3Tracking Accuracy and Performance on SCOUT
Tab.˜2presents results on the labeled subset of the SCOUT dataset (8 annotated cameras), using a 50% train/test split as in[12]. MV3DT outperforms the baseline (UMPN+SP) by a margin of +14.7 IDF1 +25.9 MOTA and +19.3 MOTP percentage points when using PNT. With the PN3 detector, MV3DT also shows scale-up capabilities: higher throughput and still better accuracy than UMPN.
Table 2:Results on the SCOUT test set: last 50% of the 8 annotated cameras.
4.4Ablation of the ID propagation protocol
To quantify the contribution of each stage of our ID propagation protocol, we run an ablation on WILDTRACK by progressively enabling Stages 1–3 of multi-view association. As shown inTab.˜3, multi-view processing dramatically improves ID consistency over a single-view baseline (multi-view communication disabled). Adding Stage 1 (ID Acquisition) more than doubles both metrics, showing the benefit of cross-camera matching. Late Re-Association brings a particularly large gain on WILDTRACK, due to the short sequence length that allows for a significant proportion of nearly simultaneous activation during pipeline initialization. Finally, ID Correction further reduces residual ID errors, reaching 96.5% IDF1 and 93.1% MOTA, matching the full-system WILDTRACK results inTab.˜1.
Table 3:Ablation of the three-stage ID propagation on WILDTRACK. Each row adds the indicated component. “Single-view only” disables multi-view interactions.
4.5Large-Scale Deployment
To validate the scalability of our method, we use two NVIDIA H100 NVL GPUs to process the AI City 2024 Warehouse Synthetic Dataset (50 streams per GPU).
Synchronized real-time performance:
Barrier-based frame synchronization limits each camera’s input stream to at most one frame per1/fs1/f_{s}period. In this scenario, MV3DT sustains 30 FPS across all 100 streams. GPU utilization averages 95% and the network bandwidth for inter-camera communication is 44 MB/s total (0.44 MB/s per camera), indicating efficient resource usage and demonstrating the lightweight nature of the communications. The average inter-camera message latency, which is critical for real-time multi-view association, is 5 ms with the 95th percentile at 12 ms, well below the typical 33 ms real-time budget.
Asynchronous stress test:
Without frame synchronization, MV3DT achieves 44.5 FPS on 100 streams. For comparison, a single-view baseline achieves 45.5 FPS, indicating the overhead of distributed ID propagation and fusion of only 2.2%.Tab.˜4reports tracking accuracy with and without frame synchronization on the 100-camera warehouse scene. We also provide the current leaderboard on the AI City Challenge, as a reference of expected accuracy. However, note that our accuracy evaluation is only for the specific warehouse scene.
Table 4:Tracking accuracy on the 100-camera AI City Warehouse Scene: synchronized vs. unsynchronized (stress test). Leaderboard results on the complete AICity’24 dataset are provided for reference. These methods do not report real-time performance but are unlikely to achieve real-time operation at such a scale.
5Conclusions
MV3DT addresses a critical gap in multi-camera tracking: the need for real-time, scalable systems that operate without scene-specific training or retraining for different camera configurations. Unlike learned approaches, such as UMPN, BEV-SUSHI, and MVTrajecter, MV3DT is a zero-shot method that requires only camera calibrations. Most competing methods cannot be deployed to different camera configurations without retraining, effectively locking solutions to their training geometry.
Our experiments demonstrate that MV3DT achieves competitive accuracy on WILDTRACK, SCOUT, and the AI City 2024 warehouse scene, while providing superior scalability. On the 100-camera warehouse setup, MV3DT sustains 30 FPS with 5 ms average inter-camera latency and only 2.2% communication overhead. On SCOUT (8 cameras), it reaches 28–55 FPS depending on the detector, compared to 2 FPS for UMPN. The ability to scale out by adding more nodes and cameras is where the architectural difference matters most. Most methods require all cameras to share a single GPU, so throughput and memory bounds prevent them from scaling beyond a handful of cameras. MV3DT, in contrast, is designed for distributed deployment: each camera stream runs in a separate process (machine or GPU), and nodes communicate only via lightweight messaging (e.g., MQTT). Scaling is thus limited by network bandwidth rather than GPU memory or compute. This positions MV3DT as uniquely suited for large-scale real-world deployments in warehouses, airports, and smart cities.
References
- [1]Amosa, T.I., Sebastian, P., Izhar, L.I., Ibrahim, O., Ayinla, L.S., Bahashwan, A.A., Bala, A., Samaila, Y.A.: Multi-camera multi-object tracking: a review of current trends and future advances. Neurocomputing (2023)
- [2]Aung, S., Park, H., Jung, H., Cho, J.: Enhancing multi-view pedestrian detection through generalized 3d feature pulling. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1196–1205 (2024)
- [3]Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and real-time tracking. In: 2016 IEEE International Conference on Image Processing (ICIP) (2016)
- [4]Black, J., Ellis, T.: Multi camera image tracking. Image and Vision Computing (2006)
- [5]Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision (2020)
- [6]Chavdarova, T., Baqué, P., Bouquet, S., Maksai, A., Jose, C., Bagautdinov, T., Lettry, L., Fua, P., Van Gool, L., Fleuret, F.: WILDTRACK: A multi-camera HD dataset for dense unscripted pedestrian detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
- [7]Chen, K.W., Lai, C.C., Lee, P.J., Chen, C.S., Hung, Y.P.: Adaptive learning for target tracking and true linking discovering across multiple non-overlapping cameras. IEEE Transactions on Multimedia (2011)
- [8]Daryani, A.E., Bhutta, M., Hernandez, B., Medeiros, H.: Camuvid: Calibration-free multi-view detection. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1220–1229 (2025)
- [9]Dong, Z., Zhang, Y., Huang, X., Ji, H., Shi, Z., Zhan, X., Chen, J.: Mv-detr: Multi-modality indoor object detection by multi-view detecton transformers. arXiv preprint arXiv:2408.06604 (2024)
- [10]Engilberge, M., Liu, W., Fua, P.: Multi-view tracking using weakly supervised human motion prediction. In: IEEE Winter Conference on Applications of Computer Vision (2023)
- [11]Engilberge, M., Shi, H., Wang, Z., Fua, P.: Two-level data augmentation for calibrated multi-view detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 128–136 (2023)
- [12]Engilberge, M., Vrkic, I., Grosche, F.W., Pilet, J., Turetken, E., Fua, P.: One graph to track them all: Dynamic gnns for single-and multi-view tracking. arXiv preprint arXiv:2507.08494 (2025)
- [13]Foundation, E.: Eclipse mosquitto.https://mosquitto.org/(2026), message broker for MQTT
- [14]Iguernaissi, R., Merad, D., Aziz, K., Drap, P.: People tracking in multi-camera systems: a review. Multimedia Tools and Applications78, 10773–10793 (2019)
- [15]Jiang, N., Bai, S., Xu, Y., Xing, C., Zhou, Z., Wu, W.: Online inter-camera trajectory association exploiting person re-identification and camera topology. In: Proceedings of the 26th ACM International Conference on Multimedia (2018)
- [16]Jiang, P., Ergu, D., Liu, F., Cai, Y., Ma, B.: A review of YOLO algorithm developments. Procedia Computer Science (2022)
- [17]Kalman, R.E.: A new approach to linear filtering and prediction problems. ASME Journal of Basic Engineering (1960)
- [18]Kim, J., Shin, W., Park, H., Baek, J.: Addressing the occlusion problem in multi-camera people tracking with human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5463–5469 (2023)
- [19]Lee, L., Romano, R., Stein, G.: Monitoring activities from multiple video streams: Establishing a common coordinate frame. IEEE Transactions on Pattern Analysis and Machine Intelligence (2000)
- [20]Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Qiao, Y., Dai, J.: BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: European Conference on Computer Vision. pp. 1–18. Springer (2022)
- [21]Medeiros, H., Park, J., Kak, A.: Distributed object tracking using a cluster-based Kalman filter in wireless camera networks. IEEE Journal of Selected Topics in Signal Processing (2008)
- [22]Mikic, I., Santini, S., Jain, R.: Video processing and integration from multiple cameras. In: Proceedings of the 1998 Image Understanding Workshop, Morgan-Kaufman, San Francisco (1998)
- [23]MQTT.org: MQTT - The Standard for IoT Messaging — mqtt.org.https://mqtt.org/(2019)
- [24]Naphade, M., Anastasiu, D.C., Sharma, A., Jagrlamudi, V., Jeon, H., Liu, K., Chang, M.C., Lyu, S., Gao, Z.: The NVIDIA AI City Challenge. In: 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI) (2017)
- [25]Naphade, M., Wang, S., Anastasiu, D.C., Tang, Z., Chang, M.C., Yao, Y., Zheng, L., Rahman, M.S., Arya, M.S., Sharma, A., et al.: The 7th AI city challenge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
- [26]Nguyen, Q.Q.V., Le, H.D.A., Chau, T.T.T., Luu, D.T., Chung, N.M., Ha, S.V.U.: Multi-camera people tracking with mixture of realistic and synthetic knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
- [27]NVIDIA Corporation: NVIDIA deepstream SDK.https://developer.nvidia.com/deepstream-sdk(2026)
- [28]NVIDIA Corporation: NVIDIA TAO peoplenet.https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/peoplenet(2026), deployable quantized ONNX v2.6.3
- [29]NVIDIA Corporation: NVIDIA TAO peoplenet transformer.https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/peoplenet_transformer(2026), deployable v1.1
- [30]NVIDIA Corporation: NVIDIA TAO toolkit.https://developer.nvidia.com/tao-toolkit(2026)
- [31]Olagoke, A.S., Ibrahim, H., Teoh, S.S.: Literature survey on multi-camera system and its application. IEEE Access (2020)
- [32]Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)
- [33]Previtali, F., Bloisi, D.D., Iocchi, L.: A distributed approach for real-time multi-camera multiple object tracking. Machine Vision and Applications (2017)
- [34]Quach, K.G., Nguyen, P., Le, H., Truong, T.D., Duong, C.N., Tran, M.T., Luu, K.: DyGLIP: A dynamic graph model with link prediction for accurate multi-camera multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13784–13793 (2021)
- [35]Redmon, J.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
- [36]Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
- [37]Ristani, E., Tomasi, C.: Features for multi-target multi-camera tracking and re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
- [38]Siddique, A., Medeiros, H.: Tracking passengers and baggage items using multiple overhead cameras at security checkpoints. IEEE Transactions on Systems, Man, and Cybernetics: Systems (2022)
- [39]Stein, G.P.: Tracking from multiple view points: Self-calibration of space and time. In: Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (1999)
- [40]Taj, M., Cavallaro, A.: Distributed and decentralized multicamera tracking. IEEE Signal Processing Magazine (2011)
- [41]Tang, Z., Wang, S., Anastasiu, D.C., Chang, M.C., Sharma, A., Kong, Q., Kobori, N., Gochoo, M., Batnasan, G., Otgonbold, M.E., Alnajjar, F., Hsieh, J.W., Kornuta, T., Li, X., Zhao, Y., Zhang, H., Radhakrishnan, S., Jain, A., Kumar, R., Murali, V.N., Wang, Y., Pusegaonkar, S.S., Wang, Y., Biswas, S., Wu, X., Zheng, Z., Chakraborty, P., Chellappa, R.: The 9th AI City Challenge (2025),https://arxiv.org/abs/2508.13564
- [42]Teepe, T., Wolters, P., Gilg, J., Herzog, F., Rigoll, G.: Lifting multi-view detection and tracking to the bird’s eye view. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 667–676 (2024)
- [43]Wang, A., Chen, H., Liu, L., Chen, K., Lin, Z., Han, J., Ding, G.: YOLOv10: Real-time end-to-end object detection. In: Advances in Neural Information Processing Systems (2024)
- [44]Wang, C.Y., Yeh, I.H., Liao, H.Y.M.: YOLOv9: Learning what you want to learn using programmable gradient information. arXiv preprint arXiv:2402.13616 (2024)
- [45]Wang, S., Anastasiu, D.C., Tang, Z., Chang, M.C., Yao, Y., Zheng, L., Rahman, M.S., Arya, M.S., Sharma, A., Chakraborty, P., et al.: The 8th AI City Challenge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
- [46]Wang, Y., Meinhardt, T., Cetintas, O., Yang, C.Y., Satish Pusegaonkar, S., Missaoui, B., Biswas, S., Tang, Z., Leal-Taixé, L.: Bev-sushi: Multi-target multi-camera 3d detection and tracking in bird’s-eye view. arXiv e-prints pp. arXiv–2412 (2024)
- [47]Wang, Y.: Distributed multi-object tracking with multi-camera systems composed of overlapping and non-overlapping cameras. The University of Nebraska-Lincoln (2013)
- [48]Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP) (2017)
- [49]Xie, Z., Ni, Z., Yang, W., Zhang, Y., Chen, Y., Zhang, Y., Ma, X.: A robust online multi-camera people tracking system with geometric consistency and state-aware re-id correction. In: CVPR Workshop. Seattle, WA, USA (2024)
- [50]Yamane, T., Masumura, R., Suzuki, S., Orihashi, S.: Mvtrajecter: Multi-view pedestrian tracking with trajectory motion cost and trajectory appearance cost. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13270–13280 (2025)
- [51]Yang, S., Ding, F., Li, P., Hu, S.: Distributed multi-camera multi-target association for real-time tracking. Scientific Reports (2022)
- [52]Yang, Y., Xu, M., Ralph, J.F., Ling, Y., Pan, X.: An end-to-end tracking framework via multi-view and temporal feature aggregation. Computer Vision and Image Understanding249, 104203 (2024)
- [53]Yoder, J., Medeiros, H., Park, J., Kak, A.C.: Cluster-based distributed face tracking in camera networks. IEEE Transactions on Image Processing (2010)
- [54]Yoshida, R., Okubo, J., Fujii, J., Amakata, M., Yamashita, T.: Overlap suppression clustering for offline multi-camera people tracking. In: CVPR Workshop. Seattle, WA, USA (2024)
- [55]Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.: Bytetrack: Multi-object tracking by associating every detection box. In: European Conference on Computer Vision (2022)
- [56]Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., Chen, J.: RT-DETR: DETRs beat YOLOs on real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16788–16797 (2024)
- [57]Zong, Z., Song, G., Liu, Y.: DETRs with collaborative hybrid assignments training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
相似文章
TrackCraft3R: 改造视频扩散变换器用于密集3D追踪
TrackCraft3R 改造视频扩散变换器,用于从单目视频进行密集3D追踪。它采用双潜在表示和时间RoPE对齐,以比先前方法快1.3倍的速度和少4.6倍的峰值内存,实现了最先进的性能。
Track2View:通过配对3D点轨迹实现4D一致的相机控制视频生成
Track2View 通过将视频扩散转换器基于配对3D点轨迹进行条件生成,从视频中生成新的相机视角,实现了最先进的视觉质量,并显著降低了旋转和平移误差。
@XAMTO_AI: ControlNet作者敏神又搞出新东西了! 新开源的FramePack直接把视频生成的门槛打了下来——6GB显存就能跑,13B模型生成1分钟30帧视频,在RTX 4090上只要1.5秒出一帧,这配置要求放以前根本不敢想。 核心思路是逐帧…
ControlNet作者敏神开源了FramePack视频生成模型,仅需6GB显存即可运行13B模型,生成1分钟30帧视频,RTX 4090上每帧1.5秒,并提供Windows一键包。
OmniDirector: 通用多镜头相机克隆,无需交叉配对数据
一个统一的框架,通过网格运动视频和多模态扩散变换器实现相机运动克隆,无需交叉配对数据即可实现导演级别的控制。
@VincentLogic: NVIDIA 刚开源的这个 LocateAnything 模型,真的有点强。 以前那种视觉定位模型,生成坐标是一个数字一个数字往外蹦(像挤牙膏一样),又慢又不稳定。 这个新模型用了“并行边界框解码”,直接一步预测完整坐标,速度快多了,框得…
NVIDIA 开源了 LocateAnything 模型,采用并行边界框解码技术,一步预测完整坐标,速度快且准确。模型仅 3B 参数,可在消费级显卡上运行,支持视频物体定位、UI 识别和 OCR 等任务。