ResAware: Cross-Environment Website Fingerprinting via Resource-Privileged Distillation
Summary
ResAware proposes a resource-aware distillation framework to improve website fingerprinting robustness across different network environments by training a teacher model on resource-level features and distilling knowledge to a student model that uses only encrypted traffic, achieving significant gains under temporal drift and other perturbations.
View Cached Full Text
Cached at: 06/17/26, 05:38 AM
# ResAware: Cross-Environment Website Fingerprinting via Resource-Privileged Distillation Source: [https://arxiv.org/html/2606.17462](https://arxiv.org/html/2606.17462) Chongru Fan[Chongrufan@bupt\.edu\.cn](https://arxiv.org/html/2606.17462v1/mailto:[email protected])Beijing University of Posts and TelecommunicationsBeijingChinaZhongguancun LaboratoryBeijingChinaWei WangZhongguancun LaboratoryBeijingChina,Wentao HuangBeijing University of Posts and TelecommunicationsBeijingChina,Zhenquan Ding[dingzq@zgclab\.edu\.cn](https://arxiv.org/html/2606.17462v1/mailto:[email protected])Zhongguancun LaboratoryBeijingChina,Jinqiao Shi[shijinqiao@bupt\.edu\.cn](https://arxiv.org/html/2606.17462v1/mailto:[email protected])Beijing University of Posts and TelecommunicationsBeijingChina,Lei CuiZhongguancun LaboratoryBeijingChina,Zhiyu HaoZhongguancun LaboratoryBeijingChinaandXiaochun YunZhongguancun LaboratoryBeijingChina ###### Abstract\. While Website Fingerprinting \(WF\) attacks achieve high accuracy in controlled laboratory settings, they often degrade substantially in real\-world environments due to spatio\-temporal drift, browser heterogeneity, proxy obfuscation and etc\. This limitation stems from their sole reliance on low\-level traffic features that are noisy and highly sensitive to environmental perturbations\. To address this problem, we proposeResAware, a cross\-environment resource\-aware distillation framework under atraining\-rich/inference\-poorasymmetric setting\. Specifically, ResAware trains a teacher model on resource\-level features, and then distills the resulting privileged knowledge into a student model through heterogeneous knowledge distillation\. At deployment time, the student model performs inference using only encrypted traffic, incurring zero additional cost\. We evaluate ResAware on a large\-scale dataset collected over five months from six globally distributed vantage points, comprising more than160,000160\{,\}000paired samples\. The results show that ResAware significantly enhances the cross\-environment robustness of diverse WF baselines\. Under a 150\-day temporal drift, for example, ResAware improves the F1\-score of Var\-CNN from72\.77%72\.77\\%to81\.49%81\.49\\%and the open\-worldTPR@1%FPRTPR@1\\%FPRfrom22\.40%22\.40\\%to27\.20%27\.20\\%\. Our results demonstrate that resource\-level supervision improves WF robustness without expanding online observation capabilities\. Website Fingerprinting, Encrypted Traffic Analysis, Knowledge Distillation, Cross\-Environment Robustness ††ccs:Networks Network privacy and anonymity††ccs:Security and privacy Pseudonymity, anonymity and untraceability††ccs:Computing methodologies Machine learning algorithms## 1\.Introduction With the widespread adoption of HTTPS and related encryption protocols, the contents of web traffic are now largely hidden from direct inspection\(Rescorlaet al\.,[2025](https://arxiv.org/html/2606.17462#bib.bib75); Hoffman and McManus,[2018](https://arxiv.org/html/2606.17462#bib.bib76)\)\. However, encryption does not eliminate side\-channel leakage: observable traffic patterns, such as packet length, direction, and timing, can still reveal sensitive information about user activities\. Website Fingerprinting \(WF\) exploits such leakage to infer visited websites from encrypted traffic traces\(Hintz,[2002](https://arxiv.org/html/2606.17462#bib.bib77); Hayes and Danezis,[2016](https://arxiv.org/html/2606.17462#bib.bib55); Wanget al\.,[2014](https://arxiv.org/html/2606.17462#bib.bib64)\), making it an important privacy threat to encrypted web communications\. Figure 1\.A website’s identity is reflected in its architecture and resource loading patterns\. A browsing instance can be viewed as a sequence of resource deliveries, which, after being shaped by environmental noise, appears as the observable network traffic\.A diagram illustrating that a website’s static resource dependency topology generates a stable application\-layer resource sequence, which is then mapped to an encrypted traffic trace\. ResAware uses the resource sequence as privileged offline supervision while keeping the online attacker restricted to encrypted traffic only\.Although deep learning\-based WF models have achieved strong performance in closed, IID experimental settings\(Sirinamet al\.,[2018](https://arxiv.org/html/2606.17462#bib.bib58); Bhatet al\.,[2019](https://arxiv.org/html/2606.17462#bib.bib68); Rahmanet al\.,[2020](https://arxiv.org/html/2606.17462#bib.bib19); Denget al\.,[2025a](https://arxiv.org/html/2606.17462#bib.bib28)\), a substantial deployment gap remains between laboratory scenarios and real\-world network environments\. In practice, traffic features are highly susceptible to temporal evolution, geographic variation, and obfuscated proxy protocol conversion, all of which induce significant distribution shifts\(Cherubinet al\.,[2022](https://arxiv.org/html/2606.17462#bib.bib13); Denget al\.,[2025b](https://arxiv.org/html/2606.17462#bib.bib1); Liet al\.,[2025](https://arxiv.org/html/2606.17462#bib.bib23); Shustermanet al\.,[2026](https://arxiv.org/html/2606.17462#bib.bib21)\)\. When a model is trained in one environment and evaluated in another with substantial feature discrepancies, the accuracy of mainstream WF models deteriorates markedly\. This degradation suggests that existing models rely heavily on transient, environment\-specific network artifacts and therefore generalize poorly across environments\. Prior work on mitigating this problem generally follows two trajectories\. The first seeks more robust traffic representations through manual feature engineering, data augmentation, or contrastive learning\(Shenet al\.,[2023](https://arxiv.org/html/2606.17462#bib.bib20); Bahramaliet al\.,[2023](https://arxiv.org/html/2606.17462#bib.bib33); Xieet al\.,[2024](https://arxiv.org/html/2606.17462#bib.bib31)\)\. The second adopts domain adaptation, such as few\-shot fine\-tuning or inference\-time calibration on unlabeled target traffic\(Sirinamet al\.,[2019](https://arxiv.org/html/2606.17462#bib.bib29); Denget al\.,[2026](https://arxiv.org/html/2606.17462#bib.bib14); Zhanget al\.,[2023](https://arxiv.org/html/2606.17462#bib.bib74)\)\. Despite these efforts, existing methods remain confined to atraffic\-onlyobservational perspective: both training and inference rely exclusively on signals derived from encrypted traffic\. Such signals are readily distorted by network variation, browser scheduling, and protocol encapsulation\. As a result, current methods attempt to recover website identities from unstable observations, while overlooking the deterministic application\-layer resources that give rise to these traffic patterns\. As illustrated in Figure[1](https://arxiv.org/html/2606.17462#S1.F1), our key insight is that a website’s identity determines its application\-layer resource composition and dependency patterns\(Wanget al\.,[2013](https://arxiv.org/html/2606.17462#bib.bib78); Netravaliet al\.,[2016](https://arxiv.org/html/2606.17462#bib.bib79)\)\. During page loading, the resulting resource sequence reflects relatively stable website\-specific loading logic\. By contrast, the observed network traffic is only a noisy projection of this process, shaped by substantial stochasticity and environmental variation\(Juarezet al\.,[2014](https://arxiv.org/html/2606.17462#bib.bib80)\)\. Recent resource\-aware WF studies suggest that resource\-level information enjoys a natural robustness advantage in cross\-environment settings\(Chenget al\.,[2025b](https://arxiv.org/html/2606.17462#bib.bib15); Gaoet al\.,[2025](https://arxiv.org/html/2606.17462#bib.bib26); Chenget al\.,[2025a](https://arxiv.org/html/2606.17462#bib.bib11)\)\. However, obtaining such information in practice typically requires traffic decryption or end\-host compromise, both of which exceed the capabilities of a standard passive eavesdropper\(Panchenkoet al\.,[2016b](https://arxiv.org/html/2606.17462#bib.bib81)\)\. To exploit resource\-level stability without expanding the online attack surface, we formalize an asymmetric threat setting termedtraining\-rich/inference\-poor\. During offline fingerprint database construction, the attacker can collect both encrypted traffic and resource\-level information using controlled crawlers; during online inference, however, the attacker remains a standard passive eavesdropper limited to encrypted traffic alone\. Under this setting, resource\-level information is available during training but unavailable during online inference, which naturally qualifies it as Privileged Information\(Vapnik and Izmailov,[2015](https://arxiv.org/html/2606.17462#bib.bib67)\): an auxiliary supervisory signal that guides learning without being available as an input at inference time\. Motivated by the Learning Using Privileged Information \(LUPI\) paradigm\(Vapnik and Izmailov,[2015](https://arxiv.org/html/2606.17462#bib.bib67); Lopez\-Pazet al\.,[2016](https://arxiv.org/html/2606.17462#bib.bib66)\), we propose ResAware, a resource\-aware distillation framework for cross\-environment WF under thetraining\-rich/inference\-poorsetting\. Using paired traffic\-resource samples, ResAware trains a resource\-side teacher model on resource\-level features and distills the resulting privileged knowledge into a student model through cross\-modal knowledge distillation\(Hintonet al\.,[2015](https://arxiv.org/html/2606.17462#bib.bib71); Lopez\-Pazet al\.,[2016](https://arxiv.org/html/2606.17462#bib.bib66)\)\. The student model operates on encrypted traffic alone\. At deployment time, the resource sequences and the teacher model are removed, leaving a standard traffic\-only WF classifier\. In this way, ResAware improves robustness without strengthening the online attacker beyond the standard passive threat model\. We evaluate ResAware under multidimensional distribution shifts, including temporal, spatial, browser, and proxy variations, on a large\-scale dataset spanning five months across six globally distributed nodes and more than 160,000 samples\. The results show that ResAware robustly improves the cross\-environment generalization of mainstream WF baselines with zero additional inference cost\. Moreover, ResAware is orthogonally complementary to existing target\-domain adaptation techniques\(Sirinamet al\.,[2019](https://arxiv.org/html/2606.17462#bib.bib29); Denget al\.,[2026](https://arxiv.org/html/2606.17462#bib.bib14); Zhanget al\.,[2023](https://arxiv.org/html/2606.17462#bib.bib74)\)\. The main contributions of this paper are as follows: - •Asymmetric Threat Model Formalization\.We introduce and formalize atraining\-rich/inference\-poorasymmetric setting for WF\. Under this setting, application\-layer resource information is available only during training and naturally serves as privileged information, improving the robustness oftraffic\-onlyWF models while preserving the standard passive eavesdropper assumption\. - •Cross\-Modal Distillation Framework\.We propose ResAware, a cross\-modal knowledge distillation framework for cross\-environment WF\. ResAware trains a resource\-side teacher model on resource\-level features and distills the resulting privileged knowledge into atraffic\-onlystudent model, injecting stable resource\-side supervision without requiring resource access at inference time\. - •Plug\-and\-Play Integration with Zero Inference Overhead\.ResAware can be incorporated into existing WF models through the training objective alone, without modifying backbone architectures\. At deployment time, it operates directly on encrypted traffic with zero additional inference overhead and remains complementary to existing domain adaptation techniques\. - •Large\-Scale Benchmark and Evaluation\.We construct a large\-scale paired traffic\-resource dataset spanning multidimensional cross\-environment scenarios\. On this benchmark, ResAware robustly improves the robustness of mainstream WF baselines\. Under a 150\-day temporal drift, it raises the F1\-score of Var\-CNN from 72\.77% to 81\.49% and improves the open\-world TPR from 22\.40% to 27\.20% at 1% FPR\. Under the more challenging obfuscated proxy drift setting, it further delivers absolute F1\-score gains of 8\.96% and 3\.88% for Var\-CNN and RF, respectively\. ## 2\.Threat Model Figure 2\.Thetraining\-rich / inference\-poorasymmetric threat model\. During offline construction, the attacker collects paired traffic and resource sequences via TLS key logging; at online inference, only encrypted traffic is observable\.A two\-phase diagram showing the offline construction phase where the attacker collects paired traffic and resource sequences via TLS key logging, and the online inference phase where the attacker is a passive eavesdropper observing only encrypted packet\-level traffic\.As depicted in Figure[2](https://arxiv.org/html/2606.17462#S2.F2), we formalize an asymmetric threat model for cross\-environment WF, termedtraining\-rich/inference\-poor\. The asymmetry lies in the fact that resource\-level information is available during offline training but unavailable during online inference\. Offline Construction Phase\.As shown in the upper portion of Figure[2](https://arxiv.org/html/2606.17462#S2.F2), the attacker deploys instrumented crawlers in a controlled environment to visit websites and collect both encrypted traffic and the corresponding TLS key logs\. These key logs are generated solely by attacker\-controlled crawlers during offline data collection and are not available from the victim during online inference\. This enables offline parsing of encrypted traffic and extraction of high\-fidelity application\-layer resource information, which is used solely as privileged supervision during training\. Online Inference Phase\.The lower portion of Figure[2](https://arxiv.org/html/2606.17462#S2.F2)depicts the online attacker as a client\-side path observer, such as a local Autonomous System \(AS\), an Internet Service Provider \(ISP\), or a malicious local router\. The attacker can only passively observe packet\-level characteristics of the victim’s encrypted connections, including packet direction, length, and timing\. The attacker cannot decrypt payloads, inject, modify, delay, or drop packets, and has no control over the victim’s endpoint\. We further exclude side\-channel metadata that could directly reveal the target website, including DNS queries, TLS SNI fields, certificate contents, HTTP Host headers, IP\-to\-domain mappings, and browser\-side API fingerprints\. Consistent with mainstream page\-load\-level WF research\(Chenget al\.,[2025a](https://arxiv.org/html/2606.17462#bib.bib11); Sirinamet al\.,[2018](https://arxiv.org/html/2606.17462#bib.bib58); Huanget al\.,[2023](https://arxiv.org/html/2606.17462#bib.bib82)\), we assume that each test sample corresponds to a single isolated page\-load event\. ## 3\.Motivating Analysis: Stability and Transferability of Resource\-Level Features To motivate ResAware, we examine two fundamental questions\.\(1\) Resource\-Level Feature Robustness:Are resource\-level features stable and discriminative under cross\-environment distribution shifts?\(2\) Knowledge Transferability:Can knowledge derived from offline resource\-level information be effectively distilled into atraffic\-onlystudent model? ### 3\.1\.Are Resource\-Level Features Stable and Discriminative Across Environments? Modern web page loading follows a structured process shaped by HTML, CSS, JavaScript, and asynchronous resource fetching\(Fieldinget al\.,[2022](https://arxiv.org/html/2606.17462#bib.bib83)\)\. While dynamic updates, advertisement injections, and A/B testing may introduce localized variation, the overall resource loading sequence of a website typically retains stable macroscopic patterns across visits\(Liet al\.,[2023](https://arxiv.org/html/2606.17462#bib.bib84); Panchenkoet al\.,[2016a](https://arxiv.org/html/2606.17462#bib.bib85)\)\. In contrast, low\-level traffic features, such as packet length, direction, and burst intervals, are heavily affected by transport\- and network\-layer dynamics, including congestion, routing changes, and TCP control behavior\. This suggests that resource loading sequences may provide a more stable basis for cross\-environment identification than low\-level traffic features\. To validate this hypothesis, we conduct an empirical analysis along two dimensions\. First, we quantify the cross\-environment drift and class separability of resource representations in feature space \(Finding 1\)\. Second, we evaluate whether this relative stability yields more robust classification performance under cross\-environment distribution shifts \(Finding 2\)\. Figure 3\.\(a\) CESM comparison between resource features \(FcatF\_\{\\mathrm\{cat\}\},FsizeF\_\{\\mathrm\{size\}\}\) and traffic bursts \(FburstF\_\{\\mathrm\{burst\}\}\) under spatial and temporal drift; \(b\) classification F1\-scores decay for a resource\-only vs\. traffic\-only model\.Finding 1: Compared with traffic features, resource\-level features exhibit substantially stronger intra\-class stability and inter\-class separability across environments\. For each page load, we extract two resource sequences ordered by request initiation time\. The first,FsizeF\_\{size\}, records the log\-scaled payload size of each fetched resource\. The second,FcatF\_\{cat\}, represents resource categories as one\-hot vectors\. As a traffic\-side baseline, we deriveFburstF\_\{burst\}from encrypted traces by grouping contiguous packets traveling in the same direction and representing each burst by its signed log\-scaled size\. We perform cross\-regional and cross\-temporal measurements on 100 monitored websites\. Because these sequences have variable length and may exhibit local alignment shifts under cross\-environment loading conditions\(Liet al\.,[2025](https://arxiv.org/html/2606.17462#bib.bib23)\), we use Normalized Dynamic Time Warping \(nDTW\)\(Berndt and Clifford,[1994](https://arxiv.org/html/2606.17462#bib.bib70)\)to measure distances between site prototypes\. For a featureFF, we define the same\-site cross\-environment drift \(Δsame\\Delta\_\{\\text\{same\}\}\) and the different\-site cross\-environment distance \(Δdiff\\Delta\_\{\\text\{diff\}\}\) between source environmentssand target environmentttas follows: \(1\)ΔsameF=1N∑i=1Nd\(pi,sF,pi,tF\),ΔdiffF=1N\(N−1\)∑i≠jd\(pi,sF,pj,tF\),\\begin\{gathered\}\\Delta\_\{\\text\{same\}\}^\{F\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}d\(p\_\{i,s\}^\{F\},p\_\{i,t\}^\{F\}\),\\\\ \\Delta\_\{\\text\{diff\}\}^\{F\}=\\frac\{1\}\{N\(N\-1\)\}\\sum\_\{i\\neq j\}d\(p\_\{i,s\}^\{F\},p\_\{j,t\}^\{F\}\),\\end\{gathered\} To jointly evaluate whether features retain discriminative power while suppressing environmental noise, we introduce the Cross\-Environment Stability Margin \(CESM\): \(2\)CESMF\(s,t\)=1−ΔsameF/ΔdiffF\.\\text\{CESM\}\_\{F\}\(s,t\)=1\-\\Delta\_\{\\text\{same\}\}^\{F\}/\\Delta\_\{\\text\{diff\}\}^\{F\}\. A higher CESM indicates that cross\-environment intra\-site variation remains much smaller than inter\-site discrepancy, and therefore reflects stronger robustness to environmental noise\. As shown in Figure[3](https://arxiv.org/html/2606.17462#S3.F3)\(a\), bothFsizeF\_\{size\}andFcatF\_\{cat\}exhibit substantially stronger cross\-environment stability thanFburstF\_\{burst\}\. In cross\-regional experiments spanning five geographic regions,FcatF\_\{cat\}andFsizeF\_\{size\}achieve average CESM values of 0\.675 and 0\.577, respectively, compared with 0\.218 forFburstF\_\{burst\}, corresponding to gains of 3\.09×\\timesand 2\.65×\\timesoverFburstF\_\{burst\}\. The same trend persists under temporal drift: after 150 days,FcatF\_\{cat\}\(CESM = 0\.592\) andFsizeF\_\{size\}\(CESM = 0\.456\) remain well aboveFburstF\_\{burst\}\(CESM = 0\.224\), yielding 2\.64×\\timesand 2\.04×\\timeslarger margins, respectively\. These results show that resource\-level features are inherently more robust to environmental variation and preserve stronger discriminative structure across diverse deployment conditions\. Finding 2: The stability advantage of resource\-level features yields stronger task\-level robustness\.To examine whether the feature\-space advantage carries over to downstream classification robustness, we control all other factors and compare two classifiers built on the same Deep Fingerprinting \(DF\)\(Sirinamet al\.,[2018](https://arxiv.org/html/2606.17462#bib.bib58)\)architecture: aResource\-Onlymodel, which takesFsizeF\_\{size\}andFcatF\_\{cat\}as input, and aTraffic\-Onlymodel, which takes packet\-level traffic sequences as input\. As shown in Figure[3](https://arxiv.org/html/2606.17462#S3.F3)\(b\), theResource\-Onlymodel degrades much more slowly under environmental drift\. Under temporal drift, both models initially achieve near\-perfect source\-domain F1\-scores\. However, after 150 days, theTraffic\-Onlymodel drops by 33\.30 percentage points to 64\.85%, whereas theResource\-Onlymodel declines by only 14\.22 points and still maintains 83\.50%\. In cross\-regional evaluation, theResource\-Onlymodel achieves an average F1\-score of 91\.49% across all target regions, outperforming thetraffic\-onlymodel by 8\.35 percentage points on average\. Takeaway\.These empirical results show that resource\-level features are substantially more stable and robust to environmental noise than low\-level traffic features under cross\-environment shifts\. Yet such robust resource\-side signals are unavailable to a passive eavesdropper at deployment time\. This gap motivates the core design of ResAware: the key challenge is not to seek stronger features from online observations alone, but to transfer resource\-side robustness to atraffic\-onlyclassifier through offline cross\-modal supervision\. ### 3\.2\.Can Resource\-Side Robustness Be Transferred to Traffic\-Only Models? This leads to the central methodological question behind ResAware: can the stability available only through privileged supervision during training be transferred across modalities, and if so, how can it improve a classifier that must rely solely on low\-level traffic at deployment time? Our answer is yes, but not by assuming that resource sequences can be faithfully reconstructed from encrypted traffic\. In modern web communications, concurrent browser scheduling, HTTP/2/3 multiplexing, transport\-layer dynamics, and network latency variation collectively entangle multiple object\-level requests within a continuous packet stream\. Recovering precise object boundaries from encrypted traffic is therefore highly ill\-posed in practice\. Building stability transfer on such packet\-to\-object reconstruction would not only be impractical, but would also force the model to depend on fragile local alignment assumptions\. Instead, ResAware follows a more robust transfer pathway groun\-ded in the Learning Using Privileged Information \(LUPI\) paradigm\(Vapnik and Izmailov,[2015](https://arxiv.org/html/2606.17462#bib.bib67); Lopez\-Pazet al\.,[2016](https://arxiv.org/html/2606.17462#bib.bib66)\)and generalized knowledge distillation\. The resource view, available only during training, does not need to be reconstructed at inference time\. As long as it provides a cleaner inductive signal than encrypted traffic alone, it can reshape the decision boundaries of a single\-modality classifier through a teacher\-student framework\. In cross\-environment WF, resource\-side privileged supervision fits this paradigm particularly well\. Under standard hard\-label supervision with Empirical Risk Minimization \(ERM\), a single\-modality model can easily fall intoshortcut learning, relying on spurious yet separable packet\-level cues in the source environment and thus forming brittle decision boundaries that fail under distribution shift\. ResAware mitigates this problem by introducing a structural prior from the resource modality through soft\-target supervision\. This prior captures class\-level similarity relationships at the application layer and guides thetraffic\-onlymodel toward representations that better reflect intrinsic website identity, rather than transient environment\-specific traffic patterns\. What is transferred is not the raw resource sequence itself, but the class\-level relational knowledge encoded in the resource modality\. The appropriate role of the resource modality is therefore not as an auxiliary runtime input, but as a source of privileged supervision during training\. How effectively this knowledge is inherited by the student, and how it affects decision\-boundary calibration under long\-term drift, are quantitatively analyzed in §[5\.7](https://arxiv.org/html/2606.17462#S5.SS7)\. ## 4\.ResAware Overview and Design Figure 4\.Overview of the ResAware framework\. Offline training first trains a resource\-only teacher, then distills its knowledge into a traffic\-only student; all resource\-side components are discarded before online deployment\.A pipeline diagram of ResAware divided into two phases: offline training with resource extraction, resource teacher training, and cross\-modal knowledge distillation to the traffic\-only student; and online inference where only the traffic student is deployed with no resource access\.ResAware is a training\-time privileged knowledge distillation framework for cross\-environment WF\. Its core idea is straightforward: during the offline construction phase, aresource\-only teacher modelis trained on resource loading sequences collected under controlled conditions, and the resulting resource\-side knowledge is transferred to atraffic\-only student modelthrough heterogeneous cross\-modal distillation\. In this way, ResAware introduces stable resource\-side supervision into the student model without requiring resource access at inference time\. The framework is designed under three constraints: - •Privileged Information Isolation\.High\-fidelity resource features are available only during offline training and remain inaccessible during online inference\. - •Zero Online Overhead and Interface Compatibility\.At deployment time, the framework must preserve the standard traffic\-only observation interface of a passive WF attacker, without introducing additional online cost\. - •Plug\-and\-Play Integration\.ResAware makes no assumptions about the underlying WF backbone\. It can be instantiated on top of any existing WF model following a standard end\-to\-end classification pipeline\. ### 4\.1\.Design Principle: Privileged Resource Distillation As illustrated in Figure[4](https://arxiv.org/html/2606.17462#S4.F4), ResAware operates in two strictly separated phases: offline training and online inference\. During offline training, the framework has access to paired samples\(x,x∗,y\)\(x,x^\{\*\},y\), wherexxdenotes the encrypted traffic trace of a single page load,x∗x^\{\*\}denotes the corresponding resource loading sequence extracted under controlled conditions, andyyis the ground\-truth website label\. The distillation pipeline proceeds in three steps\. First, aresource\-only teacher modelis trained in the offline source environment to capture stable resource\-side patterns that are less affected by network noise\. Second, the teacher model is frozen, and its soft target outputs are used to supervise atraffic\-only student model, transferring resource\-side relational knowledge into the traffic feature space\. Third, all resource\-side components are discarded before deployment\. The complete training and deployment procedure is summarized in Algorithm[1](https://arxiv.org/html/2606.17462#algorithm1)\(Appendix[C](https://arxiv.org/html/2606.17462#A3)\)\. ### 4\.2\.Training the Resource\-Only Teacher In the controlled source environment, ResAware constructs page\-load\-level paired triplets\(x,x∗,y\)\(x,x^\{\*\},y\), wherexxdenotes the encrypted traffic trace,x∗x^\{\*\}denotes the corresponding resource loading sequence, andyyis the website label\. This requires only page\-load\-level correspondence betweenxxandx∗x^\{\*\}, avoiding any need for fragile packet\-to\-object reconstruction\. To improve robustness under cross\-environment extraction, we order resource events by requestinitiation time, as triggered by the browser engine, rather than by response completion order\. Initiation order more directly reflects the parsing progress of the root document and the triggering logic of resource dependencies, whereas completion order is far more sensitive to network latency, congestion control, and HTTP multiplexing\. Using initiation order therefore decouples the resource sequence from transport\-side timing variation\. To convert a variable\-length resource loading sequence into a fixed\-size model input, we represent each page load as a sequence ofNNresource events: \(3\)Z=\{\(ci,s~i\)\}i=1NZ=\\\{\(c\_\{i\},\\tilde\{s\}\_\{i\}\)\\\}\_\{i=1\}^\{N\}wherecic\_\{i\}is the resource category ID ands~i\\tilde\{s\}\_\{i\}is the log\-scaled payload size\. Unless otherwise specified, we setN=200N=200for truncation and padding\. The sequence captures two feature channels: - •Categorical Channel\.Based on theContent\-Typefield in HTTP response headers, each resource is mapped into one of nine categories: HTML, Tiny Image, Regular Image, CSS, JS, Font, JSON/API, Document, and Unknown\. Image resources are further divided by payload size: those smaller than 5 KB are classified as Tiny Image, while those of 5 KB or larger are classified as Regular Image\. This taxonomy captures the major resource types that commonly appear in modern web pages\. - •Size Channel\.We use the byte size of each resource as a continuous feature\. To reduce the influence of large resources on optimization, the size of theii\-th resource is log\-scaled tos~i\\tilde\{s\}\_\{i\}before being fed into the model\. We deliberately discard absolute request timestamps and preserve temporal structure only through event order, encoded by positional embeddings\. This design prevents the teacher from overfitting to source\-environment\-specific latency patterns\. The resulting fixed\-length sequence is then used to train a teacher modelT\(⋅\)T\(\\cdot\)based on a Transformer encoder\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.17462#bib.bib69)\), whose parameters are frozen after supervised training on source\-domain hard labels\. ### 4\.3\.Distilling Resource Knowledge into a Traffic\-Only Student During cross\-modal knowledge distillation, the student modelS\(⋅\)S\(\\cdot\)receives only encrypted traffic as input\. ResAware is agnostic to the student’s input representation and can be instantiated on top of any compatible WF backbone\. The original input format—whether packet length sequences, burst sequences, or Traffic Aggregation Matrices \(TAM\)—remains unchanged\. At each forward pass, the frozen teacher model produces logitszT=T\(x∗\)z\_\{T\}=T\(x^\{\*\}\), while the student model produces logitszS=S\(x\)z\_\{S\}=S\(x\)\. The student parametersθS\\theta\_\{S\}are optimized using two loss terms: Classification Loss\.To preserve the student’s discriminative accuracy on the traffic modality, we compute the cross\-entropy loss against ground\-truth labels: \(4\)ℒcls=−∑c=1Cyclog\(exp\(zS,c\)∑j=1Cexp\(zS,j\)\)\\mathcal\{L\}\_\{cls\}=\-\\sum\_\{c=1\}^\{C\}y\_\{c\}\\log\\left\(\\frac\{\\exp\(z\_\{S,c\}\)\}\{\\sum\_\{j=1\}^\{C\}\\exp\(z\_\{S,j\}\)\}\\right\)whereCCis the number of monitored websites\. Resource\-Privileged Distillation Loss\.To mitigate shortcut learning under ERM, we transfer the teacher’s soft knowledge via KL divergence with temperatureτ\\tau: \(5\)ℒkd=τ2⋅𝒟KL\(σ\(zTτ\)∥σ\(zSτ\)\)\\mathcal\{L\}\_\{kd\}=\\tau^\{2\}\\cdot\\mathcal\{D\}\_\{KL\}\\left\(\\sigma\\left\(\\frac\{z\_\{T\}\}\{\\tau\}\\right\)\\parallel\\sigma\\left\(\\frac\{z\_\{S\}\}\{\\tau\}\\right\)\\right\)whereσ\(⋅\)\\sigma\(\\cdot\)is the Softmax function\. The temperatureτ\\tauflattens the posterior distribution, amplifying inter\-class similarity signals beyond the target class\. Minimizing this loss guides the student to internalize the inter\-class relationships encoded by the resource modality, acting as a form ofsemantic regularization\. Joint Objective\.The student’s total training objective is a weighted combination: \(6\)ℒtotal=\(1−α\)ℒcls\+αℒkd\\mathcal\{L\}\_\{total\}=\(1\-\\alpha\)\\mathcal\{L\}\_\{cls\}\+\\alpha\\mathcal\{L\}\_\{kd\}α∈\[0,1\]\\alpha\\in\[0,1\]controls the trade\-off between the classification objective and the privileged distillation objective\. At the boundaryα=0\\alpha=0,ℒtotal\\mathcal\{L\}\_\{total\}reduces toℒcls\\mathcal\{L\}\_\{cls\}, and the student degenerates into a standard traffic\-only classifier trained under ordinary ERM\. In practice, the optimalα\\alphais primarily determined by the student backbone and exhibits relatively low sensitivity to the specific training and testing datasets; we analyze its effect in the ablation study \(§[F](https://arxiv.org/html/2606.17462#A6)\)\. Mechanistically,ℒcls\\mathcal\{L\}\_\{cls\}preserves the student’s ability to discriminate ground\-truth website labels, whileℒkd\\mathcal\{L\}\_\{kd\}answers: “which websites share similar resource loading structures?” Together, they prevent the student from relying solely on one\-hot supervision, effectively suppressing overfitting to transient, environment\-specific traffic patterns\. ### 4\.4\.Online Inference and Deployment After distillation, ResAware retains only the student modelSSfor deployment\. All training\-specific components, including the resource parser, the teacher model, and the distillation objective, are removed before deployment\. At inference time, the deployed model takes a single encrypted traffic trace as input, without expanding the online attack surface\. All additional computation introduced by ResAware is confined to the offline training phase\. As a result, the deployed model has the same inference latency and memory footprint as the underlying baseline, making ResAware a plug\-and\-play enhancement with zero additional online overhead\. ## 5\.Evaluation This section evaluates the effectiveness, generality, and underlying mechanisms of ResAware under diverse cross\-environment WF settings\. ### 5\.1\.Experimental Setup Datasets and Evaluation Protocols\.Since existing public WF datasets lack application\-layer resource events synchronized with traffic traces, we collect a large\-scale evaluation dataset of paired traffic\-resource samples\. Each sample is represented as\(x,x∗,y\)\(x,x^\{\*\},y\), wherexxis the encrypted traffic trace,x∗x^\{\*\}is the privileged resource sequence \(accessible only at training time\), andyyis the website label\. Detailed resource sequence construction procedures are provided in Appendix[D\.1](https://arxiv.org/html/2606.17462#A4.SS1)\. Data collection spanned November 2025 to April 2026 across six geographically distributed vantage points \(US, Japan, Singapore, South Africa, Australia, and Germany\)\. The monitored set comprises 100 stable websites randomly sampled from the Tranco Top 100K\(Pochatet al\.,[2019](https://arxiv.org/html/2606.17462#bib.bib63)\)\. The unmonitored set consists of 83,645 reachable background websites excluding the monitored set\. In total, we collected over 160,000 page\-load traces \(collection pipelines and distribution statistics are detailed in Appendix[D\.2](https://arxiv.org/html/2606.17462#A4.SS2)\)\. In the source domain, each monitored site comprises 150 traces used for model training; in each target\-domain test set, each monitored site contributes 25–30 traces per snapshot\. For open\-world evaluation, the background split contains 1 trace per unmonitored site\. In all cross\-environment experiments, target\-domain samples are strictly excluded from model training, distillation hyperparameter selection, and threshold tuning\. Distillation hyperparameters are tuned once on the source\-domain validation set and fixed thereafter\. All experiments are run five times with different random seeds; we report the mean performance\. We design five evaluation scenarios to cover realistic deployment shifts: Table 1\.Summary of backbone models, input feature representations, and the source\-validated selected distillation weightα\\alphaBackboneInput Featuresα\\alphaAWF\(Rimmeret al\.,[2018](https://arxiv.org/html/2606.17462#bib.bib24)\)Packet direction sequence0\.1DF\(Sirinamet al\.,[2018](https://arxiv.org/html/2606.17462#bib.bib58)\)Packet direction sequence0\.5RF\(Wanget al\.,[2014](https://arxiv.org/html/2606.17462#bib.bib64)\)Traffic aggregated features0\.5Var\-CNN\(Bhatet al\.,[2019](https://arxiv.org/html/2606.17462#bib.bib68)\)Packet direction sequence0\.7Tik\-Tok\(Rahmanet al\.,[2020](https://arxiv.org/html/2606.17462#bib.bib19)\)Packet direction and timestamp sequence0\.5CountMamba\(Denget al\.,[2025a](https://arxiv.org/html/2606.17462#bib.bib28)\)Packet direction, length,and timestamp sequence0\.7 Table 2\.Zero\-shot closed\-world F1\-score with and without ResAware across four drift settings\.Δ\\Deltadenotes the absolute gain in percentage pointsModelTemporal Drift \(Day 150\)Spatial Drift \(Avg\.\)Proxy Drift \(Avg\.\)Browser Drift \(Avg\.\)w/ow/Δ\\Deltaw/ow/Δ\\Deltaw/ow/Δ\\Deltaw/ow/Δ\\DeltaAWF33\.2532\.25\-1\.0049\.2348\.76\-0\.4717\.5318\.03\+0\.505\.916\.06\+0\.15CountMamba28\.9429\.16\+0\.2272\.9176\.03\+3\.1261\.2162\.50\+1\.297\.119\.50\+2\.39RF36\.6438\.27\+1\.6376\.1178\.61\+2\.5062\.8666\.74\+3\.8818\.1522\.83\+4\.68Tik\-Tok54\.6457\.67\+3\.0382\.8585\.10\+2\.2544\.5244\.88\+0\.364\.796\.05\+1\.26DF61\.3965\.79\+4\.4084\.7186\.64\+1\.9348\.3247\.28\-1\.044\.076\.66\+2\.59Var\-CNN72\.7781\.49\+8\.7282\.6686\.96\+4\.3038\.1447\.10\+8\.9617\.2421\.45\+4\.21- •Temporal Drift\.Models are trained on the source domain and tested on temporal snapshots collected at∼\\sim30\-day intervals to evaluate cross\-time generalization\. - •Spatial Drift\.The test set consists of samples from five geographic locations within the same time window, assessing generalization across diverse network paths and CDN deployments\. - •Obfuscation proxy Drift\.The test set covers six obfuscation proxy protocols \(Shadowsocks\(shadowsocks\.org,[2016](https://arxiv.org/html/2606.17462#bib.bib86)\), Trojan\(trojan\-gfw,[2023](https://arxiv.org/html/2606.17462#bib.bib87)\), VLESS\-XTLS\-Vision, VMess\-WS\-TLS, VMess\-TLS and VMess\(Project X Community,[2020](https://arxiv.org/html/2606.17462#bib.bib88)\)\) to evaluate resilience against transport\-layer obfuscation\. - •Browser Drift:By designating Chrome as the source domain for training and utilizing Edge and Firefox as target domains, we evaluate the robustness against variations in rendering engines and connection management\. - •Open\-World Temporal Drift\.Using temporal snapshots, we mix 100 monitored classes \(8 samples each\) with 80,000 unmonitored classes\(1 samples each\) at a 1:100 ratio\. This scenario evaluates detection capability under strict false\-positive constraints \(TPR@1%FPRTPR@1\\%FPR\)\. Backbone Models\.To cover diverse input representations and modeling paradigms, we select six representative WF architectures as student backbones\. Table[1](https://arxiv.org/html/2606.17462#S5.T1)lists the input features and validatedα\\alphavalues for each model; the optimalα\\alphais coupled to the model architecture, a relationship we analyze in §[F](https://arxiv.org/html/2606.17462#A6)\. For each baseline, we compare its native version against its ResAware\-distilled counterpart\. Teacher Model and Reporting Role\.Unless otherwise specified, “Teacher” refers exclusively to the resource\-only teacher defined in §[4\.2](https://arxiv.org/html/2606.17462#S4.SS2)\. This Transformer\-based model is trained on the same source\-domain monitored websites as the student models, using only resource sequences\. Because it relies on a privileged modality unavailable to the online attacker, its predictions are reported only as an oracle\-style diagnostic reference for resource\-side stability; they are neither a deployable baseline nor a formal upper bound for traffic\-side student models\. Training Protocol and Hyperparameter Fairness\.To ensure that performance gains are attributable to ResAware rather than additional hyperparameter tuning, we follow the architectures, optimizers, learning rate schedules, batch sizes, and training epochs from the original papers or official implementations for each backbone\. For the native and ResAware\-distilled versions of the same baseline, all training configurations are kept identical except for the distillation mechanism itself\. ResAware\-specific hyperparameters \(temperature and distillation weightα\\alpha\) are tuned once on the source\-domain validation set and fixed for all subsequent experiments; they are never re\-tuned for target environments\. Metrics\.For closed\-world tasks, we adopt the F1\-score as the primary evaluation metric, supplemented by Precision and Recall\. For open\-world tasks, given the attacker’s sensitivity to false alarms, we prioritize True Positive Rate at a fixed False Positive Rate \(TPR@1%FPRTPR@1\\%FPR\) as the primary metric\. Implementation Details\.We implement ResAware in Python 3\.12 with PyTorch 2\.10\.0\. All training, distillation, and inference experiments run on a single workstation \(Ubuntu 24\.04 LTS\) equipped with dual Intel Xeon Platinum 8352S CPUs, 128 GB RAM, and an NVIDIA RTX 4090 GPU \(24 GB VRAM\)\. Unless otherwise noted, all backbone implementations use the same random seeds to ensure reproducibility\. ### 5\.2\.Zero\-Shot Robustness under Cross\-Environment Drift We evaluate whether ResAware improves the zero\-shot robustness of WF models under four cross\-environment drift scenarios, where models are trained on the source domain with no access to target\-domain samples\. Overall Results\.Table[2](https://arxiv.org/html/2606.17462#S5.T2)summarizes the zero\-shot closed\-world F1\-scores across six backbone models \(see Appendix[E](https://arxiv.org/html/2606.17462#A5)for per\-environment breakdowns\)\. ResAware yields positive gains in 21 of 24 backbone×\\timesdrift combinations \(87\.5%\)\. Temporal Drift\.Under a 150\-day drift, ResAware improves Var\-CNN from 72\.77% to 81\.49% \(\+8\.72%\), with additional gains of \+4\.40% for DF and \+3\.03% for Tik\-Tok\. The temporal decay curves in Figure[5](https://arxiv.org/html/2606.17462#S5.F5)show that the performance gap generally widens over longer intervals for the stronger sequential backbones, indicating that ResAware slows degradation under long\-horizon drift rather than merely improving source\-domain fit\. Table[3](https://arxiv.org/html/2606.17462#S5.T3)further reports per\-snapshot Precision, Recall, and F1 for all six backbones; the ResAware student \(Var\-CNN backbone\) remains close to the Teacher across all five snapshots, dropping only from 93\.95% at Day 30 to 81\.49% at Day 150, while the vanilla Var\-CNN baseline falls to 72\.77%\. Notably, the Teacher model maintains high accuracy after 150 days, confirming that page\-level resource organization is substantially more stable over time than packet morphology\. ResAware exploits this asymmetry to regularize the student’s decision boundaries\. Table 3\.Precision, Recall, and F1\-score \(%\) under temporal drift for all six backbones and the resource\-only teacher across five test snapshots \(Day 30–150\)\.Day 30Day 60Day 90Day 120Day 150PRF1PRF1PRF1PRF1PRF1Teacher97\.3596\.7796\.4995\.5594\.7794\.4489\.7490\.9389\.2789\.6189\.5787\.6190\.4998\.8088\.97AWF55\.8954\.0551\.2352\.7851\.6448\.1843\.7244\.3040\.1441\.8642\.9438\.2035\.7737\.9533\.25CountMamba88\.3989\.3187\.7944\.7336\.7135\.2638\.1933\.3931\.7936\.3531\.9329\.7934\.7331\.3628\.94Tik\-Tok90\.3090\.1589\.0780\.4580\.9778\.6369\.4669\.5166\.5763\.2462\.3859\.3458\.4057\.8354\.64RF91\.2591\.5290\.4661\.9647\.1245\.8852\.2441\.8239\.8449\.8338\.8436\.5347\.8439\.4736\.64DF91\.9992\.0891\.2888\.1788\.3386\.7574\.8276\.9073\.6068\.6071\.6567\.3562\.0866\.1961\.39Var\-CNN92\.9092\.5291\.6890\.3089\.5688\.6383\.5782\.0380\.3782\.3381\.6279\.4474\.5176\.2272\.77ResAware94\.1194\.6693\.9594\.2294\.0093\.3887\.9387\.8186\.3789\.4689\.1987\.6082\.5784\.2881\.49 Figure 5\.Closed\-world F1\-score \(%\) with and without ResAware over five temporal test snapshots \(Day 30–150\) for WF backbones\.Spatial Drift\.Across the five international vantage points, ResAware improves five of the six backbones on average\. The largest gains appear for Var\-CNN \(\+4\.30%, 82\.66%→\\to86\.96%\), followed by CountMamba \(\+3\.12%\) and RF \(\+2\.50%\), while DF and Tik\-Tok also improve by \+1\.93% and \+2\.25%, respectively\. AWF is the only exception, showing a slight average drop \(\-0\.47%\), which indicates that spatial drift is generally mild enough for resource supervision to help, but low\-capacity students may still fail to absorb the transferred topology consistently\. Obfuscated Proxy Drift\.Obfuscation Proxy protocols heavily distort packet\-level morphology while leaving page resource structure largely intact\. The largest improvement is observed for Var\-CNN, whose F1 score increases from 38\.14% to 47\.10% \(\+8\.96 %\), followed by RF \(\+3\.88 %\) and CountMamba \(\+1\.29 %\)\. AWF and Tik\-Tok obtain only marginal gains \(\+0\.50 and \+0\.36 %\), while DF is the only backbone with a slight degradation \(\-1\.04 %\)\. This exception indicates that, under severe proxy\-induced deformation, cross\-modal distillation is not uniformly beneficial across architectures; its effectiveness still depends on the student’s ability to align traffic representations with the resource\-level topology transferred by the teacher\. Cross\-Browser Drift\.Browser drift is the most challenging of the four scenarios: under the vanilla setting, four of the six backbones remain below 10% average F1\-score, with only RF \(18\.15%\) and Var\-CNN \(17\.24%\) retaining limited discriminative power\. ResAware still yields consistent gains for all six backbones, led by RF \(\+4\.68%\), Var\-CNN \(\+4\.21%\), and DF \(\+2\.59%\), although the absolute performance remains far below that under temporal and spatial drift\. This result suggests that browser switching perturbs how application resources are rendered, scheduled, and multiplexed into observable traffic, making cross\-modal topology transfer substantially harder than in the other drift settings\. Takeaways\.The zero\-shot results support two conclusions: \(1\) Resource supervision during training improves most traffic\-only WF models, with the clearest and most stable benefits under long\-term temporal drift\. \(2\) ResAware is not an unconditional enhancer: its effectiveness depends on whether resource sequences remain predictive of observable traffic morphology and whether the student has sufficient capacity to absorb the teacher’s inter\-class topology; when browser execution or proxy encapsulation disrupts this correspondence, or when the student cannot accommodate the distillation constraint, gains may become limited or turn into negative transfer \(§[5\.6](https://arxiv.org/html/2606.17462#S5.SS6)\)\. ### 5\.3\.Open\-World Detection under Temporal Drift Table 4\.Open\-world temporal drift results \(TPR@1%FPRTPR@1\\%FPR\) for DF, Tik\-Tok, and Var\-CNN with and without ResAware across five temporal snapshots \(100 monitored sites vs\. 80K unmonitored, 1:100 ratio\)\.Δ\\Deltadenotes the absolute gain in percentage points\.ModelType306090120150DFw/o50\.6839\.5228\.4524\.8720\.40w/54\.2340\.7028\.6326\.0021\.02Δ\\Delta\+3\.55\+1\.18\+0\.18\+1\.13\+0\.62Tik\-Tokw/o27\.508\.606\.735\.504\.52w/50\.1522\.6316\.4312\.9310\.17Δ\\Delta\+22\.65\+14\.03\+9\.70\+7\.43\+5\.65Var\-CNNw/o48\.7535\.7027\.5724\.9222\.40w/55\.0741\.0530\.4328\.8527\.20Δ\\Delta\+6\.32\+5\.35\+2\.86\+3\.93\+4\.80 Closed\-world performance alone is insufficient to assess the practical threat of WF attacks; we therefore evaluate ResAware under the open\-world temporal drift setting\. We focus on the three strongest backbones in this regime, namely DF, Tik\-Tok, and Var\-CNN, and report TPR at a stringent operating point of 1% FPR under the 1:100 monitored\-to\-unmonitored imbalance\. Table[4](https://arxiv.org/html/2606.17462#S5.T4)shows that ResAware improves TPR@FPR=0\.01 for all three backbones across the full 150\-day window\. The gains are most pronounced for Tik\-Tok, where TPR rises by \+22\.65%, \+14\.03%, and \+9\.70% over the first three snapshots, and remains \+5\.65% higher even at Day 150\. Var\-CNN exhibits consistently positive improvements \(\+2\.86% to \+6\.32%\), while DF remains comparatively robust and still benefits from modest gains \(\+0\.18% to \+3\.55%\)\. Even when the baseline detector degrades substantially under long\-term drift, resource supervision preserves meaningful monitored\-site detection capability at a strict false\-positive budget\. Takeaway\.The closed\-world robustness gains from ResAware carry over to the more operationally relevant open\-world setting: training\-time resource supervision improves low\-FPR monitored\-site detection under temporal aging and severe class imbalance, with the largest benefits appearing in backbones whose traffic\-side decision boundaries are otherwise most vulnerable to long\-term drift\. ### 5\.4\.Target\-Domain Data Efficiency: Few\-Shot and Zero\-Label Adaptation Figure 6\.Few\-shot adaptation F1\-score \(%\) of Var\-CNN with and without ResAware under Shadowsocks, Trojan, VLESS\-XTLS\-Vision and VMess\-WS\-TLS proxy drifts\.In practice, WF attackers may occasionally obtain limited target\-domain information after initial deployment\. Such information takes two forms: a stronger but more costly variant in which the attacker acquires a small number of labeled target traces via controlled probing or repeated visits\(Sirinamet al\.,[2019](https://arxiv.org/html/2606.17462#bib.bib29); Chenet al\.,[2021](https://arxiv.org/html/2606.17462#bib.bib89)\), and a weaker but more scalable variant in which the attacker observes unlabeled target traffic for Test\-Time Adaptation \(TTA\(Wanget al\.,[2021](https://arxiv.org/html/2606.17462#bib.bib90); Liet al\.,[2020](https://arxiv.org/html/2606.17462#bib.bib91)\)\)\. Using Var\-CNN as the backbone, we evaluate whether ResAware reduces the target\-domain data required to recover performance under distribution shift\. Few\-Shot Adaptation with Labeled Target Samples\.Following the standard few\-shot evaluation protocol adopted in prior WF and domain adaptation works\(Sirinamet al\.,[2019](https://arxiv.org/html/2606.17462#bib.bib29); Chenet al\.,[2021](https://arxiv.org/html/2606.17462#bib.bib89)\), we freeze the backbone and update only the linear classifier withKKlabeled target\-domain traces\. Figure[6](https://arxiv.org/html/2606.17462#S5.F6)shows that ResAware provides the largest benefits in the low\-shot regime\. Under the Trojan proxy, ResAware achieves 88\.33% with just 1 shot, whereas vanilla Var\-CNN reaches only 77\.78%\. Under the more disruptive VMess\-WS\-TLS setting, ResAware with 5 shots matches the performance of vanilla Var\-CNN with 10 shots \(94\.50% vs\. 94\.11%\), effectively halving the label requirement\. Unlabeled Target\-Domain Adaptation\.We further assess ResAware’s compatibility with Proteus\(Denget al\.,[2026](https://arxiv.org/html/2606.17462#bib.bib14)\), a state\-of\-the\-art unlabeled adaptation framework, under six obfuscation proxy drift settings \(Table[5](https://arxiv.org/html/2606.17462#S5.T5)\)\. ResAware proves to be highly complementary to adaptation techniques\. Proteus elevates vanilla Var\-CNN accuracy from 38\.79% to 54\.89%, while ResAware \+ Proteus achieves a superior 69\.14%\. This complementarity reflects their distinct operational stages: ResAware focuses on environment\-agnostic representation learning in the source domain, whereas Proteus facilitates target\-domain calibration\. Consequently, ResAware serves as a stronger feature initializer rather than a substitute for test\-time adaptation\. Takeaways\.ResAware improves the efficiency of both labeled and unlabeled target\-domain adaptation: it reduces the labeled sample requirement for few\-shot adaptation and provides a more robust feature initialization for unlabeled adaptation\. Table 5\.Closed\-world F1\-score \(%\) under Obfuscation proxy drift for Var\-CNN across six obfuscation protocols, under four configurations of ResAware and Proteus\(Denget al\.,[2026](https://arxiv.org/html/2606.17462#bib.bib14)\)\. ResAware operates at training time \(source\-side\); Proteus operates at inference time \(target\-side\); their combination consistently outperforms either component alone\.ResAwarew/ow/ow/w/Proteusw/ow/w/ow/Shadowsocks40\.87%56\.43%49\.83%74\.70%Trojan43\.07%60\.72%52\.96%82\.32%VLESS\-XTLS\-Vision29\.75%34\.47%34\.98%39\.86%VMess46\.28%64\.87%55\.58%85\.04%VMess\-TLS41\.88%62\.96%51\.33%75\.25%VMess\-WS\-TLS30\.90%49\.87%34\.35%57\.66%AVG38\.79%54\.89%46\.51%69\.14% ### 5\.5\.Ablation Analysis: What Makes ResAware Work? Prior experiments establish that ResAware consistently improves the robustness of traffic\-only WF models across diverse distribution shifts\. We now investigate the sources of these gains through two ablation studies\. First, we verify whether the improvements stem from correctly aligned privileged resource supervision or merely from the regularization effect of soft\-label distillation\. Second, we ablate individual resource channels—size, category, and order—to quantify each channel’s contribution\. The sensitivity analysis of the distillation weightα\\alphais provided in Appendix[F](https://arxiv.org/html/2606.17462#A6)\. Table 6\.Ablation study verifying the necessity of correctly aligned privileged supervision under 150\-day temporal drift \(F1\-score \(%\)\)\. Three conditions are compared: ResAware with a resource teacher, KD with a traffic teacher, and KD with class\-shuffled resource soft labels\.Modelw/ ResourceKD \(Ours\)Baselinew/ TrafficKDw/ Class\-ShuffledResource KDTeacher88\.97%\-77\.15%\-AWF32\.25%33\.25%31\.02%29\.44%DF65\.79%61\.39%55\.31%62\.58%RF38\.27%36\.64%30\.98%37\.71%Tik\-Tok57\.67%54\.64%40\.61%54\.52%Var\-CNN81\.49%72\.77%51\.48%74\.92%CountMamba29\.16%28\.94%24\.36%28\.11%Privileged Resource KD vs\. Generic KD\.We design two control conditions to isolate the source of gains:Traffic KDreplaces the teacher’s input with traffic burst features \(retaining the distillation pipeline but removing the resource modality\), andClass\-Shuffled Resource KDpreserves the numerical distribution of the resource soft labels but randomly permutes the class assignments \(to test whether gains arise solely from soft\-label regularization\)\. As shown in Table[6](https://arxiv.org/html/2606.17462#S5.T6), both control conditions perform substantially worse than ResAware and generally fall below the baseline\. Traffic KD not only fails to improve robustness but exacerbates degradation, indicating that a same\-modality teacher reinforces the student’s reliance on source\-domain\-specific spurious correlations\. Class\-Shuffled Resource KD performs at or below the baseline, ruling out soft\-label regularization as the primary driver of gains\. These results confirm that the student inherits the correct inter\-class topology from the resource teacher, not merely the numerical smoothing of its soft labels\. Figure 7\.Per\-channel ablation F1\-score \(%\) for the resource\-only teacher and ResAware Var\-CNN student across temporal drift \(Days 30–150\)\.Contribution of Resource Channels\.We evaluate the teacher and Var\-CNN student under temporal drift by selectively ablating resource size \(no size\), type \(no type\), or request order \(no order\) to quantify each channel’s contribution\. As shown in Figure[7](https://arxiv.org/html/2606.17462#S5.F7), resource size is the strongest discriminative signal: removing it drops the teacher’s 150\-day F1\-score from 88\.97% to 16\.16%, while the student drops from 81\.49% to 77\.34%\. Resource type and order are nonetheless non\-redundant: ablating type incurs a 6\.65% F1\-score drop for the teacher under 150\-day drift, and shuffling order causes a 14\.76% drop\. Even when distilling from only partial resource channels \(size and type\), the student’s cross\-environment robustness remains above the traffic\-only baseline\. All resource channels thus encode transferable, stable structural information\. Sensitivity of Distillation Weightα\\alpha\.The weightα∈\[0,1\]\\alpha\\in\[0,1\]balances hard\-label classification and resource\-topology supervision inℒtotal\\mathcal\{L\}\_\{total\}\. Across backbone and capacity sweeps \(Figure[9](https://arxiv.org/html/2606.17462#A6.F9)and Table[9](https://arxiv.org/html/2606.17462#A6.T9)\), the near\-optimal range is primarily capacity\-dependent and stable across training and testing datasets; we therefore tuneα\\alphaonce per backbone on the source\-domain validation set and fix it for all target environments, with the full analysis deferred to Appendix[F](https://arxiv.org/html/2606.17462#A6)\. Takeaways\.The ablation studies confirm three points: \(1\) Gains stem from correctly aligned privileged resource knowledge, not from the distillation mechanism or soft\-label regularization alone; \(2\) Resource size is the strongest single channel, while type and order provide complementary structural constraints; \(3\)α\\alphais a capacity\-matching parameter for each student backbone, calibrated once on the source\-domain validation set \(full analysis in Appendix[F](https://arxiv.org/html/2606.17462#A6)\)\. ### 5\.6\.Applicability Analysis: When Does ResAware Fail? The preceding experiments show that ResAware is not an unconditional plug\-in enhancer\. Its effectiveness relies on two conditions\. First, resource sequences must continue to encode stable website identity across the source and target domains\. Second, the student must have sufficient capacity to compress the resource teacher’s inter\-class topology into atraffic\-onlyrepresentation\. When either condition is weakened, the distillation term may provide only limited benefit; when both are violated, it can induce negative transfer\. Failure from Broken Traffic\-Resource Correspondence\.The first failure mode arises when the correspondence between resource structure and observable traffic morphology is substantially disrupted\. ResAware is most suitable for temporal and spatial drift, where the perturbation mainly affects the network\-observation layer while the resource set, category sequence, and size distribution remain comparatively stable\. In contrast, browser drift changes page\-load scheduling, connection reuse, preloading behavior, and script execution order\. Obfuscation proxy drift can also systematically rewrite the projection from resource events to traffic packet sequences through tunnel multiplexing, fragmentation, outer TLS encapsulation, or WebSocket framing\. As a result, ResAware still yields relative gains under browser drift, but the absolute macro\-F1 remains low\. Obfuscation proxy drift also exhibits clear model dependence: Var\-CNN and RF benefit, whereas DF shows slight negative transfer\. These results indicate that once the target shift enters the browser execution layer or the protocol encapsulation layer, the teacher’s resource\-side soft labels may become a mismatched constraint rather than a stable prior\. Failure from Insufficient Student Capacity\.The second failure mode comes from limited student capacity\. ResAware does not expose resource features to the student at inference time; instead, it asks a traffic\-only student to fit both hard\-label decision boundaries and the resource teacher’s soft topology\. The Var\-CNN width\-scaling experiment in Appendix[F](https://arxiv.org/html/2606.17462#A6)shows that smaller students have narrower bestα\\alpharanges and lower gain ceilings\. The full\-width Var\-CNN maintains 80\.25%–82\.22% macro\-F1 within the best range ofα=0\.1\\alpha=0\.1–0\.70\.7, reaching a maximum gain of 9\.45 percentage points\. In contrast, the0\.125×0\.125\\timesVar\-CNN has a best range of onlyα=0\.1\\alpha=0\.1–0\.30\.3, with a maximum gain of 4\.60 percentage points\. Thus, low\-capacity students are not unable to benefit from resource supervision; they simply absorb a weaker teacher constraint\. An overly largeα\\alphaturns the KD term from structural regularization into an optimization burden\. Deployment Guidelines\.Based on the above analysis, we derive three practical guidelines\. First, when drift mainly occurs below the resource layer, such as temporal aging, geographic relocation, CDN routing changes, or link\-state variation, ResAware is a suitable default training enhancement\. When the drift involves browser execution or complex proxy encapsulation, it should be validated per scenario, with theα=0\\alpha=0traffic\-only baseline retained as a fallback\. Second,α\\alphashould be matched to student capacity and inductive bias: DF, Tik\-Tok, and RF benefit most from moderate weights; Var\-CNN and CountMamba benefit from medium\-to\-high weights; AWF should use smaller weights and be checked for negative transfer\. Third, deployment should retain only the traffic\-only student, with no resource parser or teacher model in the inference pipeline\. If a small amount of target\-domain traffic is available, ResAware is best used as a stronger source\-domain initialization that can be combined with few\-shot or unlabeled adaptation\. ### 5\.7\.Mechanism Analysis: What Does the Student Inherit? Figure 8\.Calibration and confidence distributions of Var\-CNN at Day 150 with and without ResAware\. The left panel shows the reliability diagram, while the middle and right panels show the confidence KDEs for correct and incorrect predictions, respectively\.This section investigates a core mechanistic question: what information does the resource teacher transfer to the traffic\-only student through heterogeneous distillation? We find that the student inherits the resource\-induced inter\-class topology and acquires more stable decision boundaries\. ResAware Improves Model Robustness and Calibration\.Figure[8](https://arxiv.org/html/2606.17462#S5.F8)characterizes the effect of ResAware on Var\-CNN’s output distribution at Day 150\. The reliability diagram\(Mindereret al\.,[2021](https://arxiv.org/html/2606.17462#bib.bib72)\)shows that the baseline exhibits severe overconfidence \(ECE = 0\.138\), whereas ResAware reduces ECE to 0\.034—a nearly fourfold improvement\. The confidence KDE reveals a complementary pattern: for correct predictions, ResAware produces a sharper, more concentrated peak near 1\.0, indicating higher decisiveness; for incorrect predictions, the baseline clusters errors near high\-confidence regions, whereas ResAware shifts the error mass toward lower confidence\. These results show that ResAware does not uniformly suppress confidence; instead, it achieves structural calibration—being more confident when correct and more conservative when wrong\. The teacher thus imprints resource\-side structural invariants onto the student, guiding it away from overfitting to transient traffic noise\. Table 7\.Inter\-class topology alignment between Var\-CNN with and without ResAware and the resource teacher over 150 days\. KL divergence \(↓\\downarrow\) measures the distributional distance between student and teacher soft outputs; Spearmanρ\\rho\(↑\\uparrow\) measures the rank correlation of per\-class similarity orderings\.DaysKL to Teacher \(↓\\downarrow\)Rel\. Spearmanρ\\rho\(↑\\uparrow\)w/o Res\.w/ Res\.w/o Res\.w/ Res\.302\.55160\.38380\.03730\.0806602\.43110\.35220\.02730\.0775902\.25660\.32610\.03240\.08601202\.17130\.31400\.02970\.09621502\.11330\.30290\.02870\.1022 The Student Aligns with the Teacher’s Soft Topology\.To quantify how much of the teacher’s inter\-class topology the student inherits, we track two metrics over the 150\-day drift window\.KL divergencemeasures the distributional distance between the student’s and teacher’s soft output distributions: a lower value indicates that the student assigns similar per\-class probability mass as the teacher\.Spearmanρ\\rhomeasures the rank correlation between the student’s and teacher’s per\-class similarity orderings: a higher value indicates that the student preserves the teacher’s relative inter\-class structure\(Huanget al\.,[2022](https://arxiv.org/html/2606.17462#bib.bib92)\)\. As shown in Table[7](https://arxiv.org/html/2606.17462#S5.T7), Var\-CNN without ResAware remains far from the teacher throughout the window, with KL divergence in the range of 2\.1–2\.6 and Spearmanρ\\rhonear zero \(0\.027–0\.037\)\. In contrast, Var\-CNN with ResAware maintains substantially closer alignment: KL divergence stays within 0\.30–0\.38, and Spearmanρ\\rhoincreases steadily from 0\.081 at Day 30 to 0\.102 at Day 150\. These results confirm that the student inherits the resource teacher’s inter\-class soft topology rather than simply mimicking hard\-label predictions\. Takeaway\.ResAware converts resource\-side structural priors from the training phase into stable relational supervision signals\. At inference, the student relies exclusively on encrypted traffic, yet its decision boundaries are regularized by the resource\-induced topology—enabling both higher F1\-score and better calibration under long\-term drift\. ## 6\.Limitations and Future Work ResAware assumes that resource structure retains adequate stability in the target environment\. For highly personalized pages, sites under frequent A/B testing, heavily ad\-injected platforms, or dynamically generated templates, resource sizes, categories, and loading sequences may fluctuate substantially, weakening the structural priors provided by the teacher\. Future work could explore more abstract resource representations—such as dependency graphs, initiator graphs, or rendering\-stage topologies—to reduce sensitivity to specific object sizes\. We do not evaluate strong anonymity networks such as Tor\. Tor’s fixed\-size cells, multiplexing, and congestion control further obscure the correspondence between application\-layer resource sizes and observable packet sequences\. Extending ResAware to such networks would likely require a resource teacher that de\-emphasizes object sizes in favor of sequential or graph\-based representations\. Finally, ResAware does not uniformly benefit all backbone–drift combinations\. As shown in §[5\.2](https://arxiv.org/html/2606.17462#S5.SS2)and §[5\.6](https://arxiv.org/html/2606.17462#S5.SS6), severe browser or obfuscated proxy drift can break the traffic\-resource correspondence, while low\-capacity students may fail to absorb the teacher’s inter\-class topology, leading to diminished gains or negative transfer\. Future work should develop capacity\-aware weighting and drift\-aware criteria for falling back to traffic\-only training when resource supervision becomes unreliable\. ## 7\.Related Work Website Fingerprinting Attacks\.Deep learning\-based WF models, including DF\(Sirinamet al\.,[2018](https://arxiv.org/html/2606.17462#bib.bib58)\), Var\-CNN\(Bhatet al\.,[2019](https://arxiv.org/html/2606.17462#bib.bib68)\), Tik\-Tok\(Rahmanet al\.,[2020](https://arxiv.org/html/2606.17462#bib.bib19)\), and CountMamba\(Denget al\.,[2025a](https://arxiv.org/html/2606.17462#bib.bib28)\), achieve high accuracy in closed\-world IID settings by learning low\-level packet statistics\. However, packet\-level signals are jointly shaped by website content structure*and*transport\-layer dynamics: TCP congestion control, HTTP/2 multiplexing, CDN routing, and browser scheduling all modulate observable traffic independently of the page being loaded\. Consequently, even modest changes in network path, obfuscated proxy protocol, or browser engine suffice to invalidate learned patterns\(Cherubinet al\.,[2022](https://arxiv.org/html/2606.17462#bib.bib13); Shustermanet al\.,[2026](https://arxiv.org/html/2606.17462#bib.bib21); Liet al\.,[2025](https://arxiv.org/html/2606.17462#bib.bib23); Denget al\.,[2025b](https://arxiv.org/html/2606.17462#bib.bib1); Shadbehet al\.,[2026](https://arxiv.org/html/2606.17462#bib.bib12)\), reflecting a fundamental mismatch between the stability of application\-layer content structure and the volatility of its encrypted traffic projection\(Juarezet al\.,[2014](https://arxiv.org/html/2606.17462#bib.bib80)\)\. Improving Robustness within the Traffic\-Only Threat Model\.Existing efforts fall into two categories\. The first improves traffic feature representations: Shen et al\.\(Shenet al\.,[2023](https://arxiv.org/html/2606.17462#bib.bib20)\)use feature attribution and contrastive regularization to suppress defense\-induced perturbations; Bahramali et al\.\(Bahramaliet al\.,[2023](https://arxiv.org/html/2606.17462#bib.bib33)\)simulate network\-condition variation via trace augmentation; and Shen et al\.\(shen2025swallow\)pursue transfer\-robust training objectives\. While each reduces sensitivity to a specific perturbation type, all remain constrained by working within the encrypted traffic domain, since the instability of these signals stems from sources outside the traffic itself, limiting the reach of any representation\-level fix\. The second category introduces post\-deployment target\-domain observations\. Few\-shot adaptation methods fine\-tune the classifier with a small number of labeled target traces\(Sirinamet al\.,[2019](https://arxiv.org/html/2606.17462#bib.bib29); Chenet al\.,[2021](https://arxiv.org/html/2606.17462#bib.bib89); Zouet al\.,[2022](https://arxiv.org/html/2606.17462#bib.bib6)\); test\-time adaptation methods operate on unlabeled target traffic via entropy minimization or distribution alignment\(Zhanget al\.,[2023](https://arxiv.org/html/2606.17462#bib.bib74); Denget al\.,[2026](https://arxiv.org/html/2606.17462#bib.bib14)\)\. Both improve accuracy under drift but require post\-deployment data collection and do not eliminate the root cause: the adapted model still anchors its decision boundaries to volatile traffic\-domain signals\. ResAware is orthogonally complementary to these methods \(§[5\.4](https://arxiv.org/html/2606.17462#S5.SS4)\): by providing a more stable source\-domain initialization, it amplifies the benefit of subsequent adaptation rather than replacing it\. Resource\-Aware Website Fingerprinting\.A separate research thread exploits application\-layer resource structure as a more stable website identifier\. Li et al\.\(Liet al\.,[2023](https://arxiv.org/html/2606.17462#bib.bib84)\)show that resource loading sequences exhibit substantially greater cross\-environment stability than packet features\. HOLMES & WATSON\(Chenget al\.,[2025b](https://arxiv.org/html/2606.17462#bib.bib15)\)infers HTTP parallelism patterns directly from traffic as lightweight fingerprints; MRCGCN\(Gaoet al\.,[2025](https://arxiv.org/html/2606.17462#bib.bib26)\)constructs multi\-level resource dependency graphs; and STAR\(Chenget al\.,[2025a](https://arxiv.org/html/2606.17462#bib.bib11)\)trains dual encoders to align traffic and resource representations for zero\-shot cross\-modal retrieval\. These works confirm that resource\-level signals offer a more stable encoding of website identity\. However, their deployment assumptions differ: HOLMES infers structural signals from traffic alone and therefore needs no resource access at inference, but is bounded by what traffic can reveal about resource structure\. MRCGCN and STAR directly incorporate resource graphs or embeddings that must be available at inference time, expanding the attacker’s observational requirement beyond standard passive eavesdropping\. ResAware takes a different position: resource information is used exclusively as a privileged training\-time supervision signal and fully discarded before deployment, leaving the online model with the same footprint as a conventional traffic\-only classifier\. Learning Using Privileged Information and Knowledge Distillation\.Vapnik and Izmailov\(Vapnik and Izmailov,[2015](https://arxiv.org/html/2606.17462#bib.bib67); Vapnik and Vashist,[2009](https://arxiv.org/html/2606.17462#bib.bib95)\)formalize Learning Using Privileged Information \(LUPI\): auxiliary features available only at training time can substitute for larger datasets by providing richer concept supervision\. Lopez\-Paz et al\.\(Lopez\-Pazet al\.,[2016](https://arxiv.org/html/2606.17462#bib.bib66)\)establish a formal equivalence between LUPI and knowledge distillation, showing that teacher\-student training on enriched data implements privileged supervision through soft\-label transfer\. Hinton et al\.\(Hintonet al\.,[2015](https://arxiv.org/html/2606.17462#bib.bib71)\)demonstrate that temperature\-scaled KL divergence provides a rich inter\-class relational signal beyond one\-hot labels\. Industrial applications confirm this paradigm: at Taobao, post\-click behavioral signals—privileged at training but unavailable at serving—are distilled into click\-through\-rate predictors with significant accuracy gains\(Xuet al\.,[2020](https://arxiv.org/html/2606.17462#bib.bib73)\)\. In the security domain, KD has been applied to traffic classification primarily for model compression\(Huanget al\.,[2022](https://arxiv.org/html/2606.17462#bib.bib92); Panet al\.,[2024](https://arxiv.org/html/2606.17462#bib.bib94)\), not cross\-modal generalization\. To our knowledge, no prior WF work has formalized webpage resource structure as privileged information or exploited thetraining\-rich / inference\-poorasymmetry to provide environment\-agnostic supervision\. ResAware fills this gap by treating the resource modality as a privileged teacher that transfers inter\-class topology to a traffic\-only student without expanding the online attacker’s observational boundary\. ## 8\.Conclusion This paper presents ResAware to address the performance degradation of WF models under environmental shift\. By formalizing atraining\-rich / inference\-poorasymmetric threat model, ResAware uses stable application\-layer resource sequences as privileged supervision to regularize traffic\-only student models\. Our findings show that internalizing resource\-induced class topology allows the student to move beyond the observational limitations of the traffic modality, anchoring on a website’s intrinsic identity rather than environment\-specific traffic artifacts\. Evaluated on a dataset spanning 5 months and 6 global vantage points, ResAware consistently improves the robustness of diverse WF architectures—including an 8\.72% F1\-score gain for Var\-CNN under 150\-day temporal drift\. With zero inference overhead and orthogonal compatibility with existing adaptation methods, ResAware provides a practical foundation for robust website fingerprinting in real\-world deployments\. ## References - A\. Bahramali, A\. Bozorgi, and A\. Houmansadr \(2023\)Realistic website fingerprinting by augmenting network traces\.InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, CCS 2023, Copenhagen, Denmark, November 26\-30, 2023,W\. Meng, C\. D\. Jensen, C\. Cremers, and E\. Kirda \(Eds\.\),New York, NY, USA,pp\. 1035–1049\.External Links:[Link](https://doi.org/10.1145/3576915.3616639),[Document](https://dx.doi.org/10.1145/3576915.3616639)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p3.1),[§7](https://arxiv.org/html/2606.17462#S7.p2.1)\. - D\. J\. Berndt and J\. Clifford \(1994\)Using dynamic time warping to find patterns in time series\.InProceedings of the 3rd International Conference on Knowledge Discovery and Data Mining,AAAIWS’94,pp\. 359–370\.Cited by:[§3\.1](https://arxiv.org/html/2606.17462#S3.SS1.p5.5)\. - S\. Bhat, D\. Lu, A\. Kwon, and S\. Devadas \(2019\)Var\-cnn: A data\-efficient website fingerprinting attack based on deep learning\.Proc\. Priv\. Enhancing Technol\.2019\(4\),pp\. 292–310\.External Links:[Link](https://doi.org/10.2478/popets-2019-0070),[Document](https://dx.doi.org/10.2478/POPETS-2019-0070)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p2.1),[Table 1](https://arxiv.org/html/2606.17462#S5.T1.3.1.5.1),[§7](https://arxiv.org/html/2606.17462#S7.p1.1)\. - M\. Chen, Y\. Wang, H\. Xu, and X\. Zhu \(2021\)Few\-shot website fingerprinting attack\.Comput\. Netw\.198\(C\)\.External Links:ISSN 1389\-1286,[Link](https://doi.org/10.1016/j.comnet.2021.108298),[Document](https://dx.doi.org/10.1016/j.comnet.2021.108298)Cited by:[§5\.4](https://arxiv.org/html/2606.17462#S5.SS4.p1.1),[§5\.4](https://arxiv.org/html/2606.17462#S5.SS4.p2.1),[§7](https://arxiv.org/html/2606.17462#S7.p3.1)\. - Y\. Cheng, Y\. Zhu, B\. Li, X\. Deng, Y\. Cai, Y\. Ren, and Q\. Liu \(2025a\)STAR: semantic\-traffic alignment and retrieval for zero\-shot HTTPS website fingerprinting\.CoRRabs/2512\.17667\.External Links:[Link](https://doi.org/10.48550/arXiv.2512.17667),[Document](https://dx.doi.org/10.48550/ARXIV.2512.17667),2512\.17667Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p4.1),[§2](https://arxiv.org/html/2606.17462#S2.p3.1),[§7](https://arxiv.org/html/2606.17462#S7.p4.1)\. - Y\. Cheng, Y\. Zhu, B\. Li, P\. Sun, Y\. Ding, X\. Deng, and Q\. Liu \(2025b\)HOLMES & WATSON: A robust and lightweight HTTPS website fingerprinting through HTTP version parallelism\.InProceedings of the ACM on Web Conference 2025, WWW 2025, Sydney, NSW, Australia, 28 April 2025\- 2 May 2025,G\. Long, M\. Blumestein, Y\. Chang, L\. Lewin\-Eytan, Z\. H\. Huang, and E\. Yom\-Tov \(Eds\.\),pp\. 1078–1092\.External Links:[Link](https://doi.org/10.1145/3696410.3714578),[Document](https://dx.doi.org/10.1145/3696410.3714578)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p4.1),[§7](https://arxiv.org/html/2606.17462#S7.p4.1)\. - G\. Cherubin, R\. Jansen, and C\. Troncoso \(2022\)Online website fingerprinting: evaluating website fingerprinting attacks on tor in the real world\.In31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10\-12, 2022,K\. R\. B\. Butler and K\. Thomas \(Eds\.\),pp\. 753–770\.External Links:[Link](https://www.usenix.org/conference/usenixsecurity22/presentation/cherubin)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p2.1),[§7](https://arxiv.org/html/2606.17462#S7.p1.1)\. - X\. Deng, R\. Zhao, Y\. Wang, M\. Zhan, Z\. Xue, and Y\. Wang \(2025a\)Countmamba: a generalized website fingerprinting attack via coarse\-grained representation and fine\-grained prediction\.In2025 IEEE Symposium on Security and Privacy \(SP\),Vol\.,pp\. 1419–1437\.External Links:[Document](https://dx.doi.org/10.1109/SP61157.2025.00154)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p2.1),[Table 1](https://arxiv.org/html/2606.17462#S5.T1.3.1.7.1),[§7](https://arxiv.org/html/2606.17462#S7.p1.1)\. - X\. Deng, J\. Chen, L\. Yu, Y\. Zhang, Z\. Gu, C\. Qiu, X\. Zhao, K\. Xu, and Q\. Li \(2025b\)Beyond a single perspective: towards a realistic evaluation of website fingerprinting attacks\.CoRRabs/2510\.14283\.External Links:[Link](https://doi.org/10.48550/arXiv.2510.14283),[Document](https://dx.doi.org/10.48550/ARXIV.2510.14283),2510\.14283Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p2.1),[§7](https://arxiv.org/html/2606.17462#S7.p1.1)\. - X\. Deng, Y\. Zhang, Q\. Li, Z\. Liu, Y\. Wang, and K\. Xu \(2026\)Enhancing website fingerprinting attacks against traffic drift\.InNetwork and Distributed System Security \(NDSS\) Symposium,External Links:[Link](https://www.ndss-symposium.org/ndss-paper/enhancing-website-fingerprinting-attacks-against-traffic-drift/)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p3.1),[§1](https://arxiv.org/html/2606.17462#S1.p7.1),[§5\.4](https://arxiv.org/html/2606.17462#S5.SS4.p3.1),[Table 5](https://arxiv.org/html/2606.17462#S5.T5),[§7](https://arxiv.org/html/2606.17462#S7.p3.1)\. - R\. T\. Fielding, M\. Nottingham, and J\. Reschke \(2022\)HTTP Semantics\.Request for Comments,RFC Editor\.Note:RFC 9110External Links:[Document](https://dx.doi.org/10.17487/RFC9110),[Link](https://www.rfc-editor.org/info/rfc9110)Cited by:[§3\.1](https://arxiv.org/html/2606.17462#S3.SS1.p1.1)\. - B\. Gao, W\. Liu, G\. Liu, F\. Nie, and J\. Huang \(2025\)Multi\-level resource\-coherented graph learning for website fingerprinting attacks\.IEEE Trans\. Inf\. Forensics Secur\.20,pp\. 693–708\.External Links:[Link](https://doi.org/10.1109/TIFS.2024.3520014),[Document](https://dx.doi.org/10.1109/TIFS.2024.3520014)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p4.1),[§7](https://arxiv.org/html/2606.17462#S7.p4.1)\. - J\. Hayes and G\. Danezis \(2016\)K\-fingerprinting: A robust scalable website fingerprinting technique\.In25th USENIX Security Symposium, USENIX Security 16, Austin, TX, USA, August 10\-12, 2016,T\. Holz and S\. Savage \(Eds\.\),pp\. 1187–1203\.External Links:[Link](https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/hayes)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p1.1)\. - G\. E\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.CoRRabs/1503\.02531\.External Links:[Link](http://arxiv.org/abs/1503.02531),1503\.02531Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p6.1),[§7](https://arxiv.org/html/2606.17462#S7.p5.1)\. - A\. Hintz \(2002\)Fingerprinting websites using traffic analysis\.InProceedings of the 2nd International Conference on Privacy Enhancing Technologies,PET’02,Berlin, Heidelberg,pp\. 171–178\.External Links:ISBN 354000565XCited by:[§1](https://arxiv.org/html/2606.17462#S1.p1.1)\. - P\. E\. Hoffman and P\. McManus \(2018\)DNS Queries over HTTPS \(DoH\)\.Request for Comments,RFC Editor\.Note:RFC 8484External Links:[Document](https://dx.doi.org/10.17487/RFC8484),[Link](https://www.rfc-editor.org/info/rfc8484)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p1.1)\. - G\. Huang, C\. Ma, M\. Ding, Y\. Qian, C\. Ge, L\. Fang, and Z\. Liu \(2023\)Efficient and low overhead website fingerprinting attacks and defenses based on tcp/ip traffic\.InProceedings of the ACM Web Conference 2023,WWW ’23,New York, NY, USA,pp\. 1991–1999\.External Links:ISBN 9781450394161,[Link](https://doi.org/10.1145/3543507.3583200),[Document](https://dx.doi.org/10.1145/3543507.3583200)Cited by:[§2](https://arxiv.org/html/2606.17462#S2.p3.1)\. - T\. Huang, S\. You, F\. Wang, C\. Qian, and C\. Xu \(2022\)Knowledge distillation from a stronger teacher\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[§5\.7](https://arxiv.org/html/2606.17462#S5.SS7.p4.1),[§7](https://arxiv.org/html/2606.17462#S7.p5.1)\. - M\. Juarez, S\. Afroz, G\. Acar, C\. Diaz, and R\. Greenstadt \(2014\)A critical evaluation of website fingerprinting attacks\.InProceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security,CCS ’14,New York, NY, USA,pp\. 263–274\.External Links:ISBN 9781450329576,[Link](https://doi.org/10.1145/2660267.2660368),[Document](https://dx.doi.org/10.1145/2660267.2660368)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p4.1),[§7](https://arxiv.org/html/2606.17462#S7.p1.1)\. - C\. Li, L\. Nie, L\. Zhao, and K\. Li \(2023\)Robust website fingerprinting through resource loading sequence\.World Wide Web \(WWW\)26\(5\),pp\. 2329–2349\.External Links:[Link](https://doi.org/10.1007/s11280-023-01138-2),[Document](https://dx.doi.org/10.1007/S11280-023-01138-2)Cited by:[§3\.1](https://arxiv.org/html/2606.17462#S3.SS1.p1.1),[§7](https://arxiv.org/html/2606.17462#S7.p4.1)\. - D\. Li, Q\. Yuan, T\. Li, S\. Chen, and J\. Yang \(2020\)Cross\-domain network traffic classification using unsupervised domain adaptation\.In2020 International Conference on Information Networking \(ICOIN\),Vol\.,pp\. 245–250\.External Links:[Document](https://dx.doi.org/10.1109/ICOIN48656.2020.9016470)Cited by:[§5\.4](https://arxiv.org/html/2606.17462#S5.SS4.p1.1)\. - J\. Li, D\. Wang, Y\. Liu, Y\. Gao, X\. Zhang, Z\. Lin, X\. Ma, X\. Luo, and X\. Guan \(2025\)Cross\-environmental website fingerprinting\.InIEEE INFOCOM 2025 \- IEEE Conference on Computer Communications, London, United Kingdom, May 19\-22, 2025,pp\. 1–10\.External Links:[Link](https://doi.org/10.1109/INFOCOM55648.2025.11044569),[Document](https://dx.doi.org/10.1109/INFOCOM55648.2025.11044569)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.17462#S3.SS1.p5.5),[§7](https://arxiv.org/html/2606.17462#S7.p1.1)\. - D\. Lopez\-Paz, L\. Bottou, B\. Schölkopf, and V\. Vapnik \(2016\)Unifying distillation and privileged information\.In4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2\-4, 2016, Conference Track Proceedings,Y\. Bengio and Y\. LeCun \(Eds\.\),External Links:[Link](http://arxiv.org/abs/1511.03643)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p6.1),[§3\.2](https://arxiv.org/html/2606.17462#S3.SS2.p3.1),[§7](https://arxiv.org/html/2606.17462#S7.p5.1)\. - M\. Minderer, J\. Djolonga, R\. Romijnders, F\. Hubis, X\. Zhai, N\. Houlsby, D\. Tran, and M\. Lucic \(2021\)Revisiting the calibration of modern neural networks\.InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6\-14, 2021, virtual,M\. Ranzato, A\. Beygelzimer, Y\. N\. Dauphin, P\. Liang, and J\. W\. Vaughan \(Eds\.\),pp\. 15682–15694\.External Links:[Link](https://proceedings.neurips.cc/paper/2021/hash/8420d359404024567b5aefda1231af24-Abstract.html)Cited by:[§5\.7](https://arxiv.org/html/2606.17462#S5.SS7.p2.1)\. - R\. Netravali, A\. Goyal, J\. Mickens, and H\. Balakrishnan \(2016\)Polaris: faster page loads using fine\-grained dependency tracking\.In13th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2016, Santa Clara, CA, USA, March 16\-18, 2016,K\. J\. Argyraki and R\. Isaacs \(Eds\.\),pp\. 123–136\.External Links:[Link](https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/netravali)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p4.1)\. - Q\. Pan, Y\. Yu, H\. Yan, M\. Wang, and B\. Qi \(2024\)ETKD: a semi\-supervised learning\-based knowledge distillation model for encrypted traffic classification\.In2024 IEEE International Conference on Systems, Man, and Cybernetics \(SMC\),Vol\.,pp\. 4528–4533\.External Links:[Document](https://dx.doi.org/10.1109/SMC54092.2024.10831035)Cited by:[§7](https://arxiv.org/html/2606.17462#S7.p5.1)\. - A\. Panchenko, F\. Lanze, J\. Pennekamp, T\. Engel, A\. Zinnen, M\. Henze, and K\. Wehrle \(2016a\)Website fingerprinting at internet scale\.In23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21\-24, 2016,External Links:[Link](http://wp.internetsociety.org/ndss/wp-content/uploads/sites/25/2017/09/website-fingerprinting-internet-scale.pdf)Cited by:[§3\.1](https://arxiv.org/html/2606.17462#S3.SS1.p1.1)\. - A\. Panchenko, F\. Lanze, J\. Pennekamp, T\. Engel, A\. Zinnen, M\. Henze, and K\. Wehrle \(2016b\)Website fingerprinting at internet scale\.In23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21\-24, 2016,External Links:[Link](http://wp.internetsociety.org/ndss/wp-content/uploads/sites/25/2017/09/website-fingerprinting-internet-scale.pdf)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p4.1)\. - V\. L\. Pochat, T\. van Goethem, S\. Tajalizadehkhoob, M\. Korczynski, and W\. Joosen \(2019\)Tranco: A research\-oriented top sites ranking hardened against manipulation\.In26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24\-27, 2019,External Links:[Link](https://www.ndss-symposium.org/ndss-paper/tranco-a-research-oriented-top-sites-ranking-hardened-against-manipulation/)Cited by:[Appendix B](https://arxiv.org/html/2606.17462#A2.p2.1),[2nd item](https://arxiv.org/html/2606.17462#A4.I1.i2.p1.1),[§5\.1](https://arxiv.org/html/2606.17462#S5.SS1.p2.1)\. - Project X Community \(2020\)Note:[https://xtls\.github\.io/](https://xtls.github.io/)Accessed: 2026\-04\-28Cited by:[3rd item](https://arxiv.org/html/2606.17462#S5.I1.i3.p1.1)\. - M\. S\. Rahman, P\. Sirinam, N\. Mathews, K\. G\. Gangadhara, and M\. Wright \(2020\)Tik\-tok: the utility of packet timing in website fingerprinting attacks\.Proc\. Priv\. Enhancing Technol\.2020\(3\),pp\. 5–24\.External Links:[Link](https://doi.org/10.2478/popets-2020-0043),[Document](https://dx.doi.org/10.2478/POPETS-2020-0043)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p2.1),[Table 1](https://arxiv.org/html/2606.17462#S5.T1.3.1.6.1),[§7](https://arxiv.org/html/2606.17462#S7.p1.1)\. - E\. Rescorla, K\. Oku, N\. Sullivan, and C\. A\. Wood \(2025\)TLS Encrypted Client Hello\.Internet\-DraftTechnical Reportdraft\-ietf\-tls\-esni\-25,Internet Engineering Task Force,Internet Engineering Task Force\.Note:Work in ProgressExternal Links:[Link](https://datatracker.ietf.org/doc/draft-ietf-tls-esni/25/)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p1.1)\. - V\. Rimmer, D\. Preuveneers, M\. Juarez, T\. van Goethem, and W\. Joosen \(2018\)Automated website fingerprinting through deep learning\.In25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18\-21, 2018,External Links:[Link](https://www.ndss-symposium.org/wp-content/uploads/2018/02/ndss2018%5C_03A-1%5C_Rimmer%5C_paper.pdf)Cited by:[Table 1](https://arxiv.org/html/2606.17462#S5.T1.3.1.2.1)\. - M\. Shadbeh, K\. Khajavi, and T\. Wang \(2026\)Reality check for tor website fingerprinting in the open world\.CoRRabs/2603\.07412\.External Links:[Link](https://doi.org/10.48550/arXiv.2603.07412),[Document](https://dx.doi.org/10.48550/ARXIV.2603.07412),2603\.07412Cited by:[§7](https://arxiv.org/html/2606.17462#S7.p1.1)\. - shadowsocks\.org \(2016\)External Links:[Link](https://shadowsocks.org/)Cited by:[3rd item](https://arxiv.org/html/2606.17462#S5.I1.i3.p1.1)\. - M\. Shen, K\. Ji, Z\. Gao, Q\. Li, L\. Zhu, and K\. Xu \(2023\)Subverting website fingerprinting defenses with robust traffic representation\.In32nd USENIX Security Symposium \(USENIX Security 23\),Anaheim, CA,pp\. 607–624\.External Links:ISBN 978\-1\-939133\-37\-3,[Link](https://www.usenix.org/conference/usenixsecurity23/presentation/shen-meng)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p3.1),[§7](https://arxiv.org/html/2606.17462#S7.p2.1)\. - A\. Shusterman, R\. David, and Y\. Oren \(2026\)Understanding and addressing concept drift in website fingerprinting\.Computer Networks275\(English\)\.Note:Publisher Copyright: © 2025 The Author\(s\)External Links:[Document](https://dx.doi.org/10.1016/j.comnet.2025.111811),ISSN 1389\-1286Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p2.1),[§7](https://arxiv.org/html/2606.17462#S7.p1.1)\. - P\. Sirinam, M\. Imani, M\. Juarez, and M\. Wright \(2018\)Deep fingerprinting: undermining website fingerprinting defenses with deep learning\.InProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS 2018, Toronto, ON, Canada, October 15\-19, 2018,D\. Lie, M\. Mannan, M\. Backes, and X\. Wang \(Eds\.\),pp\. 1928–1943\.External Links:[Link](https://doi.org/10.1145/3243734.3243768),[Document](https://dx.doi.org/10.1145/3243734.3243768)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p2.1),[§2](https://arxiv.org/html/2606.17462#S2.p3.1),[§3\.1](https://arxiv.org/html/2606.17462#S3.SS1.p10.2),[Table 1](https://arxiv.org/html/2606.17462#S5.T1.3.1.3.1),[§7](https://arxiv.org/html/2606.17462#S7.p1.1)\. - P\. Sirinam, N\. Mathews, M\. S\. Rahman, and M\. Wright \(2019\)Triplet fingerprinting: more practical and portable website fingerprinting with n\-shot learning\.InProceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, CCS 2019, London, UK, November 11\-15, 2019,L\. Cavallaro, J\. Kinder, X\. Wang, and J\. Katz \(Eds\.\),pp\. 1131–1148\.External Links:[Link](https://doi.org/10.1145/3319535.3354217),[Document](https://dx.doi.org/10.1145/3319535.3354217)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p3.1),[§1](https://arxiv.org/html/2606.17462#S1.p7.1),[§5\.4](https://arxiv.org/html/2606.17462#S5.SS4.p1.1),[§5\.4](https://arxiv.org/html/2606.17462#S5.SS4.p2.1),[§7](https://arxiv.org/html/2606.17462#S7.p3.1)\. - trojan\-gfw \(2023\)Note:[https://trojan\-gfw\.github\.io/trojan/](https://trojan-gfw.github.io/trojan/)Accessed: 2026\-04\-28Cited by:[3rd item](https://arxiv.org/html/2606.17462#S5.I1.i3.p1.1)\. - V\. Vapnik and R\. Izmailov \(2015\)Learning using privileged information: similarity control and knowledge transfer\.J\. Mach\. Learn\. Res\.16,pp\. 2023–2049\.External Links:[Link](https://dl.acm.org/doi/10.5555/2789272.2886814),[Document](https://dx.doi.org/10.5555/2789272.2886814)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p5.1),[§1](https://arxiv.org/html/2606.17462#S1.p6.1),[§3\.2](https://arxiv.org/html/2606.17462#S3.SS2.p3.1),[§7](https://arxiv.org/html/2606.17462#S7.p5.1)\. - V\. Vapnik and A\. Vashist \(2009\)A new learning paradigm: learning using privileged information\.Neural Networks22\(5\-6\),pp\. 544–557\.External Links:[Link](https://doi.org/10.1016/j.neunet.2009.06.042),[Document](https://dx.doi.org/10.1016/J.NEUNET.2009.06.042)Cited by:[§7](https://arxiv.org/html/2606.17462#S7.p5.1)\. - A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 6000–6010\.External Links:ISBN 9781510860964Cited by:[§4\.2](https://arxiv.org/html/2606.17462#S4.SS2.p6.1)\. - D\. Wang, E\. Shelhamer, S\. Liu, B\. Olshausen, and T\. Darrell \(2021\)Tent: fully test\-time adaptation by entropy minimization\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=uXl3bZLkr3c)Cited by:[§5\.4](https://arxiv.org/html/2606.17462#S5.SS4.p1.1)\. - T\. Wang, X\. Cai, R\. Nithyanand, R\. Johnson, and I\. Goldberg \(2014\)Effective attacks and provable defenses for website fingerprinting\.InProceedings of the 23rd USENIX Security Symposium, San Diego, CA, USA, August 20\-22, 2014,K\. Fu and J\. Jung \(Eds\.\),pp\. 143–157\.External Links:[Link](https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/wang%5C_tao)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p1.1),[Table 1](https://arxiv.org/html/2606.17462#S5.T1.3.1.4.1)\. - X\. S\. Wang, A\. Balasubramanian, A\. Krishnamurthy, and D\. Wetherall \(2013\)Demystifying page load performance with WProf\.In10th USENIX Symposium on Networked Systems Design and Implementation \(NSDI 13\),Lombard, IL,pp\. 473–485\.External Links:ISBN 978\-1\-931971\-00\-3,[Link](https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/wang_xiao)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p4.1)\. - Y\. Xie, J\. Feng, W\. Huang, Y\. Zhang, X\. Sun, X\. Chen, and X\. Luo \(2024\)Contrastive fingerprinting: A novel website fingerprinting attack over few\-shot traces\.InProceedings of the ACM on Web Conference 2024, WWW 2024, Singapore, May 13\-17, 2024,T\. Chua, C\. Ngo, R\. Kumar, H\. W\. Lauw, and R\. K\. Lee \(Eds\.\),pp\. 1203–1214\.External Links:[Link](https://doi.org/10.1145/3589334.3645575),[Document](https://dx.doi.org/10.1145/3589334.3645575)Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p3.1)\. - C\. Xu, Q\. Li, J\. Ge, J\. Gao, X\. Yang, C\. Pei, F\. Sun, J\. Wu, H\. Sun, and W\. Ou \(2020\)Privileged features distillation at taobao recommendations\.InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,KDD ’20,New York, NY, USA,pp\. 2590–2598\.External Links:ISBN 9781450379984,[Link](https://doi.org/10.1145/3394486.3403309),[Document](https://dx.doi.org/10.1145/3394486.3403309)Cited by:[§7](https://arxiv.org/html/2606.17462#S7.p5.1)\. - G\. Zhang, J\. Cao, M\. Xu, and X\. Deng \(2023\)Unsupervised and adaptive tor website fingerprinting\.InInternational Conference on Security and Privacy in Communication Systems,pp\. 209–229\.Cited by:[§1](https://arxiv.org/html/2606.17462#S1.p3.1),[§1](https://arxiv.org/html/2606.17462#S1.p7.1),[§7](https://arxiv.org/html/2606.17462#S7.p3.1)\. - H\. Zou, J\. Su, Z\. Wei, S\. Chen, and B\. Zhao \(2022\)An efficient cross\-domain few\-shot website fingerprinting attack with brownian distance covariance\.Comput\. Networks219,pp\. 109461\.External Links:[Link](https://doi.org/10.1016/j.comnet.2022.109461),[Document](https://dx.doi.org/10.1016/J.COMNET.2022.109461)Cited by:[§7](https://arxiv.org/html/2606.17462#S7.p3.1)\. ## Appendix AOpen Science Research artifacts have been de\-identified for double\-blind review and are hosted at:[https://github\.com/aimafan123/ResAware](https://github.com/aimafan123/ResAware)\. The repository includes the full training and evaluation code for the ResAware distillation framework and teacher model, along with implementations of six WF backbones—AWF, DF, Var\-CNN, Tik\-Tok, RF, and CountMamba—reproduced from their original papers or official codebases\. The artifact suite also includes scripts for zero\-shot evaluation across four drift scenarios \(temporal, spatial, obfuscated proxy, and browser\), open\-world temporal drift testing, and few\-shot adaptation supporting both supervised and Proteus\-based unsupervised modes\. Due to storage constraints and privacy considerations, we provide featurized versions of our cross\-environment datasets\. Automated pipelines for data processing, training, and evaluation are included to facilitate efficient reproduction of the core experimental results\. ## Appendix BEthical Considerations This research follows the established ethical guidelines of the security community for network measurement and privacy analysis\. Our practices are as follows\. Data Collection and Privacy Protection\.All network traffic was generated by automated headless browsers on vantage points controlled by our team\. We accessed only publicly indexable websites \(a subset of the Tranco Top 100K\(Pochatet al\.,[2019](https://arxiv.org/html/2606.17462#bib.bib63)\)\)\. Data collection complied with each site’srobots\.txtdirectives and terms of service\. No real user browsing behavior, personally identifiable information \(PII\), or private communication content was involved at any stage\. TLS session keys were extracted within our controlled client processes solely to reconstruct application\-layer resource events and were neither retained beyond this purpose nor shared\. Upon publication, we will release only featurized representations—packet direction and length sequences for traffic, and category and size sequences for resource loading\. No raw payloads, URLs, IP addresses, or user\-linked identifiers will be included, fully mitigating privacy risks while preserving research utility\. Selection of Monitored Sites\.The 100 monitored sites were randomly sampled from the Tranco Top 100K list\. To minimize exposure to politically sensitive content, we manually excluded sites identified by international human rights organizations as subject to mandatory censorship\. The 83,645 unmonitored sites were also drawn from the Tranco rankings and do not target specific user groups or sensitive content\. Dual\-Use and Responsible Disclosure\.WF attack research is widely recognized as a prerequisite for improving privacy defenses: only by understanding and quantifying attacker capabilities can defenders design effective countermeasures\. This paper focuses on the root causes of cross\-environment WF failures rather than expanding online attacker capabilities\. The core insight of ResAware—that unstable traffic\-side supervision is the fundamental driver of generalization failure—provides direct guidance for defensive research\. Defenders can introduce targeted perturbations at the resource\-loading level—randomizing loading orders, injecting dummy requests, or diversifying resource type distributions—to undermine the stable resource\-side inductive bias that ResAware exploits, complementing existing packet\-level defenses\. We will also release the large\-scale paired traffic\-resource dataset to lower the barrier for cross\-environment WF research\. Covering temporal, spatial, obfuscated proxy, and browser distribution shifts, this dataset represents one of the most comprehensive cross\-environment WF benchmarks and will help the community evaluate attacks and defenses under unified protocols\. ## Appendix CResAware Training and Deployment Protocol Algorithm[1](https://arxiv.org/html/2606.17462#algorithm1)formalizes the three\-stage training and deployment protocol of ResAware introduced in §[4](https://arxiv.org/html/2606.17462#S4)\. Stage 1 extracts two\-channel privileged resource features from raw resource records and trains the resource\-only teacher under hard\-label supervision\. Stage 2 freezes the teacher and distills its soft\-target distributions into the traffic\-only student through a weighted combination of classification loss and KL\-divergence distillation loss\. Stage 3 discards all resource\-side components—the resource extractor, teacher model, cached soft labels, and distillation loss—before deployment, so that the deployed student operates on encrypted traffic alone with zero additional inference overhead\. Input:Source\-domain paired training set 𝒟s=\{\(xi,Ri,yi\)\}i=1n\\mathcal\{D\}\_\{s\}=\\\{\(x\_\{i\},R\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}, where RiR\_\{i\}denotes raw resource records; resource teacher TθTT\_\{\\theta\_\{T\}\}; traffic student SθSS\_\{\\theta\_\{S\}\}; truncation length NN; temperature τ\\tau; distillation weight α\\alpha Output:Deployed traffic\-only student SθSS\_\{\\theta\_\{S\}\} 1 //Stage 1: extract two\-channel privileged resource features and train the teacher 2foreach*sample\(xi,Ri,yi\)∈𝒟s\(x\_\{i\},R\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\_\{s\}*do 3 Zi←SortByRequestOrder\(Ri\)Z\_\{i\}\\leftarrow\\text\{SortByRequestOrder\}\(R\_\{i\}\); ci←MapTypeToCategory\(Zi\)c\_\{i\}\\leftarrow\\text\{MapTypeToCategory\}\(Z\_\{i\}\); //categorical channel s~i←log\(1\+PayloadBytes\(Zi\)\)\\tilde\{s\}\_\{i\}\\leftarrow\\log\(1\+\\text\{PayloadBytes\}\(Z\_\{i\}\)\); //size channel 4 xi∗←Pad/TruncateN\(\[\(ci,1,s~i,1\),…,\(ci,\|Zi\|,s~i,\|Zi\|\)\]\)x\_\{i\}^\{\*\}\\leftarrow\\text\{Pad/Truncate\}\_\{N\}\\big\(\[\(c\_\{i,1\},\\tilde\{s\}\_\{i,1\}\),\\ldots,\(c\_\{i,\|Z\_\{i\}\|\},\\tilde\{s\}\_\{i,\|Z\_\{i\}\|\}\)\]\\big\); 5 6end foreach 7Construct privileged set 𝒟s∗=\{\(xi,xi∗,yi\)\}i=1n\\mathcal\{D\}\_\{s\}^\{\*\}=\\\{\(x\_\{i\},x\_\{i\}^\{\*\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}; 8foreach*mini\-batchℬ∗⊂\{\(xi∗,yi\)∣\(xi,xi∗,yi\)∈𝒟s∗\}\\mathcal\{B\}^\{\*\}\\subset\\\{\(x\_\{i\}^\{\*\},y\_\{i\}\)\\mid\(x\_\{i\},x\_\{i\}^\{\*\},y\_\{i\}\)\\in\\mathcal\{D\}\_\{s\}^\{\*\}\\\}*do 9 zT←TθT\(x∗\)z\_\{T\}\\leftarrow T\_\{\\theta\_\{T\}\}\(x^\{\*\}\); 10 ℒT←CE\(σ\(zT\),y\)\\mathcal\{L\}\_\{T\}\\leftarrow\\mathrm\{CE\}\(\\sigma\(z\_\{T\}\),y\); 11update θT\\theta\_\{T\}by minimizing ℒT\\mathcal\{L\}\_\{T\}; 12 13end foreach 14Freeze θT\\theta\_\{T\}; 15 //Stage 2: distill resource knowledge into the traffic\-only student 16foreach*mini\-batchℬ⊂𝒟s∗\\mathcal\{B\}\\subset\\mathcal\{D\}\_\{s\}^\{\*\}*do zT←TθT\(x∗\)z\_\{T\}\\leftarrow T\_\{\\theta\_\{T\}\}\(x^\{\*\}\); //privileged branch; no gradient toTT zS←SθS\(x\)z\_\{S\}\\leftarrow S\_\{\\theta\_\{S\}\}\(x\); //deployable branch 17 ℒcls←CE\(σ\(zS\),y\)\\mathcal\{L\}\_\{cls\}\\leftarrow\\mathrm\{CE\}\(\\sigma\(z\_\{S\}\),y\); 18 ℒkd←τ2DKL\(σ\(zT/τ\)∥σ\(zS/τ\)\)\\mathcal\{L\}\_\{kd\}\\leftarrow\\tau^\{2\}D\_\{KL\}\(\\sigma\(z\_\{T\}/\\tau\)\\,\\\|\\,\\sigma\(z\_\{S\}/\\tau\)\); 19 ℒtotal←\(1−α\)ℒcls\+αℒkd\\mathcal\{L\}\_\{total\}\\leftarrow\(1\-\\alpha\)\\mathcal\{L\}\_\{cls\}\+\\alpha\\mathcal\{L\}\_\{kd\}; 20update θS\\theta\_\{S\}by minimizing ℒtotal\\mathcal\{L\}\_\{total\}; 21 22end foreach 23 //Stage 3: discard privileged components before deployment 24Discard resource extractor, TθTT\_\{\\theta\_\{T\}\}, cached soft labels, and distillation losses; 25return*SθSS\_\{\\theta\_\{S\}\}for online inference on encrypted trafficxxonly*; Algorithm 1Resource\-Privileged Distillation for Traffic\-Only WF ## Appendix DDataset Construction and Collection Details ### D\.1\.Methodology for Traffic\-Resource Pairing This section outlines the methodology for constructing trace\-level traffic\-resource pairs from a single controlled page visit\. For each page load, we define atraceas the aggregate network activity triggered by a complete page visit\. For each crawler visit, we simultaneously capture raw encrypted traffic and the corresponding TLS session keys\. In the offline phase, we reconstruct application\-layer resource sequences and align them with the captured traffic traces\. The detailed procedure for resource sequence recovery is provided in Algorithm[2](https://arxiv.org/html/2606.17462#algorithm2)\. Input:Encrypted packet sequencexx, TLS session keysKK Output:Two\-channel privileged resource sequence x∗x^\{\*\} 1 //Recover application\-layer records offline 2 xdec←DecryptTraffic\(x,K\)x\_\{dec\}\\leftarrow\\text\{DecryptTraffic\}\(x,K\); C←∅C\\leftarrow\\emptyset; //Connection state map 3 Z←∅Z\\leftarrow\\emptyset; 4 //Group decrypted application frames into resource streams 5foreach*application framea∈xdeca\\in x\_\{dec\}*do 6 f←FlowTuple\(a\)f\\leftarrow\\text\{FlowTuple\}\(a\); 7 sid←StreamID\(a\)sid\\leftarrow\\text\{StreamID\}\(a\); 8 9if*sid∉C\[f\]sid\\notin C\[f\]*then 10 C\[f\]\[sid\]←NewStream\(\)C\[f\]\[sid\]\\leftarrow\\text\{NewStream\}\(\); 11 12end if 13 S←C\[f\]\[sid\]S\\leftarrow C\[f\]\[sid\]; 14 15if*a∈RequestHeadersa\\in\\text\{RequestHeaders\}*then 16 S\.treq←a\.timeS\.t\_\{req\}\\leftarrow a\.\\text\{time\}; 17 18end if 19if*a∈ResponseHeadersa\\in\\text\{ResponseHeaders\}*then 20 S\.type←InferResourceType\(a\)S\.\\text\{type\}\\leftarrow\\text\{InferResourceType\}\(a\); 21 22end if 23if*a∈DataFramea\\in\\text\{DataFrame\}*then 24 S\.size←S\.size\+a\.lengthS\.\\text\{size\}\\leftarrow S\.\\text\{size\}\+a\.\\text\{length\}; 25 26end if 27 28end foreach 29 //Keep complete streams and form a two\-channel sequence 30foreach*streamS∈CS\\in C*do 31if*S\.typeS\.\\text\{type\}existsandS\.size\>0S\.\\text\{size\}\>0*then 32 Z←Z∪\{\(S\.treq,S\.type,S\.size\)\}Z\\leftarrow Z\\cup\\\{\(S\.t\_\{req\},S\.\\text\{type\},S\.\\text\{size\}\)\\\}; 33 34end if 35 36end foreach 37 38 Z←SortByRequestTime\(Z\)Z\\leftarrow\\text\{SortByRequestTime\}\(Z\); 39 x∗←\[\(type1,size1\),…,\(type\|Z\|,size\|Z\|\)\]x^\{\*\}\\leftarrow\[\(\\text\{type\}\_\{1\},\\text\{size\}\_\{1\}\),\\ldots,\(\\text\{type\}\_\{\|Z\|\},\\text\{size\}\_\{\|Z\|\}\)\]; 40 41return x∗x^\{\*\}; Algorithm 2Offline Privileged Resource Sequence Reconstruction ### D\.2\.Dataset Information To evaluate cross\-environment robustness, we construct a dataset suite we term theResAware Dataset Suite\. This suite is partitioned into multiple subsets, each isolating a distinct experimental factor: temporal evolution, spatial diversity, obfuscated proxy encapsulation, browser variation, and open\-world background traffic\. To ensure environmental consistency across subsets, all vantage points run on Virtual Private Servers \(VPS\) hosted by Vultr111[https://www\.vultr\.com/](https://www.vultr.com/)\. Each VPS uses an identical base configuration: Debian 13 OS, 1 vCPU, 2 GB RAM, 64 GB NVMe storage, and 2 TB bandwidth\. During collection, two isolated Docker containers ran concurrently on each VPS to execute crawling tasks, providing a clean and reproducible environment\. Each access targeted the site’s homepage with a fixed 50\-second capture window to cover initial page loads, asynchronous requests, and deferred resource loading\. The automated browser scrolled the page three times at random intervals to trigger lazy\-loaded images, scripts, and advertisement resources\. After each visit, a screenshot was saved and a quality control \(QC\) pipeline filtered out failed visits, error pages, blank pages, and incomplete loads\. The collected subsets are described below: - •Train\-Base:The source\-domain training set for temporal and spatial drift experiments\. Collected on November 21, 2025 from 6 VPS in New York, US, using Chrome with standard HTTPS/TLS\. It covers 100 monitored sites at 150 traces per site \(15,000 paired traces total\)\. - •Open\-World:The unmonitored background pool for open\-world evaluation, collected on November 21, 2025 from 6 VPS in New York, US with settings identical toTrain\-Base\. Starting from 100,000 Tranco Top 100K\(Pochatet al\.,[2019](https://arxiv.org/html/2606.17462#bib.bib63)\)candidate sites, we retain 83,645 after excluding monitored\-set overlap and filtering failed or anomalous visits\. Each site contributes one trace \(83,645 total\), used exclusively as the negative background pool\. - •Geo\-Drift:Used for spatial drift experiments\. Collected on November 21, 2025 across five international vantage points—Japan \(Tokyo\), Singapore, South Africa \(Johannesburg\), Australia \(Sydney\), and Germany \(Frankfurt\)—using 10 VPS in total\. All settings mirrorTrain\-Base\. It covers the same 100 monitored sites at 25\-30 traces per site per location \(14,087 paired traces across five locations\)\. - •Time\-Drift:Used for temporal drift experiments, comprising five snapshots collected on December 21, 2025; January 20, 2026; February 19, 2026; March 21, 2026; and April 20, 2026—corresponding to 30, 60, 90, 120, and 150 days afterTrain\-Base\. Each snapshot uses 2 VPS in New York, US, with 30 traces per site for the 100 monitored sites \(15,000 paired traces across five snapshots\)\. - •Train\-Base\-2:The source\-domain training set for obfuscated proxy and browser drift experiments\. Collected on March 21, 2026 from 6 VPS in New York, US with configurations identical toTrain\-Base, covering 100 monitored sites at 150 traces per site \(15,000 paired traces\)\. Its temporal alignment with the obfuscated proxy and browser test sets controls for long\-term temporal drift, so that observed performance differences are attributable primarily to protocol or browser variation\. - •Obfuscated\-Proxy\-Drift:Used for obfuscated proxy drift experiments\. Collected on March 21, 2026 using 12 client VPS in New York, US; all traffic was forwarded through Xray proxies\. Two additional VPS served as Xray proxy servers, each handling three obfuscation protocols, with two client VPS assigned per protocol\. Both clients and servers run Xray\-core v26\.1\.23222[https://github\.com/XTLS/Xray\-core/releases/tag/v26\.1\.23](https://github.com/XTLS/Xray-core/releases/tag/v26.1.23)\. It covers 100 monitored sites at 30 traces per site per protocol \(18,000 paired traces across six protocols\)\. - •Browser\-Drift:Used for browser drift experiments\. Collected on March 21, 2026 using 4 VPS in New York, US \(2 per browser: Edge and Firefox\); all other settings matchTrain\-Base\-2\. It covers 100 monitored sites at 25\-30 traces per browser per site \(5,523 paired traces total\)\. ## Appendix EComplete Per\-Environment Zero\-Shot Results Across All Drift Scenarios Table[8](https://arxiv.org/html/2606.17462#A5.T8)reports the complete per\-environment closed\-world F1\-score for all six backbones evaluated in §[5\.2](https://arxiv.org/html/2606.17462#S5.SS2), complementing the aggregated results in Table[2](https://arxiv.org/html/2606.17462#S5.T2)\. Across 108 backbone\-environment combinations, ResAware yields positive gains in 84 cases \(77\.78%\) and negative gains in 24 cases, indicating broad but not unconditional effectiveness\. Negative cases mainly arise from low\-capacity AWF under temporal/spatial/proxy/browser drift and from several protocol\-induced proxy settings for DF/Tik\-Tok, consistent with the applicability analysis in §[5\.6](https://arxiv.org/html/2606.17462#S5.SS6)\. Table 8\.Complete per\-environment zero\-shot closed\-world F1\-score for all six backbones with \(w/\) and without \(w/o\) ResAware across all drift scenarios\. Values are Mean±\\pmSD\.ScenarioTarget Env\.AWFDFRFTik\-TokVar\-CNNCountMambaw/ow/w/ow/w/ow/w/ow/w/ow/w/ow/TemporalDriftDay 3051\.26±\\pm9\.1651\.04±\\pm3\.7091\.28±\\pm0\.3691\.72±\\pm0\.6990\.46±\\pm0\.9589\.99±\\pm0\.8289\.07±\\pm0\.6690\.39±\\pm0\.3091\.68±\\pm0\.8093\.95±\\pm0\.6587\.79±\\pm1\.3287\.06±\\pm0\.67Day 6048\.17±\\pm8\.3649\.00±\\pm2\.9786\.75±\\pm1\.2487\.25±\\pm0\.6245\.88±\\pm1\.9447\.71±\\pm1\.8578\.63±\\pm1\.8981\.45±\\pm1\.2288\.63±\\pm1\.7893\.38±\\pm0\.4435\.26±\\pm3\.3136\.09±\\pm3\.76Day 9040\.15±\\pm6\.1039\.58±\\pm2\.0073\.60±\\pm0\.6776\.46±\\pm0\.8939\.84±\\pm1\.8942\.37±\\pm1\.4666\.57±\\pm1\.1369\.72±\\pm0\.5680\.37±\\pm1\.9586\.37±\\pm0\.7331\.79±\\pm2\.0332\.81±\\pm1\.03Day 12038\.20±\\pm5\.3537\.89±\\pm2\.2867\.35±\\pm0\.9271\.38±\\pm1\.5536\.53±\\pm2\.0438\.73±\\pm1\.9059\.34±\\pm1\.1462\.67±\\pm1\.3079\.44±\\pm1\.9987\.60±\\pm1\.4229\.79±\\pm2\.8929\.38±\\pm1\.22Day 15033\.25±\\pm5\.2332\.25±\\pm3\.4161\.39±\\pm1\.1165\.79±\\pm1\.4936\.64±\\pm2\.4638\.27±\\pm1\.0654\.64±\\pm0\.8457\.67±\\pm0\.6572\.77±\\pm1\.6381\.49±\\pm1\.6128\.94±\\pm2\.3429\.16±\\pm2\.21AVG42\.21±\\pm6\.6341\.95±\\pm2\.7676\.07±\\pm0\.5678\.52±\\pm0\.9349\.87±\\pm1\.6951\.41±\\pm1\.3069\.65±\\pm1\.0172\.38±\\pm0\.5082\.58±\\pm1\.5488\.56±\\pm0\.8442\.71±\\pm2\.3342\.90±\\pm1\.70SpatialDriftAU53\.83±\\pm8\.7153\.33±\\pm4\.0887\.04±\\pm0\.2688\.13±\\pm0\.4579\.01±\\pm0\.7379\.65±\\pm0\.7285\.44±\\pm0\.2586\.98±\\pm0\.2984\.74±\\pm0\.7387\.44±\\pm0\.5474\.87±\\pm1\.2579\.31±\\pm0\.95DE40\.80±\\pm7\.8840\.07±\\pm3\.1681\.99±\\pm0\.8184\.65±\\pm1\.1177\.21±\\pm0\.8783\.14±\\pm0\.6280\.87±\\pm0\.8882\.29±\\pm0\.3181\.99±\\pm0\.4585\.66±\\pm1\.0777\.05±\\pm0\.8876\.56±\\pm0\.18JP50\.95±\\pm8\.5650\.73±\\pm3\.9884\.50±\\pm0\.4186\.38±\\pm0\.2278\.80±\\pm0\.4280\.69±\\pm0\.5083\.05±\\pm0\.3984\.92±\\pm0\.2483\.21±\\pm1\.0688\.11±\\pm0\.6874\.14±\\pm1\.6476\.84±\\pm0\.37SG51\.64±\\pm7\.6951\.18±\\pm2\.8486\.84±\\pm0\.5288\.65±\\pm0\.3379\.79±\\pm1\.0282\.35±\\pm0\.3483\.97±\\pm0\.3586\.51±\\pm0\.5884\.71±\\pm1\.2888\.64±\\pm0\.2175\.44±\\pm0\.6477\.77±\\pm0\.21ZA48\.93±\\pm5\.5548\.50±\\pm2\.9483\.16±\\pm1\.0285\.37±\\pm0\.4065\.76±\\pm1\.9567\.23±\\pm1\.3780\.93±\\pm0\.7584\.79±\\pm0\.2878\.67±\\pm0\.5484\.97±\\pm0\.5363\.05±\\pm1\.4869\.69±\\pm1\.61AVG49\.23±\\pm7\.5748\.76±\\pm3\.3684\.71±\\pm0\.3486\.64±\\pm0\.2876\.11±\\pm0\.3678\.61±\\pm0\.3882\.85±\\pm0\.3785\.10±\\pm0\.2282\.66±\\pm0\.6086\.96±\\pm0\.4072\.91±\\pm1\.0776\.03±\\pm0\.59ObfuscatedProxyDriftShadowsocks12\.09±\\pm2\.3214\.91±\\pm2\.1643\.94±\\pm0\.9944\.09±\\pm0\.9159\.32±\\pm2\.0761\.98±\\pm3\.2345\.90±\\pm1\.5945\.64±\\pm1\.2040\.21±\\pm1\.9550\.34±\\pm2\.8860\.84±\\pm1\.4563\.01±\\pm0\.16Trojan13\.02±\\pm2\.4915\.13±\\pm1\.9545\.26±\\pm1\.4745\.37±\\pm0\.9763\.27±\\pm2\.8066\.46±\\pm2\.5646\.77±\\pm1\.6346\.66±\\pm1\.4642\.07±\\pm1\.3653\.25±\\pm2\.4964\.47±\\pm2\.0564\.58±\\pm0\.03VLESS\-XTLS\-Vision18\.48±\\pm0\.8816\.77±\\pm0\.5441\.33±\\pm1\.1440\.05±\\pm0\.6948\.30±\\pm1\.9751\.58±\\pm1\.9935\.94±\\pm1\.8437\.12±\\pm0\.5529\.33±\\pm2\.6536\.25±\\pm3\.1646\.89±\\pm1\.8847\.72±\\pm0\.54VMess\-TLS15\.32±\\pm2\.8218\.09±\\pm2\.1048\.90±\\pm1\.3847\.38±\\pm1\.3764\.33±\\pm2\.1767\.53±\\pm22\.945\.16±\\pm3\.1145\.47±\\pm0\.8845\.44±\\pm1\.8255\.73±\\pm2\.1363\.91±\\pm0\.4665\.41±\\pm1\.32VMess25\.03±\\pm0\.7422\.94±\\pm2\.0457\.68±\\pm1\.2855\.11±\\pm0\.6672\.02±\\pm2\.2176\.90±\\pm2\.6347\.76±\\pm4\.1848\.69±\\pm1\.5241\.46±\\pm3\.0752\.21±\\pm3\.1165\.34±\\pm0\.6266\.37±\\pm1\.23VMess\-WS\-TLS21\.25±\\pm0\.9420\.32±\\pm1\.7352\.82±\\pm1\.6451\.70±\\pm0\.4969\.94±\\pm2\.1375\.99±\\pm2\.2945\.57±\\pm3\.7345\.71±\\pm1\.6030\.30±\\pm5\.8134\.85±\\pm3\.1565\.82±\\pm0\.5167\.89±\\pm0\.83AVG17\.53±\\pm1\.3418\.03±\\pm1\.4548\.32±\\pm0\.7747\.28±\\pm0\.7362\.86±\\pm2\.1266\.74±\\pm2\.3644\.52±\\pm2\.3444\.88±\\pm0\.3538\.14±\\pm1\.8147\.10±\\pm2\.4661\.21±\\pm1\.0462\.50±\\pm0\.44BrowserDriftEdge10\.23±\\pm1\.8809\.98±\\pm1\.0907\.66±\\pm0\.6912\.18±\\pm0\.4227\.08±\\pm1\.5732\.60±\\pm0\.4109\.18±\\pm0\.5211\.80±\\pm0\.8526\.16±\\pm2\.1133\.64±\\pm0\.9812\.80±\\pm1\.2216\.37±\\pm2\.28Firefox01\.60±\\pm0\.2602\.15±\\pm0\.2900\.49±\\pm0\.1001\.13±\\pm0\.2209\.22±\\pm0\.6413\.07±\\pm1\.3700\.40±\\pm0\.2300\.30±\\pm0\.1808\.31±\\pm0\.6809\.27±\\pm1\.5901\.42±\\pm0\.4502\.63±\\pm0\.24AVG05\.91±\\pm0\.8706\.06±\\pm0\.6904\.07±\\pm0\.3206\.66±\\pm0\.2118\.15±\\pm1\.0422\.83±\\pm0\.8204\.79±\\pm0\.3006\.05±\\pm0\.4717\.24±\\pm1\.2021\.45±\\pm0\.7207\.11±\\pm0\.5109\.50±\\pm1\.02 ## Appendix FSensitivity Analysis ofα\\alpha Figure 9\.Performance gainΔ\\Delta\(%\) over theα=0\\alpha=0baseline as a function of distillation weightα\\alphafor six backbones\. The best\-performingα\\alpharange remains largely stable for each backbone, indicating that the distillation weight is mainly coupled to student capacity rather than to a particular source training window\.This appendix provides the complete sensitivity analysis of the distillation weightα\\alpha\. Our goal is to understand how strongly resource\-privileged supervision should be injected into the traffic\-only student, and whether this choice is tied to a particular target environment or instead reflects an intrinsic property of the student backbone\. In the joint objective \(7\)ℒtotal=\(1−α\)ℒcls\+αℒkd,\\mathcal\{L\}\_\{total\}=\(1\-\\alpha\)\\mathcal\{L\}\_\{cls\}\+\\alpha\\mathcal\{L\}\_\{kd\},the weightα∈\[0,1\]\\alpha\\in\[0,1\]controls how much optimization pressure is assigned to the resource\-inducedinter\-class topology, relative to hard\-label discrimination\. Whenα\\alphais too small, the student receives little structural supervision from the resource teacher and largely degenerates to ordinary traffic\-only ERM\. Whenα\\alphais too large, the student may be forced to fit soft topological constraints that exceed the representational capacity of its traffic\-side feature space\. To characterize this trade\-off, we perform a fullα\\alphascan over all six student backbones under temporal drift from two independent training datasets\. We additionally verify the same trend under spatial drift from the source training dataset\. Figure[9](https://arxiv.org/html/2606.17462#A6.F9)reports the temporal\-drift scan and shows that the best singleα\\alphamay shift slightly across training and testing datasets, but each backbone exhibits a stable best\-performing range\. This range is primarily governed by student capacity rather than the specific training window\. DF, Tik\-Tok, and RF benefit most from moderate distillation weights, whereas Var\-CNN and CountMamba benefit from medium\-to\-high distillation weights; AWF is the most sensitive and can degrade when the distillation term dominates\. This pattern indicates thatα\\alphashould be interpreted as a capacity\-matching parameter on the student side\. The resource teacher’s inter\-class topology must ultimately be compressed into atraffic\-onlyrepresentation space\. A higher\-capacity student can absorb this structural prior while preserving hard\-label decision boundaries; a lower\-capacity student has a smaller representation budget and is more prone to objective interference betweenℒcls\\mathcal\{L\}\_\{cls\}andℒkd\\mathcal\{L\}\_\{kd\}\. Var\-CNN Capacity Scaling\.To further isolate the effect of model capacity, we keep the residual topology of Var\-CNN fixed and scale only the channel width\. We then repeat theα\\alphascan under 150\-day temporal drift\. This controlled experiment removes architectural differences and concentrates the comparison on student capacity itself\. Table[9](https://arxiv.org/html/2606.17462#A6.T9)summarizes the bestα\\alpharange for each width, where the best range denotes weights that remain close to the capacity\-specific optimum and clearly outperform theα=0\\alpha=0baseline\. Table 9\.Capacity scaling analysis for Var\-CNN under 150\-day temporal drift\. The bestα\\alpharange denotes distillation weights that remain close to the best result for each width and outperform theα=0\\alpha=0baseline\.Var\-CNN Widthα=0\\alpha=0Bestα\\alphaRangeF1 in Best RangeMax Gain1×1\\times72\.770\.1–0\.780\.25–82\.22\+9\.45 pp0\.5×0\.5\\times74\.360\.1–0\.680\.31–80\.94\+6\.58 pp0\.25×0\.25\\times72\.180\.2–0\.476\.82–78\.35\+6\.17 pp0\.125×0\.125\\times70\.350\.1–0\.373\.14–74\.95\+4\.60 pp Table[9](https://arxiv.org/html/2606.17462#A6.T9)shows that capacity controls both the ceiling of distillation gains and the best\-performing range ofα\\alpha\. The full\-width Var\-CNN has the widest best range: forα=0\.1\\alpha=0\.1–0\.70\.7, it maintains 80\.25%–82\.22% macro\-F1 and reaches a maximum gain of 9\.45 percentage points\. Reducing the width to0\.5×0\.5\\timesstill preserves a broad best range ofα=0\.1\\alpha=0\.1–0\.60\.6, but the maximum gain drops to 6\.58 percentage points\. Further reducing the width to0\.25×0\.25\\timesand0\.125×0\.125\\timesnarrows the best ranges to0\.20\.2–0\.40\.4and0\.10\.1–0\.30\.3, respectively, while the maximum gains decrease to 6\.17 and 4\.60 percentage points\. Under the same teacher, training data, and residual topology, smaller students therefore convert less privileged resource supervision into peak robustness gains and exhibit narrower near\-optimalα\\alpharanges\. This result does not imply that low\-capacity students cannot benefit from resource supervision\. Rather, they can absorb only a limited strength of teacher topology\. For high\-capacity students, the KD term mainly acts as a structural regularizer that reshapes decision boundaries without overwhelming hard\-label discrimination\. For low\-capacity students, an overly largeα\\alphaallocates too much of the representation budget to matching the teacher distribution, turning the KD term from useful regularization into an optimization constraint beyond the student’s capacity\. Takeaways\.The distillation weightα\\alphashould be viewed as the strength of resource\-privileged supervision matched to student capacity\. Architectural differences affect the exact optimum, but the underlying mechanism is whether the student has sufficient representation budget to internalize the resource teacher’s inter\-class topology\. In practice, we tuneα\\alphaonce per backbone on the source\-domain validation set and keep it fixed across all target environments; the reported robustness gains do not rely on target\-domain retuning\. This analysis also clarifies the applicability boundary of ResAware: when the student has adequate capacity and the correspondence between resource structure and traffic observations remains stable, moderate or largeα\\alphacan substantially improve cross\-environment generalization; when student capacity is limited,α\\alphashould be reduced to avoid over\-distillation; when drift disrupts the cross\-modal correspondence itself,α\\alphashould be further reduced or the model should fall back to traffic\-only training\.
Similar Articles
RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting
RAFT is a two-stage framework for domain-specific fine-tuning of LLMs that addresses catastrophic forgetting by refining supervision data and using on-policy distillation with adaptive loss balancing, achieving significant improvements on domain accuracy while recovering general capabilities.
Protecting Language Models Against Unauthorized Distillation through Trace Rewriting
This paper proposes methods for protecting large language models against unauthorized knowledge distillation by rewriting reasoning traces to degrade training usefulness while preserving correctness, and embedding verifiable watermarks in distilled student models. The approach uses instruction-based and gradient-based rewriting techniques to achieve anti-distillation effects without compromising teacher model performance.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
The paper introduces Reflection-Enhanced Self-Distillation (Resd), a framework that transforms failure feedback into corrective supervision for LLMs, enabling efficient learning from rare successes. It outperforms standard self-distillation baselines and achieves faster early improvement than GRPO with fewer samples.
Protecting Language Models Against Unauthorized Distillation through Trace Rewriting
Researchers propose trace rewriting methods to prevent unauthorized LLM knowledge distillation while preserving answer correctness and embedding detectable watermarks.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
This paper introduces a training-free diagnostic framework to analyze per-token distillation signals for reasoning models, revealing that guidance is more beneficial on incorrect rollouts and depends on student capacity and task context.