Multi-Quantile Regression for Extreme Precipitation Downscaling

arXiv cs.LG 05/14/26, 04:00 AM Papers
Summary
Introduces Q-srdrn, a multi-quantile super-resolution network using pinball loss to improve extreme precipitation downscaling, achieving dramatic detection rate gains for heavy rainfall events while maintaining overall accuracy.
arXiv:2605.12762v1 Announce Type: new Abstract: Deep super-resolution networks for precipitation downscaling achieve strong bulk skill yet systematically under-predict the heavy-tail events that drive flood risk. We demonstrate that the primary obstacle is the loss function, not the data: under intensity-weighted MAE, real and synthetic labels at the same input are simply averaged, meaning data augmentation shifts the predicted mean rather than the conditional distribution. We resolve this with Q-SRDRN, a multi-quantile super-resolution network trained with pinball loss at tau in 0.50, 0.95, 0.99, 0.999. Two CNN-specific design choices make this practical: IncrementBound enforces monotonicity while preserving each quantile channel's gradient identity, and separate per-quantile output heads provide independent filter banks for bulk and tail detection. Under this design, data augmentation via cVAE becomes complementary: the median head absorbs synthetic patterns without contaminating upper quantiles. Empirically, on Florida (convective/tropical-cyclone dominated), the un-augmented Q-SRDRN P999 head detects 1,598 of 2,111 events at 200 mm/day versus 88 for the deterministic baseline--an 18x detection-rate gain (4.2% to 75.7%)--with 63% lower KL divergence and 3.9% lower RMSE. Adding cVAE-generated samples lifts the P50 channel from 14 to 1,038 hits at 200 mm/day. On California (atmospheric-river dominated), the architecture reaches near-perfect detection (P999 SEDI >= 0.996 through 300 mm/day). On Texas, the baseline catches only 2 of 10,720 events at 200 mm/day while the P999 head catches 8,776 (81.9%). While the cVAE does not transfer across regions, multi-quantile regression captures extremes wherever the large-scale signal is strong, while augmentation rescues the median where it is not.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/14/26, 06:18 AM
# Multi-Quantile Regression for Extreme Precipitation Downscaling
Source: [https://arxiv.org/html/2605.12762](https://arxiv.org/html/2605.12762)
Hamed Najafi Florida International University Miami, FL hnaja002@fiu\.edu Gareth Lagerwall Everglade foundation Miami, FL glagerwall@evergladesfoundation\.org Jayantha Obeysekera Florida International University Miami, FL jobeysek@fiu\.edu Jason Liu Florida International University Miami, FL liux@fiu\.edu

###### Abstract

Deep super\-resolution networks for precipitation downscaling achieve strong bulk skill yet systematically under\-predict the heavy\-tail events that drive flood risk\. The natural remedy—augmenting training with synthetic extremes—turns out fruitless\. We argue that the obstacle is the loss, not the data: under intensity\-weighted MAE, real and synthetic labels at the same input are simply averaged, so augmentation shifts the predicted mean rather than the conditional distribution\. We resolve this with Q\-srdrn, a multi\-quantile super\-resolution network trained with pinball loss atτ∈\{0\.50,0\.95,0\.99,0\.999\}\\tau\\in\\\{0\.50,0\.95,0\.99,0\.999\\\}\. Two CNN\-specific design choices make this practical:*IncrementBound*enforces monotonicity while preserving each quantile channel’s gradient identity, and separate per\-quantile output heads give the bulk and tail detectors independent filter banks\. Under this design, data augmentation \(cvae\) becomes complementary rather than harmful: the median head absorbs synthetic extreme patterns without contaminating the upper quantiles\. Empirically, on Florida \(convective, tropical\-cyclone\-dominated\) the un\-augmented Q\-srdrnP999 head detects1,5981\{,\}598of2,1112\{,\}111events at200200mm/day versus8888for the deterministic baseline—an18×18\{\\times\}detection\-rate gain \(4\.2%→75\.7%4\.2\\%\\to 75\.7\\%\)—with63%63\\%lower KL divergence and3\.9%3\.9\\%lower RMSE\. Additionally,8383cvae\-generated samples lift the P50 channel from1414to1,0381\{,\}038hits at200200mm/day \(74×74\{\\times\}\)\. On California \(atmospheric\-river\-dominated\), the architecture alone reaches near\-perfect detection \(P999 SEDI≥0\.996\\geq 0\.996through300300mm/day\)\. On a Texas substate, the architecture\-only effect sharpens further: the deterministic baseline only catches22of10,72010\{,\}720events at200200mm/day while the P999 head catches8,7768\{,\}776\(81\.9%81\.9\\%\)\. Interestingly, the FL\-tuned cvaedoes not transfer to California or the Texas substate, exposing augmentation itself as a region\-specific component\. Multi\-quantile regression captures extremes wherever the large\-scale signal is strong; cvaeaugmentation, when its architecture matches to the regime, rescues the median where it is not\.

## 1Introduction

Precipitation downscaling, which maps coarse reanalysis fields to high\-resolution observations, is a routine input to flood\-risk modeling, infrastructure planning, and operational hydrology\. Modern super\-resolution networks match the bulk of observed precipitation distributions well\(Vandal et al\.,[2017](https://arxiv.org/html/2605.12762#bib.bib26);Sha et al\.,[2020](https://arxiv.org/html/2605.12762#bib.bib24);Wang et al\.,[2021](https://arxiv.org/html/2605.12762#bib.bib28)\), yet their predictions degrade systematically at the heaviest events that matter most for downstream risk\. On Florida ERA5/PRISM data, a state\-of\-the\-art deterministic baseline\(Wang et al\.,[2021](https://arxiv.org/html/2605.12762#bib.bib28)\)over\-predicts the climatological mean by43%43\\%\(5\.945\.94vs\.4\.144\.14mm/day\) while detecting only8888of2,1112\{,\}111observed events at≥200\\geq 200mm/day \(4\.2%4\.2\\%probability of detection, POD\)\. The textbook fix—enlarging the training set with synthetic extremes from a generative model—also fails: performance degrades rather than improves as augmentation grows\.

The failure is a property of the loss, not the data\. An intensity\-weighted MAE converges to a weighted conditional median\(Koenker and Hallock,[2001](https://arxiv.org/html/2605.12762#bib.bib14)\): at a pixel where99%99\\%of days are light, the optimal prediction sits near the bulk, and a synthetic extreme is simply averaged into the median\. The natural alternative is pinball\-loss multi\-quantile regression\(Koenker and Hallock,[2001](https://arxiv.org/html/2605.12762#bib.bib14);Bremnes,[2004](https://arxiv.org/html/2605.12762#bib.bib4)\), which assigns one output per quantile level and penalizes under\-predictionτ/\(1−τ\)\\tau/\(1\-\\tau\)times more heavily than over\-prediction \(a999×999\{\\times\}asymmetry atτ=0\.999\\tau\{=\}0\.999\)\.

Two CNN\-specific obstacles must be addressed simultaneously: per\-pixel sorting \(the textbook non\-crossing fix\) scrambles the gradient identity of each output channel so no convolutional filter specializes; and a sharedConv2D\(4\)output layer forces bulk and tail detectors to compete for the same filter bank\. We resolve both with*IncrementBound*, a cumulative\-softplus construction that pins each output channel to a fixed quantile, combined with*separate*Conv2D\(1\)heads\. An unexpected payoff is that the P50 head, trained with unweighted pinball loss, converges to the true conditional median and is robust to a small fraction of synthetic extremes—so multi\-quantile regression and augmentation, orthogonal in isolation, become complementary\.

We validate on three U\.S\. domains: Florida \(convective, tropical\-cyclone\), California \(atmospheric\-river\), and a Texas Gulf\-Coast substate \(mixed convective/tropical\-cyclone landfall\)\. cvaeaugmentation is only reported on Florida because the same FL\-tuned generator architecture mildly harms California’s P50 channel and was therefore not run on the Texas substate—a finding we treat as a separate result on the regime\-specificity of augmentation \(§[4\.5](https://arxiv.org/html/2605.12762#S4.SS5)\)\.

Overall, our contributions are three\-folds:First,we identify a CNN\-specific failure mode in sort\-based quantile monotonicity—per\-pixel permutations destroy spatial filter specialization—and resolve it with Q\-srdrn, which combines IncrementBound with separateConv2D\(1\)heads\. Without any augmentation, P999 detection at200200mm/day on Florida rises from4\.2%4\.2\\%to75\.7%75\.7\\%POD \(8888to1,5981\{,\}598events\)\.Second,we validate the same architecture and hyperparameters across three climatologically distinct domains\. On California \(atmospheric\-river\-dominated\) the un\-augmented architecture reaches P999 SEDI≥0\.996\\geq 0\.996at every threshold up to300300mm/day; on a Texas substate the un\-augmented P999 head catches8,7768\{,\}776of10,72010\{,\}720events at200200mm/day \(81\.9%81\.9\\%, vs\.0\.02%0\.02\\%for the deterministic baseline\) and1,4411\{,\}441of2,2652\{,\}265at300300mm/day\.Finally,we show that multi\-quantile regression is what makes cvaeaugmentation useful on Florida: under pinball loss the median is shielded from the label\-averaging that poisons MAE\-trained networks, lifting Florida P50 SEDI at200200mm/day from0\.4050\.405to0\.8660\.866and reducing KL divergence by50%50\\%\.

## 2Related Work

Statistical and learning\-based downscaling\.Statistical downscalers—bias correction and spatial disaggregation\(Wood et al\.,[2004](https://arxiv.org/html/2605.12762#bib.bib27)\), generalized linear models\(Chandler and Wheater,[2002](https://arxiv.org/html/2605.12762#bib.bib7)\), and analog methods\(Zorita and von Storch,[1999](https://arxiv.org/html/2605.12762#bib.bib30)\)—are calibrated by construction but cannot exploit high\-dimensional reanalysis fields\. Deep learning approaches, including CNNs\(Vandal et al\.,[2017](https://arxiv.org/html/2605.12762#bib.bib26);Baño\-Medina et al\.,[2020](https://arxiv.org/html/2605.12762#bib.bib1)\), super\-resolution residual networks\(Wang et al\.,[2021](https://arxiv.org/html/2605.12762#bib.bib28)\), and U\-Nets\(Sha et al\.,[2020](https://arxiv.org/html/2605.12762#bib.bib24)\), lift bulk skill substantially but inherit the mean\-seeking bias of their training objectives—whether pixel\-wise MAE/MSE point losses or expected\-value parametric likelihoods \(e\.g\., the Bernoulli–gamma NLL used for precipitation inBaño\-Medina et al\.,[2020](https://arxiv.org/html/2605.12762#bib.bib1)\)—the very failure mode this paper diagnoses\. Stochastic extensions sidestep mean\-seeking by sampling rather than by changing the loss: GANs\(Harris et al\.,[2022](https://arxiv.org/html/2605.12762#bib.bib12);Price and Rasp,[2022](https://arxiv.org/html/2605.12762#bib.bib21)\)produce ensemble samples, and diffusion\-based samplers—whether for km\-scale regional downscaling \(CorrDiff\(Mardani et al\.,[2025](https://arxiv.org/html/2605.12762#bib.bib17)\),22km Taiwan\) or global ensemble forecasting \(GenCast\(Price et al\.,[2025](https://arxiv.org/html/2605.12762#bib.bib22)\),0\.25∘0\.25^\{\\circ\}latitude–longitude\)—extend ensemble sampling to calibrated probabilistic distributions at the cost of an iterative multi\-step generation process per realization \(roughly1212denoising iterations for CorrDiff and3939network evaluations per timestep for GenCast\)\. Foundation\-scale weather models\(Pathak et al\.,[2022](https://arxiv.org/html/2605.12762#bib.bib20);Bi et al\.,[2023](https://arxiv.org/html/2605.12762#bib.bib2);Nguyen et al\.,[2023](https://arxiv.org/html/2605.12762#bib.bib19)\)target global forecasting and are complementary in scope\. Q\-srdrndiffers from the sampling\-based family by replacing the loss: a single forward pass returns a calibrated discrete CDF at every pixel, including a dedicated extreme\-tail quantile\.

Quantile regression and monotonic networks\.Pinball\-loss quantile regression for precipitation has a long history\(Bremnes,[2004](https://arxiv.org/html/2605.12762#bib.bib4);Cannon,[2011](https://arxiv.org/html/2605.12762#bib.bib5)\)\. Non\-crossing has been enforced through several mechanisms: cumulative softplus increments between successive quantile heads\(Cannon,[2018](https://arxiv.org/html/2605.12762#bib.bib6)\), structurally monotonic feed\-forward networks built from positive weights or integrated positive functions\(Daniels and Velikova,[2010](https://arxiv.org/html/2605.12762#bib.bib8);Wehenkel and Louppe,[2019](https://arxiv.org/html/2605.12762#bib.bib29);Brando et al\.,[2022](https://arxiv.org/html/2605.12762#bib.bib3)\), and constrained quantile increments inside recurrent forecasters\(Liu et al\.,[2023](https://arxiv.org/html/2605.12762#bib.bib16)\)\. Each of these approaches assumes a one\-to\-one mapping between output unit and quantile, which holds in fully\-connected and recurrent decoders but breaks in spatial CNNs, where a single convolutional filter is reused at every pixel\. We diagnose the resulting CNN\-specific failure mode—per\-pixel sort permutations scramble the gradient identity of each channel \(Sec\.[4\.3](https://arxiv.org/html/2605.12762#S4.SS3)\)—and resolve it with IncrementBound combined with separateConv2D\(1\)output heads, which together preserve a fixed gradient\-to\-channel routing\. Sampling\-based uncertainty estimators\(Gal and Ghahramani,[2016](https://arxiv.org/html/2605.12762#bib.bib10);Lakshminarayanan et al\.,[2017](https://arxiv.org/html/2605.12762#bib.bib15);Vandal et al\.,[2018](https://arxiv.org/html/2605.12762#bib.bib18)\)are a probabilistic alternative but, as our comparisons show \(§[4\.4](https://arxiv.org/html/2605.12762#S4.SS4)\), collapse to a quantile\-invariant predictive variance at the precipitation tails\.

Data augmentation for imbalanced geophysical data\.Variational autoencoders\(Kingma and Welling,[2014](https://arxiv.org/html/2605.12762#bib.bib13)\)and their conditional extensions\(Sohn et al\.,[2015](https://arxiv.org/html/2605.12762#bib.bib25)\), alongside GANs, have been used to synthesize weather fields\(Ravuri et al\.,[2021](https://arxiv.org/html/2605.12762#bib.bib23)\), but augmenting training sets with such samples to address the under\-represented precipitation tail remains underexplored\. Existing work treats the generator as a drop\-in data source independent of the supervised loss\. We show that the*interaction*between augmented samples and the loss—rather than the quality of the synthetic samples—determines whether augmentation helps or hurts: the same cvaesamples that destabilize an MAE\-trained network are absorbed by a calibrated multi\-quantile predictor without contaminating the upper quantile heads\. Architecture and augmentation are therefore inseparable: the former determines whether the latter helps or harms\.

## 3Method

### 3\.1Problem Setup

Table 1:All three study domains share the same 15 ERA5 input variables over∼\{\\sim\}12,400 training days \(1980–2013\)\. Texas substate covers a6∘×6∘6^\{\\circ\}\\times 6^\{\\circ\}Gulf\-Coast box \(2828–34∘34^\{\\circ\}N,9898–92∘92^\{\\circ\}W\)\.We downscale daily ERA5 reanalysis at∼25\{\\sim\}25km resolution to PRISM precipitation at∼4\{\\sim\}4km\. Letx∈ℝC×H×Wx\\in\\mathbb\{R\}^\{C\\times H\\times W\}be the coarse input overCC=15 ERA5 channels: 2\-meter temperature with daily extrema and soil temperature \(t2m,mx2t,mn2t,stl1\); 10\-meter wind \(u10,v10\); precipitation totals with daily extremes \(tp,cp,mxtpr,mntpr\); cloud cover at three levels \(lcc,tcc,hcc\); and surface runoff and evaporation \(sro,e\)\. Lety∈ℝ≥0H′×W′y\\in\\mathbb\{R\}\_\{\\geq 0\}^\{H^\{\\prime\}\\times W^\{\\prime\}\}be the fine\-resolution PRISM target\. A land maskm∈\{0,1\}H′×W′m\\in\\\{0,1\\\}^\{H^\{\\prime\}\\times W^\{\\prime\}\}restricts loss and evaluation to theNland=∑h,wmh,wN\_\{\\text\{land\}\}=\\sum\_\{h,w\}m\_\{h,w\}land pixels\. A deterministic downscaler returns a single point estimatey^h,w\\hat\{y\}\_\{h,w\}; we instead predict a discrete conditional CDF comprising four quantile estimatesq^τ,h,w\\hat\{q\}\_\{\\tau,h,w\}per pixel forτ∈𝒯=\{0\.50,0\.95,0\.99,0\.999\}\\tau\\in\\mathcal\{T\}=\\\{0\.50,0\.95,0\.99,0\.999\\\}\. Table[1](https://arxiv.org/html/2605.12762#S3.T1)summarizes the three study domains, which share the same input variables and training period but differ sharply in extreme\-precipitation regime\.

### 3\.2SRDRN Backbone

All models in this paper share a common feature extractor: thesrdrnofWang et al\.\([2021](https://arxiv.org/html/2605.12762#bib.bib28)\), a convolutional super\-resolution network that bilinearly upsamplesxxto the PRISM grid, refines it through 16 residual blocks with SpatialDropout2D, and recovers fine\-scale detail through asymmetric pixel\-shuffle upsampling; we call the resulting per\-pixel feature map the*shared backbone features*\. The deterministic baseline maps these features toy^\\hat\{y\}through a single9×99\\times 9Conv2D\(1\)head trained with intensity\-weighted MAE,w=clip\(y/11\.7,0\.1,2\.0\)w=\\mathrm\{clip\}\(y/11\.7,\\;0\.1,\\;2\.0\), where11\.711\.7mm/day is the training\-set standard deviation of wet\-day precipitation and the weight saturates aty=23\.4y=23\.4mm/day\. Q\-srdrnreuses this backbone unchanged, replacing only the output head and the loss; any performance difference is therefore attributable to the quantile design, not to feature extraction\.

### 3\.3Q\-SRDRN: Multi\-Quantile Regression

Turning the deterministic head into a multi\-quantile predictor exposes four coupled problems:\(1\)an asymmetric per\-quantile loss;\(2\)sufficient gradient signal at the upper tail despite∼1%\{\\sim\}1\\%extreme\-day prevalence;\(3\)non\-crossing monotonicity inτ\\tauthat does not destroy spatial filter specialization in a CNN; and\(4\)per\-channel gradient routing so the median and P999 detectors do not compete\.

\(1\) Pinball loss\.For quantileτ\\tauand residuale=y−q^τe=y\-\\hat\{q\}\_\{\\tau\}, the pinball loss\(Koenker and Hallock,[2001](https://arxiv.org/html/2605.12762#bib.bib14);Bremnes,[2004](https://arxiv.org/html/2605.12762#bib.bib4)\)ρτ\(e\)=max⁡\(τe,\(τ−1\)e\)\\rho\_\{\\tau\}\(e\)=\\max\(\\tau e,\(\\tau\-1\)e\)penalizes under\-predictionτ/\(1−τ\)\\tau/\(1\-\\tau\)times more heavily than over\-prediction \(99×99\{\\times\}atτ=0\.99\\tau\{=\}0\.99\)\. Aggregated over𝒯=\{0\.50,0\.95,0\.99,0\.999\}\\mathcal\{T\}=\\\{0\.50,0\.95,0\.99,0\.999\\\}and land pixels:

ℒ=∑τ∈𝒯1Nland∑h,wmh,wwτρτ\(yh,w−q^τ,h,w\)\.\\mathcal\{L\}=\\sum\_\{\\tau\\in\\mathcal\{T\}\}\\frac\{1\}\{N\_\{\\text\{land\}\}\}\\sum\_\{h,w\}m\_\{h,w\}\\,w\_\{\\tau\}\\,\\rho\_\{\\tau\}\\\!\\bigl\(y\_\{h,w\}\-\\hat\{q\}\_\{\\tau,h,w\}\\bigr\)\.\(1\)
\(2\) Per\-quantile event weighting\.wτ=1\+α1\[y\>zthresh\]w\_\{\\tau\}=1\+\\alpha\\,\\mathbf\{1\}\[y\>z\_\{\\text\{thresh\}\}\]withα=5\\alpha=5on the upper\-tail heads \(P95/P99/P999\) upweights extreme days by6×6\{\\times\};zthreshz\_\{\\text\{thresh\}\}is the 75th\-percentile precipitation in log1p z\-score units\. The P50 head is exempt \(w0\.50=1w\_\{0\.50\}\{=\}1\) so its convergence to the true conditional median—the property that lets it absorb augmentation without bias \(§[3\.4](https://arxiv.org/html/2605.12762#S3.SS4)\)—is preserved\.

\(3\) IncrementBound: gradient\-stable monotonicity\.The textbook non\-crossing fix—per\-pixel sorting of output channels—is destructive on a CNN: sorting induces a per\-pixel permutation, so the same convolutional filter receives P999\-scale gradients at some pixels and P50\-scale gradients at others, preventing any filter from specializing\. Adapting the cumulative\-softplus construction ofCannon\([2018](https://arxiv.org/html/2605.12762#bib.bib6)\)pins each output channel to a fixed quantile identity:

q^0\.50=8tanh⁡\(r0/8\),q^τk=q^τk−1\+softplus\(rk\),k=1,2,3\.\\hat\{q\}\_\{0\.50\}=8\\tanh\(r\_\{0\}/8\),\\quad\\hat\{q\}\_\{\\tau\_\{k\}\}=\\hat\{q\}\_\{\\tau\_\{k\-1\}\}\+\\mathrm\{softplus\}\(r\_\{k\}\),\\quad k=1,2,3\.\(2\)This givesq^0\.50≤q^0\.95≤q^0\.99≤q^0\.999\\hat\{q\}\_\{0\.50\}\\leq\\hat\{q\}\_\{0\.95\}\\leq\\hat\{q\}\_\{0\.99\}\\leq\\hat\{q\}\_\{0\.999\}by construction, fixes thekk\-th channel’s gradient identity, and lets the upper quantiles grow without an explicit cap \(a physical1,2001\{,\}200mm cap is applied only in post\-processing\)\.

\(4\) Separate output heads\.A sharedConv2D\(4\)output layer forces all four quantiles into the same filter bank, but P50 \(≈0\\approx 0at most pixels\) and P999 \(rare extreme firings\) need fundamentally different spatial detectors\. We use four independentConv2D\(1\)heads \(one9×99\{\\times\}9kernel each\), concatenated and passed through IncrementBound, so every quantile gets its own dedicated spatial detector\.

![Refer to caption](https://arxiv.org/html/2605.12762v1/figures/fig11_architecture.jpg)Figure 1:Q\-SRDRN architecture\.The shared backbone \(16 residual blocks\) feeds four separate Conv2D\(1\) output heads\. IncrementBound enforces monotonicity via cumulative softplus increments \(Eq\.[2](https://arxiv.org/html/2605.12762#S3.E2)\)\. P50 receives pure pinball loss; P95/P99/P999 receive event\-weighted pinball loss \(α=5\\alpha=5\)\.
### 3\.4cVAE Augmentation

In regimes where extreme spatial patterns are sparse in the historical record, the P50 head is data\-starved at the thresholds that matter most\. Synthetic augmentation is the natural remedy, and—unlike under MAE—works here because the upper\-tail heads no longer rely on the median to reach extremes, so the median can absorb augmented samples without contaminating them\. We use a conditional VAE\(Sohn et al\.,[2015](https://arxiv.org/html/2605.12762#bib.bib25)\), building on the variational autoencoder framework ofKingma and Welling\([2014](https://arxiv.org/html/2605.12762#bib.bib13)\), that learns a stochastic ERA5→\\toPRISM mapping with a256256\-dim Gaussian latentzz; skip connections are omitted \(forcing the decoder to usezz\), GroupNorm is used for train/eval consistency, andβKL=0\.008\\beta\_\{\\text\{KL\}\}\{=\}0\.008keeps the posterior informative\. For large output grids the fully\-connected decoder becomes a bottleneck \(e\.g\., California’s spatial correlation is0\.7070\.707on7676K pixels\), so we add a*Spatial cvae*variant that removes the FC layer and upsamples through convolutional blocks with AdaINzz\-conditioning, raising California’s correlation to0\.8350\.835\. Florida uses8383synthetic samples \(∼0\.67%\{\\sim\}0\.67\\%of training; the cliff to harm is near∼1%\{\\sim\}1\\%\); California uses385385samples \(∼3\.1%\{\\sim\}3\.1\\%, tolerable because the upper\-tail channel is augmentation\-independent in this regime\)\. Full architecture, training, and augmentation budgets are in Appendix[B](https://arxiv.org/html/2605.12762#A2)\.

## 4Experiments

### 4\.1Experimental Design

Our experiments are organized as a single2×22\\times 2factorial study that cleanly separates the architecture contribution from the augmentation contribution\. We evaluate four model configurations per domain—the cross of \{srdrn, Q\-srdrn\} with \{no augmentation, cvaeaugmentation\}—so that every comparison either fixes the loss family and varies the data, or fixes the data and varies the loss family\. All models share the same backbone, ERA5 inputs, and training/test splits\. Q\-srdrnadditionally uses separate output heads, IncrementBound, and per\-quantile event weighting \(α=5\\alpha=5\)\. Test\-set metrics are reported throughout\.

At each precipitation thresholdTTwe form the2×22\{\\times\}2contingency table withaa= hits \(y^\>T\\hat\{y\}\>Tandy\>Ty\>T\),bb= false alarms \(y^\>T\\hat\{y\}\>Tandy≤Ty\\leq T\),cc= misses \(y^≤T\\hat\{y\}\\leq Tandy\>Ty\>T\), anddd= correct rejections\. TheProbability of Detection, also called the hit rate, POD=a/\(a\+c\)=a/\(a\+c\), is the share of observed extreme events the model captures, and theFalse Alarm Ratio, FAR=b/\(a\+b\)=b/\(a\+b\), is the share of forecast events that do not occur\. Neither score is informative on its own at extreme thresholds—“always wet” achieves POD==1 and “always dry” achieves FAR==0, both with zero skill—so we report them only where the absolute event count is itself informative\. TheSymmetric Extremal Dependence Index\(SEDI,Ferro and Stephenson,[2011](https://arxiv.org/html/2605.12762#bib.bib9)\) combines POD and the false\-alarm rateb/\(b\+d\)b/\(b\+d\)into a single skill score that converges to a non\-degenerate limit as the base rate→\\to0, making it our headline extreme\-event metric\. Kullback–Leibler \(KL\) divergence measures distributional fidelity, while Continuous Ranked Probability Score \(CRPS\) and interval coverage \(available only for Q\-srdrn\) assess calibration quality\.

### 4\.2Architecture:srdrnvs\. Q\-srdrnacross three domains

Table 2:Bulk fit \(central\-tendency channels\)\.P95, P99, and P999 heads target levels above the observation by construction\.Boldis best in row\.We begin with the architecture\-only comparison:srdrn’s single output versus the four heads of Q\-srdrn, both trained on the original \(un\-augmented\) ERA5/PRISM data, on Florida, California, and the Texas substate\. This isolates what the multi\-head pinball architecture buys before any augmentation is introduced; cvae\-augmented results, which apply only to Florida, are deferred to §[4\.5](https://arxiv.org/html/2605.12762#S4.SS5)\. Bulk fit \(RMSE, Pearsonrr, KL divergence\) is summarised in Table[2](https://arxiv.org/html/2605.12762#S4.T2)for the central\-tendency predictors \(srdrnand the Q\-srdrnP50 head\); upper\-quantile heads \(P95/P99/P999\) target distributional levels above the observation by design, so RMSE/Pearson/KL are not informative for them and we score them through their natural extreme\-detection metrics\. Table[3](https://arxiv.org/html/2605.12762#S4.T3)reports SEDI and hit counts across all four Q\-srdrnheads at thresholds matched to each head’s climatological exceedance level \(P95 at 10–20 mm, P99 at 30–60 mm, P999 at≥\\geq70 mm\)\. P50 and matched\-head SEDI envelopes are visualised in Appendex[J](https://arxiv.org/html/2605.12762#A10)and[K](https://arxiv.org/html/2605.12762#A11)\.

Table 3:Architecture comparison \(without no augmentation\)\.srdrn\(single output\) vs\. Q\-srdrn\.Bold= best per\-threshold within each domain\.SEDI at thresholdHitsModel50100150200300@200 mm@300 mmFlorida\(n200=2,111n\_\{200\}\{=\}2\{,\}111,n300=75n\_\{300\}\{=\}75\)srdrn0\.5720\.6940\.7250\.5640\.000880Q\-srdrnP500\.5220\.6880\.6070\.4050\.000140Q\-srdrnP950\.7120\.7390\.8310\.8310\.0008990Q\-srdrnP990\.7780\.7990\.8520\.8960\.3521,3091Q\-srdrnP9990\.7740\.8590\.8790\.9320\.7101,59819California\(n200=537n\_\{200\}\{=\}537,n300=15n\_\{300\}\{=\}15\)srdrn0\.8230\.7770\.7550\.7050\.000540Q\-srdrnP500\.8260\.8240\.8250\.8330\.7941793Q\-srdrnP950\.9720\.9750\.9830\.9801\.00048415Q\-srdrnP990\.9880\.9920\.9940\.9941\.00052115Q\-srdrnP9990\.9890\.9960\.9970\.9981\.00053415Texas substate\(n200=10,720n\_\{200\}\{=\}10\{,\}720,n300=2,265n\_\{300\}\{=\}2\{,\}265\)srdrn0\.6090\.5500\.4620\.2400\.00020Q\-srdrnP500\.6010\.6040\.5340\.2630\.00070Q\-srdrnP950\.8470\.8150\.7840\.7940\.5584,143160Q\-srdrnP990\.8990\.8820\.8710\.8680\.7586,307695Q\-srdrnP9990\.8360\.9350\.9340\.9420\.8918,7761,441Bulk performance\.Q\-srdrn’s P50 channel attains lower RMSE, higher Pearsonrr, and lower KL divergence thansrdrnon all three domains \(Table[2](https://arxiv.org/html/2605.12762#S4.T2)\)\. The largest improvement is on the heaviest\-tailed regime \(Texas substate: RMSE−10\.6%\-10\.6\\%, KL0\.176→0\.0140\.176\\to 0\.014\); on California, wheresrdrnis already well\-calibrated by virtue of the atmospheric\-river regime, the gain is smaller \(RMSE−1\.6%\-1\.6\\%\)\. The diagnostic is sharpest on Florida, wheresrdrn’s 43% over\-prediction bias \(predicted mean 5\.94 vs\. observed 4\.14 mm/day\) inflates extreme\-threshold hits at the cost of distributional fidelity \(KL 0\.124 vs\. 0\.046\)\.

P50 extreme detection\.The three domains tell different stories at the extreme tail\. On California, the P50 head alone already dominatessrdrnat every threshold above 100 mm: SEDI at 200 mm rises from 0\.705 to 0\.833 \(\+18%\+18\\%\), and at 300 mm—wheresrdrnis blind—P50 reaches 0\.794\.

On Florida and the Texas substate the comparison reveals a calibration–detection trade\-off\. Q\-srdrn’s P50 without augmentation is more conservative at extreme thresholds thansrdrn\(FL SEDI at 200 mm: 0\.405 vs\. 0\.564\), because the true conditional median is low for most pixels whilesrdrn’s bulk over\-prediction inflates hits at the cost of distributional fidelity \(KL 0\.124 vs\. 0\.046\)\. In short,srdrnbuys SEDI at extreme thresholds with bulk over\-prediction, whereas Q\-srdrn’s P50 buys distributional calibration at the cost of extreme\-day exposure\. The multi\-head design resolves the trade\-off without choosing: each remaining head is the optimal detector for the threshold range matching its climatological exceedance level, and the matched\-head envelope dominatessrdrnat every threshold on all three domains \(Table[3](https://arxiv.org/html/2605.12762#S4.T3), P95–P999 rows; matched\-head envelope in Appendix[K](https://arxiv.org/html/2605.12762#A11)\)\.

P999 detection across the three domains\.The P999 head delivers the headline lift on every domain\. On Florida it raises POD at 200 mm from4\.2%4\.2\\%\(8888hits\) to75\.7%75\.7\\%\(1,5981\{,\}598hits\) and detects1919of7575events at300300mm wheresrdrnis blind \(SEDI==0\.7100\.710\)\. On California it reaches near\-perfect detection \(SEDI≥\\geq0\.9960\.996through300300mm;534534of537537events at200200mm;1515of1515at300300mm\), reflecting California’s atmospheric\-river regime which provides a clearer large\-scale signal than Florida’s scattered convection\. The Texas substate is the most extreme demonstration:srdrncatches22of10,72010\{,\}720events at200200mm and zero of2,2652\{,\}265at300300mm, while the Q\-srdrnP999 head catches8,7768\{,\}776\(POD==81\.9%81\.9\\%, SEDI==0\.9420\.942\) and1,4411\{,\}441\(POD==63\.6%63\.6\\%, SEDI==0\.8910\.891\) respectively—a regime with∼\\sim5×\\timesmore 200 mm events than Florida and∼\\sim30×\\timesmore at300300mm, so the deep\-tail detection signal is statistically resolved in a way Florida and California cannot match\.

Head selection provides the operational advantage\.Beyond raw skill numbers, the multi\-head design lets operational users select the quantile head matching their risk tolerance: P50 \(agriculture/water management\), P95 \(urban drainage\), P99 \(flood warning\), P999 \(emergency management\)\. The spread between P50 and P999 also encodes tail risk directly: two pixels with identical 10 mm at P50 would carry vastly different flood risk for 20 mm vs\. 220 mm at P999\. Such information is invisible to any deterministic model\.

### 4\.3IncrementBound and Separate Heads

Having established the gain from specialized heads, we next isolate the two architectural choices that make that specialization possible\. Both target a single failure mode: the loss of gradient identity in spatial quantile heads\. In this section, we discuss the diagnosis, the fix, and a capacity\-control ablation that rules out the most natural alternative explanation\.

Problem:Multi\-quantile regression requires that predicted quantiles never cross:q^0\.5≤q^0\.95≤q^0\.99≤q^0\.999\\hat\{q\}\_\{0\.5\}\\leq\\hat\{q\}\_\{0\.95\}\\leq\\hat\{q\}\_\{0\.99\}\\leq\\hat\{q\}\_\{0\.999\}for every pixel\. Without enforcement, a CNN can freely predictq^0\.5\>q^0\.99\\hat\{q\}\_\{0\.5\}\>\\hat\{q\}\_\{0\.99\}—a mathematically invalid output that violates the definition of a quantile function \(F−1\(τ\)F^\{\-1\}\(\\tau\)is non\-decreasing inτ\\tau\)\. The standard fix is to sort the output channels:𝐪^=sort\(𝐫\)\\hat\{\\mathbf\{q\}\}=\\mathrm\{sort\}\(\\mathbf\{r\}\), where𝐫\\mathbf\{r\}are the raw network outputs\. Sorting guarantees valid quantiles but introduces a permutation that potentially changes at every pixel\. During backpropagation, the P999 pinball gradient flows to whichever raw channel happened to be largest—a different channel at each pixel and at each forward pass\. Thus, no channel can specialize: the P999 head receives P50\-scale gradients half the time and vice versa\. Enforcing quantile monotonicity destroys gradient identity\.

Solution:We replacetf\.sortwith the cumulative\-softplus parameterisation of Eq\.[2](https://arxiv.org/html/2605.12762#S3.E2), which guarantees monotonicity by construction while preserving a fixed gradient identity for each quantile—the P999 channel always receives P999 gradients\. Combined with four separate Conv2D\(1\) output heads \(eliminating filter competition in a shared output layer\), this yields significant improvement\.

A capacity\-control ablation \(Appendix[A](https://arxiv.org/html/2605.12762#A1)\) rules out the alternative hypothesis that P999 detection is parameter\-limited: adding a259259K\-parameterConv2D\(64\)\+PReLUexpansion before the P999 output collapses SEDI@300 mm from0\.7350\.735to−0\.210\-0\.210\(zero hits\) and P50 SEDI@200 mm from0\.8660\.866to0\.0000\.000\. The300300mm ceiling is set by the2525km ERA5 input resolution, not by model capacity\.

### 4\.4Comparison Against Uncertainty\-Quantification Baselines

To verify that Q\-srdrn’s gains are not reproducible by simpler probabilistic schemes, we ablate against four external baselines that fix the backbone, training data, and 83\-sample cvaeaugmentation and vary only the predictive head:\(i\)mc\_dropout\(Gal and Ghahramani,[2016](https://arxiv.org/html/2605.12762#bib.bib10)\)extracts empirical P50/P95/P99/P999 fromK=30K\{=\}30stochastic forward passes; the predictive variance turns out to be approximately quantile\-invariant, so all upper quantiles collapse onto roughly the same wet\-day band, giving an empirical P999 exceedance of25\.5%25\.5\\%on Florida \(250×250\{\\times\}above the0\.10%0\.10\\%target\) and14\.4%14\.4\\%on California \(144×144\{\\times\}\)\.\(ii\)single\_head\_qr\(Cannon,[2011](https://arxiv.org/html/2605.12762#bib.bib5)\)is a shared\-head QR\-CNN in the standard QRNN lineage, with post\-hocnp\.sortnon\-crossing applied at inference; it matches Q\-srdrnon bulk fit and on deep\-tail P999 detection—the cost of shared\-head sorting is paid in the predictive\-*interval shape*, not in bulk skill or event hits\. The gradient\-permutation pathology of §[4\.3](https://arxiv.org/html/2605.12762#S4.SS3)widens the P99−\-P50 spread to3838–138138mm vs\.88–2222mm for Q\-srdrn\+cvaeand inflates KL by41%41\\%\(0\.0320\.032vs\.0\.0230\.023\), so the predicted quantiles cannot be used as calibrated decision intervals\.\(iii\)deep\_ensemble\(Lakshminarayanan et al\.,[2017](https://arxiv.org/html/2605.12762#bib.bib15)\)quantile\-aggregatesM=5M\{=\}5deterministic SRDRNs \(rather than the dual\-head Gaussian\-NLL form of the original paper\), so the ensemble spread captures only epistemic model uncertainty and not heteroscedastic data noise; it consequently reproduces the same quantile\-invariant variance asmc\_dropoutvia a different mechanism, with25\.0%25\.0\\%empirical P999 exceedance on Florida\.\(iv\)multi\_seedruns five seeds of Q\-srdrnitself \(\{11,23,47,71,97\}\\\{11,23,47,71,97\\\}\); all five land within a tight band \(RMSE8\.640±0\.0428\.640\\,\\pm\\,0\.042, P999 SEDI@200 mm0\.928±0\.0040\.928\\,\\pm\\,0\.004\) confirming that the headline numbers are not a lucky seed\.

Sampling\-based tail failure is structural, not a tuning artifact\.mc\_dropout’s weak Florida bulk fit \(RMSE16\.716\.7vs\. Q\-srdrn’s8\.78\.7, Pearson0\.330\.33vs\.0\.630\.63\) has a clear mechanistic origin rather than a tuning origin: applying dropout inside a CNN acts as an aggressive spatial regularizer that smears Florida’s localized convective cells, which degrades any pixel\-wise mean\-field metric\. Two independent comparisons confirm this attribution\. First, the samemc\_dropoutconfiguration is essentially bulk\-competitive with Q\-srdrnon California’s smoother synoptic/orographic precipitation \(RMSE2\.992\.99vs\.2\.882\.88, Pearson0\.840\.84vs\.0\.850\.85\), so the Florida bulk gap is a climatology\-dependent regularization effect, not a hyperparameter failure\. Second,deep\_ensembleaggregates deterministic SRDRNs*without*internal dropout and therefore recovers Q\-srdrn’s bulk RMSE and Pearson everywhere \(8\.628\.62/0\.650\.65on Florida,2\.732\.73/0\.860\.86on California\)\. Yet whether the bulk fit is poor \(mc\_dropoutFlorida\), competitive \(mc\_dropoutCalifornia\), or matched \(deep\_ensembleboth regions\), all three settings collapse to the*same*τ\\tau\-invariant predictive variance and a near\-identical∼25%\\sim\\\!25\\%Florida and∼13\\sim\\\!13–14%14\\%California P999 exceedance —144144–250×250\\timesabove the0\.10%0\.10\\%target\. The inability to capture heavy\-tail precipitation is therefore a structural property of sampling\-based UQ in deep learning—epistemic spread alone cannot stretch to cover heavy\-tailed targets—and not an artifact of either spatial blurring or sub\-optimal baseline configuration\.

Full table and per\-baseline diagnostics, including CA subset and multi\-seed calibration check, are in Appendix[F](https://arxiv.org/html/2605.12762#A6)\. In one sentence: explicit per\-quantile pinball regression directly optimizes the target loss at eachτ\\tauand is not substitutable by post\-hoc empirical\-quantile wrappers around point predictors\.

### 4\.5cvaeAugmentation

A region\-specific cvae*generator*was trained on each of the three domains, but the cvae*architecture and input\-variable set*were tuned only for Florida and reused unchanged on California and the Texas substate\. Using this configuration, augmentation*helped*Florida but*mildly harmed*California’s P50 channel \(P50 SEDI@200 mm0\.833→0\.7390\.833\\to 0\.739\)\. We only present results for Florida \(Table[4](https://arxiv.org/html/2605.12762#S4.T4)\)\.

Table 4:cvaeaugmentation on Florida\.The 83\-sample cvaeaugmentation set is the same one used by the headline Q\-srdrn\+cvaeruns throughout the paper\.Bold= best in row\. P999 rows are reported only for Q\-srdrnconfigurations becausesrdrnis single\-output\. California and Texas\-substate cvae\-augmented entries are omitted; see the discussion above for why\.Bulk and P50 detection\.Q\-srdrn\+cvaeattains the best RMSE \(−5\.4%\-5\.4\\%\) and KL \(−81%\-81\\%\); augmentation lifts the P50 head from 14 to1,0381\{,\}038hits at 200 mm \(74×74\{\\times\}\) and SEDI@200 mm from 0\.405 to 0\.866, while the matched\-head P999 channel is essentially unchanged \(SEDI@200 mm0\.932→0\.9220\.932\\to 0\.922, hits@300 mm19→2319\\to 23\): augmentation concentrates its value in the median channel\. The asymmetry againstsrdrnfollows from the loss: MAE amplifies label conflicts \(20×20\{\\times\}gradient on a synthetic 170 mm pixel\), whereas pinball treats real and synthetic pixels equally, so 83 synthetic days among12,40012\{,\}400shift the median by<<0\.6%0\.6\\%\(full mechanism in Appendix[D](https://arxiv.org/html/2605.12762#A4)\)\.

Augmentation is region\-specific\.The lesson is that the augmentation architecture and input\-channel set are themselves regime dependent: convective \(FL\), atmospheric\-river \(CA\), and tropical\-cyclone\-coastal \(Texas substate\) regimes likely require different generator backbones, latent\-dim budgets, and conditioning variables\. Multi\-quantile regression, by contrast, transfers across all three demains without retuning\.

## 5Discussion

Florida and California are not independent benchmarks; they are two halves of the same contribution probed from opposite directions, with the Texas substate adding a third architecture\-only confirmation\. The remainder of this section reads the empirical record through that decomposition: the relative roles of architecture and augmentation across the three regimes, the quality of the predictive distribution itself, and the limits beyond which the framework cannot extend\.

### 5\.1Two Domains, Two Roles for the Two Components

The three domains isolate the two halves of the contribution\. California \(atmospheric\-river regime\) is the architecture\-only test bed: un\-augmented Q\-srdrnalready reaches P999 SEDI≥\\geq0\.99 at every threshold through300300mm/day and P50 SEDI==0\.794 at300300mm\. Florida \(sparse scattered convection\) is the rescue test bed: extreme spatial patterns are too rare for the median head to learn from real data alone, and 83 synthetic cvaesamples \(∼\\sim0\.67% of training\) lift P50 SEDI@200 mm from0\.4050\.405to0\.8660\.866\. The Texas substate is a third architecture\-only confirmation on a heavier\-tailed regime \(n200=10,720n\_\{200\}\{=\}10\{,\}720,n300=2,265n\_\{300\}\{=\}2\{,\}265\)\. The full per\-component decomposition is in Appendix[I](https://arxiv.org/html/2605.12762#A9)\(Table[7](https://arxiv.org/html/2605.12762#A9.T7)\); the pattern is consistent across all three domains, with the CA P50 entry telling the most informative story: augmentation*decreases*P50 SEDI@200 mm by0\.0940\.094because synthetic samples smooth the conditional median toward the population mean in a regime that did not need them, while seasonal RMSE \(Table[8](https://arxiv.org/html/2605.12762#A12.T8)in Appendix[L](https://arxiv.org/html/2605.12762#A12)\) reinforces this—Q\-srdrn\+cvaewins seasonally on FL but only sporadically on CA\. We therefore recommend Q\-srdrn\(no augmentation\) as the operational California and Texas\-substate model and Q\-srdrn\+cvaeas the operational Florida model: the two components are complementary, not additive\.

### 5\.2Probabilistic Calibration Diagnostics

Beyond point skill, the predictive distribution itself is calibrated: all eight head×\\timesdomain combinations show positive calibration\-gap closure \(55–24%24\\%, mean13%13\\%\); FL\[P50,P99\]\[\\text\{P50\},\\text\{P99\}\]interval coverage is46\.9%46\.9\\%\(target49%49\\%\) and CA is48\.0%48\.0\\%\. CRPS skill by intensity band mirrors the regime split: cvaeimproves CRPS in every FL band≥\\geq10 mm \(peak\+17\.7%\+17\.7\\%at≥\\geq200 mm\) but flips to−29\.5%\-29\.5\\%on CA’s heaviest band\. Sharpness pinpoints the mechanism\. CRPS combines a sharpness term and a calibration term, and the CA loss is in calibration: cvaenarrows the predictive spread on*both*domains’ extreme bands \(sharpness improves\), but on CA the narrower distribution concentrates around a biased median \(P50 SEDI−0\.094\-0\.094, Table[7](https://arxiv.org/html/2605.12762#A9.T7); cf\. §[5\.1](https://arxiv.org/html/2605.12762#S5.SS1)\)—overconfidence, not skill\. Full reliability, CRPS, and sharpness diagrams are in Appendix[G](https://arxiv.org/html/2605.12762#A7)\.

### 5\.3Limitations

ERA5 resolution wall\.At300300mm/day only7575events exist in Florida’s11\.611\.6M pixel\-day test set, and a capacity\-expansion ablation \(Appendix[A](https://arxiv.org/html/2605.12762#A1)\) collapses detection entirely—the2525km ERA5 forcing cannot resolve the meso\-γ\\gamma\-scale dynamics that produce ultra\-extreme point rainfall\. Higher\-resolution forcing \(HRRR33km,11km RCM\) is the natural next step; the architecture is input\-resolution agnostic\.

Further limits\.cvaegenerator smoothness vs\. diffusion, residual P999 calibration gap \(15\.12×15\.12\{\\times\}wet\-day on FL\), and transfer beyond precipitation are detailed in Appendix[H](https://arxiv.org/html/2605.12762#A8)\. Headline numbers are single\-seed; FL seed variance is bounded by the 5\-seedmulti\_seedcomparator \(Appendix[F](https://arxiv.org/html/2605.12762#A6)\); the CAmulti\_seedrun lands in the camera\-ready\.

### 5\.4Broader Impacts and Reproducibility

Broader impacts\.Improved downscaling supports flood\-risk, water, agricultural, and climate\-adaptation planning in hurricane\- and atmospheric\-river\-prone regions; risks are limited to over\-trusting the median or applying the model outside training domains/period \(1980–2013 ERA5–PRISM\)—the calibrated quantile heads, not a point prediction, are the intended interface\.Reproducibility\.ERA5 and PRISM are public; full code \(training scripts, pinball\+IncrementBound, postprocessing, cvaepipeline, figure scripts\) is released on acceptance\. Each Q\-srdrnrun is 12 h \(FL and TX\) / 22 h \(CA\) on one H100 \(80 GB\); hyperparameters in Appendix[E](https://arxiv.org/html/2605.12762#A5)\.

## 6Conclusion

MAE\-trained downscalers fail at extremes because their conditional median cannot reach the upper tail\. Replacing the loss with multi\-quantile pinball regression, gradient\-decoupling the heads via IncrementBound and separate output convolutions, and—only where the regime requires it—adding a small cvae\-generated extreme set, lifts P999 detection at 200 mm/day from88→1,59888\\to 1\{,\}598events on FL \(18×18\{\\times\}\), to SEDI≥\\geq0\.996 through 300 mm on CA, and from 2 of10,72010\{,\}720to8,7768\{,\}776\(81\.9%81\.9\\%\) on the Texas substate, all without scaling up the network\. The dominant bottleneck at extremes is the loss, not the network or the regime\.

## Acknowledgments and Disclosure of Funding

Omitted for double\-blind review\.

## References

- Baño\-Medina et al\. \[2020\]J\. Baño\-Medina, R\. Manzanas, and J\. M\. Gutiérrez\.Configuration and intercomparison of deep learning neural models for statistical downscaling\.*Geoscientific Model Development*, 13\(4\):2109–2124, 2020\.DOI:[10\.5194/gmd\-13\-2109\-2020](https://doi.org/10.5194/gmd-13-2109-2020)\.
- Bi et al\. \[2023\]K\. Bi, L\. Xie, H\. Zhang, X\. Chen, X\. Gu, and Q\. Tian\.Accurate medium\-range global weather forecasting with 3D neural networks\.*Nature*, 619\(7970\):533–538, 2023\.DOI:[10\.1038/s41586\-023\-06185\-3](https://doi.org/10.1038/s41586-023-06185-3)\.
- Brando et al\. \[2022\]A\. Brando, J\. Gimeno, J\. A\. Rodríguez\-Serrano, and J\. Vitrià\.Deep non\-crossing quantiles through the partial derivative\.In*Proceedings of the 25th International Conference on Artificial Intelligence and Statistics \(AISTATS\)*, volume 151 of*PMLR*, pages 7902–7914, 2022\.arXiv:[2201\.12848](https://arxiv.org/abs/2201.12848)\.
- Bremnes \[2004\]J\. B\. Bremnes\.Probabilistic forecasts of precipitation in terms of quantiles using NWP model output\.*Monthly Weather Review*, 132\(1\):338–347, 2004\.DOI:[10\.1175/1520\-0493\(2004\)132<0338:PFOPIT\>2\.0\.CO;2](https://doi.org/10.1175/1520-0493(2004)132%3C0338:PFOPIT%3E2.0.CO;2)\.
- Cannon \[2011\]A\. J\. Cannon\.Quantile regression neural networks: Implementation in R and application to precipitation downscaling\.*Computers & Geosciences*, 37\(9\):1277–1284, 2011\.DOI:[10\.1016/j\.cageo\.2010\.07\.005](https://doi.org/10.1016/j.cageo.2010.07.005)\.
- Cannon \[2018\]A\. J\. Cannon\.Non\-crossing nonlinear regression quantiles by monotone composite quantile regression neural network, with application to rainfall extremes\.*Stochastic Environmental Research and Risk Assessment*, 32\(11\):3207–3225, 2018\.DOI:[10\.1007/s00477\-018\-1573\-6](https://doi.org/10.1007/s00477-018-1573-6)\.
- Chandler and Wheater \[2002\]R\. E\. Chandler and H\. S\. Wheater\.Analysis of rainfall variability using generalized linear models: a case study from the west of Ireland\.*Water Resources Research*, 38\(10\):1192, 2002\.DOI:[10\.1029/2001WR000906](https://doi.org/10.1029/2001WR000906)\.
- Daniels and Velikova \[2010\]H\. Daniels and M\. Velikova\.Monotone and partially monotone neural networks\.*IEEE Transactions on Neural Networks*, 21\(6\):906–917, 2010\.DOI:[10\.1109/TNN\.2010\.2044803](https://doi.org/10.1109/TNN.2010.2044803)\.
- Ferro and Stephenson \[2011\]C\. A\. T\. Ferro and D\. B\. Stephenson\.Extremal dependence indices: Improved verification measures for deterministic forecasts of rare binary events\.*Weather and Forecasting*, 26\(5\):699–713, 2011\.DOI:[10\.1175/WAF\-D\-10\-05030\.1](https://doi.org/10.1175/WAF-D-10-05030.1)\.
- Gal and Ghahramani \[2016\]Y\. Gal and Z\. Ghahramani\.Dropout as a Bayesian approximation: Representing model uncertainty in deep learning\.In*Proceedings of the 33rd International Conference on Machine Learning \(ICML\)*, volume 48 of*PMLR*, pages 1050–1059, 2016\.
- Gneiting and Raftery \[2007\]T\. Gneiting and A\. E\. Raftery\.Strictly proper scoring rules, prediction, and estimation\.*Journal of the American Statistical Association*, 102\(477\):359–378, 2007\.DOI:[10\.1198/016214506000001437](https://doi.org/10.1198/016214506000001437)\.
- Harris et al\. \[2022\]L\. Harris, A\. T\. T\. McRae, M\. Chantry, P\. D\. Dueben, and T\. N\. Palmer\.A generative deep learning approach to stochastic downscaling of precipitation forecasts\.*Journal of Advances in Modeling Earth Systems*, 14\(10\):e2022MS003120, 2022\.DOI:[10\.1029/2022MS003120](https://doi.org/10.1029/2022MS003120)\.
- Kingma and Welling \[2014\]D\. P\. Kingma and M\. Welling\.Auto\-encoding variational Bayes\.In*International Conference on Learning Representations \(ICLR\)*, 2014\.arXiv:[1312\.6114](https://arxiv.org/abs/1312.6114)\.
- Koenker and Hallock \[2001\]R\. Koenker and K\. F\. Hallock\.Quantile regression\.*Journal of Economic Perspectives*, 15\(4\):143–156, 2001\.DOI:[10\.1257/jep\.15\.4\.143](https://doi.org/10.1257/jep.15.4.143)\.
- Lakshminarayanan et al\. \[2017\]B\. Lakshminarayanan, A\. Pritzel, and C\. Blundell\.Simple and scalable predictive uncertainty estimation using deep ensembles\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, pages 6402–6413, 2017\.arXiv:[1612\.01474](https://arxiv.org/abs/1612.01474)\.
- Liu et al\. \[2023\]H\. Liu, S\. Zhu, and L\. Mo\.A novel daily runoff probability density prediction model based on simplified minimal gated memory–non\-crossing quantile regression and kernel density estimation\.*Water*, 15\(22\):3947, 2023\.DOI:[10\.3390/w15223947](https://doi.org/10.3390/w15223947)\.
- Mardani et al\. \[2025\]M\. Mardani, N\. Brenowitz, Y\. Cohen, J\. Pathak, C\.\-Y\. Chen, C\.\-C\. Liu, A\. Vahdat, M\. A\. Nabian, T\. Ge, A\. Subramaniam, K\. Kashinath, J\. Kautz, and M\. Pritchard\.Residual corrective diffusion modeling for km\-scale atmospheric downscaling\.*Communications Earth & Environment*, 6:124, 2025\.DOI:[10\.1038/s43247\-025\-02042\-5](https://doi.org/10.1038/s43247-025-02042-5)\.
- Vandal et al\. \[2018\]T\. Vandal, E\. Kodra, J\. Dy, S\. Ganguly, R\. Nemani, and A\. R\. Ganguly\.Quantifying uncertainty in discrete\-continuous and skewed data with Bayesian deep learning\.In*Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining \(KDD\)*, pages 2377–2386, 2018\.DOI:[10\.1145/3219819\.3219996](https://doi.org/10.1145/3219819.3219996)\.
- Nguyen et al\. \[2023\]T\. Nguyen, J\. Brandstetter, A\. Kapoor, J\. K\. Gupta, and A\. Grover\.ClimaX: A foundation model for weather and climate\.In*Proceedings of the 40th International Conference on Machine Learning \(ICML\)*, pages 25904–25938, 2023\.arXiv:[2301\.10343](https://arxiv.org/abs/2301.10343)\.
- Pathak et al\. \[2022\]J\. Pathak, S\. Subramanian, P\. Harrington, S\. Raja, A\. Chattopadhyay, M\. Mardani, T\. Kurth, D\. Hall, Z\. Li, K\. Azizzadenesheli, P\. Hassanzadeh, K\. Kashinath, and A\. Anandkumar\.FourCastNet: A global data\-driven high\-resolution weather model using adaptive Fourier neural operators\.arXiv preprint, 2022\.arXiv:[2202\.11214](https://arxiv.org/abs/2202.11214)\.
- Price and Rasp \[2022\]I\. Price and S\. Rasp\.Increasing the accuracy and resolution of precipitation forecasts using deep generative models\.In*Proceedings of the 25th International Conference on Artificial Intelligence and Statistics \(AISTATS\)*, volume 151 of*PMLR*, pages 10555–10571, 2022\.
- Price et al\. \[2025\]I\. Price, A\. Sanchez\-Gonzalez, F\. Alet, T\. R\. Andersson, A\. El\-Kadi, D\. Masters, T\. Ewalds, J\. Stott, S\. Mohamed, P\. Battaglia, R\. Lam, and M\. Willson\.Probabilistic weather forecasting with machine learning\.*Nature*, 637\(8044\):84–90, 2025\.DOI:[10\.1038/s41586\-024\-08252\-9](https://doi.org/10.1038/s41586-024-08252-9)\.
- Ravuri et al\. \[2021\]S\. Ravuri, K\. Lenc, M\. Willson, D\. Kangin, R\. Lam, P\. Mirowski, M\. Fitzsimons, M\. Athanassiadou, S\. Kashem, S\. Madge, R\. Prudden, A\. Mandhane, A\. Clark, A\. Brock, K\. Simonyan, R\. Hadsell, N\. Robinson, E\. Clancy, A\. Arribas, and S\. Mohamed\.Skilful precipitation nowcasting using deep generative models of radar\.*Nature*, 597\(7878\):672–677, 2021\.DOI:[10\.1038/s41586\-021\-03854\-z](https://doi.org/10.1038/s41586-021-03854-z)\.
- Sha et al\. \[2020\]Y\. Sha, D\. J\. Gagne II, G\. West, and R\. Stull\.Deep\-learning\-based gridded downscaling of surface meteorological variables in complex terrain\. Part I: Daily maximum and minimum 2\-m temperature\.*Journal of Applied Meteorology and Climatology*, 59\(12\):2057–2073, 2020\.DOI:[10\.1175/JAMC\-D\-20\-0057\.1](https://doi.org/10.1175/JAMC-D-20-0057.1)\.
- Sohn et al\. \[2015\]K\. Sohn, X\. Yan, and H\. Lee\.Learning structured output representation using deep conditional generative models\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, pages 3483–3491, 2015\.
- Vandal et al\. \[2017\]T\. Vandal, E\. Kodra, S\. Ganguly, A\. Michaelis, R\. Nemani, and A\. R\. Ganguly\.DeepSD: Generating high resolution climate change projections through single image super\-resolution\.In*Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining \(KDD\)*, pages 1663–1672, 2017\.DOI:[10\.1145/3097983\.3098004](https://doi.org/10.1145/3097983.3098004)\.
- Wood et al\. \[2004\]A\. W\. Wood, L\. R\. Leung, V\. Sridhar, and D\. P\. Lettenmaier\.Hydrologic implications of dynamical and statistical approaches to downscaling climate model outputs\.*Climatic Change*, 62\(1–3\):189–216, 2004\.DOI:[10\.1023/B:CLIM\.0000013685\.99609\.9e](https://doi.org/10.1023/B:CLIM.0000013685.99609.9e)\.
- Wang et al\. \[2021\]F\. Wang, D\. Tian, L\. Lowe, L\. Kalin, and J\. Lehrter\.Deep learning for daily precipitation and temperature downscaling\.*Water Resources Research*, 57\(4\):e2020WR029308, 2021\.DOI:[10\.1029/2020WR029308](https://doi.org/10.1029/2020WR029308)\.
- Wehenkel and Louppe \[2019\]A\. Wehenkel and G\. Louppe\.Unconstrained monotonic neural networks\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, pages 1543–1553, 2019\.arXiv:[1908\.05164](https://arxiv.org/abs/1908.05164)\.
- Zorita and von Storch \[1999\]E\. Zorita and H\. von Storch\.The analog method as a simple statistical downscaling technique: comparison with more complicated methods\.*Journal of Climate*, 12\(8\):2474–2489, 1999\.DOI:[10\.1175/1520\-0442\(1999\)012<2474:TAMAAS\>2\.0\.CO;2](https://doi.org/10.1175/1520-0442(1999)012%3C2474:TAMAAS%3E2.0.CO;2)\.

## Appendix AArchitectural Ablation Details

We add architectural changes one at a time, evaluating each step against the previous configuration\.Step \(i\)is the baseline: Q\-srdrnwithtf\.sortfor monotonicity and a sharedConv2D\(4\)output layer\. Steps \(ii\)–\(iv\) below progressively replace these components, each in isolation, with backbone, loss, data, and epochs held identical across the chain\.

#### Step \(ii\): IncrementBound\.

We replacetf\.sortwith the cumulative\-softplus parameterisation of Eq\.[2](https://arxiv.org/html/2605.12762#S3.E2)\. The prior tanh bound onr0r\_\{0\}becomes unnecessary because monotonicity is now guaranteed by construction; upper quantiles grow without an explicit cap \(physical caps of1,2001\{,\}200mm are applied only in post\-processing\)\.

P999 detection improves at all thresholds≥200\\geq 200mm: POD at300300mm nearly doubles \(0\.133→0\.2530\.133\\to 0\.253\) and SEDI rises from0\.5670\.567to0\.7080\.708\(\+24\.9%\+24\.9\\%\)\. A side effect: P50 SEDI at150150–250250mm also improves \(e\.g\.,\+22%\+22\\%at250250mm\), because consistent gradient routing lets each channel’s filters specialize rather than averaging the four quantile signals\.

#### Step \(iii\): Separate output heads\.

We replace the sharedConv2D\(4, kernel=9\)with four independentConv2D\(1, kernel=9\)heads, preserving total parameter count\. P999 SEDI at300300mm improves further to0\.7350\.735\(\+3\.8%\+3\.8\\%\); P50 SEDI at200200mm jumps from0\.6840\.684to0\.8660\.866\(\+26\.6%\+26\.6\\%\); KL divergence reaches its best value of0\.0230\.023\.

#### Step \(iv\): Deeper P999 head \(FAILED\)\.

We add aConv2D\(64, kernel=3\)\+PReLUexpansion before the P999 output head \(259259K added parameters\)\. P999 SEDI at300300mm collapses from0\.7350\.735to−0\.210\-0\.210\(zero hits\)\. The failure has three coupled causes: the512→64512\\to 64channel reduction loses spatial information, the259259K extra parameters trained on∼0\.1%\{\\sim\}0\.1\\%of pixel\-days learn a conservative mapping, and the degraded P999 gradients propagate back through IncrementBound into the shared backbone\. The result confirms that the300300mm detection limit is imposed by ERA5’s2525km input resolution, not by model capacity\.

## Appendix BAugmentation Experiments Summary

Across 23srdrnaugmentation experiments, only one configuration \(83 samples, 0\.67% ratio\) succeeded\. Key failures include:

- •Higher ratios: 2\.1% \(Corr−0\.007\-0\.007\), 1\.5% \(RMSE\+0\.16\+0\.16, SEDI@100−0\.08\-0\.08\), 9\.4% \(catastrophic collapse\)\.
- •Two\-stage fine\-tuning: Fine\-tunning the trained SRDRN on non\-augmented data with augmented data\. RMSE = 10\.81; pred max = 1,919 mm\. No real\-data anchor causes uncontrolled gradient drift\.
- •Loss modification: changingwmaxw\_\{\\text\{max\}\}= 2\.0 towmaxw\_\{\\text\{max\}\}= 5\.0 \(pred mean\+72\.5%\+72\.5\\%\)\. Amplifies gradients for all events, not just synthetic\.

For Q\-srdrn, the 0\.67% ceiling persists: increasing to 186 samples \(1\.5%\) regressed on RMSE \(\+0\.16\+0\.16\), Corr \(−0\.02\-0\.02\), KL \(\+52%\+52\\%\), and SEDI@100 \(−0\.08\-0\.08\)\. The transition is a cliff near∼\{\\sim\}1%, confirming that the label\-averaging effect is the root cause\.

## Appendix CFailed Experiments Table

This appendix compiles the negative results referenced throughout the main text, each establishing a hard constraint of the framework\. The rows in Table[5](https://arxiv.org/html/2605.12762#A3.T5)fall into three groups: augmentation\-ratio failures that locate the∼1%\{\\sim\}1\\%cliff \(rows 1, 4, 6\); loss and optimisation failures from amplifying tail gradients indiscriminately \(rows 2, 5, 7\); and capacity\-versus\-input\-resolution failures that show the deep\-tail ceiling is set by ERA5 rather than by model depth \(rows 3, 8\)\.

Table 5:Negative results establishing hard constraints of the framework\.
## Appendix DWhy Augmentation Helps P50 but Not P999

For quantileτ\\tau, the pinball loss penalizes under\-predictionτ/\(1−τ\)\\tau/\(1\-\\tau\)times more heavily than over\-prediction\. This ratio determines how much extreme\-event gradient signal each head receives from the real training data alone:

The pattern is monotonic: the higher the asymmetric amplification from the loss function itself, the less the head benefits from additional extreme training examples\.

#### P50 is blind to extremes without augmentation\.

With symmetric loss \(1×1\\times\), a 200 mm event produces the same gradient magnitude as a 2 mm event\. The P50 head learns “when ERA5 shows a storm, predict∼\{\\sim\}5–10 mm”—accurate for typical days, useless at 200 mm\. Adding 83 synthetic extreme days provides the P50 head its only meaningful exposure to extreme\-event spatial patterns, explaining the 0\.405→\\to0\.866 SEDI jump at 200 mm\.

#### P999 is already saturated from real data\.

With999×999\\timesasymmetric loss, each extreme pixel\-day contributes a999×999\\times\-amplified gradient\. Adding synthetic days increases the effective gradient budget by only∼\{\\sim\}10%\. Moreover, the cvae’s spatial correlation is 0\.885, not 1\.0: every spatial misplacement is amplified999×999\\times, offsetting the marginal benefit with amplified noise\. The net result is near\-zero augmentation effect \(SEDI at 200 mm: 0\.932→\\to0\.922\)\.

#### Why augmentation helps Q\-SRDRN’s P50 but not SRDRN\.

Bothsrdrnand P50 predict a median\-like quantity, yet augmentation barely helpedsrdrn\(SEDI\+\+2\.3%, KL*worsened*\) while dramatically improving P50 \(SEDI@200mm\+\+114%, KL−\-50%\)\. Three mechanisms explain the difference:

1. 1\.Intensity weighting amplifies label conflicts in SRDRN\.A synthetic 170 mm pixel receives20×20\\timesthe gradient of a real dry pixel insrdrn, far exceeding its 0\.67% data fraction\. P50’s unweighted pinball loss makes real and synthetic pixels contribute equal gradient, so 83 synthetic days shift the median by<<0\.6%\.
2. 2\.Gradient saturation caps SRDRN’s extreme\-event signal\.Events above 23\.4 mm produce identical gradients insrdrn\. Pinball loss has no saturation: the residual\|y−q^\|\|y\-\\hat\{q\}\|scales linearly with event severity\.
3. 3\.Single output vs\. separate heads\.srdrn’s single output must simultaneously serve all purposes\. Q\-srdrn’s separate heads let augmentation improve P50’s extreme\-day predictions without affecting P999’s filters\.

#### Scope of the “P999 immune to augmentation” claim\.

The immunity statement holds only in the regime tested here: 83 cvaesamples \(∼\\sim0\.67% of training\),τ\\tau= 0\.999, FL/CA precipitation, cvaesampling correlation 0\.83–0\.89\. At higher augmentation ratios or lower\-quality cvaes the999×999\\timesamplification becomes a liability rather than an asymptote \(Run 11 in Appendix[B](https://arxiv.org/html/2605.12762#A2): 1\.5% ratio regressed all metrics\)\. We do not claim immunity outside this operating point\.

## Appendix EHyperparameters

#### Q\-SRDRN backbone\.

16 residual blocks \(each: two Conv2D3×33\\times 3kernels with PReLU and SpatialDropout2D\(0\.1\)\), 256 filters, asymmetric pixel\-shuffle upsampling \(6×6\\timeslat,4×4\\timeslon for FL;6×66\\times 6for CA\), four separate Conv2D\(1, kernel=9=9\) output heads concatenated and passed through IncrementBound\. Backbone weights initialised via He normal\.

#### Training\.

Adam, lr=10−5=10^\{\-5\}, no scheduler\. 160 epochs, batch size 64 \(FL\)\. Inputs \(the 15 ERA5 surface variables enumerated in §[3\.1](https://arxiv.org/html/2605.12762#S3.SS1)\) z\-normalised per channel; targets log1p\-transformed and z\-normalised, inverted at evaluation\. The loss is the masked pinball objective of Eq\.[1](https://arxiv.org/html/2605.12762#S3.E1)withτ∈\{0\.50,0\.95,0\.99,0\.999\}\\tau\\in\\\{0\.50,0\.95,0\.99,0\.999\\\}and event weightα=5\.0\\alpha=5\.0on the P95/P99/P999 heads \(zthresh=0\.5z\_\{\\text\{thresh\}\}=0\.5in log1p z\-score units\); the P50 head is unweighted\.

#### cVAE\.

PyTorch, latentz∈ℝ256z\\in\\mathbb\{R\}^\{256\},βKL=0\.008\\beta\_\{\\text\{KL\}\}=0\.008, encoder/decoder share the same 5\-block backbone with GroupNorm and no skip connections\. Adam, lr=2×10−4=2\\times 10^\{\-4\}, 300 epochs, batch size 16\. FL augmentation set is 83 samples drawn from posteriorzz, filtered by min\-intensity\-ratio≥\\geq1\.03 over the original PRISM target\.

#### Postprocessing\.

Per\-channel physical caps \(P50/P95: 600 mm, P99/P999: 1,200 mm\) applied after IncrementBound; no sort needed \(IncrementBound is monotone by construction\)\. Log1p inverse and target denormalization applied last\.

## Appendix FUQ\-Baseline Comparison: Detailed Results

This appendix contains the full results table and per\-baseline diagnostics underlying the summary in §[4\.4](https://arxiv.org/html/2605.12762#S4.SS4)\. All four comparators fix Q\-srdrn’s backbone, optimiser, training data, and 83\-sample cvaeaugmentation, and vary only the predictive head and the mechanism that produces tail quantiles\.

Table 6:Florida and California: Q\-srdrnalongside complementary uncertainty\-quantification designs\.Backbone, optimiser, training data, and 83\-sample cvaeaugmentation are held fixed across the four comparators \(identical augmentation set used by the headline Q\-srdrn\+cvaeruns of Table[4](https://arxiv.org/html/2605.12762#S4.T4)\); only the head architecture and the mechanism that produces tail quantiles vary\. Each comparator answers a different question \(does pinball replace MC sampling? does sharing all weights acrossτ\\tauwork? does ensembling deterministic predictors reproduce a quantile spread?\), so this table is not a head\-to\-head ranking—no row\-best emphasis is applied\. P999 calib\. target is 0\.10%; values in the 20–30% range indicate the head’s predictive variance is approximately constant acrossτ\\tau, i\.e\. the model cannot distinguish median from extreme\.†mc\_dropout’s low KL reflects an aggregated wet\-day distribution match; the per\-quantile structure \(rows below\) is broken\.‡At 300 mm onlyn=75n\{=\}75observed events exist; this\+0\.04\+0\.04inversion is within deep\-tail sampling noise\.§deep\_ensemble’s 0\.5% RMSE / 1\.9 p\.p\. Pearson edge is the expected variance\-reduction artefact of averaging 5 deterministic predictors; its upper\-quantile rows below show the cost\.multi\_seedentries report mean±1σ\\pm\\,1\\sigmaacross seeds\{11,23,47,71,97\}\\\{11,23,47,71,97\\\}; per\-seed empirical\-exceedance means are reported withoutσ\\sigma\(per\-seed spread<0\.4<\\,0\.4p\.p\. across all four quantiles\)\. The single\-seed Q\-srdrn\+cvaereference column reports POD and hit counts recomputed directly from its verified test predictions; bulk metrics and the P999\-head SEDI/POD/hit rows are reproduced by themulti\_seedmean within≤\\leq1σ\\sigma\.∥Californiamulti\_seeddata was not ready at the time of submission and will be populated in the camera\-ready version\. All other comparators here are trained on the same 83\-sample cvaeaugmentation set used by the reference Q\-srdrn\+cvaecolumn, so the comparison isolates head architecture and quantile mechanism only\.

#### MC\-Dropout: predictive variance is approximately quantile\-invariant\.

Themc\_dropoutcolumns in Table[6](https://arxiv.org/html/2605.12762#A6.T6)reveal a single mechanism behind two failure modes\. Empirical exceedance rates at P95, P99, and P999 are 26\.7%, 25\.7%, and 25\.5%—essentially identical, and 5–250×\\timesabove their respective targets\. Dropout’s predictive variance is a near\-constant function of the input rather than a function ofτ\\tau: every quantile collapses onto roughly the same wet\-day band\. The same effect drives the bulk\-RMSE blow\-up: noisyKK\-pass averaging hallucinates extremes, nearly doubling RMSE from 8\.665 to 16\.726\. This is a strong domain\-specific demonstration of the MC\-Dropout failure mode flagged byGal and Ghahramani\[[2016](https://arxiv.org/html/2605.12762#bib.bib10)\]for tail prediction:*a 250×\\timesovershoot at the 99\.9th\-percentile target on a real precipitation downscaling task*\. The same quantile\-invariance failure mode reproduces on California \(P95/P99/P999 empirical exceedance==15\.0%15\.0\\%,14\.5%14\.5\\%,14\.4%14\.4\\%— 3×\\times, 14×\\times, 144×\\timesabove their respective targets\), with a milder bulk\-RMSE penalty \(\+4%\+4\\%vs\. Q\-srdrn\+cvae\) than on Florida because California’s lighter convective tail gives the noisy empirical median less room to hallucinate extremes\. Explicit pinball regression is therefore not optional for extreme\-quantile prediction\.

#### Single\-head QR\-CNN: bulk\-competitive, but the median cannot reach extremes\.

single\_head\_qris the strongest comparator on bulk metrics: RMSE within1\.5%1\.5\\%of Q\-srdrn, Pearson within0\.0110\.011, per\-channel calibration within0\.60\.6p\.p\. of Q\-srdrn\+cvae, and deep\-tail P999 detection at300300mm slightly above the Q\-srdrnreference in both SEDI \(0\.7740\.774vs\.0\.7350\.735\) and raw hits \(2929vs\.2323ofn=75n\{=\}75\)\. Atn=75n\{=\}75this\+0\.04\+0\.04SEDI inversion sits within deep\-tail sampling spread; the post\-hocnp\.sortis sufficient to enforce non\-crossing once the network has learned roughly correct quantiles, and the P999 head’s own gradient signal does most of the work for ultra\-extreme detection regardless of whether channels are shared\. The cost of shared\-headtf\.sortshows up indirectly in the predictive distribution’s*shape*: KL rises by41%41\\%\(0\.023→0\.0320\.023\\to 0\.032\) and the sharpness table \(Appendix[A](https://arxiv.org/html/2605.12762#A1)\) shows P99−\-P50 medians widening to3838–138138mm forsingle\_head\_qrvs\.88–2222mm for Q\-srdrn\+cvae, the empirical signature of the gradient\-permutation pathology of §[4\.3](https://arxiv.org/html/2605.12762#S4.SS3)—per\-pixeltf\.sortre\-routes gradients across channels, so no filter specializes\. On California, where the conditional median is far more predictable owing to the atmospheric\-river regime \(§[4\.2](https://arxiv.org/html/2605.12762#S4.SS2)\),single\_head\_qrand Q\-srdrn\+cvaeare essentially indistinguishable on bulk metrics \(RMSE2\.8562\.856vs\.2\.8812\.881; Pearson0\.8490\.849vs\.0\.8480\.848\) and on P999 detection \(P999 SEDI@200 mm0\.9990\.999vs\.0\.9930\.993; both reach 15/15 hits at 300 mm\)\.

#### Deep ensemble: same quantile\-invariant variance, different mechanism\.

deep\_ensembleclusters tightly across seeds \(per\-member RMSE std=0\.09=0\.09mm\), and the 5\-member mean attains the lowest bulk RMSE in Table[6](https://arxiv.org/html/2605.12762#A6.T6)\(8\.618 vs\. 8\.665\), the standard variance\-reduction artefact of averaging multiple trained models\. But upper\-quantile calibration collapses in exactly the same shape asmc\_dropout: P95/P99/P999 empirical exceedance is 25\.6/25\.1/25\.0%, again 5–250×\\timesabove target\. The mechanism differs—M=5M\{=\}5deterministic predictions cannot span the 0\.999 quantile rank by construction—but the symptom is identical, and the pinball CRPS at the\[200,∞\)\[200,\\infty\)band \(68\.8\) is the worst of any comparator \(vs\.mc\_dropout’s 52\.8 andsingle\_head\_qr’s 54\.2\)\. P999 SEDI at 300 mm degrades to 0\.603, a moderate loss relative to Q\-srdrn\+cvae\. Together,mc\_dropoutanddeep\_ensembledemonstrate that*empirical\-quantile aggregation of point predictors cannot calibrate the deep tail*, regardless of whether the source of spread is dropout noise \(K=30K\{=\}30\) or seed variance \(M=5M\{=\}5\)\. This is the central architectural argument of the section: explicit per\-quantile pinball regression directly optimises the target loss at eachτ\\tauand is not substitutable by post\-hoc empirical\-quantile wrappers\.

#### Multi\-seed: Q\_SRDRN\+cVAE is a typical seed, not a lucky one\.

multi\_seedfixes the Q\-srdrnarchitecture and 83\-sample augmentation and varies only the random seed across\{11,23,47,71,97\}\\\{11,23,47,71,97\\\}\. All five seeds land within a tight band: RMSE8\.640±0\.0428\.640\\,\\pm\\,0\.042\(0\.5% spread\), Pearson0\.630±0\.0050\.630\\,\\pm\\,0\.005, KL0\.022±0\.0060\.022\\,\\pm\\,0\.006, P999 SEDI@200 mm0\.928±0\.0040\.928\\,\\pm\\,0\.004, and P999 SEDI@300 mm0\.726±0\.0350\.726\\,\\pm\\,0\.035\. The 5\-seed mean is statistically indistinguishable from the Q\_SRDRN\+cVAE reference column on every metric the architecture is designed to optimise \(RMSE within 1σ\\sigma, Pearson within 0\.2σ\\sigma, KL within 0\.2σ\\sigma, P999 SEDI@300 mm within 0\.3σ\\sigma\), and matches or slightly exceeds Q\_SRDRN\+cVAE on KL \(0\.0220\.022vs\.0\.0230\.023\)\. Critically, per\-quantile calibration is preserved across every seed—empirical exceedance is4\.324\.32/1\.711\.71/0\.470\.47% at P95/P99/P999 \(target5/1/0\.15/1/0\.1%\), in the same regime as Q\_SRDRN\+cVAE andsingle\_head\_qr, and nowhere near the≈25\\approx\\,25% quantile\-invariant collapse ofmc\_dropoutanddeep\_ensemble\. This contrast is mechanism\-revealing: averaging*already\-calibrated pinball heads*acrossM=5M\{=\}5seeds preserves quantile structure, whereas extracting empirical quantiles*from the spread of point predictors*\(whether dropout passes or ensemble members\) destroys it\. The headline architectural argument therefore holds across the spread of Q\-srdrnitself\.

#### California\.

Three of the four comparators are populated in Table[6](https://arxiv.org/html/2605.12762#A6.T6); themulti\_seeddata was not ready at the time of submission and will be populated in the camera\-ready version\. Two findings are already visible\. First, themc\_dropoutquantile\-invariance failure mode reproduces on California — P95/P99/P999 empirical exceedance==15\.0%15\.0\\%,14\.5%14\.5\\%,14\.4%14\.4\\%\(3×3\{\\times\},14×14\{\\times\},144×144\{\\times\}above target\) — with a much smaller bulk\-RMSE penalty \(\+4%\+4\\%vs\. Q\-srdrn\+cvae\) than on Florida, because California’s lighter convective tail gives the noisy empirical median less room to hallucinate extremes; the calibration mechanism is the same, the bulk amplification is regime\-driven\. Second,deep\_ensemblereproduces the same empirical\-quantile\-aggregation ceiling as on Florida — P95/P99/P999 saturate at13\.6%13\.6\\%,13\.3%13\.3\\%,13\.3%13\.3\\%\(133×133\{\\times\}above the P999 target\) — confirming thatM=5M\{=\}5deterministic predictors cannot span the predictive tail in either regime, even when their per\-member bulk fit is competitive \(RMSE2\.7312\.731vs\. Q\-srdrn\+cvae’s2\.8812\.881\)\.

## Appendix GProbabilistic Calibration Diagnostics: Detailed Results

This appendix contains the full reliability, CRPS, and sharpness diagrams underlying the calibration summary in §[5\.2](https://arxiv.org/html/2605.12762#S5.SS2)\.

#### Reliability\.

Fig\.[2](https://arxiv.org/html/2605.12762#A7.F2)reports calibration gap closure from cvaeaugmentation for each of the four quantile heads, restricted to wet days \(y\>1y\>1mm\) to remove the zero\-inflation that makes all\-sky exceedance uninterpretable forτ=0\.5\\tau=0\.5\. Defining the calibration ratior=Pr⁡\(y\>q^τ\)/\(1−τ\)r=\\Pr\(y\>\\hat\{q\}\_\{\\tau\}\)\\,/\\,\(1\-\\tau\),r=1r=1is perfect calibration andr\>1r\>1means the model under\-predicts the quantile\. All eight head/domain combinations haver\>1r\>1at baseline, andrrwidens monotonically withτ\\tau\(FL:1\.53×1\.53\\timesat P50,5\.52×5\.52\\timesat P99,15\.91×15\.91\\timesat P999; CA:1\.49×1\.49\\timesat P50,11\.05×11\.05\\timesat P999\), the residual under\-prediction that the architectural design of §[3\.3](https://arxiv.org/html/2605.12762#S3.SS3)aims to compress\. The bar lengths in Fig\.[2](https://arxiv.org/html/2605.12762#A7.F2)report the relative reduction in calibration gap,1−\(raug−1\)/\(rbase−1\)1\-\(r\_\{\\text\{aug\}\}\-1\)/\(r\_\{\\text\{base\}\}\-1\)\. All 8/8 head×\\timesdomain combinations show positive gap closure with magnitudes55–24%24\\%\(mean13%13\\%; largest on CA P50 at\+24%\+24\\%and FL P95 at\+16%\+16\\%\)\. Crucially, this is gap closure on top of*not*breaking wet\-day calibration on either domain—including California, where augmentation hurts P50 extreme*detection*\. The all\-sky P50 exceedance of∼\{\\sim\}32% on FL and∼\{\\sim\}18% on CA is*not*miscalibration:∼\{\\sim\}66% \(FL\) and∼\{\\sim\}78% \(CA\) of pixel\-days are dry, so the true conditional median is exactly 0 mm and the all\-sky exceedance reduces to the probability of precipitation, not 50%\. The\[P50,P99\]\[\\text\{P50\},\\text\{P99\}\]interval coverage confirms joint calibration: FL 46\.9% \(target 49%\), CA 48\.0%, both near\-ideal\.

#### CRPS skill and sharpness\.

The decisive evidence for the differential augmentation effect is in Fig\.[3](https://arxiv.org/html/2605.12762#A7.F3), which reports a CRPS skill score per observed\-intensity band,s=\(CRPSbase−CRPSaug\)/CRPSbases=\(\\mathrm\{CRPS\}\_\{\\text\{base\}\}\-\\mathrm\{CRPS\}\_\{\\text\{aug\}\}\)/\\mathrm\{CRPS\}\_\{\\text\{base\}\}, where the underlying score is the pinball\-CRPS proxyCRPSproxy=14∑τρτ\(q^τ,y\)\\mathrm\{CRPS\}\_\{\\text\{proxy\}\}=\\tfrac\{1\}\{4\}\\sum\_\{\\tau\}\\rho\_\{\\tau\}\(\\hat\{q\}\_\{\\tau\},y\)which on an even quantile grid is proportional to CRPS\[Gneiting and Raftery,[2007](https://arxiv.org/html/2605.12762#bib.bib11)\]\. Plotting the skill score \(rather than the raw proxy on a log axis, which compresses all relative changes\) makes the regime split of Sec\.[5\.1](https://arxiv.org/html/2605.12762#S5.SS1)quantitative\. On Florida, augmentation improves CRPS\-proxy in every band≥\\geq10 mm, with the largest gain in the heaviest band \(56\.3→46\.356\.3\\to 46\.3mm/day at obs≥\\geq200 mm/day,\+17\.7%\+17\.7\\%skill\)\. On California the score is positive on the mid\-intensity bands \[10,100\) \(up to\+7\.8%\+7\.8\\%\) but*negative*in the heaviest band \(14\.3→18\.514\.3\\to 18\.5mm/day,−29\.5%\-29\.5\\%skill\)—direct probabilistic evidence for the same regime\-dependent role of cvaethat the threshold\-detection table already showed\. Fig\.[4](https://arxiv.org/html/2605.12762#A7.F4)reports the underlying sharpness directly\. Median predictive spread \(P99−\-P50 and P999−\-P95\) grows monotonically with observed intensity on both domains, confirming the upper quantile heads stretch where the data demands\. cVAE augmentation narrows the spread \(raises sharpness\) on the extreme bands of*both*domains—substantially more on CA than on FL—so sharpness alone does not mirror the CRPS\-skill split\. The CA contrast is the informative one: the same augmentation that sharpens the predictive distribution there also costs CRPS skill in the heaviest band, so the additional concentration is uncalibrated rather than informative, consistent with the P50@200 mm SEDI loss in Table[7](https://arxiv.org/html/2605.12762#A9.T7)\.

![Refer to caption](https://arxiv.org/html/2605.12762v1/x1.png)Figure 2:Calibration gap closure with cvaeaugmentation \(wet days,y\>1y\>1mm\)\.Bar length is the relative reduction in calibration gap from cvaeaugmentation:gap closure=1−\(raug−1\)/\(rbase−1\)\\text\{gap closure\}=1\-\(r\_\{\\text\{aug\}\}\-1\)/\(r\_\{\\text\{base\}\}\-1\), wherer=Pr⁡\(y\>q^τ\)/\(1−τ\)r=\\Pr\(y\>\\hat\{q\}\_\{\\tau\}\)/\(1\-\\tau\)is the empirical\-to\-nominal exceedance ratio \(a value of11is perfect calibration\)\. Positive \(green\) bars mean augmentation moves the head toward perfect calibration; negative bars would mean it moves away\. Annotations to the left of each bar report the absolute baseline and augmented ratios, e\.g\. “1\.53×→1\.44×1\.53\\times\\to 1\.44\\times” for FL P50 means the model over\-predicts wet\-day exceedance by1\.53×1\.53\\timeswithout augmentation and1\.44×1\.44\\timeswith it\. All 8/8 head×\\timesdomain combinations show positive gap closure \(55–24%24\\%, mean13%13\\%\)\. Under\-prediction widens monotonically withτ\\tau\(∼\\sim1\.5×\\timesat P50,∼\\sim16×\\timesat P999\), motivating IncrementBound and per\-quantile event weighting\. We plot the gap closure rather than the raw ratios on a log axis because the ratios span1\.5×1\.5\\timesto16×16\\timesand visually crush percentage\-level changes from augmentation into invisibility\.![Refer to caption](https://arxiv.org/html/2605.12762v1/x2.png)Figure 3:CRPS skill score by observed intensity band\.Bars shows=\(CRPSbase−CRPSaug\)/CRPSbases=\(\\mathrm\{CRPS\}\_\{\\text\{base\}\}\-\\mathrm\{CRPS\}\_\{\\text\{aug\}\}\)/\\mathrm\{CRPS\}\_\{\\text\{base\}\}per band: positive \(green\) bars mean cvaeaugmentation improves the pinball\-CRPS proxy; negative \(red\) bars mean it hurts\. Underlying CRPS values \(mm/day\) are printed beneath each bar so the absolute magnitude is recoverable\. On Florida, augmentation produces positive skill in every band≥\\geq10 mm/day, peaking at\+17\.7%\+17\.7\\%in the heaviest band; on California the skill is positive in the mid bands \[10,100\) but flips to−29\.5%\-29\.5\\%in the heaviest band—a regime\-dependent signature consistent with Sec\.[5\.1](https://arxiv.org/html/2605.12762#S5.SS1)\. We use the skill score rather than absolute log bars because the proxy spans almost three orders of magnitude across bands, which makes the \(small but consequential\) percentage changes invisible on a log axis\.![Refer to caption](https://arxiv.org/html/2605.12762v1/x3.png)Figure 4:Sharpness conditioned on observed intensity\.Median predictive spread P99−\-P50 \(circles\) and P999−\-P95 \(squares\), binned by observed precipitation\. Spread grows monotonically with intensity on both domains, confirming that the upper quantile heads stretch where the data demands\. cVAE augmentation narrows the predictive spread \(raises sharpness\) on the extreme bands of*both*domains, and the effect is substantially larger on CA than on FL\. Sharpness therefore does not mirror the CRPS\-skill split of Fig\.[3](https://arxiv.org/html/2605.12762#A7.F3): the same CA sharpening coincides with negative CRPS skill in the heaviest band, so the added concentration is uncalibrated rather than informative \(consistent with the P50@200 mm SEDI loss in Table[7](https://arxiv.org/html/2605.12762#A9.T7)\)\.

## Appendix HLimitations: Detailed Discussion

#### Generator quality: cvaeversus conditional diffusion\.

The augmentation generator in this paper is a cvaewith sampling spatial correlations of0\.8850\.885on Florida and0\.7070\.707\(vanilla\) /0\.8350\.835\(with AdaIN\) on California—the latter still well below1\.01\.0\. cVAEs are known to produce smoother, lower\-variance samples than state\-of\-the\-art conditional diffusion models because of their Gaussian\-posterior parameterisation; spatial misplacements in the synthetic fields are then amplified by the999×999\{\\times\}pinball gradient at the deep tail \(Appendix[D](https://arxiv.org/html/2605.12762#A4)\), which is why cvaeaugmentation is neutral on the P999 head in the regime tested here\. A higher\-fidelity generator—e\.g\., a physics\-constrained conditional diffusion model—could plausibly produce samples crisp enough to lift P99 and P999 detection beyond what real data alone supports\. The natural design is a*hybrid*: diffusion is used only offline to expand the training set, while Q\-srdrnretains its single\-pass inference\. This separates the \(expensive\) sample\-generation step from the \(cheap\) downscaling step, preserving the computational profile that distinguishes Q\-srdrnfrom end\-to\-end diffusion downscalers\.

#### Residual calibration at the deep tail\.

The reliability diagnostic of §[5\.2](https://arxiv.org/html/2605.12762#S5.SS2)closes 5–24% of the calibration gap across all eight head–domain combinations, but the P999 head retains a15\.12×15\.12\{\\times\}wet\-day under\-prediction ratio on Florida; the architecture compresses the residual gap without eliminating it\. Two gradient\-side extensions are natural candidates for closing it further\. First, the event weightα=5\\alpha=5is fixed during training; makingα\\alpha*dynamic*—scaling with the running batch variance of pinball loss at each head, or with the empirical exceedance gap accumulated over a sliding window—would let the upper\-tail heads receive proportionally more gradient when calibration is drifting and less when it is on target\. Second, a Focal\-Loss\-inspired dynamic scaling on the pinball residual \(down\-weighting easily\-predicted residuals and up\-weighting hard ones\) would concentrate optimisation pressure on the deep tail\. Both modifications operate at the gradient level only, compose with IncrementBound and the separate\-heads architecture without changing head topology, and leave the quantile set𝒯\\mathcal\{T\}intact; we leave their empirical validation to future work\.

#### Broader applicability\.

The Q\-srdrnframework is architecture\-agnostic: IncrementBound and per\-quantile event weighting can be applied to any CNN backbone \(U\-Net, diffusion models\) and to any right\-skewed geophysical variable \(wind gusts, streamflow, wave height\)\. Demonstrating transfer to those settings is the clearest route to establishing whether the heavy\-tail failure mode we diagnose is general or precipitation\-specific\.

## Appendix IPer\-Component Contributions Table

This appendix decomposes the headline numbers of §[4](https://arxiv.org/html/2605.12762#S4)into the additive contribution of each component: the architecture switchsrdrn→\\toQ\-srdrn\(columnArch\.\) and the further cvaeaugmentation on top of Q\-srdrn\(columncVAE\)\. All values in Table[7](https://arxiv.org/html/2605.12762#A9.T7)are single\-seed test\-set deltas signed in the improvement direction\. The architecture switch dominates the extreme\-tail metrics \(P999 SEDI@300 mm:\+0\.71\+0\.71Florida,\+1\.00\+1\.00California,\+0\.89\+0\.89Texas substate\), while cvaeaugmentation provides the largest median lift \(P50 SEDI@200 mm:\+0\.46\+0\.46Florida\) and is region\-specific \(§[4\.5](https://arxiv.org/html/2605.12762#S4.SS5)\)\.

Table 7:Per\-component contributions, by metric and domain\.Arch\.compares Q\-srdrn\(no aug\) tosrdrn;cVAEcompares Q\-srdrn\+cvaeto Q\-srdrn\. Positive values indicate improvement in the listed direction\. Numbers are single\-seed test\-set deltas and should be read as point estimates, not significance\-tested effects\. Texas substate cvae\-column entries are not reported; the FL\-tuned cvaeconfiguration was not run there \(§[4\.5](https://arxiv.org/html/2605.12762#S4.SS5), Table[4](https://arxiv.org/html/2605.12762#S4.T4)\)\.
## Appendix JP50 SEDI Envelope Figure

This appendix supplies the per\-domain median\-head SEDI envelope referenced in §[4\.5](https://arxiv.org/html/2605.12762#S4.SS5): the SEDI of the Q\-srdrnP50 head as a function of precipitation threshold, compared against the deterministicsrdrnbaseline\. Figure[5](https://arxiv.org/html/2605.12762#A10.F5)shows that on California the un\-augmented Q\-srdrnP50 already surpassessrdrnat thresholds≥100\\geq 100mm/day — the median\-quantile loss is sufficient on a smooth synoptic regime — whereas on Florida the un\-augmented P50 trailssrdrnat extreme thresholds, with the gap closed only by cvaeaugmentation\.

![Refer to caption](https://arxiv.org/html/2605.12762v1/x4.png)\(a\)Florida
![Refer to caption](https://arxiv.org/html/2605.12762v1/x5.png)\(b\)California

Figure 5:P50 SEDI across precipitation thresholds\.srdrn\(single output\) vs\. Q\-srdrnP50\. On CA, P50 alone surpassessrdrnat≥\\geq100 mm; on FL the un\-augmented P50 trailssrdrnat extreme thresholds—a gap closed by augmentation \(§[4\.5](https://arxiv.org/html/2605.12762#S4.SS5)\)\.
## Appendix KMatched\-Head Envelope Figure

This appendix shows the per\-head SEDI envelope of Q\-srdrn\(no augmentation\) with each quantile head plotted only over its natural climatological range: P50 for≤10\\leq 10mm, P95 for1010–2020mm, P99 for3030–6060mm, and P999 for≥70\\geq 70mm\. Figure[6](https://arxiv.org/html/2605.12762#A11.F6)demonstrates that the matched\-head envelope \(black\) lies above the deterministicsrdrnbaseline \(dashed grey\) at every threshold on both Florida and California, and that the California P999 head sustains SEDI≥0\.99\\geq\\,0\.99across the entire≥50\\geq\\\!50mm range\.

![Refer to caption](https://arxiv.org/html/2605.12762v1/x6.png)Figure 6:Each quantile head in its natural range\(Q\-srdrn, no augmentation\)\. Head assignment follows climatological exceedance rates: P50 \(blue,≤\\leq10 mm\), P95 \(green, 10–20 mm\), P99 \(orange, 30–60 mm\), P999 \(red,≥\\geq70 mm\)\. The matched\-head envelope \(black\) exceeds thesrdrnbaseline \(dashed gray\) at every threshold on both domains\. On California, the P999 head achieves SEDI≥\\geq0\.99 across the entire≥\\geq50 mm range\.
## Appendix LSeasonal RMSE

This appendix breaks the headline RMSE numbers down by meteorological season \(DJF, MAM, JJA, SON\) for each of the four configurations evaluated in the paper\. Table[8](https://arxiv.org/html/2605.12762#A12.T8)shows that Q\-srdrn\+cvaeattains the lowest seasonal RMSE in 7 of 8 Florida and California season pairs, and that on the un\-augmented Texas substate the Q\-srdrnP50 head improves oversrdrnin all four seasons \(17–49% RMSE reduction\), confirming that the architecture\-driven gain is not a single\-season artefact\.

Table 8:Seasonal RMSE \(mm/day\)\.Q\-srdrn\+cvaeachieves the lowest seasonal RMSE in 7 of 8 Florida and California season pairs; on the un\-augmented Texas substate, the Q\-srdrnP50 head beatssrdrnin all four seasons \(17–49% RMSE reduction\)\.∗Texas substate is reported without cvaeaugmentation; see §[4\.5](https://arxiv.org/html/2605.12762#S4.SS5), Table[4](https://arxiv.org/html/2605.12762#S4.T4)\.

## NeurIPS Paper Checklist

1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The abstract and introduction precisely state our three contributions — \(i\) we identify a CNN\-specific failure mode in sort\-based quantile monotonicity and resolve it with Q\-SRDRN \(IncrementBound \+ separate Conv2D\(1\) heads\); \(ii\) we validate the same architecture and hyperparameters across three climatologically distinct domains \(Florida, California, and a Texas Gulf\-Coast substate\); and \(iii\) we show that multi\-quantile regression is what makes cVAE augmentation useful on Florida by shielding the median from the label\-averaging that poisons MAE\-trained networks — and these are exactly what the experimental sections evaluate\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: Section 5\.3 \(Limitations\) and the augmentation discussion in Section 4\.5 explicitly address \(i\) the ERA5 25 km resolution wall that caps deep\-tail \(300 mm/day\) detection regardless of model capacity, demonstrated by a capacity\-expansion ablation; \(ii\) cVAE augmentation is region\-specific \(the Florida\-tuned generator architecture and input\-variable set mildly harm California’s P50 channel and were not run on the Texas substate\); \(iii\) cVAE generator smoothness vs\. conditional diffusion, and the residual P999 wet\-day calibration gap \(15\.12×15\.12\\timeson Florida\); \(iv\) transfer to non\-precipitation right\-skewed geophysical variables is unverified; and \(v\) headline numbers are single\-seed, with FL seed variance bounded by the 5\-seedmulti\_seedcomparator and CAmulti\_seeddeferred to camera\-ready\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[N/A\]
14. Justification: The paper presents an empirical methodology \(Q\-SRDRN architecture, IncrementBound layer, cVAE augmentation\) with no formal theorems or proofs; all results are experimental\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification: Section 5\.4 \(Broader Impacts and Reproducibility\) and the Hyperparameters appendix specify the full SRDRN/Q\-SRDRN architecture, IncrementBound layer, cVAE v5c configuration, optimiser/schedule, batch sizes, multi\-seed protocol, and dataset preprocessing \(ERA5 15 variables, PRISM target, log1p \+ per\-pixel normalization, train/test split\)\. Both ERA5 and PRISM are publicly available, so an external reader has every detail required to rebuild the pipeline from scratch\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[No\]
24. Justification: Code is not released at submission time but will be made publicly available upon acceptance\. The paper and appendix nonetheless describe all architectural, training, and preprocessing details \(Section 5\.4 Broader Impacts and Reproducibility, Hyperparameters appendix\) sufficient for an independent re\-implementation, and both underlying data sources — ERA5 \(Copernicus CDS\) and PRISM \(Oregon State University PRISM Climate Group\) — are publicly downloadable\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification: Section 4 \(Experiments\) and the Hyperparameters appendix specify the train/test split, Adam optimizer with documented learning rates and schedules, batch sizes \(64 for SRDRN/Q\-SRDRN, 48 for cVAE\), epoch counts, loss\-function weights \(pinball quantiles,α=5\\alpha=5event weighting, KLβ=0\.008\\beta=0\.008\), normalization scheme, and how each hyperparameter was selected \(architectural ablations and the failed\-experiments table report the search trajectory\)\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[Yes\]
34. Justification: The Florida Comparison Against Uncertainty\-Quantification Baselines table \(Section 4\.4\) reports 5\-seed mean±\\pmstandard deviation across seeds\{11,23,47,71,97\}\\\{11,23,47,71,97\\\}, capturing variability from initialization and stochastic optimization, and confirms the headline UQ findings are statistically robust\. Other tables and figures show single\-seed results due to compute constraints; this scope is acknowledged in Section 5\.3 \(Limitations\)\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[Yes\]
39. Justification: The Reproducibility content within Section 5\.4 \(Broader Impacts and Reproducibility\) reports the GPU type \(single NVIDIA H100 80 GB\) and per\-run wall\-clock time for every model and region: SRDRN baseline∼\\sim6 h \(FL and TX\) /∼\\sim13 h \(CA\); Q\-SRDRN∼\\sim12 h \(FL and TX\) /∼\\sim22 h \(CA\); cvae∼\\sim4 h \(FL\) /∼\\sim8 h \(CA\)\. The Hyperparameters appendix lists batch size, optimizer, and other run\-level details\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: The research uses publicly available reanalysis \(ERA5\) and gauge\-derived \(PRISM\) precipitation datasets, involves no human subjects, no personally identifiable information, and no models with foreseeable dual\-use risks\. All authors have reviewed the NeurIPS Code of Ethics and confirm conformance\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[Yes\]
49. Justification: Section 5\.4 \(Broader Impacts and Reproducibility\) discusses positive applications — flood\-risk assessment, water\-resource management, agricultural planning, and climate adaptation in hurricane\- and atmospheric\-river\-affected regions — and the principal negative\-impact risks \(over\-reliance on the median output without consulting the calibrated quantile spread, and out\-of\-distribution use beyond the trained Florida/California/Texas Gulf\-Coast domains and 1980–2013 ERA5–PRISM period\), with a mitigation that emphasises calibrated uncertainty as the intended user interface\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification: The released artefacts are a regression model that maps ERA5 atmospheric reanalysis to gridded precipitation and a small set of cvae\-generated synthetic precipitation fields used purely for data augmentation\. Neither poses misuse risks of the kind contemplated by this question \(no language generation, no scraped or sensitive data, no image generation\)\.
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification: ERA5 reanalysis is provided by ECMWF/Copernicus Climate Change Service under the Copernicus license; PRISM precipitation data is provided by the PRISM Climate Group at Oregon State University and is freely available for research use\. Both datasets are cited in the paper, and we use them within their stated terms \(research/non\-commercial\)\. Baseline architectures \(SRDRN backbone, residual blocks, and the BCSD/QM/SRGAN comparators referenced in Related Work\) are credited to their original publications\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2605.12762v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[Yes\]
64. Justification: The new assets — Q\-SRDRN training code \(TensorFlow/Keras\), the IncrementBound layer and pinball loss implementation, the cvaev5c PyTorch model and sampling pipeline, the calibration\-diagnostics script, and the 83\-sample FL cvaeaugmentation set — will be released on acceptance under an open\-source license, with a README and configuration files documenting architecture, training command, hyperparameters, and dataset preparation steps \(Section 5\.4 Broader Impacts and Reproducibility\)\.
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification: The paper does not involve crowdsourcing or research with human subjects\. All data are publicly available reanalysis \(ERA5\) and gauge\-derived gridded precipitation \(PRISM\)\.
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]
74. Justification: The paper does not involve human subjects, so IRB approval is not applicable\.
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[N/A\]
79. Justification: LLMs were not used as a component of the core methodology\. The Q\-SRDRN architecture, IncrementBound layer, pinball loss, and cvaeaugmentation pipeline are all conventional CNN/VAE\-based models trained from scratch on ERA5 and PRISM\. Any LLM usage was limited to writing assistance and code formatting, which the NeurIPS policy explicitly excludes from the declaration requirement\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.
Multi-Quantile Regression for Extreme Precipitation Downscaling

Similar Articles

Quantile Geometry Regularization for Distributional Reinforcement Learning

Debiased Model-based Representations for Sample-efficient Continuous Control

Bayesian Rain Field Reconstruction using Commercial Microwave Links and Diffusion Model Priors

Saliency-Aware Regularized Quantization Calibration for Large Language Models

Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context

Submit Feedback

Similar Articles

Quantile Geometry Regularization for Distributional Reinforcement Learning
Debiased Model-based Representations for Sample-efficient Continuous Control
Bayesian Rain Field Reconstruction using Commercial Microwave Links and Diffusion Model Priors
Saliency-Aware Regularized Quantization Calibration for Large Language Models
Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context