Data-efficient flood depth prediction through domain-aware coreset selection and tabular foundation models

arXiv cs.LG 06/05/26, 04:00 AM Papers
flood-depth-prediction domain-aware-coreset tabular-foundation-models in-context-learning cross-watershed-transferability hydrodynamic-surrogate data-efficiency
Summary
This paper proposes a domain-aware coreset construction pipeline that enables a tabular foundation model to predict flood depth with only 0.7% of the training data, achieving 98.5% of the supervised reference accuracy and allowing transfer across watersheds without retraining.
arXiv:2606.05265v1 Announce Type: new Abstract: Near-real-time flood depth prediction demands surrogate models that are accurate, fast, and transferable across watersheds. Supervised surrogates can match physics-based simulators in accuracy but need millions of training rows per watershed and cannot extrapolate beyond their original mesh. We propose a domain-aware coreset construction pipeline that conditions a tabular foundation model at inference time. The pipeline stratifies storms by return period and most-affected watershed, then samples hexagons with a target-aware spatial selector. With 0.7% of the per-watershed training pool, the model attains a mean $R^2$ of 0.663 across nine Houston-area watersheds, within 98.5% of the supervised reference ($R^2$ = 0.673). It transfers to held-out watersheds without task-specific retraining, staying ahead of a coreset-trained supervised baseline. On real storms it exceeds the supervised reference on a far out-of-distribution case and trails it on a mostly in-distribution one. Domain-aware coreset construction lets tabular foundation models deliver data-efficient, watershed-transferable flood predictions without per-watershed training.
Original Article
View Cached Full Text
Cached at: 06/05/26, 08:09 AM
# Data-efficient flood depth prediction through domain-aware coreset selection and tabular foundation models
Source: [https://arxiv.org/html/2606.05265](https://arxiv.org/html/2606.05265)
\[orcid=0009\-0005\-3732\-7319\]\\cormark\[1\]\\creditConceptualization, Methodology, Software, Writing \- original draft

\\credit

Software, Validation

\\credit

Software, Validation

\\credit

Methodology, Writing \- review and editing

\\credit

Supervision, Writing \- review and editing

1\]organization=Urban Resilience\.AI Lab, Zachry Department of Civil and Environmental Engineering, Texas A&M University, city=College Station, state=TX, country=USA

2\]organization=Department of Computer Science and Engineering, Texas A&M University, city=College Station, state=TX, country=USA

3\]organization=Resilitix Intelligence LLC, city=Houston, state=TX, country=USA

4\]organization=Institute for a Disaster Resilient Texas, Texas A&M University, city=College Station, state=TX, country=USA

\\cortext

\[1\]Corresponding author

Adithi SrinathManas SinghJunwei MaAli Mostafavi\[\[\[\[

###### Abstract

Near\-real\-time flood depth prediction demands surrogate models that are accurate, fast, and transferable across watersheds\. Supervised surrogates can match physics\-based simulators in accuracy but need millions of training rows per watershed and cannot extrapolate beyond their original mesh\. We propose a domain\-aware coreset construction pipeline that conditions a tabular foundation model at inference time\. The pipeline stratifies storms by return period and most\-affected watershed, then samples hexagons with a target\-aware spatial selector\. With0\.7%0\.7\\%of the per\-watershed training pool, the model attains a meanR2R^\{2\}of0\.6630\.663across nine Houston\-area watersheds, within98\.5%98\.5\\%of the supervised reference \(R2=0\.673R^\{2\}=0\.673\)\. It transfers to held\-out watersheds without task\-specific retraining, staying ahead of a coreset\-trained supervised baseline\. On real storms it exceeds the supervised reference on a far out\-of\-distribution case and trails it on a mostly in\-distribution one\. Domain\-aware coreset construction lets tabular foundation models deliver data\-efficient, watershed\-transferable flood predictions without per\-watershed training\.

###### keywords:

Domain\-aware coreset construction\\sepIn\-context learning\\sepTabular foundation model\\sepFlood depth prediction\\sepCross\-watershed transferability\\sepHydrodynamic surrogate

\{highlights\}

A two\-stage coreset jointly stratifies by storm return period and spatial structure\.

A vanilla foundation model matches the watershed\-level baseline with a 50k coreset\.

In\-context learning predicts a held\-out watershed from neighbors without retraining\.

Models extrapolate on out\-of\-distribution storms and remain accurate in\-distribution\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.05265v1/fig_main_flow_pro.png)Figure 1:Conceptual overview of the proposed approach\. \(1\) Physics\-based flood simulation archive: a HEC\-RAS knowledge base of 592 synthetic storm events across nine Houston\-area watersheds, on the order of10810^\{8\}event\-hexagon rows, with storm metadata, watershed boundaries, and NOAA Atlas 14 return\-period labels\. \(2\) Domain\-aware coreset construction: a two\-stage pipeline compresses the archive into a compact, hydrologically representative subset\. Stage 1 stratifies events by return period and most\-affected watershed, and Stage 2 selects H3 Level 10 hexagons with a target\-aware facility\-location strategy\. \(3\) Tabular foundation model in\-context conditioning: the coreset conditions a pretrained tabular foundation model at inference, with no per\-watershed fine\-tuning\. \(4\) Flood\-depth prediction outputs: the model returns peak inundation depth for query hexagons\. The pipeline is evaluated under three protocols: within\-watershed accuracy, cross\-watershed leave\-one\-out transfer, and real\-storm validation on Hurricane Harvey and Tropical Storm Imelda\. \(The schematic was prepared with the assistance of ChatGPT\-5\.5\.\)Timely flood depth information supports emergency managers and infrastructure operators during extreme weather events, and also underpins downstream assessments of flood exposure\(yin2023unsupervised\), mobility disruption, and community resilience\(yin2026deep\)\. Yet generating it at scale remains computationally demanding\(li2025parametric\)\. Physics\-based hydrodynamic models such as the Hydrologic Engineering Center’s River Analysis System \(HEC\-RAS\) provide the engineering reference for inundation depth prediction\. Their cost grows quickly with simulated extent and mesh resolution, especially when many storm events must be screened\. A direct HEC\-RAS sweep is therefore impractical for near\-real\-time forecasting or broad scenario testing\(ma2026uncovering\)\. Machine learning \(ML\) surrogates trained on simulator output have become a standard alternative and offer near\-instant inference once trained\(bentivoglio2022deep;mosavi2018flood\)\.

Existing surrogate work falls into two patterns\. The first trains one model per mesh cell, capturing local dynamics at fine resolution while producing a fragmented collection that cannot be applied beyond the original mesh\(lee2024predicting\)\. The second trains a single supervised model on the full knowledge base of one watershed, which yields strong in\-watershed accuracy but binds the predictor to its training region\(zahura2020training\)\. Both patterns share the same operational drawback: each requires millions of training rows, and adding a new watershed or recalibrating after an event\-regime shift entails another full training pass\. Tree\-based gradient boosting\(chen2016xgboost\)still dominates this space because deep tabular methods continue to underperform tree ensembles on structured input\(grinsztajn2022why\)\. Deep learning has produced striking results on adjacent hydrologic tasks such as rainfall\-runoff modeling\(kratzert2018rainfall\), but for spatially distributed flood\-depth surrogates the tree\-based pattern remains the practical default\.

Transformer and foundation\-model architectures have recently been adapted to structured disaster\-management tasks, including post\-disaster building\-damage categorization\(xiao2025damagecat\), multimodal impact assessment\(xiao2026crisisense\), and graph\-based damage prediction\(esparza2026graph\)\. Tabular foundation models \(TFMs\) such as TabPFN\(hollmann2025accurate\)and TabICL\(qu2024tabicl\)offer a different path\. A TFM is a single transformer pretrained on a wide distribution of synthetic tabular tasks\. It solves a new task at inference time by conditioning on a labeled context set rather than through gradient updates\. This in\-context learning \(ICL\) framing avoids per\-task retraining, and a sufficiently strong pretrained backbone can rival task\-specific supervised models on tabular benchmarks\. However, the context windows of the tabular foundation models considered in this work are bounded at roughly10410^\{4\}to10510^\{5\}rows\(hollmann2025accurate;grinsztajn2025tabpfn25;qu2024tabicl\), while a single watershed’s simulation knowledge base typically reaches millions\. Choosing which rows to put into the context becomes the central operational question\. Naive options such as random sampling or feature\-only facility location ignore the spatial autocorrelation and event\-magnitude imbalance that govern flood data\. Building a high\-quality coreset for a TFM in this setting therefore depends on encoding hydrologic and geographic prior knowledge into the selection step\.

Figure[1](https://arxiv.org/html/2606.05265#S1.F1)summarizes the proposed approach\. We propose that a TFM conditioned on a carefully built coreset is a feasible flood\-depth surrogate at the watershed level\. We then test how far the same configuration transfers to held\-out watersheds under a leak\-free protocol\. To support both questions we develop a domain\-aware coreset construction pipeline that operates in two stages\. Stage one stratifies storm events on two axes: storm return period \(RP\) and most\-affected watershed\. The dual stratification ensures that rare high\-RP events and their host watersheds survive sampling noise\. Stage two samples from a hierarchical hexagonal \(H3\) grid at Level 10 with cell edge length of approximately 75 m\. The selector combines facility\-location coverage in static\-feature space with a z\-scored target\-depth signal\. For cross\-watershed evaluation we adopt a leave\-one\-out \(LOO\) protocol in which the held\-out target watershed contributes neither training rows nor context\.

The contributions of this paper are:

- •A two\-stage domain\-aware coreset construction pipeline that combines event\-level stratification by return period and watershed with target\-aware hexagon selection, encoding hydrologic and geographic priors in the context\.
- •A demonstration that a vanilla TFM conditioned on the domain\-aware coreset \(about0\.7%0\.7\\%of the per\-watershed training pool\) recovers98\.5%98\.5\\%of a watershed\-level supervised reference in averageR2R^\{2\}across nine Houston\-area watersheds\.
- •A leak\-free leave\-one\-out protocol with two source\-selection modes \(neighboring versus all other watersheds\), showing that the same vanilla TFM transfers to held\-out watersheds without retraining and outperforms a coreset\-trained supervised baseline at most context sizes in both modes\.

## 2Related Work

### 2\.1Return Period for Storm Stratification

Hydrology distinguishes pluvial return period \(RP\) from fluvial RP by their driver\. Pluvial RP ranks events by short\-duration rainfall intensity and supports urban drainage design where local rainfall dominates\. Fluvial RP ranks events by streamflow magnitude at a gauge and supports riverine flood mapping where upstream runoff dominates\. The two can rank the same storm differently, so combined RP frameworks treat both drivers jointly and are preferred in mixed\-regime basins where rainfall and runoff co\-vary\(zscheischler2018future;wahl2015compound\)\. All variants share the same statistical foundation: aTT\-year storm has a1/T1/Tchance of being equaled or exceeded in any single year\. RP underpins floodplain mapping, hydraulic structure design, and insurance rate setting\. In the United States the reference curves are published in NOAA Atlas 14\(noaaatlas14\), which maps storm duration and accumulated rainfall depth to RP at the county level\. For supervised flood\-prediction datasets, synthetic storm libraries tend to oversample moderate events, so intentional stratification by RP is needed to keep rare, high\-impact storms from being under\-represented in training\(bentivoglio2022deep\)\.

### 2\.2Coreset Selection

A coreset is a small, weighted subset of a dataset chosen so that a model trained or conditioned on the subset behaves similarly to one trained on the full data\(phillips2017coresets;mirzasoleiman2020coresets;bachem2017practical\)\. Coreset selection methods fall into three broad families\. Uniform random sampling provides an unbiased default but covers feature space inefficiently when data are imbalanced\. Geometric methods such as facility location\(lin2011submodular;wei2015submodularity\)and core\-set covering\(sener2018active\)greedily pick samples that maximize coverage of a feature\-space kernel, offering diversity guarantees but ignoring labels\. Target\-aware methods incorporate label statistics: gradient\-based selectors\(killamsetty2021glister;killamsetty2021gradmatch\)pick samples whose gradients best approximate the full\-batch update, proxy\-based selectors\(coleman2020selection\)rank candidates with a cheaper surrogate model, and pruning metrics such as forgetting score or supervised classification margin separate redundant from informative examples\(sorscher2022beyond\)\. In the foundation\-model era, coreset selection has taken on a second role of choosing in\-context examples that condition a pretrained predictor at inference time\(hollmann2025accurate;thomas2024retrieval\)\. Most of these methods assume samples are independent, an assumption that breaks in geophysical applications where features are strongly autocorrelated, causing naive feature\-space selection to cluster samples geographically and leave parts of the domain unrepresented\(roberts2017cv;meyer2018improving\)\.

### 2\.3In\-Context Learning for Tabular Prediction

Tabular foundation models \(TFMs\) cast tabular prediction as in\-context learning: a transformer is pretrained once on a large distribution of synthetic tasks and, at inference, conditions on a labeled context set\(𝐗ctx,𝐲ctx\)\(\\mathbf\{X\}\_\{ctx\},\\mathbf\{y\}\_\{ctx\}\)to predict labels for a query set𝐗q\\mathbf\{X\}\_\{q\}without gradient updates\. The framing originates in the prior\-data fitted networks ofmuller2022transformersand was specialized to tabular classification and regression by TabPFN\(hollmann2023tabpfn;hollmann2025accurate\)\. Subsequent releases progressively expand the supported context: TabPFN\-v2\.5\(grinsztajn2025tabpfn25\)reaches roughly5×1045\\times 10^\{4\}rows, and TabPFN\-v2\.6\(priorlabs2025tabpfn26\)extends this to10510^\{5\}rows\. TabICL\(qu2024tabicl\)targets even larger context sizes through a column\-then\-row attention mechanism\. A separate strand augments these backbones with task\-specific fine\-tuning\(thomas2024retrieval\), with full fine\-tuning recently shown to be a stable baseline for TabPFN\-v2\(rubachev2025finetuning\)\. Transfer learning has likewise been used to improve tabular prediction from limited engineering data\(pak2023knowledge\)\.

### 2\.4Out\-of\-Distribution Evaluation

Spatially distributed predictors are vulnerable to two distinct evaluation failures\. Standard randomkk\-fold cross\-validation underestimates prediction error on spatially autocorrelated data, because train and test folds remain close in feature and physical space\(roberts2017cv\)\. Spatial cross\-validation schemes such as Leave\-Location\-Out hold out entire spatial blocks to break that contamination\(meyer2018improving\)\. A separate question is whether predictions made outside the training distribution can be trusted at all:meyer2021predictingformalize this as the Area of Applicability of a spatial model, and the WILDS benchmark\(koh2021wilds\)catalogues representative distribution shifts in machine learning\.

## 3Data

The primary data is the MaxFloodCast HEC\-RAS simulation database\(lee2024predicting\)covering nine watersheds in Harris County, Texas \(Figure[2](https://arxiv.org/html/2606.05265#S3.F2)\)\. Harris County is the largest county in the Greater Houston Metropolitan Statistical Area, with a substantially flat topography ranging from roughly−12\-12m to9191m above mean sea level and a population exceeding4\.54\.5million\. Two main hydrologic systems organize the county: Cypress Creek in the north and the Buffalo Bayou system across the central and southern portions, both draining eastward through the San Jacinto River and the Ship Channel into the Gulf of Mexico\.

![Refer to caption](https://arxiv.org/html/2606.05265v1/fig_watersheds.png)Figure 2:The nine Houston\-area watersheds, dissolved from the HEC\-RAS simulation mesh\. Basemap rendered with contextily \([https://contextily\.readthedocs\.io/en/latest/index\.html](https://contextily.readthedocs.io/en/latest/index.html)\) using OpenStreetMap and CARTO tiles\.The database contains592592synthetic storm events generated by applying a Rasterized Time\-series Resampling Method to historic storms in the area, with durations from11to3333hours and hourly rainfall grids at approximately1,0101\{,\}010m resolution\. Flood inundation depths are simulated by HEC\-RAS 2D over an unstructured mesh of26,30126\{,\}301cells \(nominal size≈366\\approx 366m\), refined along major watercourses and breaklined at high\-elevation features\. From this database we inherit the cell\-level peak inundation depth target and the per\-event rainfall metadata\. All other inputs are processed independently in this study\.

To establish a uniform spatial unit across all variables, we discretize each mesh cell into H3 Level 10 hexagons \(about7575m edge length\) via polygon polyfill \(≈7\.7\\approx 7\.7hexagons per mesh cell on average\)\. Static geophysical features are extracted from external geospatial datasets and resampled onto the resulting177,330177\{,\}330hexagons\. The HEC\-RAS depth output and event\-level rainfall metadata are likewise resampled to the same H3 Level 10 grid\. Hexagons in the same mesh cell share that cell’s dynamic features \(rainfall, depth\) but carry their own static features, yielding592592events×\\times177,330177\{,\}330hexagons≈105\\approx 105million rows with no missing values\. Per\-watershed counts range from2\.22\.2M \(Hunting Bayou\) to14\.114\.1M \(Cypress Creek\), with a mean of about6\.96\.9M\.

Table 1:Key variables used in modeling and event stratification, with data sources\.CategoryVariableData SourceTopographyElevationNational Elevation Dataset \(NED\)ImperviousnessNational Land Cover Database \(NLCD\)Topographic Wetness Index \(TWI\)NED \+ NLCDRoad Density within 500 m radiusTxDOT roadway networkDistance to CoastNational Hydrography Dataset Plus High Resolution \(NHDPlusHR\)HydrologyHeight above Nearest Drainage \(HAND\)NOAA National Water ModelDistance to Nearest StreamNHDPlusHRDistance to Stream \(order≥4\\geq 4\)NHDPlusHRAggregatedEvent\-specificCumulative RainfallHEC\-RAS 2DPeak Rainfall IntensityHEC\-RAS 2DPrecipitation DurationHEC\-RAS 2DEvent AnnotationReturn Period \(RP\)NOAA Atlas 14Each row carries 11 predictors and a target \(Table[1](https://arxiv.org/html/2606.05265#S3.T1)\)\. Eight static predictors describe local topography and hydrology: elevation, imperviousness, Topographic Wetness Index \(TWI\), road density within a 500 m radius, distance to coast, Height above Nearest Drainage \(HAND\), distance to the nearest stream, and distance to the nearest stream of order≥4\\geq 4\. Three dynamic predictors describe each storm: cumulative rainfall, peak rainfall intensity, and precipitation duration\. Among the static predictors, TWI follows the standardln⁡\(SCA/tan⁡β\)\\ln\(\\mathrm\{SCA\}/\\tan\\beta\)formulation underD∞D\_\{\\infty\}flow routing\(tarboton1997new\)computed in SAGA GIS\(conrad2015system\), and road density is the fraction of road area within a 500 m circular kernel applied to a 15 m\-buffered roadway raster\. The target is the per\-event peak inundation depth at each hexagon simulated by HEC\-RAS\. The model predicts one peak\-depth value per event and hexagon from static geospatial features and aggregated rainfall descriptors, not a time\-evolving nowcast that updates depth as new rainfall or gauge data arrive\.

Each storm is annotated with two derived fields used for stratification\. The most\-affected watershed is the watershed whose cells have the highest mean inundation depth during that storm\. The RP bin assigns each storm to the largest exceeded NOAA Atlas 14\(noaaatlas14\)county\-median threshold at the PFDS duration class matching the event’s storm duration, with the procedure detailed in Section[4\.1](https://arxiv.org/html/2606.05265#S4.SS1)\. RP is used as a stratification label rather than a model input\. Bin populations cover the range from below 1\-year through 1,000\-year, with4242events each in the500500\-year and1,0001\{,\}000\-year bins\.

The592592events are split into350350for training,121121for validation, and121121for testing under double stratification on RP bin and most\-affected watershed, with a post\-processing pass that prevents any single watershed from being underrepresented at test time\. Within the test set,4242stratified\-subsample events serve as the query set in all experiments\.

For external validation we use two real storms simulated with observed rainfall: Hurricane Harvey \(August 2017\) and Tropical Storm Imelda \(September 2019\)\. Rainfall inputs come from Harris County Flood Control District rain\-gage observations\(hcfcd2023fws\)interpolated through Thiessen polygons, while depth targets are produced by HEC\-RAS 2D under the same setup as the synthetic events and resampled onto the same H3 L10 grid\. Harvey lies entirely outside the synthetic training envelope on the \(cumulative rainfall, duration\) plane and is treated as fully out\-of\-distribution \(OOD\)\. Imelda mixes in\-distribution and OOD cells\.

## 4Methodology

### 4\.1Return Period Construction

Each storm event is labeled with a single RP bin used to stratify the dataset split and the Stage 1 event selection\. The event is first assigned a NOAA Atlas 14 duration classd∗d^\{\\ast\}based on its storm durationded\_\{e\}:

d∗=\{6h,de≤6h12h,6h<de≤12h24h,de\>12hd^\{\\ast\}=\\begin\{cases\}6\\,\\mathrm\{h\},&d\_\{e\}\\leq 6\\,\\mathrm\{h\}\\\\ 12\\,\\mathrm\{h\},&6\\,\\mathrm\{h\}<d\_\{e\}\\leq 12\\,\\mathrm\{h\}\\\\ 24\\,\\mathrm\{h\},&d\_\{e\}\>12\\,\\mathrm\{h\}\\end\{cases\}\(1\)which selects the smallest NOAA Atlas 14\(noaaatlas14\)class containingded\_\{e\}\. The event’s peak cumulative rainfall across hexagons,rmaxr\_\{\\max\}, is then compared against the Harris\-County\-median NOAA Atlas 14 thresholds\{τd∗,T\}\\\{\\tau\_\{d^\{\\ast\},T\}\\\}at the candidate return periodsT∈\{1,2,5,10,25,50,100,200,500,1000\}T\\in\\\{1,2,5,10,25,50,100,200,500,1000\\\}years, and the event is assigned the largestTTfor which the threshold is exceeded:

RPevent=max⁡\{T:rmax≥τd∗,T\},\\mathrm\{RP\}\_\{\\mathrm\{event\}\}=\\max\\bigl\\\{\\,T:r\_\{\\max\}\\geq\\tau\_\{d^\{\\ast\},T\}\\,\\bigr\\\},\(2\)withRPevent=0\\mathrm\{RP\}\_\{\\mathrm\{event\}\}=0when no threshold is exceeded\.

### 4\.2Two\-Stage Coreset Construction

The choice of in\-context examples is the central design decision of this paper\. A naive random sample of the simulation knowledge base ignores two properties of flood data: storm\-magnitude imbalance leaves rare high\-RP events sparse in any such subsample, and spatial autocorrelation among static features causes the picked hexagons to cluster geographically\. We address both with a two\-stage construction: events are first dual\-stratified by return period and watershed, then hexagons within each watershed are sampled with a target\-aware spatial selector\. The resulting coreset is the Cartesian product ofNeN\_\{e\}events sampled in Stage 1 andNhN\_\{h\}hexagons selected per watershed in Stage 2, givingN=Ne×NhN=N\_\{e\}\\times N\_\{h\}rows per watershed\. Table[2](https://arxiv.org/html/2606.05265#S4.T2)lists the\(Ne,Nh\)\(N\_\{e\},N\_\{h\}\)values used for eachNN\.

Table 2:Coreset decomposition:\(Ne,Nh\)\(N\_\{e\},N\_\{h\}\)for each per\-watershed sizeNN\.NNNeN\_\{e\}NhN\_\{h\}50020251,00025402,00040505,0005010010,0005020050,000501,000The decomposition balances event diversity against per\-watershed hexagon coverage within a context that fits all TFMs evaluated\. The largest sizeN=50kN=50\\text\{k\}matches TabPFN\-v2\.5’s native context limit, so every coreset\-based model sees the same maximum context regardless of its own capacity ceiling\. Within thisNNenvelope, bothNeN\_\{e\}andNhN\_\{h\}rise at smallNNso the coreset captures both event types and watershed geometry\.NeN\_\{e\}saturates at5050onceNNallows it, which keeps Stage 1 from depleting the rare\-RP tail bins where the training pool itself is sparse\.NhN\_\{h\}then grows to keepN=Ne×NhN=N\_\{e\}\\times N\_\{h\}asNNcontinues to climb\.

#### 4\.2\.1Stage 1: Event Stratification

GivenNeN\_\{e\}events to select in Stage 1, we draw them from the training split, stratified jointly by RP bin and most\-affected watershed\. Each watershed first gets a floor ofkmin=min⁡\(2,⌊Ne/nws⌋\)k\_\{\\min\}=\\min\(2,\\lfloor N\_\{e\}/n\_\{\\mathrm\{ws\}\}\\rfloor\)events, wherenwsn\_\{\\mathrm\{ws\}\}is the number of watersheds\. The remaining slots are allocated across watersheds in proportion to their training\-event counts \(largest\-remainder rounding\), and within each watershed across its RP bins\. By construction, every watershed contributes at leastkmink\_\{\\min\}events, with within\-watershed RP stratification preserving the pool’s bin diversity\.

#### 4\.2\.2Stage 2: Hexagon Selection

Within each watershed we evaluate five hexagon\-selection methods: a Random baseline, Facility Location \(FL\), Spatial\-Penalty FL \(SP\-FL\), Depth\-Stratified Sampling \(Strat\-Depth\), and Depth\-Augmented FL \(FL\-Depth\)\.

##### Random\.

Uniform random sampling, used as a baseline\.

##### FL\.

Greedy facility location on a feature\-space RBF kernel\(lin2011submodular;wei2015submodularity;sener2018active\), evaluated over a candidate poolℋpool\\mathcal\{H\}\_\{\\mathrm\{pool\}\}of3,0003\{,\}000random hexagons subsampled from the watershed for tractability:

ℋfl=arg⁡max\|ℋ\|=Nh∑h∈ℋpoolmaxs∈ℋ⁡K\(xh,xs\),\\mathcal\{H\}\_\{\\mathrm\{fl\}\}=\\mathop\{\\arg\\max\}\\limits\_\{\|\\mathcal\{H\}\|=N\_\{h\}\}\\sum\_\{h\\in\\mathcal\{H\}\_\{\\mathrm\{pool\}\}\}\\max\_\{s\\in\\mathcal\{H\}\}K\(x\_\{h\},x\_\{s\}\),\(3\)K\(xh,xs\)=exp⁡\(−‖xh−xs‖2σf2\),K\(x\_\{h\},x\_\{s\}\)=\\exp\\\!\\left\(\-\\frac\{\\\|x\_\{h\}\-x\_\{s\}\\\|^\{2\}\}\{\\sigma\_\{f\}^\{2\}\}\\right\),\(4\)wherexhx\_\{h\}is the z\-scored static\-feature vector of hexagonhhandσf2\\sigma\_\{f\}^\{2\}is set by the median heuristic on pairwise squared distances within the pool\.

##### SP\-FL\.

Facility location on the same pool with a hard spatial\-exclusion constraint:

‖phi−phj‖geo≥rmin,∀hi,hj∈ℋsp\_fl,i≠j,\\\|p\_\{h\_\{i\}\}\-p\_\{h\_\{j\}\}\\\|\_\{\\mathrm\{geo\}\}\\geq r\_\{\\min\},\\qquad\\forall\\,h\_\{i\},h\_\{j\}\\in\\mathcal\{H\}\_\{\\mathrm\{sp\\\_fl\}\},\\ i\\neq j,\(5\)wherephp\_\{h\}is the hexagon centroid and∥⋅∥geo\\\|\\cdot\\\|\_\{\\mathrm\{geo\}\}is haversine distance\. We fixrmin=300r\_\{\\min\}=300m\. SP\-FL can fall short ofNhN\_\{h\}in small watersheds when the geometric constraint exhausts available hexagons \(Hunting Bayou56%56\\%filled atN=50kN=50\\text\{k\}, Vince\-Buffalo71%71\\%\)\.

##### Strat\-Depth\.

Stratified random sampling on a per\-hexagon depth signalyhy\_\{h\}\. The depth signal is the mean simulated peak depth at hexagonhhover the Stage 1 event setℰ\\mathcal\{E\}for the currentNN:

yh=1\|ℰ\|∑e∈ℰdepthh,e,y\_\{h\}=\\frac\{1\}\{\|\\mathcal\{E\}\|\}\\sum\_\{e\\in\\mathcal\{E\}\}\\mathrm\{depth\}\_\{h,e\},\(6\)wheredepthh,e\\mathrm\{depth\}\_\{h,e\}is the HEC\-RAS\-simulated peak depth at hexagonhhin eventee\. Dry hexagons \(yh≤0y\_\{h\}\\leq 0\) form a separate bin when present and wet hexagons are split into up to three quantile bins, for a total of at most four bins\. Selections are allocated equally per bin with largest\-remainder allocation to the largest pools for the residual\.

##### FL\-Depth\.

Facility location on the same pool with feature vector augmented by the z\-scored depth signal from Eq\.[6](https://arxiv.org/html/2606.05265#S4.E6):

x~h=\[xh;z\(yh\)\],z\(y\)=y−μyσy,\\tilde\{x\}\_\{h\}=\\bigl\[\\,x\_\{h\}\\,;\\;z\(y\_\{h\}\)\\,\\bigr\],\\qquad z\(y\)=\\frac\{y\-\\mu\_\{y\}\}\{\\sigma\_\{y\}\},\(7\)whereμy\\mu\_\{y\}andσy\\sigma\_\{y\}are the pool\-wide mean and standard deviation ofyhy\_\{h\}\. The kernel bandwidthσf2\\sigma\_\{f\}^\{2\}is recomputed by the median heuristic on the augmentedx~\\tilde\{x\}pool\.

### 4\.3Models

#### 4\.3\.1Reference Baselines

Coreset\-XGB is XGBoost\(chen2016xgboost\)trained on a single coreset ofN=50kN=50\\text\{k\}rows per watershed, with hyperparameters fromlee2024predicting:1,0001\{,\}000histogram\-based trees of maximum depth55, learning rate0\.010\.01, column subsampling rate0\.30\.3, andL1L\_\{1\}regularization weight1010\. Full\-KB\-XGB uses the same hyperparameters but is trained on*all*350350training events for that watershed, on average≈6\.9\{\\approx\}\\,6\.9M rows\. We call it the supervised reference rather than a strict ceiling, since several coreset\-based models exceed it on individual watersheds and on the far out\-of\-distribution storm\.

#### 4\.3\.2Vanilla TFMs

We evaluate three vanilla TFMs\. TabPFN\-v2\.6\(hollmann2025accurate;priorlabs2025tabpfn26\)is the latest release in the TabPFN family\. TabPFN\-v2\.5\(grinsztajn2025tabpfn25\)provides a backbone version contrast, and TabICL\(qu2024tabicl\)extends the comparison to a second TFM family\. At inference time the coreset𝒞\\mathcal\{C\}provides the in\-context examples:

y^q=fθ\(xq;\{\(xi,yi\)\}i∈𝒞\)\\hat\{y\}\_\{q\}=f\_\{\\theta\}\\bigl\(x\_\{q\};\\,\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i\\in\\mathcal\{C\}\}\\bigr\)\(8\)wherefθf\_\{\\theta\}is the pretrained transformer andθ\\thetais held fixed\. No gradient updates are applied at inference\.

#### 4\.3\.3Fine\-Tuning Variants

To test whether fine\-tuning \(FT\) improves over vanilla inference, we add two FT modes on both the TabPFN\-v2\.5 and TabPFN\-v2\.6 backbones\. TabPFN\-FT\-v2\.5 / v2\.6 applies per\-watershed fine\-tuning on each watershed’s own FL\-Depth coreset\. TabPFN\-FTLOO\-v2\.5 / v2\.6 is fine\-tuned across the other eight watersheds, with each episode drawing context and query from one of those eight sampled uniformly at random, so that on a held\-out target both context and weights are target\-free\. Both modes share a single recipe:500500episodes with fresh context and query batches of4,0004\{,\}000and2,0482\{,\}048rows, AdamW at learning rate10−510^\{\-5\}and weight decay10−410^\{\-4\}, gradient clipping at1\.01\.0, bfloat16 autocast, and MSE loss on context\-normalized targets\.

## 5Experiments and Results

### 5\.1Coreset Construction

The coresets evaluated below come from the two\-stage pipeline of Section[4\.2](https://arxiv.org/html/2606.05265#S4.SS2), with event selection in Stage 1 and hexagon selection in Stage 2\.

Stage 1 selectsNe=50N\_\{e\}=50training events\. This is large enough for proportional allocation to give every watershed several events despite the uneven per\-watershed training pools, and small enough forN=Ne×NhN=N\_\{e\}\\times N\_\{h\}to fit within TabPFN\-v2\.5’s context cap\. The allocation in Figure[3](https://arxiv.org/html/2606.05265#S5.F3)combines proportional sampling with a floor ofkmin=2k\_\{\\min\}=2, routing most of the sample to the larger bayou watersheds\. The floor catches Addicks and Barker, whose small training pools would otherwise round to zero and exclude these watersheds from the downstream protocols\. The below\-1\-year stratum visible at Addicks and Barker is a direct consequence of this floor on small pools\. Each pool spans only22and44RP bins respectively, including a below\-1\-year bin, so the floor forces both watersheds to draw bins they would otherwise miss\. The500500\-year and1,0001\{,\}000\-year classes each surface in four watersheds, providing rare\-event coverage across the bayou network\.

![Refer to caption](https://arxiv.org/html/2606.05265v1/fig_stage1.png)Figure 3:Per\-watershed event allocation atNe=50N\_\{e\}=50\. Bars stacked by return period bin, with x\-axis labels showing the selected / training\-pool ratio per watershed\.Stage 2 selectsNhN\_\{h\}hexagons per watershed under five candidate selectors defined in Section[4\.2\.2](https://arxiv.org/html/2606.05265#S4.SS2.SSS2)\. The selectors trade off feature\-space coverage against spatial spread, illustrated for Brays Bayou atN=10kN=10\\text\{k\}\(Nh=200N\_\{h\}=200\) in Figure[4](https://arxiv.org/html/2606.05265#S5.F4)\. FL clusters tightest because its feature\-space objective does not penalize spatial proximity, so two nearby hexagons with distinct features can both be selected\. Random and Strat\-Depth lack any spatial term and follow the underlying mesh density\. FL\-Depth is more spread than these spatially neutral baselines because its feature vector includes the z\-scored depth signal, which varies smoothly in space and therefore makes nearby hexagons redundant under the FL objective\. SP\-FL has the widest spread under thermin=300r\_\{\\min\}=300m exclusion constraint\.

![Refer to caption](https://arxiv.org/html/2606.05265v1/fig_spatial_dist.png)Figure 4:Spatial distribution of selected hexagons for Brays Bayou atN=10kN=10\\text\{k\}\(Nh=200N\_\{h\}=200\) under all five Stage 2 methods\. Panel titles include the mean nearest\-neighbor distance\. The sparser sampling at the watershed’s southwestern arm reflects the underlying mesh, which thins out where the watershed narrows\.Table 3:MeanR2R^\{2\}across nine watersheds for the five hexagon selectors at each coreset sizeNN\. Bold marks each column’s best selector\. FL\-Depth has the highest cross\-NNaverage for both models, and XGB shows a smaller method spread than v2\.5 at everyNN\.\(a\)Vanilla TabPFN\-v2\.5MethodN=500N=500N=1kN=1\\text\{k\}N=2kN=2\\text\{k\}N=5kN=5\\text\{k\}N=10kN=10\\text\{k\}N=50kN=50\\text\{k\}MeanRandom0\.3480\.3700\.3810\.4290\.4630\.6090\.433FL0\.3150\.2830\.2770\.3460\.3930\.6090\.371SP\-FL0\.3200\.3070\.2980\.3770\.4170\.6020\.387Strat\-Depth0\.3810\.4420\.4530\.4090\.4710\.6110\.461FL\-Depth0\.4550\.4750\.4690\.4020\.4180\.6210\.473
\(b\)Coreset\-XGBMethodN=500N=500N=1kN=1\\text\{k\}N=2kN=2\\text\{k\}N=5kN=5\\text\{k\}N=10kN=10\\text\{k\}N=50kN=50\\text\{k\}MeanRandom0\.4000\.4270\.4250\.4860\.5240\.6010\.477FL0\.3670\.3760\.3840\.4740\.5150\.6090\.454SP\-FL0\.3730\.3870\.3960\.4790\.5230\.6050\.461Strat\-Depth0\.4140\.4330\.4550\.4870\.5280\.6050\.487FL\-Depth0\.4360\.4660\.4690\.4580\.5140\.6060\.491

To pick the downstream hexagon selector, we sweep Vanilla TabPFN\-v2\.5 as the representative TFM and Coreset\-XGB as a tree\-based control over all five methods at sixNNvalues, reported in Table[3](https://arxiv.org/html/2606.05265#S5.T3)\. For v2\.5 in panel \(a\), FL\-Depth wins at four of the sixNNvalues and tops the cross\-NNaverage atR2=0\.473R^\{2\}=0\.473, with Strat\-Depth second and the pure facility\-location methods FL and SP\-FL well behind\. FL\-Depth slips below Random atN=5kN=5\\text\{k\}and below Strat\-Depth atN=10kN=10\\text\{k\}, but these dips reflect the v2\.5 backbone’s weakness in thisNNrange rather than any selector failure\. Outside that mid\-NNwindow, depth\-aware selectors dominate because the per\-hexagon depth signal compresses each hexagon’s response across the Stage 1 event mix into a single target\-relevant dimension\. The Stage 1 events span below\-1\-year through1,0001\{,\}000\-year storms, so hexagons selected along this signal cover the flood\-response surface directly\. Pure feature\-space selection picks hexagons that look feature\-diverse but can still cluster on flood response, leaving the target distribution under\-covered\. The largest sizeN=50kN=50\\text\{k\}is the operating point the rest of our experiments target\. It matches TabPFN\-v2\.5’s native context cap and is the largest context that fits every TFM under parity, even though it represents only about0\.7%0\.7\\%of the per\-watershed KB\. At this size every reasonable coreset spans the flood\-response surface, the five methods converge into a narrow band, and FL\-Depth still leads\. Coreset\-XGB in panel \(b\) follows the same broad ordering but with a much smaller method spread\. FL\-Depth has a small early lead, but the methods converge quickly asNNgrows because the tree ensemble absorbs row\-selection noise into its fit rather than propagating it like an in\-context TFM\. XGB therefore acts only as a control here, and we adopt FL\-Depth as the default hexagon selector for all coreset\-based models in the rest of the paper\.

### 5\.2Experiment 1: Within\-Watershed

We investigate how close a coreset\-based model gets to the watershed\-level supervised reference and how fine\-tuning shifts that gap\. Using the FL\-Depth coreset across six values ofNNfrom500500to50k50\\text\{k\}, we evaluate three vanilla TFMs, two per\-watershed fine\-tuned TabPFN variants, and Coreset\-XGB\. Full\-KB\-XGB sets the supervised reference, trained on all350350training events per watershed\. We trace how meanR2R^\{2\}scales withNNto see whether any coreset\-based model approaches the reference, then break theN=50kN=50\\text\{k\}results down by watershed and quantify how much fine\-tuning helps each TFM backbone\.

![Refer to caption](https://arxiv.org/html/2606.05265v1/fig_exp1.png)Figure 5:Within\-watershedR2R^\{2\}scaling with coreset sizeNN\. MeanR2R^\{2\}across the nine watersheds for the six coreset\-based models under FL\-Depth, with the Full\-KB\-XGB reference shown as the dotted line\.#### 5\.2\.1Aggregate Scaling withNN

The coreset\-based models differ in how steeply they improve withNNand in which size brings them closest to the Full\-KB\-XGB reference\. Figure[5](https://arxiv.org/html/2606.05265#S5.F5)reports the within\-watershed meanR2R^\{2\}at eachNNfor the six coreset\-based models and the reference\. Coreset\-XGB rises monotonically withNNbut plateaus well below the reference, since the tree ensemble has no pretrained prior and learns only from the coreset\. TabICL also rises withNNbut stays below the TabPFN\-v2\.6 family across the sweep\. Its native context window is much larger than50k50\\text\{k\}, so the parity cap we apply is conservative for TabICL and leaves it short of its optimum in this regime\. Vanilla TabPFN\-v2\.5 climbs through smallNN, underperforms in the intermediateNNrange, then recovers atN=50kN=50\\text\{k\}, a v2\.5 backbone weakness that the v2\.6 backbone eliminates\. TabPFN\-FT\-v2\.5 anchors weights to each target watershed and smooths the v2\.5 mid\-NNdip, but atN=50kN=50\\text\{k\}it still trails Vanilla TabPFN\-v2\.6\. Vanilla TabPFN\-v2\.6 rises monotonically withNNand reachesR2=0\.663R^\{2\}=0\.663atN=50kN=50\\text\{k\}, the highest among coreset\-based models and a98\.5%98\.5\\%recovery of the Full\-KB\-XGB reference at0\.6730\.673\. The same fine\-tuning recipe applied to v2\.6 yields TabPFN\-FT\-v2\.6, which tracks Vanilla TabPFN\-v2\.6 closely but lands slightly lower\. The v2\.6 prior already encodes the per\-watershed adjustment that fine\-tuning extracts from v2\.5, and additional fine\-tuning over\-fits the limited target signal in each watershed’s coreset\. Two patterns dominate the sweep\. The backbone upgrade from v2\.5 to v2\.6 delivers a larger gain than fine\-tuning v2\.5 does, and fine\-tuning v2\.6 provides no benefit\. Vanilla TabPFN\-v2\.6 is therefore the strongest within\-watershed model in this evaluation\. The stronger pretrained prior delivers more than per\-watershed adaptation on the v2\.5 backbone, and on the v2\.6 backbone the prior is already too informative for fine\-tuning to improve\.

#### 5\.2\.2Per\-Watershed Model Comparison

AtN=50kN=50\\text\{k\}with the FL\-Depth coreset, we evaluate per\-watershed performance for each coreset\-based model and the Full\-KB\-XGB reference\. Table[4](https://arxiv.org/html/2606.05265#S5.T4)reports theR2R^\{2\}values\. By training\-pool size the nine watersheds fall into three groups\. Addicks and Barker are small reservoirs with≤4\\leq 4events each, Hunting is a small\-pool bayou with1818events, and the other six bayous hold3333to8484events\. Volatility across watersheds tracks the TFM backbone\. Vanilla TabPFN\-v2\.5 has the widest cross\-watershed standard deviation atσ=0\.090\\sigma=0\.090\. It reaches above the reference on Hunting and Barker where the supervised model is data\-limited by small training pools, but falls well below it on Brays and Greens where larger pools give the reference more signal than v2\.5’s5050\-event coreset can extract\. TabICL is similarly volatile and drops below Coreset\-XGB on the harder bayou watersheds, since its column\-row attention is designed for much larger contexts than50k50\\text\{k\}\. Per\-watershed fine\-tuning on the v2\.5 backbone reducesσ\\sigmato0\.0730\.073by anchoring weights to each target watershed, and the v2\.6 backbone reduces it to roughly0\.060\.06regardless of fine\-tuning\. Backbone strength rather than fine\-tuning carries the stability gain\. Vanilla TabPFN\-v2\.6 wins six of the nine watersheds among coreset\-based models, with TabPFN\-FT\-v2\.5 taking Barker and Vince\-Buffalo and Vanilla TabPFN\-v2\.5 taking Hunting\. More striking, Vanilla TabPFN\-v2\.6 exceeds the Full\-KB\-XGB reference on Barker, Hunting, and Sims while recovering93%93\\%to100%100\\%on the remaining six\. The three reference\-beating watersheds share small\-to\-moderate training pools where the supervised reference itself is data\-limited, leaving room for in\-context inference with a strong prior to compensate\. The remaining six watersheds have larger pools that benefit Full\-KB\-XGB more than a fixed\-size coreset can match\. At only0\.7%0\.7\\%of the per\-watershed training pool, this makes the coreset approach data\-efficient at the watershed level\.

Table 4:Per\-watershedR2R^\{2\}atN=50kN=50\\text\{k\}under FL\-Depth, with the bottom two rows giving the cross\-watershed mean and standard deviationσ\\sigma\. Bold marks each row’s coreset\-based maximum, and italics mark cells where a coreset\-based model exceeds the Full\-KB\-XGB reference\. V\-TabPFN denotes Vanilla TabPFN, and TabPFN\-FT is the fine\-tuned variant\.WatershedCoresetXGBTabICLV\-TabPFNv2\.5TabPFN\-FTv2\.5TabPFN\-FTv2\.6V\-TabPFNv2\.6Full\-KBXGBAddicks0\.5920\.6050\.6150\.6520\.6510\.6620\.677Barker0\.6700\.7270\.7660\.7790\.7270\.7490\.728Brays0\.5570\.5260\.5080\.5860\.5780\.5960\.617Cypress0\.6310\.6170\.5890\.6420\.6550\.6750\.697Greens0\.5610\.5530\.5270\.5590\.5800\.5930\.632Hunting0\.6670\.7570\.7650\.7620\.7450\.7480\.737Sims0\.6010\.6450\.6070\.6680\.6730\.6800\.661Vince\-Buffalo0\.5890\.5800\.6010\.6300\.6190\.6010\.649Whiteoak\-Buffalo0\.5890\.6000\.6080\.6430\.6420\.6620\.663Mean0\.6060\.6230\.6210\.6580\.6520\.6630\.673Std0\.0410\.0760\.0900\.0730\.0580\.0590\.041Table 5:Touching\-watershed neighbors used by the*geo*mode, derived from mesh\-cell geometry\.Target watershedKTK\_\{T\}NeighborsAddicks Reservoir3Barker, Cypress, Whiteoak\-BuffaloBarker Reservoir4Addicks, Brays, Cypress, Whiteoak\-BuffaloBrays Bayou4Barker, Sims, Vince\-Buffalo, Whiteoak\-BuffaloCypress Creek4Addicks, Barker, Greens, Whiteoak\-BuffaloGreens Bayou4Cypress, Hunting, Vince\-Buffalo, Whiteoak\-BuffaloHunting Bayou3Greens, Vince\-Buffalo, Whiteoak\-BuffaloSims Bayou2Brays, Vince\-BuffaloVince Bayou\-Buffalo Bayou5Brays, Greens, Hunting, Sims, Whiteoak\-BuffaloWhiteoak Bayou\-Buffalo Bayou7Addicks, Barker, Brays, Cypress, Greens, Hunting, Vince\-Buffalo

### 5\.3Experiment 2: Cross\-Watershed LOO

We measure cross\-watershed transfer by holding out one target watershed at a time\. For each held\-out targetTT, the inference context of sizeNNis built from the other eight watersheds’50k50\\text\{k\}\-row random coresets, with two source\-selection modes determining how those eight pools feed the context\. The*geo*mode uses only theKTK\_\{T\}watersheds touchingTTin the mesh\-derived adjacency graph of Table[5](https://arxiv.org/html/2606.05265#S5.T5), each contributingN/KTN/K\_\{T\}rows\. The*all*mode uses all eight other watersheds, each contributingN/8N/8rows\. The*geo*pool sizeKTK\_\{T\}varies by target, and the mode tests whether watersheds touchingTTcarry sufficiently representative hydrologic regimes that a smaller and more local pool matches the wider pool of all eight\. Hexagons within each contributing watershed are sampled at random, since Experiment 1 already characterizes selector effects\. We sweep eight context sizes from1k1\\text\{k\}to50k50\\text\{k\}and compare three model groups\. Vanilla TabPFN on both backbones consumes the cross\-watershed context without parameter updates\. TabPFN\-FTLOO on both backbones is fine\-tuned across the other eight watersheds before inference, so that neither the target’s data nor its weights enter the prediction\. Coreset\-XGB is a tree\-based baseline fitted on the same cross\-watershed context\. The Full\-KB\-XGB reference from Experiment 1 is not reused here because each of its models was trained on its own watershed’s full training rows and would leak target labels into a leave\-one\-out evaluation, and keeping every model at the same50k50\\text\{k\}\-row cross\-watershed context isolates the transfer effect from data\-volume differences\. The leave\-one\-out protocol breaks the spatial leakage that a randomkk\-fold split would not catch\(roberts2017cv;meyer2018improving\), providing a notion of cross\-watershed transfer principled with respect to the spatial structure of the data\(meyer2021predicting\)\.

![Refer to caption](https://arxiv.org/html/2606.05265v1/fig_exp2.png)Figure 6:Cross\-watershed LOO meanR2R^\{2\}versus context sizeNNunder the*geo*\(left\) and*all*\(right\) source\-selection modes, averaged over nine held\-out target watersheds\.Table 6:Cross\-watershed LOOR2R^\{2\}atN∈\{2k,10k,50k\}N\\in\\\{2\\text\{k\},10\\text\{k\},50\\text\{k\}\\\}, averaged over nine held\-out target watersheds under the*geo*and*all*source\-selection modes\. Bold marks each row’s maximum\.ModeNNCoresetXGBV\-TabPFNv2\.5TabPFN\-FTLOOv2\.5TabPFN\-FTLOOv2\.6V\-TabPFNv2\.6*geo*2k2\\text\{k\}0\.45290\.53000\.46410\.46270\.519510k10\\text\{k\}0\.48070\.47900\.48750\.50260\.515250k50\\text\{k\}0\.47950\.39840\.46910\.49540\.5023*all*2k2\\text\{k\}0\.46600\.51410\.47330\.46900\.505610k10\\text\{k\}0\.49100\.51130\.45330\.49700\.516950k50\\text\{k\}0\.49400\.40970\.45270\.48150\.4851Cross\-watershed performance separates cleanly by backbone\. Vanilla TabPFN\-v2\.5 peaks at the smallestNNand decays steadily as the context grows, while Vanilla TabPFN\-v2\.6 holdsR2R^\{2\}between0\.500\.50and0\.520\.52across the entire sweep and takes the lead fromN=10kN=10\\text\{k\}onward\. Figure[6](https://arxiv.org/html/2606.05265#S5.F6)traces the full sweep and Table[6](https://arxiv.org/html/2606.05265#S5.T6)pins three landmarkNNvalues under both modes\. The TabPFN\-FTLOO variants track their backbones without overtaking Vanilla TabPFN\-v2\.6, so leave\-one\-out fine\-tuning offers no usable lift on the stronger backbone, mirroring the within\-watershed FT finding from Experiment 1\. The mode gap stays narrow across the sweep\. The one exception is Coreset\-XGB atN=50kN=50\\text\{k\}under*all*, where the wider and more diverse training pool benefits tree fitting enough to overtake every TFM\. Outside that corner, the narrow mode gap supports the spatial\-locality intuition behind*geo*’s design\. Touching watersheds carry enough representative variation to substitute for the full eight\-watershed pool in cross\-watershed transfer\. Vanilla TabPFN\-v2\.6 is therefore the strongest cross\-watershed model, transferring leak\-free to a held\-out target at the same50k50\\text\{k\}\-row context size as within\-watershed inference\. As in Experiment 1, per\-task fine\-tuning is unnecessary on the stronger backbone\.

### 5\.4Experiment 3: Real Events

We evaluate five coreset\-based models and the Full\-KB\-XGB reference on Hurricane Harvey and Tropical Storm Imelda\. Each TFM uses its own watershed’s coreset as in\-context examples\. Harvey’s storm profile lies entirely outside the synthetic training envelope on the cumulative\-rainfall and duration axes and therefore serves as a fully out\-of\-distribution test\. Imelda has cells both inside and outside this envelope across the nine watersheds\.

Table 7:Real\-eventR2R^\{2\}on Hurricane Harvey and Tropical Storm Imelda, averaged over the nine watersheds\. Two TFM variants exceed the Full\-KB\-XGB reference on Harvey, while the reference reclaims the lead on Imelda\.ModelHarveyImeldaCoreset\-XGB0\.4710\.430TabICL0\.4810\.380Vanilla TabPFN\-v2\.50\.4870\.404Vanilla TabPFN\-v2\.60\.5580\.430TabPFN\-FT\-v2\.50\.5700\.457Full\-KB\-XGB0\.5030\.528On Harvey the two top TFM models exceed the Full\-KB\-XGB reference\. The TabPFN pretrained prior generalizes more reliably than tree\-based fitting under sharp distribution shift, since XGB must extrapolate beyond its training envelope while the TFM stays inside its own prior\. The backbone upgrade from v2\.5 to v2\.6 delivers\+0\.071\+0\.071R2R^\{2\}on Harvey and accounts for most of the0\.0120\.012gap between Vanilla TabPFN\-v2\.6 and the leading TabPFN\-FT\-v2\.5\. On Imelda the Full\-KB\-XGB reference reclaims the lead\. Breaking the event into its in\-distribution and OOD slices localizes the gap\. Full\-KB\-XGB pulls ahead on the OOD slice with\+0\.13\+0\.13R2R^\{2\}over Vanilla TabPFN\-v2\.6, while the three top models converge within0\.020\.02R2R^\{2\}on the in\-distribution slice\. TabPFN\-FT\-v2\.5 retains a0\.0270\.027R2R^\{2\}edge over Vanilla TabPFN\-v2\.6 on this mixed event\. This is the only point in the entire evaluation where fine\-tuning shows measurable value\. In\-distribution prediction depends more on training volume than on the pretrained prior\. Full\-KB\-XGB fits 6\.9M rows per watershed and stays ahead of every coreset model on Imelda, while TabPFN\-FT\-v2\.5 extracts more from its50k50\\text\{k\}coreset through 500 gradient\-update episodes than vanilla in\-context inference does in a single forward pass\. The two storms partition the real\-event regime by distribution shift\. Coreset TFMs hold more potential than tree\-based fitting for far\-OOD events, while tree\-based models with full training access remain hard to beat on in\-distribution events\.

## 6Conclusion

We present a domain\-aware coreset construction pipeline that conditions a tabular foundation model on a small fraction of the training rows used by a watershed\-level supervised baseline\. With a50k50\\text\{k\}\-row coreset at about0\.7%0\.7\\%of the per\-watershed training pool, Vanilla TabPFN\-v2\.6 reaches meanR2=0\.663R^\{2\}=0\.663across nine Houston\-area watersheds and recovers98\.5%98\.5\\%of a Full\-KB\-XGB reference trained on roughly6\.96\.9M rows per watershed\. This makes Vanilla TabPFN\-v2\.6 the optimal TFM in our evaluation\. Three design elements drive this result\. Dual stratification of storm events by NOAA Atlas 14 return period and most\-affected watershed ensures rare\-event coverage\. Target\-aware spatial selection via FL\-Depth picks hexagons that span the flood\-response surface\. A sufficiently strong pretrained backbone removes the need for per\-task fine\-tuning\. The same model also transfers to a held\-out watershed by drawing in\-context examples from its touching neighbors, without any gradient updates\.

The three experiments map the operating range of the candidates\. In the within\-watershed evaluation, all coreset TFMs improve with context sizeNN, but the trajectory depends on the backbone\. The TabPFN\-v2\.6 family rises monotonically across the sweep, while the TabPFN\-v2\.5 family dips through the intermediateNNrange and recovers only atN=50kN=50\\text\{k\}\. Vanilla TabPFN\-v2\.6 reaches the highest meanR2=0\.663R^\{2\}=0\.663, within1\.5%1\.5\\%of the supervised reference on only0\.7%0\.7\\%of the per\-watershed training data, and per\-watershed fine\-tuning provides no further benefit\. In the cross\-watershed leave\-one\-out evaluation, Vanilla TabPFN\-v2\.5 peaks at the smallestNNand decays as the context grows, while Vanilla TabPFN\-v2\.6 stays stable between0\.500\.50and0\.520\.52across the full sweep and leads the coreset\-trained supervised baseline at most context sizes in both source\-selection modes\. LOO fine\-tuning likewise provides no benefit\. Two real storms confirm the pattern, with TFMs exceeding the supervised reference on far\-OOD Hurricane Harvey and the reference reclaiming the lead on the largely in\-distribution Tropical Storm Imelda\. Coreset TFMs therefore hold an advantage under far\-OOD conditions, while supervised models with full training access remain hard to beat in\-distribution\.

Several boundaries of the current scope motivate future work\. Vanilla TabPFN\-v2\.6 has higher cross\-watershed variance than the Full\-KB\-XGB reference \(std=0\.059\\mathrm\{std\}=0\.059versus0\.0410\.041\), so its higher mean comes with a wider per\-watershed worst case\. The50k50\\text\{k\}\-row context cap is well matched to per\-watershed inference but leaves room to explore the larger native windows of TabPFN\-v2\.6 at roughly100k100\\text\{k\}and TabICL at higher still\. The real\-event component covers two storms that span the OOD and in\-distribution regimes, and the cross\-watershed evaluation draws on nine Houston\-area watersheds\. The mesh\-to\-H3 conversion standardizes spatial units but gives hexagons inheriting the same mesh cell a shared depth label, so future evaluation should report sensitivity at both the H3 and original mesh\-cell resolution\. Results here are reported inR2R^\{2\}, and complementary metrics such as RMSE, MAE, and high\-depth tail accuracy would further characterize operational reliability\. Future work includes evaluating on a wider set of historical storms, testing transfer across distinct hydroclimatic regions to probe the broader transfer\-learning capability of coreset TFMs, and exploring hybrids that combine TFM in\-context inference with tree\-based fitting on in\-distribution residuals\.

## References
Data-efficient flood depth prediction through domain-aware coreset selection and tabular foundation models

Similar Articles

LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data

Overcoming "Physics Shock" in Earth Observation A Heteroscedastic Uncertainty Framework for PINN-based Flood Inference

Physics-Informed Machine Learning for Short-Term Flood Prediction

Evaluating Transformer and LSTM Frameworks for Prediction in Ungauged Basins

Finite Volume-Informed Neural Network Framework for 2D Shallow Water Equations: Rugged Loss Landscapes and the Importance of Data Guidance

Submit Feedback

Similar Articles

LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data
Overcoming "Physics Shock" in Earth Observation A Heteroscedastic Uncertainty Framework for PINN-based Flood Inference
Physics-Informed Machine Learning for Short-Term Flood Prediction
Evaluating Transformer and LSTM Frameworks for Prediction in Ungauged Basins
Finite Volume-Informed Neural Network Framework for 2D Shallow Water Equations: Rugged Loss Landscapes and the Importance of Data Guidance