Evaluating Transformer and LSTM Frameworks for Prediction in Ungauged Basins

arXiv cs.AI Papers

Summary

This paper evaluates encoder-only Transformer and LSTM models for streamflow prediction in ungauged basins using NOAA's National Water Model simulations. Results show LSTM outperforms Transformer, and incorporating downstream information significantly improves prediction skill across both architectures.

arXiv:2606.02791v1 Announce Type: new Abstract: Watershed networks exhibit convergent topologies in which multiple tributaries merge into downstream channels,integrating diverse upstream hydrological processes. In ungauged basins, the absence of direct observations increases uncertainty and limits the ability to anticipate extreme events. This study evaluates whether an encoder-only Transformer provides an advantage over an LSTM for upstream streamflow inference under limited hydrologic information, using retrospective simulations from the NOAA National Water Model (NWM). Across both upstream-only and combined configurations, the LSTM showed stronger overall performance than the Transformer model across the two configurations. Incorporating downstream information further boosted performance for all models, increasing median NNSE by more than 60%. Rather than treating this as a leaderboard-style comparison, we interpret the experiments as a test of architectural inductive bias for hydrologic sequence inference. The results indicate that recurrent memory remains better aligned with this upstream reconstruction task than an encoder-only Transformer, while downstream hydrologic context provides a strong auxiliary constraint that substantially improves prediction skill across architectures
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:41 AM

# EVALUATING TRANSFORMER AND LSTM FRAMEWORKS FOR PREDICTION IN UNGAUGED BASINS
Source: [https://arxiv.org/html/2606.02791](https://arxiv.org/html/2606.02791)
###### Abstract

Watershed networks exhibit convergent topologies in which multiple tributaries merge into downstream channels, integrating diverse upstream hydrological processes\. In ungauged basins, the absence of direct observations increases uncertainty and limits the ability to anticipate extreme events\. This study evaluates whether an encoder\-only Transformer provides an advantage over an LSTM for upstream streamflow inference under limited hydrologic information, using retrospective simulations from the NOAA National Water Model \(NWM\)\. Across both upstream\-only and combined configurations, the LSTM showed stronger overall performance than the Transformer model across the two configurations\. Incorporating downstream information further boosted performance for all models, increasing median NNSE by more than 60%\. Rather than treating this as a leaderboard\-style comparison, we interpret the experiments as a test of architectural inductive bias for hydrologic sequence inference\. The results indicate that recurrent memory remains better aligned with this upstream reconstruction task than an encoder\-only Transformer, while downstream hydrologic context provides a strong auxiliary constraint that substantially improves prediction skill across architectures\.

## IIntroduction

Streamflow measurement using gauging stations and the use of these observations to predict river discharge form the foundation of modern hydrologic forecasting\. In recent years, data\-driven approaches, particularly deep learning \(DL\) models, have shown considerable promise in learning complex hydrological relationships directly from data\. This has been greatly supported by the emergence of large\-sample datasets which provide comprehensive hydrometeorological records across hundreds of catchments, including consistent information on physical attributes, meteorological forcings, and streamflow time series\[[3](https://arxiv.org/html/2606.02791#bib.bib52)\], thereby enabling the development of more robust and generalizable models\.

Notably, retrospective datasets such as CAMELS \(Catchment Attributes and Meteorology for Large\-sample Studies\)\[[1](https://arxiv.org/html/2606.02791#bib.bib5)\], Caravan\[[6](https://arxiv.org/html/2606.02791#bib.bib53)\], NWM\[[8](https://arxiv.org/html/2606.02791#bib.bib4)\], EStreams\[[4](https://arxiv.org/html/2606.02791#bib.bib41)\]have become increasingly valuable with the rise of data\-intensive machine learning models\[[6](https://arxiv.org/html/2606.02791#bib.bib53)\]\. These developments have encouraged the application of DL models, with the Long Short\-Term Memory \(LSTM\) model achieving notable success in predicting streamflow for ungauged basins\. Nearing et al\.\[[7](https://arxiv.org/html/2606.02791#bib.bib43)\]demonstrated that the LSTM network can effectively forecast extreme floods in ungauged settings, while Kratzert et al\.\[[5](https://arxiv.org/html/2606.02791#bib.bib30)\]showed that LSTMs outperform conceptual models and that sufficient information exists within catchment characteristics to support data\-driven modeling under PUB conditions\. As a gated recurrent neural network, the LSTM was designed to mitigate the vanishing\-gradient limitations of standard RNNs and to better preserve long\-range temporal dependencies through its cell\-state memory mechanism\.

The Transformer architecture has demonstrated outstanding performance across various tasks, including natural language processing, speech recognition, computer vision, and question answering, and has recently been adapted for hydrological modeling applications\. Transformer models offer an alternative sequence\-learning approach because self\-attention can model dependencies across available time steps in parallel during training\. Several recent studies have highlighted the effectiveness of Transformers in hydrological forecasting\. Yin et al\.\[[13](https://arxiv.org/html/2606.02791#bib.bib48)\]proposed the Transformer\-XAJ, a process and data\-driven model, which achieved strong performance in both regional and ungauged basin predictions\. Similarly, Amanambu et al\.\[[2](https://arxiv.org/html/2606.02791#bib.bib56)\]demonstrated that the Transformer model outperformed LSTM architectures in hydrological drought forecasting across multiple prediction time steps\.

Motivated by the success of LSTMs in hydrologic forecasting and the growing use of Transformer architectures in sequence modeling, this study asks whether attention\-based models provide a practical advantage over recurrent models for upstream streamflow inference in ungauged basins\. To guide this investigation, we seek to answer two research questions:RQ1:Can an LSTM more effectively capture the lagged, state\-dependent dynamics of hydrologic response than an encoder\-only Transformer under limited upstream information?RQ2:Does incorporating downstream hydrologic context improve performance across architectures by providing a network\-level constraint on upstream reconstruction?

To answer these questions, we:

- •evaluate recurrent and attention\-based architectures under a constrained upstream\-only setting using NWM retrospective simulations;
- •quantify the effect of adding downstream hydrometeorological context\[[9](https://arxiv.org/html/2606.02791#bib.bib6)\], and
- •interpret the resulting performance differences in terms of hydrologic information availability and architectural inductive bias\.

The goal of this study is to assess which sequence\-modeling bias is better suited for upstream streamflow inference under limited information, and how this comparison changes when the downstream hydrologic context is incorporated\.

![Refer to caption](https://arxiv.org/html/2606.02791v1/x1.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.02791v1/x2.png)\(b\)

Figure 1:Model architectures used in this study: \(a\) LSTM model highlighting gate mechanisms \(ftf\_\{t\},iti\_\{t\},oto\_\{t\}\) and the interaction between cell state \(CtC\_\{t\}\) and hidden state \(hth\_\{t\}\) during streamflow sequence learning \(Xu et al\., 2020\)\[[12](https://arxiv.org/html/2606.02791#bib.bib67)\]\(b\) Encoder\-only Transformer for hydrologic time series modeling, incorporating input embeddings, positional encoding, masked multi\-head attention, and feed\-forward layers \(Vaswani et al\., 2017\)\[[11](https://arxiv.org/html/2606.02791#bib.bib2)\]
## IIMethodology

Given the interdependence between upstream and downstream streamflow, effective modeling requires architectures capable of capturing temporal dependencies and nonlinear hydrologic behavior\. This study examines two deep learning approaches, an LSTM\-based recurrent model and an encoder\-only Transformer, each leveraging different mechanisms for sequence modeling\.

TheLSTMarchitecture \(Fig\.[1](https://arxiv.org/html/2606.02791#S1.F1)a\) processes time series inputs sequentially, updating its hidden state through a series of gating operations\. The forget, input, and output gates \(ftf\_\{t\},iti\_\{t\},oto\_\{t\}\) regulate how information is retained, updated, and exposed at each timestep, enabling the model to learn temporal dependencies through internal recurrence\. The cell state \(CtC\_\{t\}\) acts as an explicit memory pathway, helping the LSTM maintain information across long sequences\.

In contrast, the encoderTransformerarchitecture \(Fig\.[1](https://arxiv.org/html/2606.02791#S1.F1)b\) replaces recurrence with parallel self\-attention mechanisms\. Instead of processing one timestep at a time, the model computes relationships across all timesteps simultaneously, learning how different points in the sequence relate to each other through multi\-head self\-attention layers\. Transformers rely on positional encodings to represent temporal order, since this information is not captured inherently by the architecture\. During training, a causal mask is applied to ensure that predictions at each timestep only attend to historical information, making the model suitable for autoregressive hydrologic forecasting\. Static catchment attributes and dynamic time series inputs are processed through separate embedding layers before being combined, enabling the model to leverage both time\-varying and spatial characteristics of each basin\.

TABLE I:Hydrological attributes used in this study, grouped by category \(National Water Model Retrospective Dataset v3\.0\)\.CategoryAttributeUnitDescriptionDynamicForcingsAPCP\_surfacemm/sAccumulated precipitationprecip\_ratemm/hrPrecipitation rateTMP\_2mabovegroundKNear\-surface air temperatureDSWRF\_surfaceW/m2Downward shortwave radiationDLWRF\_surfaceW/m2Downward longwave radiationPRES\_surfacePaSurface pressureUGRD\_10mabovegroundm/sEast–west wind at 10mVGRD\_10mabovegroundm/sNorth–south wind at 10mSPFH\_2mabovegroundkg/kgSpecific humidity at 2mHydrologic Inputstreamflowm3/sDownstream dischargeCatchmentAttributesbasin\_lengthkmLength of the basin polygonbasin\_areakm2Area of the basin polygonreach\_lengthkmLength of the river reachTargetstreamflowm3/sUpstream discharge \(target variable\)
## IIIResults and Discussion

### III\-AData Collection and Integration

Hourly hydrometeorological data from February 1, 1979, to January 1, 2023 \(44 years\) were obtained from the NWM v3\.0 dataset\. USGS gauges were linked to their corresponding reach identifiers using the RouteLink topology for 671 CAMELS basins across CONUS\. Upstream reaches were identified from river network connectivity, and basin geometries were merged to define upstream\-downstream basin pairs\. Dynamic meteorological forcings and static catchment attributes \(Table[I](https://arxiv.org/html/2606.02791#S2.T1)\) were included to provide both climatic and physiographic context\.

This study uses the simple upstream configuration ofn=1n=1reach\. For each basin pair, meteorological forcings were spatially averaged over the upstream contributing area, and hourly streamflow was extracted for both downstream and upstream\. All variables were stored in NetCDF format with consistent temporal alignment\.

### III\-BData Preprocessing and Training Strategy

The dataset was processed through a structured pipeline consisting of variable filtering, temporal splitting, normalization, and sequence generation\. Filtering removed basins with substantial data gaps; splitting produced training, validation, and test periods; normalization ensured consistent feature scales; and sequence generation converted continuous time series into supervised learning samples\.

A 70–15–15 temporal split was used \(29 years training, 8 years validation, 8 years testing\)\. Static and dynamic inputs were embedded through fully connected networks \(32 units for dynamic inputs and 16 units for static attributes, both withtanhactivation and 0\.1 dropout\) and provided to each model as 256\-length input windows\. Both the LSTM and Transformer architectures were trained under matched experimental conditions to ensure comparability\. The reported configurations were selected as comparable baseline settings under the same data split, optimizer, loss function, input window, and early\-stopping rule, rather than through an exhaustive architecture\-specific hyperparameter search\.

LSTM configuration:consists of a recurrent layer containing 32 hidden units\. A forget\-gate bias of 3 and an output dropout rate of 0\.1 were applied\. Static and dynamic inputs were embedded through fully connected layers before being passed to the LSTM, and the final timestep was predicted using a linear regression head\.

Transformer configuration:consists of a single Transformer encoder layer with four attention heads, a feedforward dimension of 128, and dropout of 0\.1 applied within the encoder and positional encoding layers\. Temporal order is represented using sum\-based positional encoding, and a causal mask is applied so that each prediction depends only on historical inputs\. The final prediction is generated from the last encoder state through a linear regression head\.

Training setup:Both models were trained to predict upstream streamflow one hour ahead under two input settings: upstream\-only and combined upstream–downstream\. Inputs included precipitation, temperature, wind speed, pressure, and radiation\. Training used AdamW with a learning rate of1×10−41\\times 10^\{\-4\}, batch size 256, gradient clipping of 1, and NSE loss\. Early stopping was applied with a patience of 10 epochs, and the best checkpoint was used for testing\. The implementation is available[here](https://github.com/Racta-1/hydro-transformer)\.

TABLE II:Summary of evaluation metrics used for model performance assessment\.MetricDescriptionRange, Best FitNNSENormalized form of NSE bounded between 0 and 1 for intuitive interpretation\.\(0, 1\), best: 1KGECombines correlation, bias, and variability for balanced model evaluation\.\(−∞\-\\infty, 1\), best: 1Pearson\-rrMeasures linear correlation between simulated and observed streamflow\.\(\-1, 1\), best: 1RMSEQuantifies the average magnitude of simulation errors; lower indicates better short\-term fit\.\(0,∞\\infty\), best: 0
### III\-CResults & Performance Analysis

To evaluate how model behavior varies with information availability, we trained both the Transformer and LSTM under two distinct input configurations\. The upstream\-only setting represents a constrained\-information setting in which upstream streamflow is predicted solely from local meteorological forcings and static basin attributes\. In contrast, the combined setting augments these inputs with downstream meteorological forcings and downstream discharge, introducing a network\-informed signal that may reflect the integrated hydrologic response of connected reaches\. This design allows us to test not only whether additional inputs improve accuracy, but also how each architecture responds to a richer hydrologic context\.

![Refer to caption](https://arxiv.org/html/2606.02791v1/x3.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.02791v1/x4.png)\(b\)
![Refer to caption](https://arxiv.org/html/2606.02791v1/x5.png)\(c\)

Figure 2:Spatial comparison of observed and model\-predicted streamflow at upstream gages under the combined configuration across the CONUS:\(a\)observed streamflow,\(b\)LSTM predicted streamflow, and\(c\)Encoder\-only Transformer predicted streamflow\. Colors indicate streamflow magnitude in m³/s using a common scale across all panels\.![Refer to caption](https://arxiv.org/html/2606.02791v1/x6.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.02791v1/x7.png)\(b\)
![Refer to caption](https://arxiv.org/html/2606.02791v1/x8.png)\(c\)

Figure 3:Comparison of model performance under the Combined and Upstream configurations\.\(a\)Median basin\-level performance across NNSE, KGE, RMSE, and Pearson\-r for LSTM and Transformer models, highlighting improvements from incorporating downstream information\.\(b\)Percentage of basins meeting predefined performance thresholds \(NNSE\>\>0\.75, KGE\>\>0\.50, Pearson\-r\>\>0\.70, RMSE<<2\.0\)\.\(c\)Cumulative distribution functions \(CDFs\) of NNSE scores across LSTM, Transformer, Informer, and CNN\-1D models \(Combined \(left\) and Upstream \(right\) setups\)\.Note that lower RMSE values indicate better performance, unlike the other metrics\.As shown in Table[III](https://arxiv.org/html/2606.02791#S3.T3), the combined configuration consistently outperformed the upstream\-only setup for both Transformer and LSTM models\. For example, median NNSE for the Transformer model increases from 0\.56 to 0\.90, and for the LSTM from 0\.56 to 0\.93, demonstrating substantial improvements in predictive accuracy\. A similar trend is observed for KGE, Pearson\-rr, and RMSE, indicating that the combined inputs enhance robustness and reduce error magnitudes across basins\. The percentage of basins achieving NNSE\>0\.5\>0\.5also increases significantly in the combined setting, for the Transformer from 75\.21% to 91\.68%, and for the LSTM from 79\.87% to 93\.84%, highlighting more reliable performance across a larger fraction of the domain \(see Fig\.[3](https://arxiv.org/html/2606.02791#S3.F3)b\)\.

This experiment shows how much predictive skill can be recovered from network\-level hydrologic context and whether that improvement varies by architecture\. Both models benefit substantially from downstream information, but the LSTM shows stronger overall performance across the two configurations\. This suggests that richer hydrologic context improves inference, but does not diminish the importance of an architecture well suited to sequential runoff dynamics\.

TABLE III:Performance comparison of the encoder\-only Transformer and LSTM models for both configurations across all basins\.MetricTransformerLSTMUpstreamCombinedUpstreamCombinedNNSE0\.560\.900\.560\.93KGE0\.200\.800\.140\.79Pearson\-rr0\.610\.960\.640\.98RMSE2\.410\.562\.240\.44% Basins NNSE\>0\.5\>0\.575\.21%91\.68%79\.87%93\.84%Basin Count601Overall, the LSTM shows modest but consistent advantages over the encoder\-only Transformer across both configurations, with slightly higher NNSE and Pearson\-rrand lower RMSE, while KGE remains comparable as seen in Fig\.[3](https://arxiv.org/html/2606.02791#S3.F3)a\. The difference in median NNSE in the combined setting \(0\.93 vs\. 0\.90\) is small, indicating that both models achieve similar overall predictive skill\.

These results suggest that downstream information is the primary driver of performance gains, while the LSTM provides a modest relative benefit for streamflow reconstruction\. One plausible explanation is that upstream inference is governed by temporally accumulated and state\-dependent hydrologic processes\. The LSTM’s sequential state updates align naturally with these dynamics, whereas the encoder\-only Transformer must infer temporal structure through attention and positional encoding alone, which appears slightly less effective in this setting\.

To assess spatial consistency, we compare observed and predicted streamflow across CONUS\. As shown in Fig\.[2](https://arxiv.org/html/2606.02791#S3.F2), both the LSTM and the encoder\-only Transformer capture similar large\-scale spatial patterns under the combined configuration\.

### III\-DPerformance Comparison of Other Models

In Fig\.[3](https://arxiv.org/html/2606.02791#S3.F3)c, we extend the comparison to CNN\-1D\[[10](https://arxiv.org/html/2606.02791#bib.bib68)\]and Informer\[[14](https://arxiv.org/html/2606.02791#bib.bib66)\]\. Across architectures, NNSE distributions become more similar when downstream hydrologic information is included, with most basin\-level scores concentrated between 0\.8 and 1\.0, indicating reduced performance differences between models\. A plausible explanation is that downstream observations act as a strong network\-level constraint, reducing uncertainty and limiting the influence of model\-specific inductive biases\. In contrast, removing downstream context leads to a general decline in performance, with CNN\-1D showing greater variability and lower median NNSE, while Informer, Transformer, and LSTM remain comparatively more stable\. This suggests that architectures capable of capturing long\-range dependencies or maintaining temporal state are better able to compensate when hydrologic context is limited\. Overall, downstream information improves performance and narrows the gap between model classes\.

## IVConclusion and Future Work

In this study, we compared LSTM and encoder\-only Transformer architectures for streamflow prediction using NWM data across CAMELS basins under two inference settings: an upstream\-only configuration reflecting PUB\-style constraints, and a network\-informed configuration that incorporates downstream hydrologic context\. The latter is not a strict PUB setting, but represents a scenario where additional network\-level information is available\.

Results show that incorporating downstream information consistently improves performance across architectures, indicating that network\-level context plays a dominant role in capturing spatial hydrologic dependencies\. While both models benefit substantially, the LSTM achieves modestly higher accuracy and more stable basin\-wise performance, suggesting that recurrent and attention\-based architectures respond differently to information scarcity and network augmentation\.

Future work will extend this framework to larger upstream–downstream networks and to observed USGS discharge data, enabling broader validation under more realistic hydrologic conditions\.

## VAcknowledgments

This research was supported by the Cooperative Institute for Research to Operations in Hydrology \(CIROH\) with funding under award NA22NWS4320003 from the NOAA Cooperative Institute Program\. The statements, findings, conclusions, and recommendations are those of the author\(s\) and do not necessarily reflect the opinions of NOAA\.

## References

- \[1\]N\. Addor, A\. J\. Newman, N\. Mizukami, and M\. P\. Clark\(2017\-10\)The CAMELS data set: catchment attributes and meteorology for large\-sample studies\.Hydrol\. Earth Syst\. Sci\.21\(10\),pp\. 5293–5313\(en\)\.Cited by:[§I](https://arxiv.org/html/2606.02791#S1.p2.1)\.
- \[2\]A\. C\. Amanambu, J\. Mossa, and Y\. Chen\(2022\-11\)Hydrological drought forecasting using a deep transformer model\.Water \(Basel\)14\(22\),pp\. 3611\(en\)\.External Links:[Document](https://dx.doi.org/10.3390/w14223611)Cited by:[§I](https://arxiv.org/html/2606.02791#S1.p3.1)\.
- \[3\]G\. Coxon, N\. Addor, J\. P\. Bloomfield, J\. Freer, M\. Fry, J\. Hannaford, N\. J\. K\. Howden, R\. Lane, M\. Lewis, E\. L\. Robinson, T\. Wagener, and R\. Woods\(2020\-10\)CAMELS\-GB: hydrometeorological time series and landscape attributes for 671 catchments in great britain\.Earth Syst\. Sci\. Data12\(4\),pp\. 2459–2483\(en\)\.Cited by:[§I](https://arxiv.org/html/2606.02791#S1.p1.1)\.
- \[4\]T\. V\. M\. do Nascimento, J\. Rudlang, M\. Höge, R\. van der Ent, M\. Chappon, J\. Seibert, M\. Hrachowitz, and F\. Fenicia\(2024\-08\)EStreams: an integrated dataset and catalogue of streamflow, hydro\-climatic and landscape variables for europe\.Sci\. Data11\(1\),pp\. 879\(en\)\.Cited by:[§I](https://arxiv.org/html/2606.02791#S1.p2.1)\.
- \[5\]F\. Kratzert, D\. Klotz, M\. Herrnegger, A\. K\. Sampson, S\. Hochreiter, and G\. S\. Nearing\(2019\-12\)Toward improved predictions in ungauged basins: exploiting the power of machine learning\.Water Resour\. Res\.55\(12\),pp\. 11344–11354\(en\)\.Cited by:[§I](https://arxiv.org/html/2606.02791#S1.p2.1)\.
- \[6\]F\. Kratzert, G\. Nearing, N\. Addor, T\. Erickson, M\. Gauch, O\. Gilon, L\. Gudmundsson, A\. Hassidim, D\. Klotz, S\. Nevo, G\. Shalev, and Y\. Matias\(2023\-01\)Caravan \- a global community dataset for large\-sample hydrology\.Sci\. Data10\(1\),pp\. 61\(en\)\.Cited by:[§I](https://arxiv.org/html/2606.02791#S1.p2.1)\.
- \[7\]G\. Nearing, D\. Cohen, V\. Dube, M\. Gauch, O\. Gilon, S\. Harrigan, A\. Hassidim, D\. Klotz, F\. Kratzert, A\. Metzger, S\. Nevo, F\. Pappenberger, C\. Prudhomme, G\. Shalev, S\. Shenzis, T\. Y\. Tekalign, D\. Weitzner, and Y\. Matias\(2024\-03\)Global prediction of extreme floods in ungauged watersheds\.Nature627\(8004\),pp\. 559–563\(en\)\.External Links:[Document](https://dx.doi.org/10.1038/s41586-024-07145-1)Cited by:[§I](https://arxiv.org/html/2606.02791#S1.p2.1)\.
- \[8\]NOAA National Water Center\(\)NOAA National Water Model CONUS Retrospective Dataset \- Registry of Open Data on AWS — registry\.opendata\.aws\.Note:[https://registry\.opendata\.aws/nwm\-archive/](https://registry.opendata.aws/nwm-archive/)\[Accessed 11\-08\-2025\]Cited by:[§I](https://arxiv.org/html/2606.02791#S1.p2.1)\.
- \[9\]A\. A\. Ramírez Molina, J\. M\. Frame, J\. Halgren, and J\. Gong\(2025\-06\)A proof of concept for improving estimates of ungauged basin streamflow via an LSTM‐based synthetic network simulation approach\.Journal of Geophysical Research: Machine Learning and Computation2\(2\) \(en\)\.Cited by:[2nd item](https://arxiv.org/html/2606.02791#S1.I1.i2.p1.1)\.
- \[10\]S\. P\. Van, H\. M\. Le, D\. V\. Thanh, T\. D\. Dang, H\. H\. Loc, and D\. T\. Anh\(2020\-05\)Deep learning convolutional neural network in rainfall–runoff modelling\.J\. Hydroinformatics22\(3\),pp\. 541–561\(en\)\.Cited by:[§III\-D](https://arxiv.org/html/2606.02791#S3.SS4.p1.1)\.
- \[11\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,I\. Guyon, U\. V\. Luxburg, S\. Bengio, H\. Wallach, R\. Fergus, S\. Vishwanathan, and R\. Garnett \(Eds\.\),Vol\.30,pp\.\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.48550/arXiv.1706.03762)Cited by:[Figure 1](https://arxiv.org/html/2606.02791#S1.F1),[Figure 1](https://arxiv.org/html/2606.02791#S1.F1.10.5)\.
- \[12\]W\. Xu, Y\. Jiang, X\. Zhang, Y\. Li, R\. Zhang, and G\. Fu\(2020\-12\)Using long short\-term memory networks for river flow prediction\.Hydrol\. Res\.51\(6\),pp\. 1358–1376\(en\)\.Cited by:[Figure 1](https://arxiv.org/html/2606.02791#S1.F1),[Figure 1](https://arxiv.org/html/2606.02791#S1.F1.10.5)\.
- \[13\]H\. Yin, L\. Zhao, M\. Zhu, and Y\. Zhang\(2025\-12\)Runoff prediction in gauged and ungauged basins using Transformer\-XAJ model\.J\. Hydrol\. \(Amst\.\)662\(133954\),pp\. 133954\(en\)\.External Links:[Document](https://dx.doi.org/10.1016/j.jhydrol.2025.133954)Cited by:[§I](https://arxiv.org/html/2606.02791#S1.p3.1)\.
- \[14\]H\. Zhou, S\. Zhang, J\. Peng, S\. Zhang, J\. Li, H\. Xiong, and W\. Zhang\(2021\-05\)Informer: beyond efficient transformer for long sequence time\-series forecasting\.Proceedings of the AAAI Conference on Artificial Intelligence35\(12\),pp\. 11106–11115\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v35i12.17325)Cited by:[§III\-D](https://arxiv.org/html/2606.02791#S3.SS4.p1.1)\.

Similar Articles

Physics-Informed Machine Learning for Short-Term Flood Prediction

arXiv cs.LG

Researchers propose a Physics-Informed Machine Learning (PIML) framework that integrates hydrological constraints into an LSTM loss function to improve short-term flood forecasting, particularly in data-scarce regimes. A 'Trend Alignment' constraint enforcing consistency between precipitation and discharge trends improves Nash-Sutcliffe Efficiency and eliminates unphysical predictions during extreme events.

Temporal Concept Drift in Legal Judgment Prediction: Neural Baselines Across Three Epochs of Ukrainian Court Decisions

arXiv cs.CL

This paper investigates temporal concept drift in legal judgment prediction by fine-tuning transformer models on Ukrainian court decisions from three epochs defined by geopolitical disruptions. Findings show severe forward degradation, asymmetry in backward transfer, and that chronological continual learning effectively mitigates forgetting while domain pretraining reduces degradation magnitude.