LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data

arXiv cs.LG Papers

Summary

LakeFM is a foundation model for aquatic systems, pre-trained on large-scale ecological datasets to forecast lake dynamics using irregular multivariate multi-depth time series data, achieving competitive performance compared to existing models.

arXiv:2606.11268v1 Announce Type: new Abstract: Understanding and forecasting lake dynamics is critical for monitoring water quality and ecosystem health across lakes and reservoirs. While machine learning methods have been recently applied to ecological time-series data, existing works assume regular sampling in time and depth, and struggle to generalize across lakes with heterogeneous variables, depths, and observation patterns. To address these limitations, we introduce \textsc{LakeFM}, a foundation model for aquatic systems, pre-trained on large-scale ecological datasets comprising both simulated and observed lakes. Through extensive empirical evaluation, we show that \textsc{LakeFM} learns meaningful representations spanning broader lake-level characteristics, and achieves competitive or often superior-forecasting performance compared to existing time-series foundation and non-foundation models, while producing physically plausible predictions consistent with real-world lake dynamics.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:46 PM

# LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data
Source: [https://arxiv.org/html/2606.11268](https://arxiv.org/html/2606.11268)
,Sepideh FatemiVirginia TechBlacksburgVAUSA[sepidehfatemi@vt\.edu](https://arxiv.org/html/2606.11268v1/mailto:[email protected]),Medha SawhneyVirginia TechBlacksburgVAUSA[medha@vt\.edu](https://arxiv.org/html/2606.11268v1/mailto:[email protected]),Kazi Sajeed MehrabVirginia TechBlacksburgVAUSA[ksmehrab@vt\.edu](https://arxiv.org/html/2606.11268v1/mailto:[email protected]),Aanish PradhanVirginia TechBlacksburgVAUSA[aanishp01@vt\.edu](https://arxiv.org/html/2606.11268v1/mailto:[email protected]),Bennett J\. McAfeeGrand Valley State UniversityMuskegonMIUSA[bennettjmcafee@gmail\.com](https://arxiv.org/html/2606.11268v1/mailto:[email protected]),Emma MarchisinUniversity of Wisconsin \- MadisonMadisonWIUSA[marchisin@wisc\.edu](https://arxiv.org/html/2606.11268v1/mailto:[email protected]),Arka DawAmazon AGISeattleWAUSA[dawark@amazon\.com](https://arxiv.org/html/2606.11268v1/mailto:[email protected]),Robert LadwigAarhus UniversityAarhusDenmark[rladwig@ecos\.au\.dk](https://arxiv.org/html/2606.11268v1/mailto:[email protected]),Cayelan C\. CareyVirginia TechBlacksburgVAUSA[cayelan@vt\.edu](https://arxiv.org/html/2606.11268v1/mailto:[email protected]),Paul C HansonUniversity of Wisconsin \- MadisonMadisonWIUSA[pchanson@wisc\.edu](https://arxiv.org/html/2606.11268v1/mailto:[email protected])andAnuj KarpatneVirginia TechBlacksburgVAUSA[karpatne@vt\.edu](https://arxiv.org/html/2606.11268v1/mailto:[email protected])

\(2026\)

###### Abstract\.

Understanding and forecasting lake dynamics is critical for monitoring water quality and ecosystem health across lakes and reservoirs\. While machine learning methods have been recently applied to ecological time\-series data, existing works assume regular sampling in time and depth, and struggle to generalize across lakes with heterogeneous variables, depths, and observation patterns\. To address these limitations, we introduceLakeFM, a foundation model for aquatic systems, pre\-trained on large\-scale ecological datasets comprising both simulated and observed lakes\. Through extensive empirical evaluation, we show thatLakeFMlearns meaningful representations spanning broader lake\-level characteristics, and achieves competitive or often superior\-forecasting performance compared to existing time\-series foundation and non\-foundation models, while producing physically plausible predictions consistent with real\-world lake dynamics\. Project page:[abhilash\-neog\.github\.io/lakefm\.github\.io/](https://abhilash-neog.github.io/lakefm.github.io/)

Foundation Model, Time Series, AI4Science

††journalyear:2026††copyright:cc††conference:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2; August 09–13, 2026; Jeju Island, Republic of Korea††booktitle:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2 \(KDD ’26\), August 09–13, 2026, Jeju Island, Republic of Korea††doi:10\.1145/3770855\.3819024††isbn:979\-8\-4007\-2259\-2/2026/08††ccs:Computing methodologies Machine learning## 1\.Introduction

Monitoring the health of inland water bodies such as lakes and reservoirs is essential for ensuring sustainable and equitable use of Earth’s freshwater reserves\. Lakes are governed by rich physical and biogeochemical processes that vary across geographies and time, creating unique opportunities for machine learning \(ML\) methods to model their temporal evolution across depths using ecological time\-series data\. For example, there is a growing body of work on modeling the temperature of water in lakes\(Daw et al\.,[2022](https://arxiv.org/html/2606.11268#bib.bib8);Jia et al\.,[2019](https://arxiv.org/html/2606.11268#bib.bib13);Ladwig et al\.,[2024](https://arxiv.org/html/2606.11268#bib.bib14)\)\. However, modeling a single variate only provides a partial view to the complex interactions of processes governing the quality of water in lakes, observed at varying depths, frequencies, subsets of variables, and levels of reliability from one site \(lake\) to another\. While recent benchmarking efforts such as LakeBeD\-US\(McAfee et al\.,[2025](https://arxiv.org/html/2606.11268#bib.bib19)\)have harmonized water quality observations across multiple monitoring programs resulting in over 500M observations across 17 variables from 21 lakes, it is still plagued with high degrees of missing values, uneven sampling frequencies, and highly variable depth and variate coverage across sites\.This sparsity and heterogeneity in lake measurements, which is intrinsic to real\-world environmental monitoring, severely limits the ability of ML methods to scale to broader collections of lakes using irregular multi\-variate multi\-depth time series data\.

At the same time, the broader ML community has made significant progress in developing time series \(TS\) foundation models such as Chronos 2\(Ansari et al\.,[2025](https://arxiv.org/html/2606.11268#bib.bib2)\)and Moment\(Goswami et al\.,[2024](https://arxiv.org/html/2606.11268#bib.bib10)\)that learn task\-agnostic representations from large, heterogeneous corpora for generic time\-series forecasting\. However, aquatic sciences still lacks a foundation model capable of unifying information across multiple lakes and variates with irregular frequencies and depths\. Moreover, most TS foundation models either focus solely on univariate signals or assume clean and densely sampled data that are difficult to find in ecology, where data is multivariate and inherently sparse across space and time\. While recent efforts\(Yu et al\.,[2025](https://arxiv.org/html/2606.11268#bib.bib33);Willard et al\.,[2022](https://arxiv.org/html/2606.11268#bib.bib30),[2021](https://arxiv.org/html/2606.11268#bib.bib29)\)have explored building large\-scale foundation models for multiple lake systems, they are restricted to predicting a small number of variates with fixed sets of inputs at regular time scales without missing values\.

Motivated by this gap, we ask the following questions:\(1\)Can we build a foundation model for aquatic sciences that learns generic lake processes across a broad collection of lakes and variables, while retaining site\-specific nuances?\(2\)Can we use such a foundation model to forecast lake dynamics using any subset of variables available at a lake with irregular observations across time and depth?\(3\)Can we extract feature representations of lakes that capture their static and time\-varying characteristics, revealing novel information about their similarity and temporal evolution at macro\-system scales?

To answer these questions, we introduceLakeFM, a foundation model pre\-trained on alarge\-scale ecological dataset containing over 1\.5 million samplescomprising a mixture of synthetic data \(over 1000 diverse lake simulations\) from physics\-based simulations and real\-world observations coming from 21 lakes in the LakeBeD\-US dataset\(McAfee et al\.,[2025](https://arxiv.org/html/2606.11268#bib.bib19)\)with significant sparsity \(60\-70% on average\)\. To robustly handle irregular spatio\-temporal data \(which is common in scientific systems\),LakeFMis designed to operate on an irregular grid unlike most temporal prediction \(or time\-series\) models\. Specifically, we model the data as a one\-dimensional sequence of events or tokens, where each variate observation at each depth and time is treated as an event \(we refer to it as a token throughout the paper\)\. Every event/token is distinguished by its individual embedding that takes into account contextual metadata in the form of temporal, variate and depth information\. Furthermore, to effectively capture both time\-invariant \(static lake characteristics\) and time\-variant \(dynamic lake behavior\) factors, we decouple the representation space into separate static and dynamic embeddings, and jointly optimize contrastive learning objectives with prediction losses over these spaces\. Overall,LakeFMattempts to establish a practical step towards scalable and generalizable modeling of lake ecosystems\. Our main contributions are as follows\.

1. \(1\)We proposeLakeFM, a foundation model that can ingest irregular, multi\-variate multi\-depth data, with competitive forecasting performance on both seen and unseen lakes while also demonstrating an emergent ability to adhere to aquatic physical laws\.
2. \(2\)We present novel insights about the static characteristics and temporal evolution of lakes using learned lake\-specific embeddings, and highlight howLakeFMrepresentations effectively align with different ecological axes\.
3. \(3\)We present case studies on forecasting performance under variate and depth masking scenarios, showing howLakeFM’s ability to handle partial inputs reveals novel insights about variable interactions in lakes\.

## 2\.Related Works

Time\-series forecasting models, including statistical approaches and deep learning architectures such as PatchTST\(Nie et al\.,[2022](https://arxiv.org/html/2606.11268#bib.bib21)\)and iTransformer\(Liu et al\.,[2023](https://arxiv.org/html/2606.11268#bib.bib18)\)have shown strong performance on benchmark TS datasets\. However, these models are domain\- or dataset\-specific, and hence struggle to generalize across ecosystems or variate configurations\. Scientific datasets, particularly in ecology and environmental modeling, involve unique challenges: missing values, irregular sampling, and multi\-resolution measurements across time and depth\. Models like mTAN\(Shukla and Marlin,[2021](https://arxiv.org/html/2606.11268#bib.bib25)\)and ContiFormer\(Chen et al\.,[2023](https://arxiv.org/html/2606.11268#bib.bib5)\)attempt to address these issues through neural ODEs, temporal embeddings, or attention over irregular grids\. However, these methods are often task\-specific, rely on carefully engineered architectures, and do not scale well to large multi\-lake or multi\-variable ecosystems\. While techniques like MissTSM\(Neog et al\.,[2026](https://arxiv.org/html/2606.11268#bib.bib20)\)provide a model agnostic approach to handle missing values, it is not very computationally scalable\.

Recent Time Series Foundation Models \(TSFM\) aim to generalize across diverse time\-series tasks by learning from large corpora of univariate \(examples include MOMENT\(Goswami et al\.,[2024](https://arxiv.org/html/2606.11268#bib.bib10)\), Chronos\(Ansari et al\.,[2024](https://arxiv.org/html/2606.11268#bib.bib3)\), LPTM\(Prabhakar Kamarthi and Prakash,[2024](https://arxiv.org/html/2606.11268#bib.bib22)\), etc\.\) or multivariate signals \(examples include Chronos 2\(Ansari et al\.,[2025](https://arxiv.org/html/2606.11268#bib.bib2)\), Toto\(Cohen et al\.,[2024](https://arxiv.org/html/2606.11268#bib.bib6)\)\. However, there are certain limitations\. Crucially, most current TSFMs operate under the assumption that data is fully observed or regularly sampled\. While Chronos 2 can handle some sparsity, it remains ill\-equipped for the highly irregular sampling intervals common in scientific datasets\. This limitation creates a heavy dependency on external imputation methods\. In scientific domains where data is significantly sparse, specialized imputation models like SAITS\(Du et al\.,[2023](https://arxiv.org/html/2606.11268#bib.bib9)\)or CSDI\(Tashiro et al\.,[2021](https://arxiv.org/html/2606.11268#bib.bib28)\)often suffer from poor performance due to a lack of sufficient training signals, which subsequently degrades the accuracy of downstream forecasting models\. Our approach overcomes this by considering each time, variate and depth observation as a token, thus converting the multi\-variate multi\-depth data as a list of tuples, thus, facilitating model training under partial observations and irregular sampling\.

## 3\.Methodology

Background and Notations:Let𝒟=\{𝒟1,…,𝒟N\}\\mathcal\{D\}=\\\{\\mathcal\{D\}\_\{1\},\\dots,\\mathcal\{D\}\_\{N\}\\\}denote a collection ofNNlakes, where each lake𝒟i\\mathcal\{D\}\_\{i\}contains a multivariate, multi\-depth time series:𝒟i=\{\(𝐱t\(i\),𝐦t\(i\),ℓi\)\}t=1Ti\\mathcal\{D\}\_\{i\}=\\left\\\{\(\\mathbf\{x\}\_\{t\}^\{\(i\)\},\\mathbf\{m\}\_\{t\}^\{\(i\)\},\\ell\_\{i\}\)\\right\\\}\_\{t=1\}^\{T\_\{i\}\}\. Here,𝐱t\(i\)∈ℝV×D\\mathbf\{x\}\_\{t\}^\{\(i\)\}\\in\\mathbb\{R\}^\{V\\times D\}represents observations ofVVvariables acrossDDdepth levelsfor lakeiiat timett\. The time intervals between consecutive stepsttandt\+1t\+1areirregular, varying dynamically from one lake to another \(e\.g\., from daily to bi\-weekly and even monthly observations\)\. Further, the binary mask𝐦t\(i\)∈\{0,1\}V×D\\mathbf\{m\}\_\{t\}^\{\(i\)\}\\in\\\{0,1\\\}^\{V\\times D\}indicates missing values acrossvariablesanddepthsat a timett, andℓi\\ell\_\{i\}denotes a categorical site\-specific identifier of every lake, used for contrastive training\.

We formulate the task ofprobabilistic forecastingfor modeling lake systems as follows\. Given a history of observations across a set of variables overLLirregular timesteps of a lake, the goal is to model the conditional distribution of all its lake variables over a time horizonHH\. To solve this problem, we employ an encoder\-decoder framework where an encoderfθf\_\{\\theta\}first maps the historical context\{𝐱t\(i\)\}t=1L\\\{\\mathbf\{x\}\_\{t\}^\{\(i\)\}\\\}\_\{t=1\}^\{L\}into a latent feature representation𝐳i∈ℝd\\mathbf\{z\}\_\{i\}\\in\\mathbb\{R\}^\{d\}\. This feature representation is subsequently processed by a decodergϕg\_\{\\phi\}to generate the parameters of the future distribution of lake variables\.

LakeFMArchitecture:Figure[1](https://arxiv.org/html/2606.11268#S3.F1)shows the architecture ofLakeFMthat operates as an encoder\-decoder framework\. The overall framework comprises of four major components: \(i\) tokenization and embedding, \(ii\) encoder layers, \(iii\) static and temporal feature disentanglement strategy, and \(iv\) decoding and query\-based forecasting strategy\. We describe each of the components them in the following\.

![Refer to caption](https://arxiv.org/html/2606.11268v1/x1.png)Figure 1\.Overview ofLakeFM\. Tokenization and embedding of irregular multi\-variable, multi\-depth time\-series data shown on the left\. Overall Model architecture showing decoupled static and dynamic representation learning with joint forecasting and contrastive objectives in the middle, with the decoder shown on the right\.### 3\.1\.Input Tokenization and Embedding

To handle the heterogeneous and irregular nature of ecological data available in lake ecosystems, we adopt a token\-centric representation as described in Figure[1](https://arxiv.org/html/2606.11268#S3.F1)\(A\)\. Unlike regular grid\-based approaches that require fixed depth levels, we treat every individual measurement—whether from a specific depth in the water column \(2D variables\) or a surface meteorological driver \(1D variables\)—as a distinct observation tuple containing time\-variate\-depth information\. This allows our model to naturally ingest data with varying time intervals, subsets of variables, and depth resolutions without imputation or explicit handling of missing data\.

Tokenization:We represent the raw time series data for a specific lakeiias a set of observations𝒪i\\mathcal\{O\}\_\{i\}, where each observationok∈𝒪io\_\{k\}\\in\\mathcal\{O\}\_\{i\}is defined as a tuple:ok=\(tk,vk,dk,xk\)o\_\{k\}=\(t\_\{k\},v\_\{k\},d\_\{k\},x\_\{k\}\)wheretkt\_\{k\}is the absolute timestamp,vk∈𝒱v\_\{k\}\\in\\mathcal\{V\}is the variable identifier \(e\.g\., temperature, DO, or air temperature\),dk∈ℝd\_\{k\}\\in\\mathbb\{R\}is the continuous depth measurement \(wheredk=0d\_\{k\}=0denotes surface/meteorological variables\), andxk∈ℝx\_\{k\}\\in\\mathbb\{R\}is the measured scalar value\. Each observationok∈𝒞Lo\_\{k\}\\in\\mathcal\{C\}\_\{L\}is treated as a distinct token, where𝒞L\\mathcal\{C\}\_\{L\}is the context set comprising of all observations over the lastLLtimesteps\. To form the input sequenceSS, we flatten the set𝒞L\\mathcal\{C\}\_\{L\}and sort the tokens first by variable ID, then by depth, and within each\(variable,depth\)\(\\text\{variable\},\\text\{depth\}\)series, we sort by absolute time as follows:S=\[o1,o2,…,oM\],S=\[o\_\{1\},o\_\{2\},\\dots,o\_\{M\}\],whereMMis the total number of observed triplets\(t,v,d\)\(t,v,d\)across theLLtimesteps\.

Embedding Layer:To map the discrete observations into a latent space suitable for the Transformer backbone, we construct a composite embeddingek∈ℝdmodele\_\{k\}\\in\\mathbb\{R\}^\{d\_\{\\text\{model\}\}\}for each tokenkk\(see Figure[1](https://arxiv.org/html/2606.11268#S3.F1)\(B\)\)\. This is formed by concatenating embeddings for time, depth, variable identity, and the scalar value as follows:

\(1\)ek=\[Etime​\(tk\);Edepth​\(dk\);Evar​\(vk\);Eval​\(xk\)\]\.e\_\{k\}=\[E\_\{\\text\{time\}\}\(t\_\{k\}\)\\,;\\,E\_\{\\text\{depth\}\}\(d\_\{k\}\)\\,;\\,E\_\{\\text\{var\}\}\(v\_\{k\}\)\\,;\\,E\_\{\\text\{val\}\}\(x\_\{k\}\)\]\.Rather than summing these embeddings with the input token representation, we concatenate them to form the final token representation as we empirically found that concatenation leads to better performance\. Specifically, concatenation preserves the semantic distinction between different embedding types, allowing the model to attend over heterogeneous subspaces independently, while summation tends to blur these roles in a shared latent space\. We describe each of the four embeddings in the following\.

Time Embedding \(EtimeE\_\{\\text\{time\}\}\):We utilize sinusoidal positional encodings to represent continuous time\. For a timestamptkt\_\{k\}, thejj\-th dimension is given by:

\(2\)Etime​\(tk\)j=\{sin⁡\(2​π​tk/10000j/dtime\),j​is evencos⁡\(2​π​tk/10000\(j−1\)/dtime\),j​is odd\\displaystyle E\_\{\\text\{time\}\}\(t\_\{k\}\)\_\{j\}=\\begin\{cases\}\\sin\(2\\pi t\_\{k\}/10000^\{j/d\_\{\\text\{time\}\}\}\),&j\\text\{ is even\}\\\\ \\cos\(2\\pi t\_\{k\}/10000^\{\(j\-1\)/d\_\{\\text\{time\}\}\}\),&j\\text\{ is odd\}\\end\{cases\}wheretkt\_\{k\}∈\\in\[0,1\] is the normalized time \(or day of the year\)\.

Depth Embedding \(EdepthE\_\{\\text\{depth\}\}\):Depth embeddings are generated using Fourier feature encoding, where every scalar depthdk∈ℝd\_\{k\}\\in\\mathbb\{R\}is projected to a vector of sinusoidal components\. Specifically, we applyKKfrequency bands to produce:

\(3\)Edepth\(dk\)=\[dk;\\displaystyle E\_\{\\text\{depth\}\}\(d\_\{k\}\)=\\big\[\\,d\_\{k\}\\,;sin⁡\(ω0​dk\),cos⁡\(ω0​dk\),…,\\displaystyle\\sin\(\\omega\_\{0\}d\_\{k\}\),\\cos\(\\omega\_\{0\}d\_\{k\}\),\\dots,sin\(ωK−1dk\),cos\(ωK−1dk\)\],\\displaystyle\\sin\(\\omega\_\{K\-1\}d\_\{k\}\),\\cos\(\\omega\_\{K\-1\}d\_\{k\}\)\\big\],whereωk=2k​πmax​\_​res\\omega\_\{k\}=\\frac\{2^\{k\}\\pi\}\{\\mathrm\{max\\\_res\}\}fork=0,…,K−1k=0,\\dots,K\{\-\}1frequency bands andmax​\_​res\\mathrm\{max\\\_res\}is the maximum value of input used to scale frequencies, where the raw inputdkd\_\{k\}is optionally concatenated \(when enabled in the configuration\)\. This Fourier feature vector is then linearly projected to the model depth embedding space via a learned matrix𝐖depth∈ℝddepth×\(1\+2​K\)\\mathbf\{W\}\_\{\\text\{depth\}\}\\in\\mathbb\{R\}^\{d\_\{\\text\{depth\}\}\\times\(1\+2K\)\}, yieldingE~depth​\(dk\)=𝐖depth​Edepth​\(dk\)\\tilde\{E\}\_\{\\text\{depth\}\}\(d\_\{k\}\)=\\mathbf\{W\}\_\{\\text\{depth\}\}\\,E\_\{\\text\{depth\}\}\(d\_\{k\}\)\.

Variate Embedding \(EvarE\_\{\\text\{var\}\}\):We employ a learnable lookup table \(or embedding layer\) to project the categorical variable identifiervkv\_\{k\}into a vectorEvar​\(vk\)∈ℝdvarE\_\{\\text\{var\}\}\(v\_\{k\}\)\\in\\mathbb\{R\}^\{d\_\{\\text\{var\}\}\}\.

Value Embedding \(EvalE\_\{\\text\{val\}\}\):To embed the continuous scalar measurementxkx\_\{k\}, we utilize a SwiGLU\-style gated projection, similar to the approach used inShi et al\.\([2024](https://arxiv.org/html/2606.11268#bib.bib24)\)\. This mechanism allows the model to non\-linearly modulate the importance of the scalar input\. Formally, let𝐖a,𝐖b∈ℝ1×dval\\mathbf\{W\}\_\{a\},\\mathbf\{W\}\_\{b\}\\in\\mathbb\{R\}^\{1\\times d\_\{\\text\{val\}\}\}be learnable weight matrices and𝐛a,𝐛b∈ℝdval\\mathbf\{b\}\_\{a\},\\mathbf\{b\}\_\{b\}\\in\\mathbb\{R\}^\{d\_\{\\text\{val\}\}\}be bias vectors\. The value embedding is then computed as:

\(4\)Eval​\(xk\)=Swish​\(xk​𝐖a\+𝐛a\)⊙\(xk​𝐖b\+𝐛b\),E\_\{\\text\{val\}\}\(x\_\{k\}\)=\\text\{Swish\}\(x\_\{k\}\\mathbf\{W\}\_\{a\}\+\\mathbf\{b\}\_\{a\}\)\\odot\(x\_\{k\}\\mathbf\{W\}\_\{b\}\+\\mathbf\{b\}\_\{b\}\),where⊙\\odotdenotes the element\-wise Hadamard product, andSwish​\(𝐳\)=𝐳⊙σ​\(𝐳\)\\text\{Swish\}\(\\mathbf\{z\}\)=\\mathbf\{z\}\\odot\\sigma\(\\mathbf\{z\}\)is the Swish activation function\.

### 3\.2\.Encoder Layers

The encoder layers ofLakeFMare a stack of Transformer blocks that transform a sequence of input tokens of lengthSSinto feature representation outputs𝐇∈ℝS×d\\mathbf\{H\}\\in\\mathbb\{R\}^\{S\\times d\}\. To handle the irregularity in data, we adopt Rotary Position Embeddings \(RoPE\)\(Su et al\.,[2024](https://arxiv.org/html/2606.11268#bib.bib27)\)for modeling relative temporal dependencies among tokens and a binaryattention biassimilar to the approach used inWoo et al\.\([2024](https://arxiv.org/html/2606.11268#bib.bib31)\)to explicitly differentiate intra\- and inter\-variate interactions\. Specifically, since standard self\-attention treats all token interactions uniformly, we introduce a learnable additive bias to the attention logits to inject structural knowledge regarding variable identity\. Letviv\_\{i\}denote the variable identifier for tokenii\. We define a binary indicator,δi​j=𝕀​\[vi=vj\]\\delta\_\{ij\}=\\mathbb\{I\}\[v\_\{i\}=v\_\{j\}\], which equals 1 if tokensiiandjjbelong to the same variable, and 0 otherwise\. For every attention headhh, we learn two scalar bias terms:bintra\(h\)b\_\{\\text\{intra\}\}^\{\(h\)\}, for same\-variable interactions, andbinter\(h\)b\_\{\\text\{inter\}\}^\{\(h\)\}, for cross\-variable interactions\. The specific biasbi​j\(h\)b\_\{ij\}^\{\(h\)\}for a query\-key pair is determined by:bi​j\(h\)=δi​j⋅bintra\(h\)\+\(1−δi​j\)⋅binter\(h\)\.b\_\{ij\}^\{\(h\)\}=\\delta\_\{ij\}\\cdot b\_\{\\text\{intra\}\}^\{\(h\)\}\+\(1\-\\delta\_\{ij\}\)\\cdot b\_\{\\text\{inter\}\}^\{\(h\)\}\.The attention scoresi​j\(h\)s\_\{ij\}^\{\(h\)\}is then computed as:si​j\(h\)=𝐪~i\(h\)⊤​𝐤~j\(h\)dh\+bi​j\(h\)\+mi​j,s\_\{ij\}^\{\(h\)\}=\\frac\{\\tilde\{\\mathbf\{q\}\}\_\{i\}^\{\(h\)\\top\}\\tilde\{\\mathbf\{k\}\}\_\{j\}^\{\(h\)\}\}\{\\sqrt\{d\_\{h\}\}\}\+b\_\{ij\}^\{\(h\)\}\+m\_\{ij\},where𝐪~\\tilde\{\\mathbf\{q\}\}and𝐤~\\tilde\{\\mathbf\{k\}\}denotes the RoPE\-rotated queries/keys,dhd\_\{h\}is the head dimension, andmi​jm\_\{ij\}represents the padding mask\. This formulation allowsLakeFMto learn distinct attention patterns for temporal dynamics \(intra\-variate\) versus variable correlations \(inter\-variate\) within distinct heads, while preserving the relative temporal information via RoPE\.

### 3\.3\.Static & Temporal Feature Disentanglement

A key objective ofLakeFMis to simultaneously model the temporal dynamics of lake state variables while capturing the intrinsic, time\-invariant signatures unique to each lake site\. To achieve this, we introduce two parallel linear projectors that map the shared encoder output𝐇∈ℝS×d\\mathbf\{H\}\\in\\mathbb\{R\}^\{S\\times d\}into specialized subspaces as follows\.

We consider a static feature projector𝐖stat∈ℝdstat×d\\mathbf\{W\}\_\{\\text\{stat\}\}\\in\\mathbb\{R\}^\{d\_\{\\text\{stat\}\}\\times d\}and a temporal feature projector𝐖temp∈ℝdtemp×d\\mathbf\{W\}\_\{\\text\{temp\}\}\\in\\mathbb\{R\}^\{d\_\{\\text\{temp\}\}\\times d\}defined as:𝐙stat=𝐇𝐖stat⊤,𝐙temp=𝐇𝐖temp⊤\.\\mathbf\{Z\_\{\\text\{stat\}\}\}=\\mathbf\{H\}\\mathbf\{W\}\_\{\\text\{stat\}\}^\{\\top\},\\quad\\mathbf\{Z\_\{\\text\{temp\}\}\}=\\mathbf\{H\}\\mathbf\{W\}\_\{\\text\{temp\}\}^\{\\top\}\.This formulation creates a learnablesoft partitionof the encoded feature outputs\. The static representations𝐙stat\\mathbf\{Z\_\{\\text\{stat\}\}\}are subsequently aggregated via mean pooling to generate a point\-wise representation\. Specifically, given token embeddings𝐙\(𝐢\)stat=\[𝐳𝟏\(𝐢\),…,𝐳𝐒\(𝐢\)\]∈ℝS×dstat\\mathbf\{Z^\{\\text\{stat\}\}\_\{\(i\)\}\}=\[\\mathbf\{z^\{\(i\)\}\_\{1\}\},\\ldots,\\mathbf\{z^\{\(i\)\}\_\{S\}\}\]\\in\\mathbb\{R\}^\{S\\times d\_\{\\text\{stat\}\}\}and mask𝐦\(𝐢\)∈\{0,1\}S\\mathbf\{m^\{\(i\)\}\}\\in\\\{0,1\\\}^\{S\}, we compute point\-wise representations𝐳𝐢\\mathbf\{z\_\{i\}\}as:

\(5\)𝐳¯\(𝐢\)=1∑t=1Smt\(i\)​∑t=1Smt\(i\)​𝐳𝐭\(𝐢\)∈ℝdstat,𝐳𝐢=gproj​\(𝐳¯\(𝐢\)\),\\begin\{split\}\\mathbf\{\\bar\{z\}^\{\(i\)\}\}&=\\frac\{1\}\{\\sum\_\{t=1\}^\{S\}m^\{\(i\)\}\_\{t\}\}\\sum\_\{t=1\}^\{S\}m^\{\(i\)\}\_\{t\}\\,\\mathbf\{z^\{\(i\)\}\_\{t\}\}\\in\\mathbb\{R\}^\{d\_\{\\text\{stat\}\}\},\\\\ \\mathbf\{z\_\{i\}\}&=g\_\{\\text\{proj\}\}\(\\mathbf\{\\bar\{z\}^\{\(i\)\}\}\),\\end\{split\}wheregp​r​o​jg\_\{proj\}is a small projection head, and𝐳𝐢\\mathbf\{z\_\{i\}\}is the final representation used during contrastive training\. Conversely, the temporal representations𝐙temp\\mathbf\{Z\_\{\\text\{temp\}\}\}retain temporal information and serve as the input to the decoding and forecasting heads\. By not enforcing explicit orthogonality, the network retains the flexibility to share information between static and temporal subspaces where beneficial, while allowing task\-specific losses to drive the specialization of features\.

### 3\.4\.Query\-Based Forecasting Strategy

Traditional time\-series decoders typically operate on a regular grid utilizing a fixed feed\-forward network, which is ill\-suited for the irregular sampling inherent in aquatic ecosystems\. To address this,LakeFMadopts aquery\-based forecastingstrategy \(see Figure[1](https://arxiv.org/html/2606.11268#S3.F1)\(C\)\) described in the following\.

We define a set of target queries𝐐target=\{\(tk,vk,dk\)\}k=1K\\mathbf\{Q\_\{\\text\{target\}\}\}=\\\{\(t\_\{k\},v\_\{k\},d\_\{k\}\)\\\}\_\{k=1\}^\{K\}representing the specific spatiotemporal points where predictions are required\. Here,tkt\_\{k\}of the target queries continue along the same axis as thetkt\_\{k\}of input tokens\. However, unlike encoder inputs, these queries do not contain observed scalar values\. We generate target embeddingsEtargetE\_\{\\text\{target\}\}using the same embedding layers used in the encoder, effectively acting as query prompts: “What is the value of variablevvat depthddand timett?”

The decoder is a stack of Transformer blocks operating on target query embeddings, built from the target time, variate, and depth metadata defined as,𝐪𝐤=\[Et​i​m​e​\(tk\);Ev​a​r​\(vk\);Ed​e​p​t​h​\(dk\)\]∈ℝdq​u​e​r​y\\mathbf\{q\_\{k\}\}=\[E\_\{time\}\(t\_\{k\}\);E\_\{var\}\(v\_\{k\}\);E\_\{depth\}\(d\_\{k\}\)\]\\in\\mathbb\{R\}^\{d\_\{query\}\}, and𝐪𝐤~=𝐖𝐪​𝐪𝐤∈ℝdt​e​m​p,𝐖𝐪∈ℝdt​e​m​p×dq​u​e​r​y\.\\mathbf\{\\tilde\{q\_\{k\}\}\}=\\mathbf\{W\_\{q\}\}\\mathbf\{q\_\{k\}\}\\in\\mathbb\{R\}^\{d\_\{t\}emp\},\\mathbf\{W\_\{q\}\}\\in\\mathbb\{R\}^\{d\_\{temp\}\\times d\_\{query\}\}\.Each layer first applies self\-attention over the target tokens \(with padding masks\) followed by a feed‑forward network, and then applies cross attention from the target queries to the encoder’s temporal representations \(historical context\), again followed by a feed‑forward network\. The output𝐔\\mathbf\{U\}is defined as:𝐔=Dec​\(𝐐,𝐙𝐭𝐞𝐦𝐩\)∈ℝSt×dt​e​m​p,\\mathbf\{U\}=\\texttt\{Dec\}\(\\mathbf\{Q\},\\mathbf\{Z\_\{temp\}\}\)\\in\\mathbb\{R\}^\{S\_\{t\}\\times d\_\{temp\}\},whereStS\_\{t\}is the number of target tokens and𝐐\\mathbf\{Q\}is a stack of all target queries\. Both self\-attention and cross\-attention layers use the same structured attentions as the encoder \(variable\-wise bias and temporal projections\), enabling the decoder to integrate dependencies among prediction points while selectively retrieving relevant information from the encoded history\. This formulation enables flexible inference at arbitrary unobserved points\.Consequently, the output grid can be user\-defined \(either regular or irregular\) with variable context and horizon lengths\.

Probabilistic Prediction:To effectively capture predictive uncertainty and model the heavy\-tailed noise distributions often observed in environmental data, we employ a probabilistic output head as follows\. We map the final decoder representations via a feed\-forward network to the parameters of a Student\-ttdistribution:Θ=\(μ,σ,ν\)\\Theta=\(\\mu,\\sigma,\\nu\), whereμ\\muis the location,σ\\sigmais the scale, andν\\nurepresents the degrees of freedom\. Specifically, we compute,μ=𝐖μ⊤​𝐔\+𝐛μ∈ℝSt\\mathbf\{\\mu\}=\\mathbf\{W\}\_\{\\mu\}^\{\\top\}\\mathbf\{U\}\+\\mathbf\{b\_\{\\mu\}\}\\in\\mathbb\{R\}^\{S\_\{t\}\}, andσ=softplus​\(𝐖σ⊤​𝐔\+𝐛σ\)\+ϵ∈ℝSt\\mathbf\{\\sigma\}=\\texttt\{softplus\}\(\\mathbf\{W\_\{\\sigma\}^\{\\top\}\}\\mathbf\{U\}\+\\mathbf\{b\_\{\\sigma\}\}\)\+\\epsilon\\in\\mathbb\{R\}^\{S\_\{t\}\}, whereϵ\>0\.\\epsilon\>0\.Crucially, rather than fixing the distribution or assuming normality, we explicitly learn the degrees of freedom of the distribution,ν\\nu, as a dynamic parameter\. This grantsLakeFMthe flexibility to adaptively transition between heavy\-tailed regimes \(lowν\\nu\) and Gaussian\-like regimes \(asν→∞\\nu\\to\\infty\), which is essential for modeling the heterogeneous behaviors of scientific variables\. Specifically, for a given variatekk, we compute,

\(6\)νk=softplus​\(fbase​\(Evar​\(vk\)\)⏟bvk\+ftemp​\(𝐮k\)⏟ak\)\+c,\\nu\_\{k\}=\\texttt\{softplus\}\(\\underbrace\{f\_\{\\text\{base\}\}\(E\_\{\\text\{var\}\}\(v\_\{k\}\)\)\}\_\{b\_\{v\_\{k\}\}\}\+\\underbrace\{f\_\{\\text\{temp\}\}\(\\mathbf\{u\}\_\{k\}\)\}\_\{a\_\{k\}\}\)\+c,whereEvar​\(vk\)E\_\{\\text\{var\}\}\(v\_\{k\}\)is the variate embedding,fbasef\_\{\\text\{base\}\}andftempf\_\{\\text\{temp\}\}are small linear heads,𝐮k\\mathbf\{u\}\_\{k\}is the decoder output andc\>2c\>2is an empirically chosen constant for numerical stability\.aka\_\{k\}represents a per\-token non\-variate\-specificν\\nuthat can be noisy and vary by time and depth for the same variable, whereasbvkb\_\{v\_\{k\}\}represents a variate\-specificν\\nu, which learns per\-variable uncertainty characteristics but doesn’t adapt to different conditions \(e\.g\., temp\. at surface vs\. at the bottom\)\. Hence, we consider a sum of the baseν\\nu\(bvkb\_\{v\_\{k\}\}\) and the context\-aware refinement/adjustment \(aka\_\{k\}\) in our formulation \(Eq\.[6](https://arxiv.org/html/2606.11268#S3.E6)\)\.

Loss Functions:LakeFMis pre\-trained to jointly optimize a probabilistic forecasting objective and a lake\-wise contrastive objective\. For the forecasting component, we minimize the negative log\-likelihood of the ground truth values under the predicted Student\-ttdistributions\. Letℐ\\mathcal\{I\}denote the set of indices\(k,t\)\(k,t\)for all valid \(unmasked\) target tokens\. Given the predicted parameters\(μt\(k\),σt\(k\),νt\(k\)\)\(\\mu\_\{t\}^\{\(k\)\},\\sigma\_\{t\}^\{\(k\)\},\\nu\_\{t\}^\{\(k\)\}\)from the projection head, the forecasting loss is:

\(7\)ℒforecast=−1\|ℐ\|​∑\(k,t\)∈ℐlog⁡𝒯​\(yt\(k\)∣μt\(k\),σt\(k\),νt\(k\)\),\\mathcal\{L\}\_\{\\text\{forecast\}\}=\-\\frac\{1\}\{\|\\mathcal\{I\}\|\}\\sum\_\{\(k,t\)\\in\\mathcal\{I\}\}\\log\\mathcal\{T\}\(y\_\{t\}^\{\(k\)\}\\mid\\mu\_\{t\}^\{\(k\)\},\\sigma\_\{t\}^\{\(k\)\},\\nu\_\{t\}^\{\(k\)\}\),
To encourage lake\-specific representations, we adopt a contrastive learning objective\. For each batch ofBBsamples we obtain corresponding representations\{𝐳1,…,𝐳B\}\\\{\\mathbf\{z\}\_\{1\},\\dots,\\mathbf\{z\}\_\{B\}\\\}and lake identifiers\{ℓ1,…,ℓB\}\\\{\\ell\_\{1\},\\dots,\\ell\_\{B\}\\\}\. We treat samples from the same lake as positives and those from different lakes as negatives\. Each representation isℓ2\\ell\_\{2\}\-normalized, and we construct a weight matrixwi​j=𝟏​\[ℓi=ℓj\]w\_\{ij\}=\\mathbf\{1\}\[\\ell\_\{i\}=\\ell\_\{j\}\]that encodes lake\-wise positives\. The contrastive loss uses a weighted InfoNCE formulation with temperatureτ\\tau:

\(10\)ℒ\(i\)=−∑jwi​j​\(zi⊤​zjτ−log​∑kexp⁡\(zi⊤​zk/τ\)\)/∑jwi​j,ℒcontrast=1B​∑i=1Bℒ\(i\),i=1,…,B\\displaystyle\\begin\{gathered\}\\mathcal\{L\}^\{\(i\)\}=\-\\sum\_\{j\}w\_\{ij\}\\left\(\\frac\{z\_\{i\}^\{\\top\}z\_\{j\}\}\{\\tau\}\-\\log\\sum\_\{k\}\\exp\(z\_\{i\}^\{\\top\}z\_\{k\}/\\tau\)\\right\)\\big/\\sum\_\{j\}w\_\{ij\},\\\\ \\qquad\\mathcal\{L\}\_\{\\text\{contrast\}\}=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\mathcal\{L\}^\{\(i\)\},\\qquad i=1,\\dots,B\\end\{gathered\}
The final pre\-training objective combines forecasting and contrastive learningℒtotal=ℒforecast\+λt​ℒcontrast,\\mathcal\{L\}\_\{\\text\{total\}\}=\\mathcal\{L\}\_\{\\text\{forecast\}\}\+\\lambda\_\{t\}\\,\\mathcal\{L\}\_\{\\text\{contrast\}\},whereλt\\lambda\_\{t\}is a time\-varying weight\. In our implementation,λt\\lambda\_\{t\}is obtained by combining \(i\) a short warmup schedule over the first few epochs, and \(ii\) an adaptive scaling rule to keep the magnitude of the contrastive term comparable to the forecasting loss\.

## 4\.Experimental Setup

Dataset\.The LakeFM model is pretrained over a mixture of realworld and simulation datasets\. The real\-world data is obtained from the LakeBeD\-US dataset\(McAfee et al\.,[2025](https://arxiv.org/html/2606.11268#bib.bib19)\), which consists of 500 million unique observations spanning 21 lakes across the United States and exhibit significant sparsity \(60\-70% on average\)\. The collection of simulation datasets comprises of two types of simulations \- \(a\) WQHanson Simulations, consisting of 4 simulation lakes, generated using the process\-based water quality model\(Hanson et al\.,[2023](https://arxiv.org/html/2606.11268#bib.bib11)\), and \(b\) FCR Simulations, consisting of 1000 simulations, generated using the GLM\-AED process\-based model\(Hipsey et al\.,[2019](https://arxiv.org/html/2606.11268#bib.bib12)\)\. Please refer to Appendix[A](https://arxiv.org/html/2606.11268#A1)for more details on the datasets\.

Pretraining and Evaluation Setup\.We partition the LakeBeD\-US data into an\(a\) In\-Distribution \(ID\) set, consisting of 15 lakes and an\(b\) Out\-of\-Distribution set, comprising 6 lakes\.LakeFMis pretrained on the ID lakes \(using the first 70% data of each lake\), together with a subset of simulation lakes\. We evaluate the model under two settings \-\(a\) In\-Distribution \(ID\)evaluation, where we test it on the final 20% \(in time\) held out data of each ID lake, and\(b\) Zero\-shot generalization, where we evaluate the model on the six entirely unseen OOD lakes\. Please refer to the Appendix[A\.1](https://arxiv.org/html/2606.11268#A1.SS1)for more details on the ID and OOD set partitioning\.

Baselines\.We evaluateLakeFMagainst three primary classes of baselines to assess its performance: \(a\) Time\-Series Foundation Models \(TSFMs\), including two multivariate forecasting model: Chronos 2\(Ansari et al\.,[2025](https://arxiv.org/html/2606.11268#bib.bib2)\),MOIRAI\(Woo et al\.,[2024](https://arxiv.org/html/2606.11268#bib.bib31)\), and univariate models: LPTM\(Prabhakar Kamarthi and Prakash,[2024](https://arxiv.org/html/2606.11268#bib.bib22)\), and MOMENT\(Goswami et al\.,[2024](https://arxiv.org/html/2606.11268#bib.bib10)\), \(b\) a non\-foundation or local model, iTransformer\(Liu et al\.,[2023](https://arxiv.org/html/2606.11268#bib.bib18)\),and \(c\) Irregularly\-sampled time\-series \(IMTS\) models: HyperIMTS\(Li et al\.,[2025](https://arxiv.org/html/2606.11268#bib.bib17)\), ReIMTS\(Li et al\.,[2026](https://arxiv.org/html/2606.11268#bib.bib16)\)\. The goal behind this selection is to ensure a comprehensive comparison against both general\-purpose pretrained models, local forecasting models and forecasting models that can inherently handle irregularly sampled data\. Detailed descriptions of the baseline andLakeFMimplementation are provided in Appendix[B](https://arxiv.org/html/2606.11268#A2)\.

## 5\.Results and Discussions

### 5\.1\.Comparing Forecasting Performance

Figure[2](https://arxiv.org/html/2606.11268#S5.F2)compares the overall lake\-wise MSE \(across all variates\) ofLakeFMand non\-IMTS baselinesfor five In\-Distribution \(ID\) lakes and five Out\-of\-distribution \(OOD\) lakes \(see Table[4](https://arxiv.org/html/2606.11268#A1.T4)in Appendix[A\.1](https://arxiv.org/html/2606.11268#A1.SS1)for details of abbreviated lake names used throughout the paper\)\. We can see that in the ID setting,LakeFMconsistently shows lowest MSE across all lakes, while baselines likeiTransformer show high variability on BM and GL4\. On the OOD lakes,LakeFMshows best zero\-shot performance on all lakes exceptTR\. Note that the performance of iTransformer varies widely across the OOD lakes, since it only relies on local data from a specific lake for training and does not utilize transfer of knowledge across lakes in contrast toLakeFMand other foundation models\. Tables[8](https://arxiv.org/html/2606.11268#A4.T8)and[9](https://arxiv.org/html/2606.11268#A4.T9)in the Appendix[D](https://arxiv.org/html/2606.11268#A4)provide a detailed comparison ofLakeFMand baselines for every variate\-lake combination\. While there are some variate\-lake combinations where baselines are performing better,LakeFMshows thebest overall rankof2\.02\.0across all OOD lakes and2\.032\.03across all ID lakes in terms of lake\-wise MSE\. Figure[3](https://arxiv.org/html/2606.11268#S5.F3)shows an example time\-series of the Chlorophyll\-a variable over lake BM comparing the test predictions ofLakeFMwith baselines\.

A key practical difference betweenLakeFMand many forecasting baselines is thatLakeFMdoes not require any imputationof lake data as it can directly work with irregular multi\-variate time\-series data, a feature common in many ecological applications\. In contrast, standard TSFM baselines rely on the accuracy of imputation\-based pre\-processing techniques to transform data onto regularly gridded formats, which can be unstable in the presence of sparse data\.LakeFMthus provides a novel paradigm shift for sharing information across disparate lakes with varying forms of irregularities in time, space, and variates, going beyond typical single\-lake and single\-variate analyses presented in previous works\.

To further evaluate this imputation\-free setting, we compareLakeFMwith recent irregular/missing time\-series \(IMTS\) baselines across all ID and OOD lakes\. For readability, Figure[4](https://arxiv.org/html/2606.11268#S5.F4)visualizes a subset of lakes, while the complete lake\-wise and variate\-wise results are provided in Tables[10](https://arxiv.org/html/2606.11268#A4.T10)and[11](https://arxiv.org/html/2606.11268#A4.T11)in Appendix[D](https://arxiv.org/html/2606.11268#A4)\. Across the full evaluation,LakeFMobtains the best average rank among the evaluated IMTS baselines, with ranks of1\.01\.0and1\.271\.27under the OOD and ID settings, respectively, indicating that the gains ofLakeFMare not only due to avoiding standard imputation pipelines, but also reflect its ability to share information across heterogeneous lakes while directly modeling irregular observations\.

![Refer to caption](https://arxiv.org/html/2606.11268v1/x2.png)

![Refer to caption](https://arxiv.org/html/2606.11268v1/x3.png)

Figure 2\.Overall lake\-wise prediction performance \(MSE\) comparison betweenLakeFMand baselines: \(top\) ID lakes; \(bottom\) OOD lakes\.![Refer to caption](https://arxiv.org/html/2606.11268v1/x4.png)Figure 3\.An example of time\-series forecasts of chlorophyll\-a for lake BM at 5m depth\. The corresponding Mean Squared Error \(MSE\) values are: Chronos 2 \(1\.11\), LPTM \(1\.23\), MOMENT \(1\.24\), and LakeFM \(1\.07\)\.![Refer to caption](https://arxiv.org/html/2606.11268v1/x5.png)Figure 4\.Overall lake\-wise prediction performance \(MSE\) comparison betweenLakeFMand IMTS baselines: ID lakes: FI, TB, CRAM; OOD lakes: ME, TR, SUGG
### 5\.2\.Discovering Novel Insights of Lake Variate Interactions

A unique feature ofLakeFMis that it can be applied on a new lake with data available on any subset of variates and depths\. This enablesLakeFMto be used not only as a forecasting tool but as a noveldiscovery enginefor analyzing the interactions among lake variates at varying depths in relation to prediction performance\. We specifically study the following two questions\.

#### 5\.2\.1\.How Does Masking a Lake Variate Affect Forecasts of Other Variates?

To study this question, we conduct experiments where we mask one or more variates in the context window during inference on a target lake and observe the forecasting performance ofLakeFM, effectively relying on the cross\-variate interactions learned during training\. Figure[5](https://arxiv.org/html/2606.11268#S5.F5)shows time\-series plots of one such experiment for lake PRLA, where we either mask out Dissolved Oxygen \(DO\) or Water Temperature \(Temp\) and observe their impacts on DO forecasts\.We can see that masking DO leads to a larger increase in DO MSE than masking Temp, which can be intuitively explained based on the auto\-correlation structure in DO\. Quantitatively, DO masking results in a higher DO MSE of 12\.57, compared to 11\.00 under Temp masking\. However, the uncertainty behavior reveals a more nuanced trend\. Although Temp masking yields a lower MSE, it produces a higher CRPS of 2\.52, and the corresponding forecast plots show narrower prediction intervals that often fail to capture the true values\. This indicates that the model becomes more confident despite making inaccurate predictions\. In contrast, DO masking increases the MSE but yields a lower CRPS of 1\.93, with the forecast plots showing wider prediction intervals around the true trajectory\. This suggests that the model responds appropriately to missing information by assigning greater uncertainty when a critical covariate is missing\.Appendix[C\.5](https://arxiv.org/html/2606.11268#A3.SS5)provides more visualizations of time\-series forecasts with masked variables across multiple lakes revealing similar trends\. This is generatingnovel hypothesesabout the effects of DO and Temp on the accuracy and uncertainty of forecasting lake variables that can be scientifically verified by ecologists in subsequent studies\.

To further quantify the interactions between lake variates\(i,j\)\(i,j\)at any lake, we conduct two experiments\. First, we mask variateiiand study the increase in MSE of variatejjcompared to the no masking baseline\. Second, we consider masking all variates exceptiiand measure the increase in MSE of variatejj\.Appendix[C\.6](https://arxiv.org/html/2606.11268#A3.SS6)and[C\.7](https://arxiv.org/html/2606.11268#A3.SS7)provide results for both these experiments with ecological observations of some of the common trends across lakes\.

![Refer to caption](https://arxiv.org/html/2606.11268v1/x6.png)\(a\)No masking
![Refer to caption](https://arxiv.org/html/2606.11268v1/x7.png)\(b\)Temp Masked
![Refer to caption](https://arxiv.org/html/2606.11268v1/x8.png)\(c\)DO Masked

Figure 5\.Visualizing DO forecasts under masked and no masking scenarios for Lake PRLA at depth 1\.0mTable 1\.Impact of masking shallow \(ZsZ\_\{s\}\) and deep \(ZdZ\_\{d\}\) depth information on forecasting performance \(MSE\) across lakes\.

#### 5\.2\.2\.Shallow Layers vs\. Deep Layers: What Matters More For Forecasting?

Similar to variate\-masking, we study the effect of masking out all variates in the context window at shallow layers \(Zs​h​a​l​l​o​wZ\_\{shallow\}\) compared to deep layersZd​e​e​pZ\_\{deep\}on forecasting performance \(see Appendix[C\.8](https://arxiv.org/html/2606.11268#A3.SS8)for details on how shallow and deep layers are defined for different lakes\)\. Table[5\.2\.1](https://arxiv.org/html/2606.11268#S5.SS2.SSS1)shows the results of this masking experiment on two lakes \(CRAM and BARC\) comparingLakeFMwith Chronos 2\. We can see from the results ofLakeFMthat the shallow layers in the context window contain more predictive information about shallow layer forecasts across both lakes, and the same is true for deep layers\. On the other hand, Chronos 2 does not register any significant difference in MSE values by masking shallow or deep layers\. Also note that Chronos 2 requires a variate to be present in the context window to compute its forecast\. Hence, it is unable to analyze the impact of masking shallow \(or deep\) layers upon itself\.

### 5\.3\.Analyzing Consistency with Physical Laws

![Refer to caption](https://arxiv.org/html/2606.11268v1/x9.png)

![Refer to caption](https://arxiv.org/html/2606.11268v1/x10.png)

Figure 6\.Comparing physical consistency of LakeFM & Chronos 2 across 100 simulated lakes: \(left\) inversion rate with thermal stratification law \(↓\\downarrow\); \(right\) PearsonR2R^\{2\}with Beer\-Lambert law \(↓\\downarrow\)\.We evaluate theemergent abilityofLakeFMto comply with two physical laws of aquatic systems that it has not been trained for, as described in the following\.

\(1\) Thermal Stratification Law\.A fundamental property of lakes is that during summer, lake temperature varies monotonically with depth, thus maintaining a vertical gradient with depth \(Tz\>=Tz\+1T\_\{z\}\>=T\_\{z\+1\}\)\. Deviations from this monotonic rule indicate inversion \(which is physically inconsistent\)\. We quantify this using theInversion rate, defined as the average number of depth\-wise inversions\(Tz<Tz\+1T\_\{z\}<T\_\{z\+1\}\) per day\. Lower inversion rate means higher physical consistency\.

\(2\) Beer\-Lambert Law\(Beer and Beer,[1852](https://arxiv.org/html/2606.11268#bib.bib4)\)states that light intensity decreases exponentially with depth due to biomass in the water column\. We evaluate this relationship by computing the PearsonR2R^\{2\}between predicted Chlorophyll\-a and Light Attenuation \(higher is better\)\.

Figure[6](https://arxiv.org/html/2606.11268#S5.F6)compares the inversion rate andR2R^\{2\}values ofLakeFMand Chronos2 over 100 unseen simulated lakes\. We can see thatLakeFMshows higher consistency with both physical laws than Chronos2 over a large majority of lakes \(shaded green\)\. For additional comparisons of physical consistency,see the Appendix[C\.1](https://arxiv.org/html/2606.11268#A3.SS1)\.

### 5\.4\.InterpretingLakeFMEmbeddings

![Refer to caption](https://arxiv.org/html/2606.11268v1/x11.png)Figure 7\.StaticLakeFMembeddings of observed lakes categorized by location \(State\) and hydrologic regime\.![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/five_lakes_2018_version_b_bm_first_in_tsne_regime_trophic.png)Figure 8\.Trajectories ofLakeFMDynamic Embeddings for lakes AL, BM, SP, MO, and ME in 2018\.Static Embeddings\.Figure[7](https://arxiv.org/html/2606.11268#S5.F7)shows the static lake\-level embeddings learned byLakeFM\(from its static projector\) using 2D t\-SNE\. We can see that lakes from similar geographic locations \(US State\) are closer to each other, with finer variations within lakes in each state determined by the hydrologic regimes of lakes\. For example, for Wisconsin \(WI\) lakes,LakeFMis able to separate lakes with drainage \(MO, ME, and WI\) from those with seepage \(SP, BM, CB\)\.Figure[14](https://arxiv.org/html/2606.11268#A3.F14)in Appendix[C\.2](https://arxiv.org/html/2606.11268#A3.SS2)presents additional visualizations ofLakeFMembeddings based on lake trophic state, that further helps to differentiate lakes such as AL \(mesotrophic\) from MO and ME \(eutrophic\)\. Note that neither of these lake metadata were used in the training ofLakeFM, demonstratingLakeFM’s ability to produce meaningful static embeddings aligned with known lake properties\.

Time\-varying Embeddings\.We examine the ability ofLakeFMto produce dynamic embeddings of lakes \(from its temporal projector\) that help differentiate their temporal trajectories\. Figure[8](https://arxiv.org/html/2606.11268#S5.F8)shows the embedding trajectories of 5 lakes in Wisconsin for 2018 using 2D t\-SNE, colored on the basis of hydrologic regime & trophic state\. We can see that both MO and ME \(eutrophic lakes with drainage regime\) exhibit closely aligned seasonal trajectories in the embedding space, while SP and BM, oligotrophic lakes with seepage\-dominated regime, form a distinct group\. While AL is geographically close to SP and BM, it is ecologically different in terms of hydrologic regime and trophic state, and thus follows a different trajectory than the other two\.Figure[14](https://arxiv.org/html/2606.11268#A3.F14)in Appendix[C\.2](https://arxiv.org/html/2606.11268#A3.SS2)includes an additional visualization for 2021, with similar observations, suggesting thatLakeFMjointly encodes time\-invariant lake characteristics and time\-varying ecological behavior\.

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/tsne_w_p_cyano.png)Figure 9\.LakeFMrepresentation for 900 unseen simulation lakes, each corresponding to a different cyanobacteria value\.Embeddings of Simulated Lakes\.We investigate whetherLakeFM’s embeddings of simulated lakes encode information of process\-based parameters used to generate the simulations\. Figure[9](https://arxiv.org/html/2606.11268#S5.F9)shows the static embeddings of 900 unseen simulated lakes generated with varying input parameter configurations \(see Appendix[A\.2](https://arxiv.org/html/2606.11268#A1.SS2)for simulation details\), where we focus on the cyanobacteria\-related parameterw​\_​p​\_​c​y​a​n​ow\\\_p\\\_cyanoto color every point in the embedding space, which modulates phytoplankton dynamics\. We can see a clear gradient across the embedding space with respect to this parameter, with clear separation of low, intermediate and highw​\_​p​\_​c​y​a​n​ow\\\_p\\\_cyanovalues\. We further analyze the trajectory embeddings of simulated lakes inAppendix[C\.3](https://arxiv.org/html/2606.11268#A3.SS3), showing consistent trends of temporal dynamics for lakes within the same range ofw​\_​p​\_​c​y​a​n​ow\\\_p\\\_cyanovalues\.

### 5\.5\.Ablations

#### 5\.5\.1\.Model Ablations

Figure[10\(a\)](https://arxiv.org/html/2606.11268#S5.F10.sf1)comparesLakeFMagainst model ablations, evaluating the contribution of contrastive learning, variate\-specific likelihoods, and probabilistic training inLakeFM\.

Without Contrastive Loss\.We remove the contrastive objective and train the model solely with the probabilistic forecasting loss\. This ablation leads to a consistent degradation in performance across all held\-out lakes, highlighting that enforcing the model to learn time\-invariant lake representations help inLakeFM’s forecasting performance\. Variate\-specific Degrees of Freedom \(DoF\)\. We evaluate the impact of learning variate\-wise DoF within the Student\-ttdistribution versus a shared or fixed DoF\. Results show a strict decrease in performance without variate\-specific parameterization\. This confirms that different limnological variables \(e\.g\., highly volatile Chlorophyll\-a vs\. stable Deep\-water Temperature\) possess distinct heavy\-tail characteristics that require individualized distributional modeling\. Student\-ttvs\. Gaussian Likelihood\. Replacing the Student\-ttdistribution with a standard Normal distribution resulted in significantly higher MSE\. This degradation highlights that environmental time\-series frequently violate normality assumptions; the Student\-ttprovides the necessary flexibility to handle the outliers and heteroskedasticity inherent in lake ecosystems\. Probabilistic vs\. Point\-Estimation \(MSE Loss\)\. Training with a deterministic MSE loss \(non\-probabilistic\) yielded poor performance across held\-out lakes\. This highlights the high degree of epistemic uncertainty in lake modeling\. A deterministic approach fails to capture the uncertain nature of future states, whereas our probabilistic framework provides a more resilient objective for zero\-shot transfer\. Continuous Depth Embedding Ablation\. To isolate the impact of our continuous depth embedding, we conducted an ablation study where depth is flattened into discrete, independent variates \(i\.e\., treating each unique depth\-variable pair as a new variate\)\. From Table[2](https://arxiv.org/html/2606.11268#S5.T2), we observe that treating depth as discrete independent variates degrades performance\. Continuous depth modeling provides a spatial coordinate system that allows the model to learn vertical gradients and generalize to unseen depths\. Treating depth as independent categories makes this challenging and also leads to extremely large, sparse input matrices\.

Table 2\.Comparing LakeFM with and without continuous\-depth modeling \(W/o depth embed\) that treats each unique \(depth, variable\) pair as an independent discrete variate![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/model_ablations_mse_ood.png)\(a\)Model ablations
![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/ood_pretraining_ablation_mse.png)\(b\)Training ablations

Figure 10\.Model and training ablations evaluated using MSE across six held\-out lakes
#### 5\.5\.2\.Training Ablations

Figure[10\(b\)](https://arxiv.org/html/2606.11268#S5.F10.sf2)compares training strategy ablations, analyzing the impact of simulation\-based pretraining and temporal sampling strategies on OOD generalization\.

Impact of Simulation Data in Pre\-training\.We compare LakeFM trained exclusively on the LakeBeD \(real\-world\) dataset against our standard pipeline which includes synthetic simulation data\. The significantly lower generalizability of the real\-only model indicates that pre\-training on physically\-grounded simulations is critical for learning a robust representation of lake dynamics that transfers to unseen basins\. Sampling Strategies \(Curriculum vs\. Cycle vs\. Random\)\. We experimented with various window\-sampling strategies during training: Curriculum \(increasing window sizes\), Cycle \(alternating context/prediction lengths\), and Random\. We found that complex scheduling provided negligible benefits over standard random sampling, suggesting the model’s robustness is driven more by data diversity than the order of temporal exposure\.

## 6\.Computational Cost Analysis

We report the computational cost ofLakeFMand the baselines in Table[3](https://arxiv.org/html/2606.11268#S6.T3), with measurements obtained on the ME dataset using a single H200 GPU and batch size 32\.

Table 3\.Computational cost comparison ofLakeFMand baseline models, measured on the ME dataset\.Treating each observation as a distinct token increases the input sequence length; however, the computational cost analysis shows that this is a highly parameter\-efficient trade\-off that enables the model to handle extreme irregularities that baselines cannot\.LakeFMachieves competitive results using significantly fewer \(7M\) parameters than the general\-purpose TSFMs\. While peak VRAM is slightly higher than some baselines, it remains well within the limits of standard modern hardware, including low\-end GPUs \(e\.g\., NVIDIA V100\)\. Moreover,LakeFM’s inference throughput is comparable to or better than several production\-grade state\-of\-the\-art FMs such as MOMENT, Chronos 2, and MOIRAI\.

## 7\.How doesLakeFMAdvance the Science of Aquatic Systems?

The conventional approach of modeling lake systems is to use process\-based models, which require expert calibration of lake\-specific parameters using custom data from every lake, which is hard to obtain at operational scales\. ML offers a completely different solution for this problem by training a model over a large collection of lakes that can be transferred to any lake without expert calibration of parameters\. However, a major challenge in harnessing the generalization power of ML models across lakes is the irregularity in data, a challenge that no other TSFM is able to address despite its wide prevalence across many ecological applications\.

For the first time, LakeFM is enabling scalable knowledge transfer across a large collection of lakes \(real and simulated\), overcoming the challenges of sampling differences within variables across depth and time\.LakeFMis a first\-step toward macro\-system understanding of lake ecology using diverse and heterogeneous lake data\.Another advantage of LakeFM for aquatic sciences is its ability to work with masked variables, which neither process\-based nor existing ML models in this domain are able to handle\. The variable masking experiments ofLakeFMare revealing novel insights about the interactions of variables in lakes that require further scientific investigation\. They have the ability to inform which variables to collect and at what depths when working on new lakes, to maximize forecasting performance\. Finally, the embedding visualizations are revealing novel interpretations of the static and dynamic characteristics of lakes, jointly accounting for changes in geography, hydrological regime, and trophic states\.

## 8\.Conclusion

We presentLakeFM, a domain\-specific foundation model for lake ecosystems developed through an inter\-disciplinary collaboration between ML researchers and ecologists that is able to handle irregular multivariate, multi\-depth time\-series across diverse lake systems\. By jointly modeling variable interactions and site\-level dynamics,LakeFMenables reliable zero\-shot cross\-lake generalization and recovers physically and ecologically meaningful information\. We hope our work inspires future research in building scientific foundation models that are tailored to the needs of application domains and are not only trained with supervision contained in data but also with the physical principles underlying scientific phenomena\.

## Limitations and Ethical Considerations

To the best of our knowledge, this work poses no major ethical concerns, as it focuses on forecasting and representation learning from environmental time\-series data\. Potential limitations stem primarily from data quality and coverage\.LakeFM’s reliability in environmental conditions that deviate significantly from the training distribution, e\.g\., regions with different thermal and ice\-cover dynamics than US, remains a key boundary condition\. Specifically, different climatic contexts can introduce variate scales that fall outside the observed training statistics\. In such scenarios, the model’s reliance on learned statistical dependencies and multi\-variate correlations may limit its extrapolative accuracy\.

## GenAI Disclosure

Generative AI tools were used to assist with editing and improving the readability of the manuscript, including grammar, phrasing, and presentation of the manuscript\. All scientific content, experimental design, results, analysis, and conclusions were conceived, verified, and finalized by the authors\.

###### Acknowledgements\.

We sincerely thank Mary E\. Lofton from the Department of Biology, Virginia Tech for preparing and curating the FCR simulations \(comprising 1000 simulation lake datasets\) used in this study\. This work was supported in part by NSF awards \#2213549, \#2213550, and \#2239328\. We are also grateful to computing resources from Bridges\-2 at Pittsburgh Supercomputing Center available through NAIRR pilot award \#240161\. We are also grateful to the Advanced Research Computing \(ARC\) Center at Virginia Tech for providing access to GPU compute resources for this project\.

## References

- \(1\)
- Ansari et al\.\(2025\)Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, et al\.2025\.Chronos\-2: From univariate to universal forecasting\.*arXiv preprint arXiv:2510\.15821*\(2025\)\.
- Ansari et al\.\(2024\)Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al\.2024\.Chronos: Learning the language of time series\.*arXiv preprint arXiv:2403\.07815*\(2024\)\.
- Beer and Beer \(1852\)August Beer and P Beer\. 1852\.Determination of the absorption of red light in colored liquids\.*Annalen der Physik und Chemie*86, 5 \(1852\), 78–88\.
- Chen et al\.\(2023\)Yuqi Chen, Kan Ren, Yansen Wang, Yuchen Fang, Weiwei Sun, and Dongsheng Li\. 2023\.Contiformer: Continuous\-time transformer for irregular time series modeling\.*Advances in Neural Information Processing Systems*36 \(2023\), 47143–47175\.
- Cohen et al\.\(2024\)Ben Cohen, Emaad Khwaja, Kan Wang, Charles Masson, Elise Ramé, Youssef Doubli, and Othmane Abou\-Amal\. 2024\.Toto: Time series optimized transformer for observability\.*arXiv preprint arXiv:2407\.07874*\(2024\)\.
- Corman et al\.\(2023\)Jessica Corman, Jacob Zwart, Jennifer Klug, Denise Bruesewitz, Elvira de Eyto, Marcus Klaus, Lesley Knoll, James Rusak, Michael Vanni, Maria Belen Alfonso, et al\.2023\.High\-frequency dissolved oxygen, water temperature, wind speed, and radiation data; stream and in\-lake nutrient concentration data; and daily metabolism and nutrient loading estimates for 16 lakes in North America and Northern Europe\.\(2023\)\.
- Daw et al\.\(2022\)Arka Daw, Anuj Karpatne, William D Watkins, Jordan S Read, and Vipin Kumar\. 2022\.Physics\-guided neural networks \(pgnn\): An application in lake temperature modeling\.In*Knowledge guided machine learning*\. Chapman and Hall/CRC, 353–372\.
- Du et al\.\(2023\)Wenjie Du, David Côté, and Yan Liu\. 2023\.Saits: Self\-attention\-based imputation for time series\.*Expert Systems with Applications*219 \(2023\), 119619\.
- Goswami et al\.\(2024\)Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski\. 2024\.Moment: A family of open time\-series foundation models\.*arXiv preprint arXiv:2402\.03885*\(2024\)\.
- Hanson et al\.\(2023\)P\. C\. Hanson, R\. Ladwig, C\. Buelo, E\. A\. Albright, A\. D\. Delany, and C\. C\. Carey\. 2023\.Legacy Phosphorus and Ecosystem Memory Control Future Water Quality in a Eutrophic Lake\.*Journal of Geophysical Research: Biogeosciences*128, 12 \(2023\), e2023JG007620\.[doi:10\.1029/2023JG007620](https://doi.org/10.1029/2023JG007620)
- Hipsey et al\.\(2019\)M\. R\. Hipsey, L\. C\. Bruce, C\. Boon, B\. Busch, C\. C\. Carey, D\. P\. Hamilton, P\. C\. Hanson, J\. S\. Read, E\. de Sousa, M\. Weber, and L\. A\. Winslow\. 2019\.A General Lake Model \(GLM 3\.0\) for linking with high\-frequency sensor data from the Global Lake Ecological Observatory Network \(GLEON\)\.*Geoscientific Model Development*12, 1 \(2019\), 473–523\.
- Jia et al\.\(2019\)Xiaowei Jia, Jared Willard, Anuj Karpatne, Jordan Read, Jacob Zwart, Michael Steinbach, and Vipin Kumar\. 2019\.Physics guided RNNs for modeling dynamical systems: A case study in simulating lake temperature profiles\.\(2019\), 558–566\.
- Ladwig et al\.\(2024\)Robert Ladwig, Arka Daw, Elen A Albright, Cal Buelo, Anuj Karpatne, Michael Frederick Meyer, Abhilash Neog, Paul C Hanson, and Hilary A Dugan\. 2024\.Modular Compositional Learning Improves 1D Hydrodynamic Lake Model Performance by Merging Process\-Based Modeling With Deep Learning\.*Journal of Advances in Modeling Earth Systems*16, 1 \(2024\), e2023MS003953\.
- Langman et al\.\(2010\)OC Langman, PC Hanson, SR Carpenter, and YH Hu\. 2010\.Control of dissolved oxygen in northern temperate lakes over scales ranging from minutes to days\.*Aquatic Biology*9, 2 \(2010\), 193–202\.
- Li et al\.\(2026\)Boyuan Li, Zhen Liu, Yicheng Luo, and Qianli Ma\. 2026\.Learning Recursive Multi\-Scale Representations for Irregular Multivariate Time Series Forecasting\.*arXiv preprint arXiv:2602\.21498*\(2026\)\.
- Li et al\.\(2025\)Boyuan Li, Yicheng Luo, Zhen Liu, Junhao Zheng, Jianming Lv, and Qianli Ma\. 2025\.Hyperimts: Hypergraph neural network for irregular multivariate time series forecasting\.*arXiv preprint arXiv:2505\.17431*\(2025\)\.
- Liu et al\.\(2023\)Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long\. 2023\.itransformer: Inverted transformers are effective for time series forecasting\.*arXiv preprint arXiv:2310\.06625*\(2023\)\.
- McAfee et al\.\(2025\)Bennett J McAfee, Aanish Pradhan, Abhilash Neog, Sepideh Fatemi, Robert T Hensley, Mary E Lofton, Anuj Karpatne, Cayelan C Carey, and Paul C Hanson\. 2025\.LakeBeD\-US: a benchmark dataset for lake water quality time series and vertical profiles\.*Earth System Science Data*17, 7 \(2025\), 3141–3165\.
- Neog et al\.\(2026\)Abhilash Neog, Arka Daw, Sepideh Fatemi, Medha Sawhney, Aanish Pradhan, Mary E Lofton, Bennett J McAfee, Adrienne Breef\-Pilz, Heather L Wander, Dexter W Howard, et al\.2026\.Investigating a Model\-Agnostic and Imputation\-Free Approach for Irregularly\-Sampled Multivariate Time\-Series Modeling\.*Transactions on Machine Learning Research*\(2026\)\.
- Nie et al\.\(2022\)Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam\. 2022\.A time series is worth 64 words: Long\-term forecasting with transformers\.*arXiv preprint arXiv:2211\.14730*\(2022\)\.
- Prabhakar Kamarthi and Prakash \(2024\)Harshavardhan Prabhakar Kamarthi and B Aditya Prakash\. 2024\.Large Pre\-trained time series models for cross\-domain Time series analysis tasks\.*Advances in Neural Information Processing Systems*37 \(2024\), 56190–56214\.
- Pradhan et al\.\(2024\)Aanish Pradhan, Bennett J\. McAfee, Abhilash Neog, Sepideh Fatemi, Mary E\. Lofton, Cayelan C\. Carey, Anuj Karpatne, and Paul C\. Hanson\. 2024\.LakeBeD\-US: Computer Science Edition \- a benchmark dataset for lake water quality time series and vertical profiles\.[doi:10\.57967/hf/3771](https://doi.org/10.57967/hf/3771)
- Shi et al\.\(2024\)Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin\. 2024\.Time\-moe: Billion\-scale time series foundation models with mixture of experts\.*arXiv preprint arXiv:2409\.16040*\(2024\)\.
- Shukla and Marlin \(2021\)Satya Narayan Shukla and Benjamin M Marlin\. 2021\.Multi\-time attention networks for irregularly sampled time series\.*arXiv preprint arXiv:2101\.10318*\(2021\)\.
- Staehr et al\.\(2010\)Peter A Staehr, Darren Bade, Matthew C Van de Bogert, Gregory R Koch, Craig Williamson, Paul Hanson, Jonathan J Cole, and Tim Kratz\. 2010\.Lake metabolism and the diel oxygen technique: state of the science\.*Limnology and Oceanography: Methods*8, 11 \(2010\), 628–644\.
- Su et al\.\(2024\)Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu\. 2024\.Roformer: Enhanced transformer with rotary position embedding\.*Neurocomputing*568 \(2024\), 127063\.
- Tashiro et al\.\(2021\)Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon\. 2021\.Csdi: Conditional score\-based diffusion models for probabilistic time series imputation\.*Advances in neural information processing systems*34 \(2021\), 24804–24816\.
- Willard et al\.\(2021\)Jared D Willard, Jordan S Read, Alison P Appling, Samantha K Oliver, Xiaowei Jia, and Vipin Kumar\. 2021\.Predicting water temperature dynamics of unmonitored lakes with meta\-transfer learning\.*Water Resources Research*57, 7 \(2021\), e2021WR029579\.
- Willard et al\.\(2022\)Jared D Willard, Jordan S Read, Simon Topp, Gretchen JA Hansen, and Vipin Kumar\. 2022\.Daily surface temperatures for 185,549 lakes in the conterminous United States estimated using deep learning \(1980–2020\)\.*Limnology and Oceanography Letters*7, 4 \(2022\), 287–301\.
- Woo et al\.\(2024\)G Woo, C Liu, A Kumar, C Xiong, S Savarese, and D Sahoo\. 2024\.Unified training of universal time series forecasting transformers\. arXiv 2024\.*arXiv preprint arXiv:2402\.02592*\(2024\)\.
- Xia et al\.\(2012\)Youlong Xia, Kenneth Mitchell, Michael Ek, Justin Sheffield, Brian Cosgrove, Eric Wood, Lifeng Luo, Charles Alonge, Helin Wei, Jesse Meng, Ben Livneh, Dennis Lettenmaier, Victor Koren, Qingyun Duan, Kingtse Mo, Yun Fan, and David Mocko\. 2012\.Continental\-scale water and energy flux analysis and validation for the North American Land Data Assimilation System project phase 2 \(NLDAS\-2\): 1\. Intercomparison and application of model products\.*Journal of Geophysical Research: Atmospheres*117, D3 \(2012\)\.[doi:10\.1029/2011JD016048](https://doi.org/10.1029/2011JD016048)
- Yu et al\.\(2025\)Runlong Yu, Chonghao Qiu, Robert Ladwig, Paul Hanson, Yiqun Xie, and Xiaowei Jia\. 2025\.Physics\-Guided Foundation Model for Scientific Discovery: An Application to Aquatic Science\.*arXiv preprint arXiv:2502\.06084*\(2025\)\.

## Appendix ADataset Details

We pretrain and evaluateLakeFMon three datasets \(spanning over 530\+ million observations\) that together span both real\-world \(21 observed lakes\) and process\-based simulation datasets \(1000\+ diverse lake simulations\)\. Each dataset contributes unique strengths to the modeling framework, as described below\.

### A\.1\.LakeBeD\-US

Our primary observational dataset is LakeBeD\-US\(McAfee et al\.,[2025](https://arxiv.org/html/2606.11268#bib.bib19);Pradhan et al\.,[2024](https://arxiv.org/html/2606.11268#bib.bib23)\), consisting of over 500 million unique lake water quality observations collected between 1981 and 2024\. The data span 21 U\.S\. lakes and include both high\- and low\-frequency measurements\. In this work, we utilize only the low\-frequency measurements\. The dataset features 17 variables organized into three categories: \(1\) static attributes, such as lake morphology and geographic location; \(2\) one\-dimensional \(1D\) variables that vary over time \(e\.g\., Secchi depth, inflow\); and \(3\) two\-dimensional \(2D\) variables that vary over both time and depth\. This rich observational dataset captures diverse temporal and spatial lake dynamics\.

In\-Distribution vs\. Out\-of\-DistributionTo evaluate generalization beyond the training distribution, we split ther LakeBeD\-US lakes into in\-distribution \(ID\) and Out of distribution \(OOD\) groups using a feature\-based notion of lake similarity\. Each lake is represented by a vector of geographic and physical attributes \(latitude, longitude, surface area, mean depth, maximum depth, and elevation\)\. Since these attributes have different units and scales, we standardize all features to zero mean and unit variance before computing similarities\.

We then apply Principal Component Analysis \(PCA\) to obtain a compact representation capturing the dominant modes of variability across lakes, and performkk\-means clustering in this normalized feature space\. The number of clusters is selected using the silhouette score, favoring clusterings that are both tight and well\-separated\. To define an OOD subset, we select a small set of lakes that are maximally dissimilar from the bulk of the dataset in the learned PCA space \(i\.e\., located in sparse or well\-separated regions relative to cluster structure\)\. The remaining lakes, which lie in denser regions of the feature space, are treated as ID\. Figure[11](https://arxiv.org/html/2606.11268#A1.F11)displays the results of the PCA with the proportion of variance explained by each principal component\. Table[4](https://arxiv.org/html/2606.11268#A1.T4)shows the splitting of the lakes as well as information about ecological state of the lake\.

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/appendix/lake_similarity_pca.png)Figure 11\.PCA of lake similarity with clusters obtained bykk\-means clustering\.Table 4\.List of lakes used in the study, split into OOD and ID groups, with locations, hydrology types, and trophic states\.GroupLake NameAbbr\.LocationHydrologyTrophic StateOOD LakesLake BarcoBARCPutnam County, FL, USASeepageOligotrophicGreen Lake 4GL4Boulder County, CO, USADrainageOligotrophicLake MendotaMEDane County, WI, USADrainageEutrophicLake SuggsSUGGPutnam County, FL, USASeepageMesotrophicToolik LakeTOOKNorth Slope Borough, AK, USADrainageOligotrophicTrout LakeTRVilas County, WI, USADrainageOligotrophicID LakesAllequash LakeALVilas County, WI, USADrainageMesotrophicBig Muskellunge LakeBMVilas County, WI, USASeepageOligotrophicBeaverdam ReservoirBVRRoanoke County, VA, USADrainageMeso\-EutrophicCrystal BogCBVilas County, WI, USASeepageDystrophicCrystal LakeCRVilas County, WI, USAPerchedOligotrophicCrampton LakeCRAMVilas County, WI, USASeepageOligotrophicFalling Creek ReservoirFCRRoanoke County, VA, USADrainageEutrophicFish LakeFIDane County, WI, USASeepageMesotrophicLittle Rock LakeLIROVilas County, WI, USASeepageMesotrophicLake MononaMODane County, WI, USADrainageEutrophicPrairie LakePRLAStutsman County, ND, USASeepageDystrophicPrairie PotholePRPOStutsman County, ND, USASeepageDystrophicSparkling LakeSPVilas County, WI, USASeepageOligotrophicLake SuggsTBVilas County, WI, USASeepageDystrophicLake WingraWIDane County, WI, USADrainageEutrophicTable 5\.Variable names and corresponding water quality description
### A\.2\.FCR Simulations

The FCR Simulation datasets were generated using the General Lake Model coupled with the AED water quality module\(GLM\-AED;Hipsey et al\.,[2019](https://arxiv.org/html/2606.11268#bib.bib12)\), and comprises 1,000 process\-based model runs at Falling Creek Reservoir \(FCR\), VA, spanning daily resolution from December 1, 2016, to December 31, 2020\. Each run represents a distinct ecological scenario defined by a unique set of phytoplankton trait parameters, sampled using Latin hypercube sampling\. Six parameters were varied across three phytoplankton groups\-cyanobacteria, green algae, and diatoms—including group\-specific growth rates and sinking rates\. Model outputs include five key water quality variables: water temperature, soluble reactive phosphorus \(SRP\), dissolved inorganic nitrogen \(DIN\), chlorophyll\-a \(Chla\), and the light attenuation coefficient \(Kd\)\. These are reported at seven depths \(0\.1, 1\.6, 3\.8, 5, 6\.2, 8, and 9 m\), corresponding to observational depths in FCR\. Additionally, meteorological driver variables \(e\.g\., AirTemp, Shortwave, Inflow\) are included\. Each row represents a specific date and depth, enabling detailed analysis of how phytoplankton trait variation influences ecosystem dynamics, particularly nutrient\-light\-temperature interactions and emergent biogeochemical patterns\.

### A\.3\.WQHanson Simulations

The WQHansonSim dataset is a set of lake water quality simulation datasets covering four lakes: Green Lake, Lake Mendota, Prairie Lake, and Trout Lake\. The synthetic data were created using a process\-based water quality model\(Hanson et al\.,[2023](https://arxiv.org/html/2606.11268#bib.bib11)\)driven by meteorological forcing data from the second phase of the North American Land Data Assimilation System\(NLDAS\-2;Xia et al\.,[2012](https://arxiv.org/html/2606.11268#bib.bib32)\)\. Each simulation underwent a 60\-year burn\-in period to allow slow\-changing ecosystem states to reach dynamic equilibrium, followed by a 20\-year simulation period\. The outputs are structured as daily time series, with each row representing a unique date\-depth combination\.

Each record includes six core water quality variables: water temperature, dissolved oxygen, dissolved organic carbon, particulate organic carbon, total phosphorus, and depth, alongside the corresponding date\. Depths are lake\-specific and selected to reflect stratification layers, representing both the epilimnion and hypolimnion \(e\.g\., 5 m and 23 m for Trout Lake\)—allowing for realistic modeling of thermal and chemical compositions among layers of the lake\.

Table 6\.Overview of available lake variables \(2D\) for each lake across all datasets that forms the vocabulary ofLakeFM\.

## Appendix BImplementation Details

### B\.1\.LakeFM

LakeFM uses a 12\-layer Transformer encoder with 4 attention heads and hidden sizedmodel=128d\_\{\\text\{model\}\}=128\. Each scalar token combines learned variable \(48\-d\), depth \(32\-d\), and value \(96\-d\) embeddings, along with a sinusoidal time embedding \(16\-d\)\. The encoder uses RoPE for relative positional information, pre\-norm Transformer blocks, SwiGLU activations, and dropout rates of 0\.1 globally, 0\.05 for attention, and 0\.05 for heads\. The contrastive projection head uses attention pooling with projection dimension 128\. Forecasting is performed with a probabilistic decoder head that outputs Student\-ttparameters, with scale and degrees of freedom constrained using softplus and clamping\. We train with an Adam\-style optimizer using learning rate5×10−55\\times 10^\{\-5\}, weight decay4×10−44\\times 10^\{\-4\}, gradient clipping at 1\.0, and cosine learning\-rate scheduling without warmup\.

Hyperparameter tuningWe perform hyper\-parameter sweeps involving the following parameters:enc\_layers,num\_heads,weight\_decay,embed\_dim,attention\_dropout,head\_dropout,variate\_embed\_dim,depth\_embed\_dim,contrastive\_loss\_weight\.

Contrastive Sampling Strategy\.We adopt a custom balanced sampling strategy, built on top of PyTorch’sDistributedSampler, to construct batches for contrastive pretraining\. Each batch consists of multiple anchor\-positive groups, where each anchor is paired withP\_pos = 4positive samples from the same lake\. For e\.g\., for a totalbatch\_size = 64, this allows up to 12 such anchor\-positive sets per batch, with the remaining slots filled by negative samples drawn from different lakes\. Positive and negative pools are precomputed per lake for efficiency, and sampling is performed with deterministic seeding to support reproducibility across distributed processes\. This sampling strategy ensures within\-lake similarity and across\-lake contrast, enabling the model to learn lake\-discriminative representations\.

Hardware\. We use a combination of NVIDIA H100 and A100 GPUs for pretraining and carrying out the experiments

### B\.2\.Baselines

We use the Samay Time\-series Foundational Models Library for foundation model baselines\(Prabhakar Kamarthi and Prakash,[2024](https://arxiv.org/html/2606.11268#bib.bib22)\)and the official iTransformer implementation\. Foundation models are evaluated zero\-shot without lake\-specific fine\-tuning\. For non\-foundation baselines, including iTransformer and IMTS models, ID evaluation uses the same 7:1:2 split asLakeFM\. For OOD evaluation, these non\-foundation baselines are trained on 60% of each OOD lake, and all models, including zero\-shot foundation models, are evaluated on the same remaining 40%\. Separate iTransformer, ReIMTS, and HyperIMTS models are trained for each lake\. All evaluations use a 30\-day context and 14\-day prediction horizon\. For baselines that cannot directly handle missing values, irregular observations are linearly interpolated along time\. For cross\-model comparisons, we report normalized metrics by re\-standardizing the denormalized model predictions using the same ground\-truth statistics across all models\.

## Appendix CAdditional Results

### C\.1\.Physical Consistency Experiments

We evaluate iTransformer and compare againstLakeFMfor the Physical consistency experiments\. We hold out the data from years 2017 and 2018 as training data \(for iTransformer\) and evaluate the models on 2019 and 2020 data \(same evaluation split as used for Chronos 2 comparison\)\. Figure[12](https://arxiv.org/html/2606.11268#A3.F12)shows thatLakeFMoutperforms iTransformer in 97% of the cases in theR2R^\{2\}comparison, and in 98% of the cases in the vertical thermal stratification experiment\.

![Refer to caption](https://arxiv.org/html/2606.11268v1/x12.png)\(a\) PearsonR2R^\{2\}comparison for LakeFM and iTransformer
![Refer to caption](https://arxiv.org/html/2606.11268v1/x13.png)\(b\) Inversion rate comparison for LakeFM and iTransformer\.

Figure 12\.Comparison of LakeFM and iTransformer on the Beer\-Lambert Law and Vertical stratification tests, evaluated across 100 simulation lake datasets\.
### C\.2\.Insights from Time\-varying Lake Embeddings

Figure[14](https://arxiv.org/html/2606.11268#A3.F14)shows trends in 2021 that mirror those observed in 2018 \(Figure[8](https://arxiv.org/html/2606.11268#S5.F8)\)\. In both years, the lake pairs \(ME, MO\) and \(SP, BM\) exhibit closely aligned trajectories within each pair, while remaining well separated in the embedding space, reflecting differences in hydrologic regime, trophic state, and regional context\. Consistent with 2018, AL’s trajectory lies near SP and BM due to geographic proximity, yet remains distinct, consistent with differences in trophic state\.

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/five_lakes_2021_regime_trophic.png)Figure 13\.Lake Embedding trajectories comparing the combination of the hydrologic and trophic states in 2021\.
![Refer to caption](https://arxiv.org/html/2606.11268v1/x14.png)Figure 14\.Static Lake Embedding representations categorized according to their respective trophic state and hydrologic regime

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/fcr_6lakes_vB_2018.png)\(a\)Trajectories for 2018\.
![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/fcr_6lakes_vB_2017.png)\(b\)Trajectories for 2017\.

Figure 15\.Dynamic Embedding\-based trajectories for simulation lakes sampled from low, intermediate, and highw​\_​p​\_​c​y​a​n​ow\\\_p\\\_cyanogroups for \(a\) 2018 and \(b\) 2017\.
### C\.3\.Embeddings from Simulated Lakes

To further assess whether the implicit parameter sensitivity extends beyond static representations, we analyze the temporal evolution of dynamic embeddings for lakes sampled from low, intermediate, and highw​\_​p​\_​c​y​a​n​ow\\\_p\\\_cyanogroups\. Figure[15](https://arxiv.org/html/2606.11268#A3.F15)show embedding trajectories for two representative lakes from each group over a full annual cycle\. Within eachw​\_​p​\_​c​y​a​n​ow\\\_p\\\_cyanogroup, lakes follow highly similar seasonal trajectories, while trajectories diverge systematically across regimes\. This indicates thatLakeFMcaptures parameter\-conditioned dynamical behavior, organizing lake evolution according to the underlying phytoplankton growth parameter groups\.

### C\.4\.Additional Results on Non\-US Lakes

The current OOD evaluation is limited in real\-world breadth, hence, we obtained additional external lake datasets from\(Corman et al\.,[2023](https://arxiv.org/html/2606.11268#bib.bib7)\)and evaluated LakeFM and the baselines on them\. Table[7](https://arxiv.org/html/2606.11268#A3.T7)reports DO prediction results \(MSE\) across six additional lakes, including one from Canada \(Harp\) and five from Sweden\. These results provide additional evidence that LakeFM can generalize beyond the original benchmark and across broader real\-world lake settings\.

Table 7\.Dissolved Oxygen \(DO\) prediction MSE on non\-US lakes\.
### C\.5\.Qualitative Analysis \- Variate Masking

Figures[16](https://arxiv.org/html/2606.11268#A3.F16),[17](https://arxiv.org/html/2606.11268#A3.F17)and[18](https://arxiv.org/html/2606.11268#A3.F18)visually show the time\-series forecasts of variates across lakes under no masking and masking conditions of variates\.

![Refer to caption](https://arxiv.org/html/2606.11268v1/x15.png)\(a\)No Masking\. PRLA @ 1\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x16.png)\(b\)DO Masked\. PRLA @ 1\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x17.png)\(c\)Temp Masked\. PRLA @ 1\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x18.png)\(d\)No Masking\. PRLA @ 2\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x19.png)\(e\)DO Masked\. PRLA @ 2\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x20.png)\(f\)Temp Masked\. PRLA @ 2\.0m

Figure 16\.Variate Masking \- No masking vs Masked prediction Plots for Lake PRLA![Refer to caption](https://arxiv.org/html/2606.11268v1/x21.png)\(a\)No Masking\. BARC @ 0\.5m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x22.png)\(b\)DO Masked\. BARC @ 0\.5m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x23.png)\(c\)Temp Masked\. BARC @ 0\.5m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x24.png)\(d\)No Masking\. BARC @ 1\.5m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x25.png)\(e\)DO Masked\. BARC @ 1\.5m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x26.png)\(f\)Temp Masked\. BARC @ 1\.5m

Figure 17\.Variate Masking \- No masking vs Masked prediction Plots for Lake BARC![Refer to caption](https://arxiv.org/html/2606.11268v1/x27.png)\(a\)No Masking\. CB @ 0\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x28.png)\(b\)DO Masked\. CB @ 0\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x29.png)\(c\)Temp Masked\. CB @ 0\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x30.png)\(d\)No Masking\. CB @ 1\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x31.png)\(e\)DO Masked\. CB @ 1\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x32.png)\(f\)Temp Masked\. CB @ 1\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x33.png)\(g\)No Masking\. CB @ 2\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x34.png)\(h\)DO Masked\. CB @ 2\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x35.png)\(i\)Temp Masked\. CB @ 2\.0m

Figure 18\.Variate Masking \- No masking vs Masked Prediction Plots for Lake CB
### C\.6\.Variate Importance \- Single Variate Masking

We mask out one variable in the model’s input/historical window, and measure the change in the predictive performance of each of the variables \(including the variate masked\)\. This experiment enables us to discover the influencing variables, as well as least sensitive variables \(w\.r\.t\. the other variates\) within each lake system\. Figure[19](https://arxiv.org/html/2606.11268#A3.F19)shows the heatmaps corresponding to the change in the error metrics based on masking out a single variable, for each lake\.

While we generally see an increase in the error metrics on masking an input variable, this is however, not always the case\. For e\.g\. in SUGG, we see that removing Water DO improves the prediction performance of Water Temp\. Here are some ecological reasons based on these observations \- DO dynamics reflect the interaction of biological processes \(e\.g\., primary production and respiration\) and physical processes \(e\.g\., mixing and air–water exchange\)\(Staehr et al\.,[2010](https://arxiv.org/html/2606.11268#bib.bib26)\), with their relative importance shifting across depths and temporal scales\(Langman et al\.,[2010](https://arxiv.org/html/2606.11268#bib.bib15)\)\. Under stable physical conditions, warm temperatures and high irradiance drive photosynthesis and produce a predictable diel DO signal, while deeper waters below the photic zone are dominated by respiration and declining DO\. This structure can be disrupted by mixing events, which obscure the biological signal\. At shorter time scales, internal waves, organismal advection past sensors, and local biotic interactions introduce additional variability that may be sensitive to weather or effectively stochastic\(Langman et al\.,[2010](https://arxiv.org/html/2606.11268#bib.bib15);McAfee et al\.,[2025](https://arxiv.org/html/2606.11268#bib.bib19)\), making DO an instructive example of a predictor whose information content can vary dramatically across contexts\.

Figure 19\.Per\-lake performance deltas visualized as heatmaps\. For each lake, we showΔ\\DeltaMSE relative to the baseline\.![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/v2v/AL_delta_mse_heatmap.png)

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/v2v/BARC_delta_mse_heatmap.png)

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/v2v/BM_delta_mse_heatmap.png)

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/v2v/BVR_delta_mse_heatmap.png)

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/v2v/CB_delta_mse_heatmap.png)

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/v2v/CR_delta_mse_heatmap.png)

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/v2v/CRAM_delta_mse_heatmap.png)

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/v2v/FCR_delta_mse_heatmap.png)

![[Uncaptioned image]](https://arxiv.org/html/2606.11268v1/figures/v2v/FI_delta_mse_heatmap.png)

![[Uncaptioned image]](https://arxiv.org/html/2606.11268v1/figures/v2v/GL4_delta_mse_heatmap.png)

![[Uncaptioned image]](https://arxiv.org/html/2606.11268v1/figures/v2v/LIRO_delta_mse_heatmap.png)

![[Uncaptioned image]](https://arxiv.org/html/2606.11268v1/figures/v2v/ME_delta_mse_heatmap.png)

![[Uncaptioned image]](https://arxiv.org/html/2606.11268v1/figures/v2v/MO_delta_mse_heatmap.png)

![[Uncaptioned image]](https://arxiv.org/html/2606.11268v1/figures/v2v/PRLA_delta_mse_heatmap.png)

![[Uncaptioned image]](https://arxiv.org/html/2606.11268v1/figures/v2v/PRPO_delta_mse_heatmap.png)

![[Uncaptioned image]](https://arxiv.org/html/2606.11268v1/figures/v2v/SP_delta_mse_heatmap.png)

![[Uncaptioned image]](https://arxiv.org/html/2606.11268v1/figures/v2v/SUGG_delta_mse_heatmap.png)

![[Uncaptioned image]](https://arxiv.org/html/2606.11268v1/figures/v2v/TB_delta_mse_heatmap.png)

![[Uncaptioned image]](https://arxiv.org/html/2606.11268v1/figures/v2v/TOOK_delta_mse_heatmap.png)

![[Uncaptioned image]](https://arxiv.org/html/2606.11268v1/figures/v2v/TR_delta_mse_heatmap.png)

![[Uncaptioned image]](https://arxiv.org/html/2606.11268v1/figures/v2v/WI_delta_mse_heatmap.png)

### C\.7\.Variate Importance \- Single Variate Input

In this experiment, we mask outV−1V\-1variates \(out ofVVvariates\)\. That is, only one variable is present in the model’s input/historical window\. We measure the change in the predictive performance of each of the variables \(including the variate masked\)\. Figure[20](https://arxiv.org/html/2606.11268#A3.F20)shows the heatmaps corresponding to each lake\.

Figure 20\.Per\-lake performance deltas, on passing a single variate as input, visualized as heatmaps\. For each lake, we showΔ\\DeltaMSE relative to the baseline \(no masking\)\. The left axis represents the single input variable passed\.![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/v1/AL_delta_mse_heatmap_v1.png)

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/v1/BM_delta_mse_heatmap_v1.png)

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/v1/BVR_delta_mse_heatmap_v1.png)

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/v1/CB_delta_mse_heatmap_v1.png)

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/v1/CR_delta_mse_heatmap_v1.png)

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/v1/FCR_delta_mse_heatmap_v1.png)

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/v1/GL4_delta_mse_heatmap_v1.png)

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/v1/SP_delta_mse_heatmap_v1.png)

![[Uncaptioned image]](https://arxiv.org/html/2606.11268v1/figures/v1/TB_delta_mse_heatmap_v1.png)

![[Uncaptioned image]](https://arxiv.org/html/2606.11268v1/figures/v1/TR_delta_mse_heatmap_v1.png)

### C\.8\.Qualitative Analysis \- Depth Masking

Figures[21](https://arxiv.org/html/2606.11268#A3.F21),[22](https://arxiv.org/html/2606.11268#A3.F22)show the effects of masking shallow and deep layers of a lake on lake forecasts\. For CRAM, shallow depths considered are 1\.0, 2\.0, 3\.0 m and deeper depths are 14\.0, 14\.5, 15\.0 m\. Similarly, for BARC, we consider 0\.5, 1\.0, 1\.5 m as shallow depths, and 5\.5 m as the deeper depth\.

![Refer to caption](https://arxiv.org/html/2606.11268v1/x36.png)\(a\)No Masking\. CRAM @ 1\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x37.png)\(b\)Shallow Layers Masked\. CRAM @ 1\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x38.png)\(c\)Deeper Layers Masked\. CRAM @ 1\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x39.png)\(d\)No Masking\. CRAM @ 2\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x40.png)\(e\)Shallow Layers Masked\. CRAM @ 2\.0m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x41.png)\(f\)Deeper Layers Masked\. CRAM @ 2\.0m

Figure 21\.Depth Masking \- Prediction performance visualization in the shallow region under no masking, masking the shallow layers and masking the deeper layers in the input, for Lake CRAM![Refer to caption](https://arxiv.org/html/2606.11268v1/x21.png)\(a\)No Masking\. BARC @ 0\.5m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x42.png)\(b\)Shallow Layers Masked\. BARC @ 0\.5m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x43.png)\(c\)Deeper Layers Masked\. BARC @ 0\.5m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x44.png)\(d\)No Masking\. BARC @ 1\.5m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x45.png)\(e\)Shallow Layers Masked\. BARC @ 1\.5m
![Refer to caption](https://arxiv.org/html/2606.11268v1/x46.png)\(f\)Deeper Layers Masked\. BARC @ 1\.5m

Figure 22\.Depth Masking \- Prediction performance visualization under no masking, masking the shallow layers and masking the deeper layers in the input, for Lake BARC

## Appendix DFull Results

Tables[8](https://arxiv.org/html/2606.11268#A4.T8)and[9](https://arxiv.org/html/2606.11268#A4.T9)report full results forLakeFMand non\-IMTS baselines; IMTS comparisons are provided in Tables[10](https://arxiv.org/html/2606.11268#A4.T10)and[11](https://arxiv.org/html/2606.11268#A4.T11)

Table 8\.Performance comparison on OOD LakeBeD\-US data\.Table 9\.Performance comparison on ID LakeBeD\-US data\.Table 10\.Performance comparison against IMTS baselines on OOD LakeBeD\-US data\.Table 11\.Performance comparison against IMTS baselines on ID LakeBeD\-US data\.
## Appendix EVisualization

Figure[23](https://arxiv.org/html/2606.11268#A5.F23)shows forecasting plots across four lakes for the Water temperature variable observed at Depth 0m

![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/prediction_plot_FI_temp_0.0.png)\(a\)Lake FI
![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/prediction_plot_ME_temp_0.0.png)\(b\)Lake ME
![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/prediction_plot_MO_temp_0.0.png)\(c\)Lake MO
![Refer to caption](https://arxiv.org/html/2606.11268v1/figures/prediction_plot_WI_temp_0.0.png)\(d\)Lake WI

Figure 23\.Prediction plots for four lakes \(FI, ME, MO, WI\) for water temperature at depth 0\.0m\.

Similar Articles

A decoder-only foundation model for time-series forecasting

Papers with Code Trending

This article presents a research paper on Time-Series Foundation Model (TimeFM), a decoder-only model that achieves near-optimal zero-shot performance across diverse time-series datasets by adapting large language model techniques.

Mini-JEPA Foundation Model Fleet Enables Agentic Hydrologic Intelligence

arXiv cs.LG

This paper introduces a fleet of five sensor-specialized Mini-JEPA foundation models for hydrologic intelligence, achieving high reconstruction accuracy (R² up to 0.97) and outperforming the Google AlphaEarth generalist on physics-matched tasks when routed via an LLM agent.