SurF: A Generative Model for Multivariate Irregular Time Series Forecasting

arXiv cs.LG Papers

Summary

SurF is a generative model for multivariate irregularly sampled time series using the Time Rescaling Theorem to transform event sequences into i.i.d. exponential noise, achieving state-of-the-art results across multiple real-world benchmarks.

arXiv:2605.14069v1 Announce Type: new Abstract: Irregularly sampled multivariate event streams remain a stubbornly difficult modality for generative modeling: tokenization-based approaches break down when inter-event intervals vary by orders of magnitude, and neural temporal point processes are bottlenecked by window-level numerical quadrature. We (i) propose SurF, a generative model that uses the Time Rescaling Theorem (TRT) as a learnable bijection between event sequences and i.i.d.\ unit-rate exponential noise, enabling a single model to be trained across heterogeneous event-stream datasets; (ii) three efficient parameterizations of the cumulative intensity that scale to long sequences; and (iii) a Transformer-based encoder for multi-dataset pretraining. On six real-world benchmarks, SurF achieves the best reported time RMSE on Earthquake, Retweet, and Taobao, and is within trial-level noise of the strongest specialist on the remaining three. Under a strict leave-one-out protocol, the held-out checkpoint beats every classical and neural-autoregressive baseline on 5/6 datasets and beats every baseline on Amazon and Earthquake, an initial step toward foundation models over asynchronous event streams.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:26 AM

# SurF: A Generative Model for Multivariate Irregular Time Series Forecasting
Source: [https://arxiv.org/html/2605.14069](https://arxiv.org/html/2605.14069)
Mohammad R\. Rezaei Department of Computer Science University of Toronto Vector Institute Toronto, ON, Canada mr\.rezaei@mail\.utoronto\.ca &Tejas Balaji Department of Computer Science University of Toronto Vector Institute Toronto, ON, Canada tejas\.balaji@mail\.utoronto\.ca &Rahul G\. Krishnan Department of Computer Science University of Toronto Vector Institute Toronto, ON, Canada rahulgk@cs\.toronto\.edu

###### Abstract

Irregularly sampled multivariate event streams remain a stubbornly difficult modality for generative modeling: tokenization\-based approaches break down when inter\-event intervals vary by orders of magnitude, and neural temporal point processes are bottlenecked by window\-level numerical quadrature\. We \(i\) proposeSurF, a generative model that uses the Time Rescaling Theorem \(TRT\) as a learnable bijection between event sequences and i\.i\.d\. unit\-rate exponential noise, enabling a single model to be trained across heterogeneous event\-stream datasets; \(ii\) three efficient parameterizations of the cumulative intensity that scale to long sequences; and \(iii\) a Transformer\-based encoder for multi\-dataset pretraining\. On six real\-world benchmarks, SurF achieves the best reported time RMSE on Earthquake, Retweet, and Taobao, and is within trial\-level noise of the strongest specialist on the remaining three\. Under a strict leave\-one\-out protocol, the held\-out checkpoint beats every classical and neural\-autoregressive baseline on5/65/6datasets and beats every baseline on Amazon and Earthquake, an initial step toward foundation models over asynchronous event streams\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.14069v1/figures/surf_3d_panel_horizental.png)Figure 1:The SurF noising–denoising framework in the joint intensity space\.A 2D point process with two anti\-correlated event types\.\(Left\)In the original time span, typeAAactivity suppresses typeBBand vice versa\.\(Right\)After the SurF noising processℱλ\\mathcal\{F\}\_\{\\lambda\}\(Theorem[1](https://arxiv.org/html/2605.14069#Thmtheorem1)\), the trajectory collapses to i\.i\.d\.Exp​\(1\)\\mathrm\{Exp\}\(1\)inter\-arrival times along the rescaled axiszz; the inverseℛλ\\mathcal\{R\}\_\{\\lambda\}recovers the original dynamics losslessly\.Generative models of regularly sampled time\-series data such as TimesFM\(Daset al\.,[2024](https://arxiv.org/html/2605.14069#bib.bib110)\), Chronos\(Ansariet al\.,[2024](https://arxiv.org/html/2605.14069#bib.bib111)\), and Lag\-Llama\(Rasulet al\.,[2024](https://arxiv.org/html/2605.14069#bib.bib112)\)share a key assumption: that data arrive at a fixed rate along synchronized channels\. This precludes their application to*irregularly sampled multivariate event streams*, where events arrive asynchronously, streams update at different frequencies, and observation timing itself carries semantic meaning; existing tokenization schemes also break down when inter\-event intervals vary by orders of magnitude\. Yet such data arise throughout high\-stakes domains: clinical records with variable vital\-sign intervals\(Rubanovaet al\.,[2019](https://arxiv.org/html/2605.14069#bib.bib79); Rezaeiet al\.,[2022](https://arxiv.org/html/2605.14069#bib.bib125)\), asynchronous financial order flows\(Bacryet al\.,[2015](https://arxiv.org/html/2605.14069#bib.bib3); Hawkes,[2018](https://arxiv.org/html/2605.14069#bib.bib17)\), event\-driven sensor measurements\(Shukla and Marlin,[2021](https://arxiv.org/html/2605.14069#bib.bib114)\), and coordinated neural spike trains\(Truccoloet al\.,[2005](https://arxiv.org/html/2605.14069#bib.bib21); Pillowet al\.,[2008](https://arxiv.org/html/2605.14069#bib.bib26)\)\. The standard mathematical object for these data is the*temporal point process*\(TPP\), which models a stream of events through its conditional intensityλ∗​\(t∣ℋt\)\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\); the instantaneous event rate at timettgiven historyℋt\\mathcal\{H\}\_\{t\}\. Its integral, the cumulative intensityΛ∗​\(t\)=∫0tλ∗​\(s∣ℋs\)​𝑑s\\Lambda^\{\*\}\(t\)=\\int\_\{0\}^\{t\}\\lambda^\{\*\}\(s\\mid\\mathcal\{H\}\_\{s\}\)\\,ds, governs how long one expects to wait for the next event\. Fitting a TPP amounts to choosing a parameterization ofλ∗\\lambda^\{\*\}and estimating its parameters from data\.

A recurring difficulty in this fitting problem is evaluation: standard regression diagnostics do not apply to event data, and since every process has its own intensity, timescales, and shape, there is no obvious reference distribution to compare a fitted model against\. In the early 2000s, the Time Rescaling Theorem \(TRT\)\(Brownet al\.,[2002](https://arxiv.org/html/2605.14069#bib.bib4)\)resolved this: if the fittedΛ\\Lambdais correct, the rescaled gapsΔ​zi=Λ​\(ti\)−Λ​\(ti−1\)\\Delta z\_\{i\}=\\Lambda\(t\_\{i\}\)\-\\Lambda\(t\_\{i\-1\}\)are i\.i\.d\.Exp​\(1\)\\mathrm\{Exp\}\(1\), so a Q–Q plot or Kolmogorov–Smirnov statistic\(Truccoloet al\.,[2005](https://arxiv.org/html/2605.14069#bib.bib21); Brownet al\.,[2002](https://arxiv.org/html/2605.14069#bib.bib4)\)suffices for goodness\-of\-fit\. The TRT became standard practice as an*evaluation*tool\. Two decades later, neural TPPs rose in popularity as a means to leverage the representational flexibility of neural networks for event\-stream modeling\. Most neural TPPs are fit by maximum likelihood, which at every step requires the intensityλθ\\lambda\_\{\\theta\}at each event and its integral∫0Tλθ​\(s\)​𝑑s\\int\_\{0\}^\{T\}\\lambda\_\{\\theta\}\(s\)\\,dsover the observation window\. Intensity\-first methods parameterizeλθ\\lambda\_\{\\theta\}directly and approximate the integral by numerical quadrature, a cost that grows with the window and accumulates with every gradient step\. The cumulative\-hazard family\(Omiet al\.,[2019b](https://arxiv.org/html/2605.14069#bib.bib45); Shchuret al\.,[2019](https://arxiv.org/html/2605.14069#bib.bib23)\)avoids this by parameterizingΛθ\\Lambda\_\{\\theta\}with a monotone neural network and recoveringλθ\\lambda\_\{\\theta\}by differentiation, collapsing the window integral to a single evaluationΛθ​\(t\)\\Lambda\_\{\\theta\}\(t\)\.

Our key insight is that this cumulative\-hazard family is, in effect, fitting only the forward direction of the TRT\. SurF treatsΛ\\Lambdaas a bijection and uses it in both directions, enabling a generative model:Λθ\\Lambda\_\{\\theta\}maps event sequences to i\.i\.d\.Exp​\(1\)\\mathrm\{Exp\}\(1\)noise \(forward, for training\), while\(Λθ\)−1\(\\Lambda\_\{\\theta\}\)^\{\-1\}maps exponential noise back to event times \(reverse, for sampling\)\. We develop three parameterizations of the cumulative intensity, two yielding closed\-form likelihoods and a third whose cost isTT\-independent and whose error is empirically below the gradient noise floor on all six benchmarks \(Appendix[H\.3](https://arxiv.org/html/2605.14069#A8.SS3)\)\. Prior neural TPPs are trained per\-dataset; concatenating event streams and summing their likelihoods\. However, this approach can lead to instabilities in learning since per\-dataset losses have incompatible scales\. SurF resolves this by recognizing that we can learn patterns in multiple event streams using a single shared canonical target i\.e\.Λθ\\Lambda\_\{\\theta\}fit jointly onKKstreams optimizes a single well\-defined objective and pushes every dataset ontoExp​\(1\)\\mathrm\{Exp\}\(1\)\. Since all sequences map to the same target, the shared\(ϕθ,Λθ\)\(\\phi\_\{\\theta\},\\Lambda\_\{\\theta\}\)accumulates similarities across event streams in a general manner enabling the same parameters to successfully enable zero\-shot prediction to a new \(unseen\) dataset \(Section[3\.2](https://arxiv.org/html/2605.14069#S3.SS2)\)\.Contributions\.\(1\) TRT as an explicit bidirectional flow\.Prior cumulative\-hazard neural TPPs\(Omiet al\.,[2019a](https://arxiv.org/html/2605.14069#bib.bib28); Shchuret al\.,[2019](https://arxiv.org/html/2605.14069#bib.bib23)\)invoke the TRT implicitly at training time\. We make both directions explicit and record the conditions onΛθ\\Lambda\_\{\\theta\}\(smooth, strictly increasing onℝ\+\\mathbb\{R\}^\{\+\}\) under which the inverse function theorem yields a smooth bijection \(Theorem[2](https://arxiv.org/html/2605.14069#Thmtheorem2)\); our contribution lies in the framing and its consequences\.\(2\) Three parameterizations of the intensity function\.We parameterizeΛθ\\Lambda\_\{\\theta\}as a monotone neural network and recoverλθ\\lambda\_\{\\theta\}via automatic differentiation\. MoE and CSB are fully closed\-form; GLQ uses an unconstrained positive MLP with a fixedO​\(Q\)O\(Q\)per\-interval rule whose cost isTT\-independent and whose error is negligible in practice \(Appendix[H\.3](https://arxiv.org/html/2605.14069#A8.SS3)\)\.\(3\) cross\-dataset and zero\-shot evaluation\.On six benchmarks, a jointly\-trained SurF checkpoint achieves the best reported time RMSE on Earthquake, Retweet, and Taobao, and is within trial\-level noise of the strongest baseline on the remaining three\. Under a strict leave\-one\-out protocol, the held\-out checkpoint beats every classical and neural\-autoregressive baseline on5/65/6datasets \(losing only StackOverflow\); to our knowledge the first reported leave\-one\-out cross\-dataset evaluation for a learned cumulative intensity, and an initial step toward foundation models over asynchronous event streams\.

## 2Background

We address generative modeling and forecasting for irregularly sampled multivariate event streams—specifically, density estimation, next\-event prediction, and multi\-step rollout from a learned conditional intensity\. Let𝒟=\{\(ti,ki\)\}i=1N\\mathcal\{D\}=\\\{\(t\_\{i\},k\_\{i\}\)\\\}\_\{i=1\}^\{N\}denote a sequence ofNNevents in\[0,T\]\[0,T\]with0<t1<⋯<tN≤T0<t\_\{1\}<\\cdots<t\_\{N\}\\leq Tand typeski∈\{1,…,K\}k\_\{i\}\\in\\\{1,\\ldots,K\\\}\. A temporal point process is specified by its conditional intensityλ∗​\(t,k∣ℋt\)\\lambda^\{\*\}\(t,k\\mid\\mathcal\{H\}\_\{t\}\), whereℋt=\{\(tj,kj\):tj<t\}\\mathcal\{H\}\_\{t\}=\\\{\(t\_\{j\},k\_\{j\}\):t\_\{j\}<t\\\}\(Daley and Vere\-Jones,[2003](https://arxiv.org/html/2605.14069#bib.bib9)\)\. Writing the ground intensityλ∗​\(t∣ℋt\)=∑kλ∗​\(t,k∣ℋt\)\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\)=\\sum\_\{k\}\\lambda^\{\*\}\(t,k\\mid\\mathcal\{H\}\_\{t\}\), the likelihood factorizes as

p​\(𝒟\)=\[∏i=1Nλ∗​\(ti,ki∣ℋti\)\]​exp⁡\(−∫0Tλ∗​\(s∣ℋs\)​𝑑s\),p\(\\mathcal\{D\}\)=\\Bigl\[\\prod\_\{i=1\}^\{N\}\\lambda^\{\*\}\(t\_\{i\},k\_\{i\}\\mid\\mathcal\{H\}\_\{t\_\{i\}\}\)\\Bigr\]\\exp\\\!\\Bigl\(\-\\\!\\int\_\{0\}^\{T\}\\lambda^\{\*\}\(s\\mid\\mathcal\{H\}\_\{s\}\)\\,ds\\Bigr\),\(1\)with the exponential term the survival probability over\[0,T\]\[0,T\]\. The name*Survival Flow*captures this structure: SurF learns a bijective flow between observed event times and the canonical survival form\. Marks factorize asλ∗​\(t,k\)=λ∗​\(t\)​p​\(k∣t,ℋt\)\\lambda^\{\*\}\(t,k\)=\\lambda^\{\*\}\(t\)\\,p\(k\\mid t,\\mathcal\{H\}\_\{t\}\), which decomposes the joint NLL additively \(Appendix[H\.1](https://arxiv.org/html/2605.14069#A8.SS1)\); the rest of this section develops the ground\-intensity machinery\.

##### Time Rescaling Theorem \(TRT\)\.

The TRT describes the existence of a common mapping that SurF will map event sequences onto using a learnable flow\.

###### Theorem 1\(Time Rescaling\(Brownet al\.,[2002](https://arxiv.org/html/2605.14069#bib.bib4)\)\)\.

Let\{ti\}i=1N\\\{t\_\{i\}\\\}\_\{i=1\}^\{N\}be a realization of a point process with intensityλ∗​\(t∣ℋt\)\>0\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\)\>0, and define the cumulative intensityΛ∗​\(t\)=∫0tλ∗​\(s∣ℋs\)​𝑑s\\Lambda^\{\*\}\(t\)=\\int\_\{0\}^\{t\}\\lambda^\{\*\}\(s\\mid\\mathcal\{H\}\_\{s\}\)\\,ds\. The transformed timeszi=Λ∗​\(ti\)z\_\{i\}=\\Lambda^\{\*\}\(t\_\{i\}\)form a unit\-rate Poisson process; equivalently, the gapsΔ​zi=zi−zi−1\\Delta z\_\{i\}=z\_\{i\}\-z\_\{i\-1\}are i\.i\.d\.Exp​\(1\)\\mathrm\{Exp\}\(1\)\( Appendix[A](https://arxiv.org/html/2605.14069#A1)\.\)

The theorem maps*any*temporal point process to the same canonical distribution,Exp​\(1\)\\mathrm\{Exp\}\(1\), enabling the creation of a flow via a generative model\. To be able to generate valid samples after learning, we need to be able to learn the reverse direction, which the next section establishes\.

## 3SurF: Survival Flow for Temporal Point Processes

While Theorem[1](https://arxiv.org/html/2605.14069#Thmtheorem1)establishes the existence of a forward mapt↦Λ∗​\(t\)t\\mapsto\\Lambda^\{\*\}\(t\)that maps a point process to a unit\-rate Poisson reference, three problems remain\. First, it does not guarantee thatΛ∗\\Lambda^\{\*\}is*invertible*, so we cannot map noise back to event times; it does not establish*smoothness*of the inverse, which we need for gradient\-based training; and it does not provide the*Jacobian*required to evaluate likelihoods under a change of variables\. We propose the following theorem, which closes all three gaps under a mild positivity condition onλ∗\\lambda^\{\*\}\. This enables us to turn the TRT from an evaluation tool into a learnable flow\.

###### Theorem 2\(Reverse Rescaling and Bijectivity\)\.

Assumeλ∗\(⋅∣ℋ⋅\)\\lambda^\{\*\}\(\\cdot\\mid\\mathcal\{H\}\_\{\\cdot\}\)is continuous and satisfiesλ∗​\(t∣ℋt\)≥λmin\>0\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\)\\geq\\lambda\_\{\\min\}\>0for allt≥0t\\geq 0\. ThenΛ∗:ℝ\+→ℝ\+\\Lambda^\{\*\}\\\!:\\mathbb\{R\}^\{\+\}\\\!\\to\\mathbb\{R\}^\{\+\}is aC1C^\{1\}bijection withC1C^\{1\}inverse; if in additionλ∗\\lambda^\{\*\}isCkC^\{k\}\(k≥0k\\geq 0\), then bothΛ∗\\Lambda^\{\*\}and\(Λ∗\)−1\(\\Lambda^\{\*\}\)^\{\-1\}areCk\+1C^\{k\+1\}\. Conversely, if\{zi\}\\\{z\_\{i\}\\\}is a realization of a unit\-rate Poisson process, then\{ti=\(Λ∗\)−1​\(zi\)\}\\\{t\_\{i\}=\(\\Lambda^\{\*\}\)^\{\-1\}\(z\_\{i\}\)\\\}is a realization of the original point process with conditional intensityλ∗\\lambda^\{\*\}, and the change of variables𝐭↦𝐳\\mathbf\{t\}\\mapsto\\mathbf\{z\}has diagonal Jacobiandet\(∂𝐳/∂𝐭\)=∏i=1Nλ∗​\(ti∣ℋti\)\>0\\det\(\\partial\\mathbf\{z\}/\\partial\\mathbf\{t\}\)=\\prod\_\{i=1\}^\{N\}\\lambda^\{\*\}\(t\_\{i\}\\mid\\mathcal\{H\}\_\{t\_\{i\}\}\)\>0\.

Proof sketch:\(Λ∗\)′=λ∗≥λmin\>0\(\\Lambda^\{\*\}\)^\{\\prime\}=\\lambda^\{\*\}\\geq\\lambda\_\{\\min\}\>0, soΛ∗\\Lambda^\{\*\}is strictly increasing andC1C^\{1\}, withΛ∗​\(0\)=0\\Lambda^\{\*\}\(0\)=0andΛ∗​\(t\)≥λmin​t→∞\\Lambda^\{\*\}\(t\)\\geq\\lambda\_\{\\min\}\\,t\\to\\infty; together with monotonicity, this gives aC1C^\{1\}bijectionℝ\+→ℝ\+\\mathbb\{R\}^\{\+\}\\\!\\to\\mathbb\{R\}^\{\+\}, and the inverse function theorem yields aC1C^\{1\}inverse\. The reverse direction is the contrapositive of Theorem[1](https://arxiv.org/html/2605.14069#Thmtheorem1): the rescalingΦ:\{ti\}↦\{Λ∗​\(ti\)\}\\Phi:\\\{t\_\{i\}\\\}\\mapsto\\\{\\Lambda^\{\*\}\(t\_\{i\}\)\\\}pushes the law of the original process forward to unit\-rate Poisson, and bijectivity ofΦ\\PhimakesΦ−1\\Phi^\{\-1\}pull the unit\-rate Poisson law back to the original \(Appendix[A](https://arxiv.org/html/2605.14069#A1)\)\.

Theorem[2](https://arxiv.org/html/2605.14069#Thmtheorem2)requiresλ∗​\(t∣ℋt\)≥λmin\>0\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\)\\geq\\lambda\_\{\\min\}\>0\. We enforce this structurally with a small floorλfloor\>0\\lambda\_\{\\text\{floor\}\}\>0\(learnable, default10−410^\{\-4\}; see Appendix[A\.1](https://arxiv.org/html/2605.14069#A1.SS1)\)\. This introduces an additive bias of orderτi⋅λfloor\\tau\_\{i\}\\cdot\\lambda\_\{\\text\{floor\}\}per inter\-event interval, bounded above byT⋅λfloorT\\cdot\\lambda\_\{\\text\{floor\}\}per sequence \(below the NLL noise floor on all six benchmarks:<0\.1<\\\!0\.1nats per sequence atλfloor=10−4\\lambda\_\{\\text\{floor\}\}\{=\}10^\{\-4\}and normalizedT≤103T\\leq 10^\{3\}\), but potentially material for processes with dead\-time \(neural refractoriness, market closures\), for which Appendix[A\.1](https://arxiv.org/html/2605.14069#A1.SS1)provides a learnable\-floor and explicit\-masking scheme that recover bijectivity without changing the Jacobian factorization\. The two theorems define mutually inverse maps that together form the SurF bijection \(Figure[1](https://arxiv.org/html/2605.14069#S1.F1)\)\. The forward mapℱλ​\(t\)=Λ∗​\(t\)\\mathcal\{F\}\_\{\\lambda\}\(t\)=\\Lambda^\{\*\}\(t\)warps time so that high\-intensity regions stretch and low\-intensity regions compress, yielding i\.i\.d\.Exp​\(1\)\\mathrm\{Exp\}\(1\)gaps\. The inverseℛλ​\(z\)=\(Λ∗\)−1​\(z\)\\mathcal\{R\}\_\{\\lambda\}\(z\)=\(\\Lambda^\{\*\}\)^\{\-1\}\(z\)reconstructs temporal structure and can be obtained by solvingd​t/d​z=1/λ∗​\(t∣ℋt\)dt/dz=1/\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\)witht​\(0\)=0t\(0\)=0: high intensity makes time advance slowly \(event clusters\), low intensity makes it advance quickly \(gaps\)\. A detailed comparison to prior cumulative\-hazard neural TPPs \(FullyNN\(Omiet al\.,[2019a](https://arxiv.org/html/2605.14069#bib.bib28)\), log\-normal mixture\(Shchuret al\.,[2019](https://arxiv.org/html/2605.14069#bib.bib23)\)\) is in Appendix[H\.2](https://arxiv.org/html/2605.14069#A8.SS2)\.

### 3\.1Training Objective: Amortized SurF Loss

To learn the SurF bijection, we*directly amortize the cumulative intensity*: a neural network learnsΛθ​\(Δ​t∣𝐡\)\\Lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\), and the instantaneous intensity is recovered by differentiation\. This inverts the usual pipeline— instead of*modelλ\\lambdathen integrate*, we*modelΛ\\Lambdathen differentiate*; yielding a maximum\-likelihood objective with no window\-level integration over\[0,T\]\[0,T\]\. We parameterizeΛθ:ℝ\+×ℝd→ℝ\+\\Lambda\_\{\\theta\}:\\mathbb\{R\}^\{\+\}\\\!\\times\\mathbb\{R\}^\{d\}\\\!\\to\\mathbb\{R\}^\{\+\}as a monotonically increasing function ofΔ​t=t−ti−1\\Delta t=t\-t\_\{i\-1\}conditioned on the history encoding𝐡i−1\\mathbf\{h\}\_\{i\-1\}, withΛθ​\(0∣𝐡\)=0\\Lambda\_\{\\theta\}\(0\\mid\\mathbf\{h\}\)=0\. The intensity is then

λθ​\(Δ​t∣𝐡\)=∂Λθ​\(Δ​t∣𝐡\)∂\(Δ​t\)\>0,\\lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\)\\;=\\;\\frac\{\\partial\\Lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\)\}\{\\partial\(\\Delta t\)\}\\;\>\\;0,\(2\)with positivity guaranteed by monotonicity\. We propose there monotone parameterizations; all three variants instantiate equation[3](https://arxiv.org/html/2605.14069#S3.E3); they differ in howΛθ\\Lambda\_\{\\theta\}andλθ\\lambda\_\{\\theta\}are computed\.

SurF\-MoE\(mixture of exponentials, closed\-form\)\. Withwj,γj\>0w\_\{j\},\\gamma\_\{j\}\>0output by the history encoder,λθ​\(Δ​t∣𝐡\)=∑j=1Jwj​e−γj​Δ​t,Λθ​\(Δ​t∣𝐡\)=λfloor​Δ​t\+∑j=1Jwjγj​\(1−e−γj​Δ​t\)\.\\lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\)=\\sum\_\{j=1\}^\{J\}w\_\{j\}e^\{\-\\gamma\_\{j\}\\Delta t\},\\Lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\)=\\lambda\_\{\\text\{floor\}\}\\Delta t\+\\sum\_\{j=1\}^\{J\}\\tfrac\{w\_\{j\}\}\{\\gamma\_\{j\}\}\\bigl\(1\-e^\{\-\\gamma\_\{j\}\\Delta t\}\\bigr\)\.

SurF\-CSB\(cumulative softplus basis, closed\-form\)\.Λθ\\Lambda\_\{\\theta\}is a sum of shifted softplus primitives with positiveαm,βm\\alpha\_\{m\},\\beta\_\{m\},Λθ​\(Δ​t∣𝐡\)=∑m=1Mαm​\[log⁡\(1\+eβm​Δ​t\+δm\)−log⁡\(1\+eδm\)\],\\Lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\)=\\sum\_\{m=1\}^\{M\}\\alpha\_\{m\}\\bigl\[\\log\(1\+e^\{\\beta\_\{m\}\\Delta t\+\\delta\_\{m\}\}\)\-\\log\(1\+e^\{\\delta\_\{m\}\}\)\\bigr\],whose derivativeλθ=∑mαm​βm​σ​\(βm​Δ​t\+δm\)\\lambda\_\{\\theta\}=\\sum\_\{m\}\\alpha\_\{m\}\\beta\_\{m\}\\,\\sigma\(\\beta\_\{m\}\\Delta t\+\\delta\_\{m\}\)is also closed\-form\. Unlike MoE, CSB represents non\-monotonic shapes and is a universal approximator for positive intensities \(Proposition[Proposition](https://arxiv.org/html/2605.14069#Thmpropositionx2)\)\.

SurF\-GLQ\(unconstrained positive MLP, per\-interval quadrature\)\.λθ=softplus​\(fθ​\(Δ​t,𝐡\)\)\\lambda\_\{\\theta\}=\\mathrm\{softplus\}\(f\_\{\\theta\}\(\\Delta t,\\mathbf\{h\}\)\)for an unconstrained MLPfθf\_\{\\theta\};Λθ\\Lambda\_\{\\theta\}is obtained by fixedQQ\-point Gauss–Legendre quadrature on\[0,τi\]\[0,\\tau\_\{i\}\]withQ=8Q=8\. This quadrature is \(i\) applied per inter\-event interval rather than over\[0,T\]\[0,T\], \(ii\) of costO​\(N​Q\)O\(NQ\)withQQfixed and independent ofTT, and \(iii\) empirically below floating\-point noise atQ=8Q=8on all six benchmarks \(Appendix[D\.2\.1](https://arxiv.org/html/2605.14069#A4.SS2.SSS1.Px3)\), so it does not enter the training gradient in practice\. This is the precise sense in which SurF avoids theO​\(N​Q′\)O\(NQ^\{\\prime\}\),TT\-dependent quadrature over\[0,T\]\[0,T\]that bottlenecks THP/SAHP\. Variant\-selection guidelines are in Appendix[H\.3](https://arxiv.org/html/2605.14069#A8.SS3)\. Substituting equation[2](https://arxiv.org/html/2605.14069#S3.E2)into equation[1](https://arxiv.org/html/2605.14069#S2.E1)and applying the exact change of variables𝐭↦𝐳\\mathbf\{t\}\\mapsto\\mathbf\{z\}with the diagonal Jacobian of Theorem[2](https://arxiv.org/html/2605.14069#Thmtheorem2)\(Appendix[B](https://arxiv.org/html/2605.14069#A2)\) gives the amortized SurF loss:

ℒSurF​\(θ\)=∑i=1NΛθ​\(τi∣𝐡i−1\)\+Λθ​\(Δ​T∣𝐡N\)−∑i=1Nlog⁡λθ​\(τi∣𝐡i−1\),\\mathcal\{L\}\_\{\\text\{SurF\}\}\(\\theta\)\\;=\\;\\sum\_\{i=1\}^\{N\}\\Lambda\_\{\\theta\}\(\\tau\_\{i\}\\mid\\mathbf\{h\}\_\{i\-1\}\)\+\\Lambda\_\{\\theta\}\(\\Delta T\\mid\\mathbf\{h\}\_\{N\}\)\-\\sum\_\{i=1\}^\{N\}\\log\\lambda\_\{\\theta\}\(\\tau\_\{i\}\\mid\\mathbf\{h\}\_\{i\-1\}\),\(3\)whereτi=ti−ti−1\\tau\_\{i\}=t\_\{i\}\-t\_\{i\-1\}andΔ​T=T−tN\\Delta T=T\-t\_\{N\}\. The first two terms are forward passes of the monotone network; the third is its derivative \(autodiff or closed form\)\. For marked sequencesℒSurF\\mathcal\{L\}\_\{\\text\{SurF\}\}is added to the mark cross\-entropyℒtype\\mathcal\{L\}\_\{\\text\{type\}\}with scale\-balancing weightβtype\\beta\_\{\\text\{type\}\}\(Appendix[H\.1](https://arxiv.org/html/2605.14069#A8.SS1)\)\. Finally, we add a termℒfcst=∑i=1N\(τi^−τi\)2\\mathcal\{L\}\_\{\\textrm\{fcst\}\}=\\sqrt\{\\sum\_\{i=1\}^\{N\}\(\\hat\{\\tau\_\{i\}\}\-\\tau\_\{i\}\)^\{2\}\}with scale\-balancing weightβfcst\\beta\_\{\\textrm\{fcst\}\}to improve decoder training \(whereτi^\\hat\{\\tau\_\{i\}\}are the predicted inter\-arrival times\)\. So, the overall loss becomesℒtotal=ℒSurF\+βtype​ℒtype\+βfcst​ℒfcst\\mathcal\{L\}\_\{\\textrm\{total\}\}=\\mathcal\{L\}\_\{\\textrm\{SurF\}\}\+\\beta\_\{\\textrm\{type\}\}\\mathcal\{L\}\_\{\\textrm\{type\}\}\+\\beta\_\{\\textrm\{fcst\}\}\\mathcal\{L\}\_\{\\textrm\{fcst\}\}\.

Generative Sampling\.SurF generates events by inverting the time rescaling: sample exponential noise and solve for event times under the learned cumulative intensity\. Given history𝐡\\mathbf\{h\}, drawU∼Uniform​\(0,1\)U\\sim\\mathrm\{Uniform\}\(0,1\), setz=−log⁡Uz=\-\\log Uso thatz∼Exp​\(1\)z\\sim\\textrm\{Exp\}\(1\), and solveΛθ​\(Δ​t∣𝐡\)=z\\Lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\)=zforΔ​t\\Delta t\. BecauseΛθ\\Lambda\_\{\\theta\}isC1C^\{1\}, strictly increasing, and lower\-bounded byλfloor\>0\\lambda\_\{\\text\{floor\}\}\>0, the root exists and is unique\. We use a safeguarded Newton method\(Nocedal and Wright,[2006](https://arxiv.org/html/2605.14069#bib.bib119)\); the updateΔ​t\(n\+1\)=Δ​t\(n\)−Λθ​\(Δ​t\(n\)∣𝐡\)−zλθ​\(Δ​t\(n\)∣𝐡\)\\Delta t^\{\(n\+1\)\}=\\Delta t^\{\(n\)\}\-\\frac\{\\Lambda\_\{\\theta\}\(\\Delta t^\{\(n\)\}\\mid\\mathbf\{h\}\)\-z\}\{\\lambda\_\{\\theta\}\(\\Delta t^\{\(n\)\}\\mid\\mathbf\{h\}\)\}is accepted when it remains inside the current bracket, and bisection is used otherwise\. This converges globally from any valid bracket, with local quadratic rate once inside the Newton–Kantorovich region \(Appendix[E\.1](https://arxiv.org/html/2605.14069#A5.SS1)\)\. The mark is drawn from the softmax ofTypeNetθ​\(𝐡\)\\mathrm\{TypeNet\}\_\{\\theta\}\(\\mathbf\{h\}\)and the history is updated; full procedure in Algorithm[3](https://arxiv.org/html/2605.14069#alg3)\. Table[1](https://arxiv.org/html/2605.14069#S3.T1)summarizes the parameterization of SurF against existing neural TPPs\. Having established howΛθ\\Lambda\_\{\\theta\}is parameterized, we now ask whether a single model trained on multiple datasets can generalize to unseen ones at zero shot\.

Table 1:Parameterization comparison across neural TPPs\.“Window quadrature” refers to theO​\(N​Q′\)O\(NQ^\{\\prime\}\)numerical integration ofλ\\lambdaover\[0,T\]\[0,T\]inside the training likelihood; “per\-interval” refers to a fixedO​\(Q\)O\(Q\)rule on\[0,τi\]\[0,\\tau\_\{i\}\]whose cost does not grow withTT\. The Sampling column reports the procedure used to generate event times given a learned model\.
### 3\.2Zero\-Shot Capabilities Across Datasets

We now show that SurF’s time\-rescaling objective provides the ability to perform zero shot prediction\. To understand why, we examine this first through the lens of the encoder\. The goal is to learn an invertible mappingϕθ\\phi\_\{\\theta\}from the inter\-arrival times of an arbitrary temporal point process to a sequence of Exp\(1\) distributed random variables \(the existence of which is guaranteed by Theorem[2](https://arxiv.org/html/2605.14069#Thmtheorem2)\)\. Since eachΛθ​\(τi∣𝐡i\)\\Lambda\_\{\\theta\}\(\\tau\_\{i\}\\mid\\mathbf\{h\}\_\{i\}\)term ofℒSurF\\mathcal\{L\}\_\{\\textrm\{SurF\}\}is Exp\(1\) distributed by Corollary[1](https://arxiv.org/html/2605.14069#Thmcorollary1), training on any dataset results in minimizing the likelihood of an Exp\(1\) distribution governed byθ\\theta\- in this case,ϕθ\\phi\_\{\\theta\}\. Sinceϕθ\\phi\_\{\\theta\}is shared across datasets, the variance in the inter\-arrival times from the multiple training datasets helps create a robust mapping from input to Exp\(1\) space\. Moreover, although multiple input TPPs might have similar inter\-arrival times between certain events, the dependence ofϕθ\\phi\_\{\\theta\}on the event history ensures that event patterns in input space play a pivotal role in governing the mapping\. The signal accumulates in the shared parameters\(ϕθ,Λθ\)\(\\phi\_\{\\theta\},\\Lambda\_\{\\theta\}\)as a dataset\-general repertoire of history features \(clustering, decay, mark–time coupling, refractory gaps\); per\-dataset rate and scale constants are absorbed by the rescaling itself\. At zero\-shot time, an unseen dataset shares the same canonical target, so the same\(ϕθ,Λθ\)\(\\phi\_\{\\theta\},\\Lambda\_\{\\theta\}\)applies without adaptation\. We state this formally in Appendix[Proposition](https://arxiv.org/html/2605.14069#Thmpropositionx4)\.

The decoder is primarily assisted by theℒfcst\\mathcal\{L\}\_\{\\textrm\{fcst\}\}term which directly minimizes the RMSE of the predicted inter\-arrival times\. Again, the variance in the inter\-arrival times of the training data ensures a robust representation that precisely maps regions of Exp\(1\) space to input space in a manner that minimizes time RMSE\. Note that it is crucial here forIm​\(ϕθ\)\\textrm\{Im\}\(\\phi\_\{\\theta\}\)to follow a single, ideally simple distributionacross datasetsto ensure a tractable objective for the decoder \( Theorem[1](https://arxiv.org/html/2605.14069#Thmtheorem1)\)\. Thus, a single pretrained SurF checkpoint should transfer zero\-shot\. Section[4\.1](https://arxiv.org/html/2605.14069#S4.SS1)confirms it: the pretrained checkpoint outperforms every classical and neural\-autoregressive baseline on four of six benchmarks without dataset\-specific adaptation; to our knowledge the first empirical evidence that the TRT’s canonical target enables cross\-dataset transfer for a learned cumulative intensity\.

## 4Experiments

We assess SurF as a general\-purpose generative model for irregular event streams by asking:\(i\)Can the SurF bijection reconstruct rich, non\-monotonic intensity patterns? \(§[4](https://arxiv.org/html/2605.14069#S4)\)\.\(ii\)Can a single sharedΛθ\\Lambda\_\{\\theta\}match or outperform dataset\-specific models on next\-event prediction across six diverse corpora, and is the induced density well calibrated under the TRT? \(§[4\.1](https://arxiv.org/html/2605.14069#S4.SS1)\)\.\(iii\)Does SurF remain stable under repeated autoregressive decoding, and does the canonical target allow strict zero\-shot transfer to unseen datasets? \(§[4\.2](https://arxiv.org/html/2605.14069#S4.SS2)\)\. To investigate these questions, we use six benchmarks covering time scales from seconds to days: A\)Taobao\(Xueet al\.,[2022](https://arxiv.org/html/2605.14069#bib.bib53)\), B\)Taxi\(Whong,[2014](https://arxiv.org/html/2605.14069#bib.bib84)\), C\)Retweet\(Zhouet al\.,[2013](https://arxiv.org/html/2605.14069#bib.bib89)\), D\)StackOverflow\(Leskovec and Krevl,[2014](https://arxiv.org/html/2605.14069#bib.bib83)\), E\)Amazon\(Ni,[2018](https://arxiv.org/html/2605.14069#bib.bib85)\), and F\)Earthquake\(∼1\.3\{\\sim\}1\.3M events,∼32\{\\sim\}32K sequences,7575event types in total; Appendix[I](https://arxiv.org/html/2605.14069#A9)\)\. We use RMSE for time prediction and accuracy for type prediction as our performance measures\.

Proof of Concept: Recovering Oscillatory Firing Rates\.Before turning to benchmarks, we verify on a controlled synthetic process that SurF actually promises to learn\. We generated synthetic spike trains fromλ​\(t\)=0\.5\+19\.5​\(\(1\+cos⁡\(π​t\)\)/2\)3\\lambda\(t\)=0\.5\+19\.5\\bigl\(\(1\+\\cos\(\\pi t\)\)/2\\bigr\)^\{3\}\(base rate0\.50\.5Hz, peak2020Hz\) and trained on200200trials of66s\. SurF\-MoE accurately recovers the oscillatory peaks and matches the ground\-truth ISI distribution, whereas a Hawkes baseline produces a noisy intensity estimate and a heavily biased inter\-spike interval \(ISI\) tail; SurF thus captures event the rate structures invisible to classical parametric models \(Figure[4](https://arxiv.org/html/2605.14069#A2.F4)\)\.

### 4\.1Next\-Event Prediction

With the parameterization validated on synthetic data, we now ask whether a single jointly\-trainedΛθ\\Lambda\_\{\\theta\}can replace baselines trained on singular datasets across six real\-world corpora\. Table[2](https://arxiv.org/html/2605.14069#S4.T2)reports next\-event RMSE and type accuracy\. We evaluate SurF in two regimes:joint\(a single checkpoint trained on the union of all six training corpora,\{A−F\}t​r​a​i​n\\\{A\-F\\\}\_\{train\}datasets, and evaluated per\-test\-set,\{A−F\}t​e​s​t\\\{A\-F\\\}\_\{test\}; the target’s training split is included in the training mix\) andfinetuned\(the same checkpoint finetuned per dataset, e\.g\., finetune onAt​r​a​i​nA\_\{train\}and test onAt​e​s​tA\_\{test\}\)\. A strict leave\-one\-out zero\-shot protocol is reported in Table[3](https://arxiv.org/html/2605.14069#S4.T3), e\.g\., we train on\{A−E\}t​r​a​i​n\\\{A\-E\\\}\_\{train\}datasets and test onFt​e​s​tF\_\{test\}\. Baselines are trained from scratch, e\.g\., we train onAt​r​a​i​nA\_\{train\}dataset and test onAt​e​s​tA\_\{test\}\. Figure[7](https://arxiv.org/html/2605.14069#A12.F7)compares the next\-event RMSE and type accuracy of the finetuned versions of the three SurF variants with the DTPP, NHP, and NJDTPP baselines\.

Table 2:Next\-event prediction: time RMSE / type accuracy \(%\)\.Bold= best,underline= second\-best\. “–” = not reported\.†DTPP conditions its mark prediction on the ground\-truth next\-event time, which biases its type accuracy upward\.SurF achieves the best reported time RMSE on Earthquake, Retweet, and Taobao, and is within trial\-level noise of the strongest baselines on the remaining three; type accuracy is competitive across the board\. On Earthquake, SurF\-GLQ \(finetuned\) improves∼10%\{\\sim\}10\\%over TimesFM and DTPP and∼14%\{\\sim\}14\\%over NJDTPP; all three finetuned variants beat every baseline, and the jointly\-trained SurF\-CSB and SurF\-GLQ checkpoints already do so without any per\-dataset finetuning\. On Retweet, SurF\-CSB \(finetuned\) improves∼10%\{\\sim\}10\\%over NJDTPP and∼12%\{\\sim\}12\\%over DTPP; its jointly\-trained counterpart again leads all baselines\. SurF\-GLQ \(finetuned\) attains the best Taobao time RMSE, and SurF\-MoE \(finetuned\) the top Taobao type accuracy among TPP methods\. On Amazon and StackOverflow, SurF beats DTPP on time RMSE, though both gaps are comparable toσRMSE≈0\.058\\sigma\_\{\\mathrm\{RMSE\}\}\{\\approx\}0\.058and should be read as effective ties; on Taxi the gap is within1\.1%1\.1\\%\. The best SurF variant outperforms the much larger foundation\-model baseline TimesFM \(231231M parameters\) on every reported dataset\.

The threeΛθ\\Lambda\_\{\\theta\}variants are complementary: MoE’s exponential basis fits Amazon’s fast\-decaying dynamics; CSB’s non\-monotonic capacity matches bursty social cascades \(Retweet, Earthquake\); and the unconstrained GLQ excels on irregular intensity profiles \(Taxi, Earthquake, Taobao\)\. The jointly\-trained checkpoints serve six heterogeneous datasets from a single shared encoder, consistent with the in\-corpus scope of Proposition[Proposition](https://arxiv.org/html/2605.14069#Thmpropositionx4); out\-of\-corpus transfer is evaluated separately under leave\-one\-out in Table[3](https://arxiv.org/html/2605.14069#S4.T3)\. Fine\-tuning further improves time RMSE on at least5/65/6datasets per variant \(CSB on6/66/6\) and type accuracy on6/66/6for CSB and GLQ\. Figure[2](https://arxiv.org/html/2605.14069#S4.F2)shows the Kolmogorov\-Smirnov \(KS\) statistic, denoted as D, plotted for the CDF ofΔ​z\\Delta z\.D=supx\|G​\(x\)−F​\(x\)\|D=\\sup\_\{x\}\\left\|G\(x\)\-F\(x\)\\right\|measures the maximum distance between two CDFsGGandFF; in this case, we letGGbe the CDF of the true Exp\(1\) distribution whileFFrepresents an empirical distribution from a SurF variant approximating Exp\(1\)\. We see from Figure[2](https://arxiv.org/html/2605.14069#S4.F2)that the gains in type accuracy and time RMSE are accompanied by well\-calibrated densities: the time\-rescaled residualsΔ​zi=Λθ​\(τi∣hi−1\)\\Delta z\_\{i\}=\\Lambda\_\{\\theta\}\(\\tau\_\{i\}\\mid h\_\{i\-1\}\)should be i\.i\.d\.Exp​\(1\)\\mathrm\{Exp\}\(1\)under a perfectly specified model \(TRT\(Brownet al\.,[2002](https://arxiv.org/html/2605.14069#bib.bib4)\)\)\. All three SurF variants achieveD≤0\.1D\\leq 0\.1on Taxi and StackOverflow with empirical CDFs tracking the theoretical curve almost exactly while the remaining datasets show only modest deviation\.

![Refer to caption](https://arxiv.org/html/2605.14069v1/figures/all_variants_ks_cdf_top3.png)Figure 2:Time\-rescaling goodness\-of\-fit across datasets and variants\.Empirical CDF of the residualsΔ​zi=Λ​\(τi∣hi−1\)\\Delta z\_\{i\}=\\Lambda\(\\tau\_\{i\}\\mid h\_\{i\-1\}\)on the test split\. A perfectly calibrated model produces residuals that are i\.i\.d\.Exp​\(1\)\\mathrm\{Exp\}\(1\)\(black dotted curve\); smaller KS statisticDDindicates better calibration\. SurF achieves near\-ideal calibration \(Figure[8](https://arxiv.org/html/2605.14069#A12.F8)presents the results for all six datasets\)\.Goodness\-of\-fit via NLL\.Beyond predictive accuracy and KS calibration, we also report held\-out negative log\-likelihood \(NLL\), the canonical TPP goodness\-of\-fit score, in Appendix[C](https://arxiv.org/html/2605.14069#A3); MoE achieves the lowest NLL on4/64/6datasets and GLQ on the remaining two, confirming that the variant best suited to each dataset\. The ranking among SurF variants tracks the calibration findings of Figure[2](https://arxiv.org/html/2605.14069#S4.F2): MoE wins on4/64/6datasets \(Taxi, StackOverflow, Earthquake, Taobao\), GLQ on the two with the heaviest tails \(Retweet, Amazon\), and finetuning improves NLL on5/65/6datasets per variant; confirming that the gains in RMSE and type accuracy are accompanied by tighter density estimates rather than point\-prediction artifacts\. Predictive accuracy and density calibration concur, but next\-event metrics only test the forward map; we next probe whether the bijection’s reverse direction holds up under iterated rollout\.

### 4\.2Multi\-Horizon Forecasting

To test whether stability survives iterated decoding \(and whether the canonical target supports strict zero\-shot transfer to held\-out datasets\) we generate the nexth∈\{1,…,10\}h\\in\\\{1,\\ldots,10\\\}events under three protocols \(joint, finetuned, leave\-one\-out\) and measure per\-horizon RMSE and type accuracy\.

![Refer to caption](https://arxiv.org/html/2605.14069v1/figures/comparison_finetuned_vs_zero_shot_rmse_and_type_acc_per_horizon_v2.png)Figure 3:Per\-horizon inter\-arrival RMSE \(top\) and type accuracy \(bottom\) for finetuned \(solid\) and zero\-shot \(dashed\) SurF variants over11–1010future events \(Earthquake to55\)\. Zero\-shot type accuracy is omitted: the type classifier is dataset\-specific and only active after fine\-tuning\.Figure[3](https://arxiv.org/html/2605.14069#S4.F3)shows that iterated autoregressive decoding remains stable\. Finetuned RMSE is nearly flat on Amazon and grows only modestly elsewhere; Retweet exhibits the largest compounding, but growth is monotone and sub\-linear\. Type accuracy stays within a few points of the one\-step value on every dataset\.

Finetuned vs\. DTPP\.SurF matches or beats DTPP, the strongest baseline, on time RMSE on all six datasets, with margins from a tie on Taxi \(where SurF additionally leads on type accuracy\) to∼41%\{\\sim\}41\\%on Taobao\. On type accuracy, SurF leads on5/65/6datasets among non\-†comparisons, losing only Retweet by a hair\. GLQ wins outright on four datasets, ties DTPP on Taxi, and is beaten only on Taobao; where MoE wins\.

Zero\-shot vs\. baselines\.Under the strict leave\-one\-out protocol, the best SurF variant beats every zero\-shot baseline on5/65/6datasets, losing only on StackOverflow \(gaps are largest where baselines collapse\)\. Across variants, GLQ transfers best on3/63/6, CSB on2/62/6, and MoE on Taobao—the two more flexible variants together accounting for5/65/6best transfers, indicating that capacity for non\-monotonic and irregular intensity shapes generalizes even when the target is entirely unseen during pretraining\.

Joint training\.In the joint training setting \(top block of Table[3](https://arxiv.org/html/2605.14069#S4.T3)\), MoE varient of SurF leads on time RMSE for3/63/6datasets, CSB on2/62/6, and GLQ on1/61/6\.

Table 3:Multi\-horizon forecasting: inter\-event time RMSE / type accuracy \(%\)\.Bold= best within block,underline= second\-best\. “–” = not reported\. Note:†DTPP conditions its mark prediction on the ground\-truth next\-event time\.Ablations\.The preceding sections establish that SurF works; in this section we investigate*why*\. We ablate three design choices of SurF: \(i\) the*conformer*block inserted before multi\-head attention; \(ii\) the*loss schedule*mixing the SurF \(MLE\) loss with the forecast loss \(*Equal*/*Staged*/*Hybrid*\); and \(iii\) the*quadrature configuration*of SurF\-GLQ—Gauss–Legendre pointsQQand log\-intensity referencesJJ\. Full tables and per\-bin breakdowns are in Appendix[G](https://arxiv.org/html/2605.14069#A7)\. The Hybrid schedule used by full SurF sits near the Pareto frontier across all six benchmarks: never worst on any column, tied\-best on Retweet time RMSE and Taobao type accuracy, and within noise of the column\-best elsewhere\. The Staged schedule collapses on Retweet \(RMSE jumps from∼16\{\\sim\}16to22\.322\.3\) because dropping the SurF likelihood during finetuning discards the gradient signal that anchors long\-tail inter\-arrivals\. The conformer block helps most on datasets with informative local temporal context \(Taxi, Earthquake, Taobao\) and is mildly harmful only when paired with the Equal schedule on Amazon and StackOverflow\. Across a100100\-trial random hyperparameter sweep, increasingQ=4→8Q\{=\}4\{\\to\}8reduces median time RMSE by∼\\sim0\.080\.08and trial\-to\-trial variance by∼3×\{\\sim\}3\\times\(σRMSE\\sigma\_\{\\text\{RMSE\}\}:0\.170→0\.0580\.170\{\\to\}0\.058\); going further toQ=12Q\{=\}12yields no measurable gain \(median difference0\.010\.01, well withinσ≈0\.05\\sigma\{\\approx\}0\.05, and best\-trial accuracy actually drops49\.94→49\.49%49\.94\{\\to\}49\.49\\%\)\. This validates theQ=8Q\{=\}8default of Section[3\.1](https://arxiv.org/html/2605.14069#S3.SS1): per\-interval GLQ has already converged atQ=8Q\{=\}8\.

Table 4:Cost on Taxi \(H100, batch512512, inference averaged over2020sequences\)\. Train measures time per batch \(ms\); Infer\. measures time per sequence \(ms\)\.A similar saturation supports theJ=4J\{=\}4default\. Across all100100successful trials, time RMSE varies in\[8\.725,9\.242\]\[8\.725,9\.242\]with79%79\\%within1%1\\%of the best run; the top\-1010configurations span a window of0\.0110\.011RMSE units \(an order of magnitude smaller than the gap to the closest baseline in Table[2](https://arxiv.org/html/2605.14069#S4.T2)\) across a mix ofQQ,JJ, conformer, and batch\-size settings, indicating several near\-optimal HP regions rather than a single fragile peak\.

Computational Efficiency\.The amortized cumulative intensity is designed to eliminate the numerical\-quadrature bottleneck of prior neural TPPs\. Table[4](https://arxiv.org/html/2605.14069#S4.T4)compares training and inference cost on Taxi\. Relative to NJDTPP \(the most directly comparable quadrature\-based neural TPP\) SurF\-MoE and SurF\-CSB require no numerical integration at training time, while SurF\-GLQ uses a fixedQ=8Q\{=\}8per\-interval rule whose cost isTT\-independent and whose error does not enter the gradient in practice \(Appendix[H\.3](https://arxiv.org/html/2605.14069#A8.SS3)\)\. This yields a∼9×\{\\sim\}9\\timestraining speed\-up for SurF\-MoE and a∼2\.6×\{\\sim\}2\.6\\timesspeed\-up for SurF\-GLQ\. At inference, the gap is∼17×\{\\sim\}17\\timesfor SurF\-MoE and∼14×\{\\sim\}14\\timesfor SurF\-GLQ; SurF\-MoE is also∼3\.7×\{\\sim\}3\.7\\timesfaster than TimesFM at inference, while being∼125×\{\\sim\}125\\timesto∼165×\{\\sim\}165\\timessmaller in parameter count and delivering stronger accuracy \(Table[2](https://arxiv.org/html/2605.14069#S4.T2)\)\. Inference is dominated by the Transformer encoder pass; Newton inversion ofΛθ\\Lambda\_\{\\theta\}converges in55–88iterations and contributes under1\.2%1\.2\\%of wall\-clock time on every benchmark \(Appendix[E\.1](https://arxiv.org/html/2605.14069#A5.SS1)\)\.

## 5Related Work

Generative models for time series\.Foundation models have advanced rapidly in language\(Brownet al\.,[2020](https://arxiv.org/html/2605.14069#bib.bib106); Touvronet al\.,[2023](https://arxiv.org/html/2605.14069#bib.bib107)\), vision\(Radfordet al\.,[2021](https://arxiv.org/html/2605.14069#bib.bib108); Oquabet al\.,[2023](https://arxiv.org/html/2605.14069#bib.bib109)\), and regularly sampled time series \(TimesFM\(Daset al\.,[2024](https://arxiv.org/html/2605.14069#bib.bib110)\), Chronos\(Ansariet al\.,[2024](https://arxiv.org/html/2605.14069#bib.bib111)\), Lag\-Llama\(Rasulet al\.,[2024](https://arxiv.org/html/2605.14069#bib.bib112)\), Moirai\(Wooet al\.,[2024](https://arxiv.org/html/2605.14069#bib.bib115)\)\)\. All assume fixed sampling rates and synchronized channels, an assumption that fails for asynchronous event streams where inter\-arrival times span orders of magnitude and timing itself carries meaning\. SurF targets this regime directly\.Neural temporal point processes\.Neural TPPs\(Shchuret al\.,[2021](https://arxiv.org/html/2605.14069#bib.bib36); Rezaeiet al\.,[2021](https://arxiv.org/html/2605.14069#bib.bib124)\)offer flexible event modeling but have not reached foundation\-model scale\. RNN\-based methods \(RMTPP\(Duet al\.,[2016](https://arxiv.org/html/2605.14069#bib.bib40); Rezaeiet al\.,[2018](https://arxiv.org/html/2605.14069#bib.bib121)\), Neural Hawkes\(Mei and Eisner,[2017](https://arxiv.org/html/2605.14069#bib.bib48)\)\) suffer from gradient pathologies and sequential bottlenecks; Transformer variants \(SAHP\(Zhanget al\.,[2020](https://arxiv.org/html/2605.14069#bib.bib41)\), THP\(Zuoet al\.,[2020](https://arxiv.org/html/2605.14069#bib.bib39)\), and extensions\(Li and Sun,[2023](https://arxiv.org/html/2605.14069#bib.bib94); Gu,[2021](https://arxiv.org/html/2605.14069#bib.bib33); Xueet al\.,[2022](https://arxiv.org/html/2605.14069#bib.bib53)\)\) parallelize training but still require numerical quadrature of∫0Tλ∗​\(s∣ℋs\)​𝑑s\\int\_\{0\}^\{T\}\\lambda^\{\*\}\(s\\mid\\mathcal\{H\}\_\{s\}\)\\,dsinside the likelihood, at a cost that grows with the observation window\. Intensity\-free methods \(IFTPP\(Shchuret al\.,[2019](https://arxiv.org/html/2605.14069#bib.bib23)\)\) avoid the integral but discard the intensity representation that makes the TRT usable\. No prior neural TPP has demonstrated zero\-shot cross\-daataset transfer\.

Cumulative\-hazard neural TPPs\.Closest to SurF are FullyNN\(Omiet al\.,[2019a](https://arxiv.org/html/2605.14069#bib.bib28)\)and the log\-normal mixture ofShchuret al\.\([2019](https://arxiv.org/html/2605.14069#bib.bib23)\), which parameterizeΛθ\\Lambda\_\{\\theta\}directly and recoverλθ\\lambda\_\{\\theta\}by differentiation\. We differ on three fronts\.*Conceptually,*prior work treatsΛθ\\Lambda\_\{\\theta\}as a likelihood\-tractability trick, whereas SurF reinterprets the TRT itself as a learnable bijection to a canonical exponential space, yielding the domain \(dataset\)\-invariance result in Proposition[Proposition](https://arxiv.org/html/2605.14069#Thmpropositionx4)\.*Architecturally,*prior work uses single\-dataset feedforward or recurrent encoders; SurF pairs the cumulative\-intensity head with a Transformer encoder and multi\-dataset pretraining, and offers three monotone parameterizations \(MoE, CSB, GLQ\) with distinct expressiveness–cost trade\-offs \(Appendix[H\.3](https://arxiv.org/html/2605.14069#A8.SS3)\)\.*Empirically,*SurF is the first cumulative\-hazard TPP to report zero\-shot cross\-dataset transfer: the pretrained checkpoint outperforms per\-dataset FullyNN on five of six benchmarks, with finetuned variants recovering the sixth \(Taxi\); see Appendix[H\.2](https://arxiv.org/html/2605.14069#A8.SS2)for a full comparison\. By turning the TRT into a learnable bijection, SurF yields a generative model that is exact in likelihood, bijective in sampling, and dataset\-invariant in its latent space; the last property being the qualitative departure from prior work and the basis for zero\-shot transfer toward foundation models for asynchronous event streams\.

## 6Conclusion

We introducedSurF, a generative model for irregularly sampled multivariate event streams that reinterprets the Time Rescaling Theorem as a learnable, invertible mapping between event sequences and canonicalExp​\(1\)\\mathrm\{Exp\}\(1\)noise\. SurF builds on the cumulative\-hazard lineage of neural TPPs and reframes the TRT as a normalizing flow\. Our contributions are an explicit bidirectional formulation grounded in a bijectivity statement that follows from the inverse function theorem \(Theorem[2](https://arxiv.org/html/2605.14069#Thmtheorem2)\), three monotone parameterizations with different expressiveness–cost trade\-offs \(Appendix[H\.3](https://arxiv.org/html/2605.14069#A8.SS3)\), and a Transformer\-amortized encoder that makes multi\-dataset pretraining practical\. Under a strict leave\-one\-out protocol, a held\-out SurF checkpoint beats every classical and neural\-autoregressive baseline on5/65/6benchmarks and beats every baseline including on Amazon and Earthquake; empirical evidence of partial cross\-dataset transfer for a learned cumulative intensity\.

Limitations and future work\.The bijectivity of Theorem[2](https://arxiv.org/html/2605.14069#Thmtheorem2)requires a strictly positive intensity, enforced by a small floorλfloor\\lambda\_\{\\text\{floor\}\}that introduces a bounded compensator bias \(Appendix[A\.1](https://arxiv.org/html/2605.14069#A1.SS1)\); negligible on our benchmarks but potentially material for processes with genuine dead\-time \(refractory neural firing, market\-closure windows\)\. The MoE parameterization admits closed\-form integration but is restricted to monotonically decaying between\-event intensity; CSB and GLQ lift this at modest cost, and richer tractable parameterizations remain a natural next step\. Finally, our six\-datasets pre\-training corpus is sufficient to demonstrate cross\-dataset transfer but remains orders of magnitude smaller than language and vision corpora; scaling is the key milestone for foundation\-scale models of asynchronous event streams\.

## References

- A\. F\. Ansari, L\. Stella, C\. Turkmen, X\. Zhang, P\. Mercado, H\. Shen, O\. Shchur, S\. S\. Rangapuram, S\. P\. Arango, S\. Kapoor,et al\.\(2024\)Chronos: learning the language of time series\.arXiv preprint arXiv:2403\.07815\.Cited by:[§1](https://arxiv.org/html/2605.14069#S1.p1.5),[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- Hawkes processes in finance\.Market Microstructure and Liquidity1\(01\),pp\. 1550005\.Cited by:[§1](https://arxiv.org/html/2605.14069#S1.p1.5)\.
- E\. N\. Brown, R\. Barbieri, V\. Ventura, R\. E\. Kass, and L\. M\. Frank \(2002\)The time\-rescaling theorem and its application to neural spike train data analysis\.Neural computation14\(2\),pp\. 325–346\.Cited by:[§1](https://arxiv.org/html/2605.14069#S1.p2.9),[§4\.1](https://arxiv.org/html/2605.14069#S4.SS1.p3.13),[Theorem 1](https://arxiv.org/html/2605.14069#Thmtheorem1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- R\. T\. Chen, B\. Amos, and M\. Nickel \(2021\)Neural spatio\-temporal point processes\.InInternational Conference on Learning Representations,Cited by:[Table 2](https://arxiv.org/html/2605.14069#S4.T2.8.6.14.8.1)\.
- D\. J\. Daley and D\. Vere\-Jones \(2003\)An introduction to the theory of point processes: volume I: elementary theory and methods\.Springer\.Cited by:[§2](https://arxiv.org/html/2605.14069#S2.p1.8)\.
- A\. Das, W\. Kong, A\. Leach, S\. Mathur, R\. Sen, and R\. Yu \(2024\)A decoder\-only foundation model for time\-series forecasting\.arXiv preprint arXiv:2310\.10688\.Cited by:[§1](https://arxiv.org/html/2605.14069#S1.p1.5),[Table 2](https://arxiv.org/html/2605.14069#S4.T2.8.6.19.13.1),[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- N\. Du, H\. Dai, R\. Trivedi, U\. Upadhyay, M\. Gomez\-Rodriguez, and L\. Song \(2016\)Recurrent marked temporal point processes: embedding event history to vector\.InProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,External Links:[Link](https://www.kdd.org/kdd2016/papers/files/rpp1081-duA.pdf)Cited by:[Table 1](https://arxiv.org/html/2605.14069#S3.T1.13.1.1.2),[Table 2](https://arxiv.org/html/2605.14069#S4.T2.8.6.9.3.1),[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- Y\. Gu \(2021\)Attentive neural point processes for event forecasting\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 7592–7600\.Cited by:[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- A\. G\. Hawkes \(1971\)Spectra of some self\-exciting and mutually exciting point processes\.Biometrika\.External Links:[Link](https://pdfs.semanticscholar.org/c082/06b44dd1f0ea54bd073e4effaf2e4483169b.pdf)Cited by:[Table 2](https://arxiv.org/html/2605.14069#S4.T2.8.6.8.2.1)\.
- A\. G\. Hawkes \(2018\)Hawkes processes and their applications to finance: a review\.Quantitative Finance18\(2\),pp\. 193–198\.Cited by:[§1](https://arxiv.org/html/2605.14069#S1.p1.5)\.
- A\. Joshi and M\. Hauskrecht \(2025\)Still competitive: revisiting recurrent models for irregular time series prediction\.arXiv preprint arXiv:2510\.16161\.Cited by:[Table 2](https://arxiv.org/html/2605.14069#S4.T2.8.6.18.12.1)\.
- J\. Leskovec and A\. Krevl \(2014\)SNAP Datasets: stanford large network dataset collection\.External Links:[Link](https://snap.stanford.edu/data/)Cited by:[4th item](https://arxiv.org/html/2605.14069#A9.I1.i4.p1.1),[§4](https://arxiv.org/html/2605.14069#S4.p1.4)\.
- Z\. Li and M\. Sun \(2023\)Sparse transformer hawkes process for long event sequences\.InJoint European Conference on Machine Learning and Knowledge Discovery in Databases,pp\. 172–188\.Cited by:[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- H\. Mei and J\. Eisner \(2017\)The neural Hawkes process: A neurally self\-modulating multivariate point process\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/1612.09328)Cited by:[Table 1](https://arxiv.org/html/2605.14069#S3.T1.15.3.3.3),[Table 2](https://arxiv.org/html/2605.14069#S4.T2.8.6.10.4.1),[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- J\. Ni \(2018\)Amazon review data\.External Links:[Link](https://nijianmo.github.io/amazon/)Cited by:[5th item](https://arxiv.org/html/2605.14069#A9.I1.i5.p1.1),[§4](https://arxiv.org/html/2605.14069#S4.p1.4)\.
- J\. Nocedal and S\. J\. Wright \(2006\)Numerical optimization\.Springer\.Cited by:[2nd item](https://arxiv.org/html/2605.14069#A5.I1.i2.p1.3),[§3\.1](https://arxiv.org/html/2605.14069#S3.SS1.p5.12)\.
- T\. Omi, K\. Aihara,et al\.\(2019a\)Fully neural network based model for general temporal point processes\.Advances in neural information processing systems32\.Cited by:[§H\.2](https://arxiv.org/html/2605.14069#A8.SS2.p1.3),[§1](https://arxiv.org/html/2605.14069#S1.p3.16),[Table 1](https://arxiv.org/html/2605.14069#S3.T1.21.9.9.2),[§3](https://arxiv.org/html/2605.14069#S3.p2.13),[Table 2](https://arxiv.org/html/2605.14069#S4.T2.8.6.15.9.1),[§5](https://arxiv.org/html/2605.14069#S5.p2.3)\.
- T\. Omi, N\. Ueda, and K\. Aihara \(2019b\)Fully neural network based model for general temporal point processes\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/1905.09690)Cited by:[§1](https://arxiv.org/html/2605.14069#S1.p2.9)\.
- M\. Oquab, T\. Darcet, T\. Moutakanni, H\. Vo, M\. Szafraniec, V\. Khalidov, P\. Fernandez, D\. Haziza, F\. Massa, A\. El\-Nouby,et al\.\(2023\)Dinov2: learning robust visual features without supervision\.arXiv preprint arXiv:2304\.07193\.Cited by:[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- J\. W\. Pillow, J\. Shlens, L\. Paninski, A\. Sher, A\. M\. Litke, E\. Chichilnisky, and E\. P\. Simoncelli \(2008\)Spatio\-temporal correlations and visual signalling in a complete neuronal population\.Nature454\(7207\),pp\. 995–999\.Cited by:[§1](https://arxiv.org/html/2605.14069#S1.p1.5)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- K\. Rasul, A\. Ashok, A\. R\. Williams, A\. Khorasani, G\. Adamopoulos, R\. Bhagwatkar, M\. Biloš, H\. Ghonia, N\. V\. Hassen, A\. Schneider,et al\.\(2024\)Lag\-llama: towards foundation models for probabilistic time series forecasting\.arXiv preprint arXiv:2310\.08278\.Cited by:[§1](https://arxiv.org/html/2605.14069#S1.p1.5),[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- M\. R\. Rezaei, A\. K\. Gillespie, J\. A\. Guidera, B\. Nazari, S\. Sadri, L\. M\. Frank, U\. T\. Eden, and A\. Yousefi \(2018\)A comparison study of point\-process filter and deep learning performance in estimating rat position using an ensemble of place cells\.In2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society \(EMBC\),pp\. 4732–4735\.Cited by:[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- M\. R\. Rezaei, A\. E\. Hadjinicolaou, S\. S\. Cash, U\. T\. Eden, and A\. Yousefi \(2022\)Direct discriminative decoder models for analysis of high\-dimensional dynamical neural data\.Neural Computation34\(5\),pp\. 1100–1135\.Cited by:[§1](https://arxiv.org/html/2605.14069#S1.p1.5)\.
- M\. R\. Rezaei, K\. Arai, L\. M\. Frank, U\. T\. Eden, and A\. Yousefi \(2021\)Real\-time point process filter for multidimensional decoding problems using mixture models\.Journal of neuroscience methods348,pp\. 109006\.Cited by:[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- Y\. Rubanova, R\. T\. Chen, and D\. K\. Duvenaud \(2019\)Latent ordinary differential equations for irregularly\-sampled time series\.Advances in Neural Information Processing Systems \(NeurIPS\)32\.Cited by:[§1](https://arxiv.org/html/2605.14069#S1.p1.5)\.
- O\. Shchur, M\. Biloš, and S\. Günnemann \(2019\)Intensity\-free learning of temporal point processes\.arXiv preprint arXiv:1909\.12127\.Cited by:[Appendix C](https://arxiv.org/html/2605.14069#A3.SS0.SSS0.Px1.p1.2),[Appendix C](https://arxiv.org/html/2605.14069#A3.p1.1),[§H\.2](https://arxiv.org/html/2605.14069#A8.SS2.p1.3),[§1](https://arxiv.org/html/2605.14069#S1.p2.9),[§1](https://arxiv.org/html/2605.14069#S1.p3.16),[Table 1](https://arxiv.org/html/2605.14069#S3.T1.20.8.8.2),[§3](https://arxiv.org/html/2605.14069#S3.p2.13),[Table 2](https://arxiv.org/html/2605.14069#S4.T2.8.6.16.10.1),[§5](https://arxiv.org/html/2605.14069#S5.p1.1),[§5](https://arxiv.org/html/2605.14069#S5.p2.3)\.
- O\. Shchur, A\. C\. Türkmen, T\. Januschowski, and S\. Günnemann \(2021\)Neural temporal point processes: a review\.arXiv preprint arXiv:2104\.03528\.Cited by:[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- S\. N\. Shukla and B\. M\. Marlin \(2021\)Multi\-time attention networks for irregularly sampled time series\.arXiv preprint arXiv:2101\.10318\.Cited by:[§1](https://arxiv.org/html/2605.14069#S1.p1.5)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- W\. Truccolo, U\. T\. Eden, M\. R\. Fellows, J\. P\. Donoghue, and E\. N\. Brown \(2005\)A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects\.Journal of neurophysiology93\(2\),pp\. 1074–1089\.Cited by:[§1](https://arxiv.org/html/2605.14069#S1.p1.5),[§1](https://arxiv.org/html/2605.14069#S1.p2.9)\.
- A\.W\. van der Vaart \(2000\)Asymptotic statistics\.Asymptotic Statistics,Cambridge University Press\.External Links:ISBN 9780521784504,LCCN 98015176,[Link](https://books.google.ca/books?id=UEuQEM5RjWgC)Cited by:[§D\.2\.1](https://arxiv.org/html/2605.14069#A4.SS2.SSS1.Px4.p1.8)\.
- C\. Whong \(2014\)FOILing NYC’s taxi trip data\.External Links:[Link](https://chriswhong.com/open-data/foil_nyc_taxi/)Cited by:[2nd item](https://arxiv.org/html/2605.14069#A9.I1.i2.p1.1),[§4](https://arxiv.org/html/2605.14069#S4.p1.4)\.
- G\. Woo, C\. Liu, A\. Kumar, C\. Xiong, S\. Savarese, and D\. Sahoo \(2024\)Unified training of universal time series forecasting transformers\.InForty\-first International Conference on Machine Learning,Cited by:[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- S\. Xue, X\. Shi, Z\. Chu, Y\. Wang, H\. Hao, F\. Zhou, C\. Jiang, C\. Pan, J\. Y\. Zhang, Q\. Wen,et al\.\(2023\)Easytpp: towards open benchmarking temporal point processes\.arXiv preprint arXiv:2307\.08097\.Cited by:[Appendix C](https://arxiv.org/html/2605.14069#A3.SS0.SSS0.Px4.p1.1)\.
- S\. Xue, X\. Shi, J\. Y\. Zhang, and H\. Mei \(2024\)Decoupled marked temporal point process using neural ordinary differential equations\.InInternational Conference on Learning Representations,Cited by:[Appendix C](https://arxiv.org/html/2605.14069#A3.SS0.SSS0.Px4.p1.1),[Appendix C](https://arxiv.org/html/2605.14069#A3.p1.1),[Table 2](https://arxiv.org/html/2605.14069#S4.T2.8.6.6.7),[Table 3](https://arxiv.org/html/2605.14069#S4.T3.8.6.6.7)\.
- S\. Xue, X\. Shi, Y\. J\. Zhang, and H\. Mei \(2022\)HYPRO: a hybridly normalized probabilistic model for long\-horizon prediction of event sequences\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/2210.01753)Cited by:[1st item](https://arxiv.org/html/2605.14069#A9.I1.i1.p1.1),[§4](https://arxiv.org/html/2605.14069#S4.p1.4),[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- C\. Yang, H\. Mei, and J\. Eisner \(2022\)Transformer embeddings of irregularly spaced events and their participants\.InProceedings of the International Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2201.00044)Cited by:[Table 2](https://arxiv.org/html/2605.14069#S4.T2.8.6.12.6.1),[Table 3](https://arxiv.org/html/2605.14069#S4.T3.8.6.11.5.1)\.
- Q\. Zhang, A\. Lipani, O\. Kirnap, and E\. Yilmaz \(2020\)Self\-attentive Hawkes process\.InProceedings of the International Conference on Machine Learning \(ICML\),External Links:[Link](http://proceedings.mlr.press/v119/zhang20q/zhang20q.pdf)Cited by:[Appendix C](https://arxiv.org/html/2605.14069#A3.p1.1),[Table 1](https://arxiv.org/html/2605.14069#S3.T1.19.7.7.3),[Table 2](https://arxiv.org/html/2605.14069#S4.T2.8.6.11.5.1),[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.
- S\. Zhang, C\. Zhou, Y\. A\. Liu, P\. Zhang, X\. Lin, and Z\. Ma \(2024\)Neural jump\-diffusion temporal point processes\.InForty\-first International Conference on Machine Learning,Cited by:[Appendix C](https://arxiv.org/html/2605.14069#A3.SS0.SSS0.Px4.p1.1),[Table 2](https://arxiv.org/html/2605.14069#S4.T2.8.6.17.11.1)\.
- K\. Zhou, H\. Zha, and L\. Song \(2013\)Learning triggering kernels for multi\-dimensional hawkes processes\.InInternational conference on machine learning,pp\. 1301–1309\.Cited by:[3rd item](https://arxiv.org/html/2605.14069#A9.I1.i3.p1.1),[§4](https://arxiv.org/html/2605.14069#S4.p1.4)\.
- S\. Zuo, H\. Jiang, Z\. Li, T\. Zhao, and H\. Zha \(2020\)Transformer Hawkes process\.InInternational Conference on Machine Learning,pp\. 11692–11702\.External Links:[Link](https://arxiv.org/pdf/2002.09291.pdf)Cited by:[Table 1](https://arxiv.org/html/2605.14069#S3.T1.17.5.5.3),[Table 2](https://arxiv.org/html/2605.14069#S4.T2.8.6.13.7.1),[Table 3](https://arxiv.org/html/2605.14069#S4.T3.8.6.12.6.1),[§5](https://arxiv.org/html/2605.14069#S5.p1.1)\.

###### Contents

1. [1Introduction](https://arxiv.org/html/2605.14069#S1)
2. [2Background](https://arxiv.org/html/2605.14069#S2)
3. [3SurF: Survival Flow for Temporal Point Processes](https://arxiv.org/html/2605.14069#S3)1. [3\.1Training Objective: Amortized SurF Loss](https://arxiv.org/html/2605.14069#S3.SS1) 2. [3\.2Zero\-Shot Capabilities Across Datasets](https://arxiv.org/html/2605.14069#S3.SS2)
4. [4Experiments](https://arxiv.org/html/2605.14069#S4)1. [4\.1Next\-Event Prediction](https://arxiv.org/html/2605.14069#S4.SS1) 2. [4\.2Multi\-Horizon Forecasting](https://arxiv.org/html/2605.14069#S4.SS2)
5. [5Related Work](https://arxiv.org/html/2605.14069#S5)
6. [6Conclusion](https://arxiv.org/html/2605.14069#S6)
7. [References](https://arxiv.org/html/2605.14069#bib)
8. [ATime Rescaling Theorem and Bijectivity Proofs](https://arxiv.org/html/2605.14069#A1)1. [A\.1Near\-Zero Intensity Regimes and Likelihood Consistency](https://arxiv.org/html/2605.14069#A1.SS1) 2. [A\.2Derivation of the Survival FunctionS​\(t\)S\(t\)](https://arxiv.org/html/2605.14069#A1.SS2)
9. [BLikelihood Derivation via Change of Variables](https://arxiv.org/html/2605.14069#A2)1. [B\.1Jacobian Analysis](https://arxiv.org/html/2605.14069#A2.SS1) 2. [B\.2Change of Variables Formula](https://arxiv.org/html/2605.14069#A2.SS2) 3. [B\.3Observation Window Correction](https://arxiv.org/html/2605.14069#A2.SS3)
10. [CHeld\-Out Negative Log\-Likelihood](https://arxiv.org/html/2605.14069#A3)
11. [DAmortized Event Intensity: Detailed Derivations](https://arxiv.org/html/2605.14069#A4)1. [D\.1SurF\-CSB: Cumulative Softplus Basis](https://arxiv.org/html/2605.14069#A4.SS1) 2. [D\.2SurF\-GLQ: Gauss–Legendre Quadrature](https://arxiv.org/html/2605.14069#A4.SS2)1. [D\.2\.1Gradient Computation under Gauss–Legendre Quadrature](https://arxiv.org/html/2605.14069#A4.SS2.SSS1) 3. [D\.3SurF\-MoE: Mixture\-of\-Exponentials \(Closed\-Form\)](https://arxiv.org/html/2605.14069#A4.SS3) 4. [D\.4Closed\-Form Loss under MoE](https://arxiv.org/html/2605.14069#A4.SS4)
12. [ESampling Algorithms](https://arxiv.org/html/2605.14069#A5)1. [E\.1Safeguarded Newton: Formal Convergence Guarantee](https://arxiv.org/html/2605.14069#A5.SS1)
13. [FGradient Flow Analysis](https://arxiv.org/html/2605.14069#A6)1. [F\.1Gradient Transformation](https://arxiv.org/html/2605.14069#A6.SS1) 2. [F\.2Gradient Norm Bounds](https://arxiv.org/html/2605.14069#A6.SS2) 3. [F\.3Parameter Gradients: SurF\-MoE](https://arxiv.org/html/2605.14069#A6.SS3) 4. [F\.4Parameter Gradients: SurF\-CSB](https://arxiv.org/html/2605.14069#A6.SS4) 5. [F\.5O​\(N\)O\(N\)Gradient Norm Bound](https://arxiv.org/html/2605.14069#A6.SS5) 6. [F\.6Comparison with RNN\-Based TPPs](https://arxiv.org/html/2605.14069#A6.SS6)
14. [GExtended Hyperparameter Sensitivity Study](https://arxiv.org/html/2605.14069#A7)
15. [HAdditional Theoretical Material](https://arxiv.org/html/2605.14069#A8)1. [H\.1Joint Modeling of Times and Marks](https://arxiv.org/html/2605.14069#A8.SS1) 2. [H\.2Relationship to Prior Cumulative\-Hazard Neural TPPs](https://arxiv.org/html/2605.14069#A8.SS2) 3. [H\.3When to Use Which Variant, and the Role of Quadrature](https://arxiv.org/html/2605.14069#A8.SS3) 4. [H\.4Shared Canonical Target](https://arxiv.org/html/2605.14069#A8.SS4)
16. [IDatasets](https://arxiv.org/html/2605.14069#A9)
17. [JModel Architecture and Hyperparameters](https://arxiv.org/html/2605.14069#A10)1. [J\.1History Encoding](https://arxiv.org/html/2605.14069#A10.SS1) 2. [J\.2Intensity Network](https://arxiv.org/html/2605.14069#A10.SS2) 3. [J\.3Cumulative\-Intensity Head](https://arxiv.org/html/2605.14069#A10.SS3) 4. [J\.4Type Prediction Head](https://arxiv.org/html/2605.14069#A10.SS4) 5. [J\.5Hyperparameter Search](https://arxiv.org/html/2605.14069#A10.SS5)
18. [KComputational Complexity](https://arxiv.org/html/2605.14069#A11)
19. [LAdditional Empirical Analysis](https://arxiv.org/html/2605.14069#A12)1. [L\.1Finetuned Models vs Top Baselines](https://arxiv.org/html/2605.14069#A12.SS1) 2. [L\.2Forecast Trajectories](https://arxiv.org/html/2605.14069#A12.SS2)

## Appendix ATime Rescaling Theorem and Bijectivity Proofs

Consider a temporal point process defined onℝ\+\\mathbb\{R\}^\{\+\}generating event times\{t1,t2,…,tN\}\\\{t\_\{1\},t\_\{2\},\\ldots,t\_\{N\}\\\}with a conditional intensity functionλ∗​\(t∣ℋt\)\>0\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\)\>0for allt≥0t\\geq 0, whereℋt\\mathcal\{H\}\_\{t\}represents the history of events up to timett\. The intensity function characterizes the instantaneous rate of event occurrence and fully describes the probabilistic structure of the process\.

###### Theorem 1\(Time Rescaling Theorem, restated\)\.

Let\{ti\}i=1N\\\{t\_\{i\}\\\}\_\{i=1\}^\{N\}be a realization of a point process with conditional intensity functionλ∗​\(t∣ℋt\)\>0\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\)\>0\. Define the cumulative intensity functionΛ∗​\(t\)=∫0tλ∗​\(s∣ℋs\)​𝑑s\\Lambda^\{\*\}\(t\)=\\int\_\{0\}^\{t\}\\lambda^\{\*\}\(s\\mid\\mathcal\{H\}\_\{s\}\)\\,dsand the rescaled event timeszi=Λ∗​\(ti\)z\_\{i\}=\\Lambda^\{\*\}\(t\_\{i\}\)\. Then\{Δ​zi=zi−zi−1\}i=1N\\\{\\Delta z\_\{i\}=z\_\{i\}\-z\_\{i\-1\}\\\}\_\{i=1\}^\{N\}withz0=0z\_\{0\}=0are i\.i\.d\. exponential random variables with unit rate\.

###### Proof\.

We show that the rescaled times form a homogeneous Poisson process with unit rate\. The survival function of the next event time is

S​\(t\)=Pr⁡\(no event in​\[0,t\]∣ℋ0\)=exp⁡\(−∫0tλ∗​\(s∣ℋs\)​𝑑s\)=exp⁡\(−Λ∗​\(t\)\),S\(t\)=\\Pr\(\\text\{no event in \}\[0,t\]\\mid\\mathcal\{H\}\_\{0\}\)=\\exp\\\!\\Bigl\(\-\\int\_\{0\}^\{t\}\\lambda^\{\*\}\(s\\mid\\mathcal\{H\}\_\{s\}\)\\,ds\\Bigr\)=\\exp\(\-\\Lambda^\{\*\}\(t\)\),\(4\)giving densityft​\(t\)=λ∗​\(t∣ℋt\)​exp⁡\(−Λ∗​\(t\)\)f\_\{t\}\(t\)=\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\)\\exp\(\-\\Lambda^\{\*\}\(t\)\)\.

Under the transformationz=Λ∗​\(t\)z=\\Lambda^\{\*\}\(t\), we haved​z/d​t=λ∗​\(t∣ℋt\)dz/dt=\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\), so the Jacobian is\|d​t/d​z\|=1/λ∗​\(\(Λ∗\)−1​\(z\)∣ℋ⋅\)\|dt/dz\|=1/\\lambda^\{\*\}\(\(\\Lambda^\{\*\}\)^\{\-1\}\(z\)\\mid\\mathcal\{H\}\_\{\\cdot\}\)\. Applying change of variables,

fz​\(z\)=ft​\(\(Λ∗\)−1​\(z\)\)​\|d​td​z\|=λ∗​\(⋅\)​exp⁡\(−z\)⋅1λ∗​\(⋅\)=exp⁡\(−z\),f\_\{z\}\(z\)=f\_\{t\}\(\(\\Lambda^\{\*\}\)^\{\-1\}\(z\)\)\\,\\Bigl\|\\tfrac\{dt\}\{dz\}\\Bigr\|=\\lambda^\{\*\}\(\\cdot\)\\exp\(\-z\)\\cdot\\frac\{1\}\{\\lambda^\{\*\}\(\\cdot\)\}=\\exp\(\-z\),\(5\)which is the density of a unit\-rate exponential\. Independence of inter\-arrival times follows from the memoryless property of the exponential distribution and the preservation of the Markov property under the transformation\. ∎

The crucial aspect of this transformation is its bijectivity\. The conditionλ∗​\(t∣ℋt\)\>0\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\)\>0ensures thatΛ∗​\(t\)\\Lambda^\{\*\}\(t\)is strictly increasing\.

###### Corollary 1\.

Let\{ti\}i=1N\\\{t\_\{i\}\\\}\_\{i=1\}^\{N\}be a realization of a point process with conditional intensity functionλ0​\(t∣ℋt\)\>0\\lambda\_\{0\}\(t\\mid\\mathcal\{H\}\_\{t\}\)\>0\. For anyt\>0t\>0, lett′=max⁡\(ℋt\)t^\{\\prime\}=\\max\\left\(\\mathcal\{H\}\_\{t\}\\right\), and writeλ∗​\(Δ​t∣ℋt\)=λ0​\(t∣ℋt\)\\lambda^\{\*\}\(\\Delta t\\mid\\mathcal\{H\}\_\{t\}\)=\\lambda\_\{0\}\(t\\mid\\mathcal\{H\}\_\{t\}\), whereΔ​t=t−t′\\Delta t=t\-t^\{\\prime\}\. Define the cumulative intensity functionsΛ0​\(t\)=∫0tλ0​\(s∣ℋs\)​𝑑s\\Lambda\_\{0\}\(t\)=\\int\_\{0\}^\{t\}\\lambda\_\{0\}\(s\\mid\\mathcal\{H\}\_\{s\}\)\\,dsandΛ∗​\(Δ​t∣ℋt\)=∫0Δ​tλ∗​\(s∣ℋt\)​𝑑s\\Lambda^\{\*\}\(\\Delta t\\mid\\mathcal\{H\}\_\{t\}\)=\\int\_\{0\}^\{\\Delta t\}\\lambda^\{\*\}\(s\\mid\\mathcal\{H\}\_\{t\}\)\\,ds\. ThenΛ∗​\(τi∣ℋti\)\\Lambda^\{\*\}\(\\tau\_\{i\}\\mid\\mathcal\{H\}\_\{t\_\{i\}\}\)are i\.i\.d\. exponential random variables with unit rate, whereτi=ti−ti−1\\tau\_\{i\}=t\_\{i\}\-t\_\{i\-1\}andτ0=0\\tau\_\{0\}=0\.

###### Proof\.

We know from Theorem[1](https://arxiv.org/html/2605.14069#Thmtheorem1a)thatΛ0​\(ti\)−Λ0​\(ti−1\)\\Lambda\_\{0\}\(t\_\{i\}\)\-\\Lambda\_\{0\}\(t\_\{i\-1\}\)are i\.i\.d Exp\(1\) random variables\. Observe that:

Λ0​\(ti\)−Λ0​\(ti−1\)\\displaystyle\\Lambda\_\{0\}\(t\_\{i\}\)\-\\Lambda\_\{0\}\(t\_\{i\-1\}\)=∫ti−1tiλ0​\(s∣ℋs\)​𝑑s\\displaystyle=\\int\_\{t\_\{i\-1\}\}^\{t\_\{i\}\}\\lambda\_\{0\}\(s\\mid\\mathcal\{H\}\_\{s\}\)ds\(6\)=∫0τiλ0​\(s\+ti−1∣ℋs\+ti−1\)​𝑑s\\displaystyle=\\int\_\{0\}^\{\\tau\_\{i\}\}\\lambda\_\{0\}\(s\+t\_\{i\-1\}\\mid\\mathcal\{H\}\_\{s\+t\_\{i\-1\}\}\)ds\(7\)=∫0τiλ∗​\(s∣ℋti−1\)​𝑑s\\displaystyle=\\int\_\{0\}^\{\\tau\_\{i\}\}\\lambda^\{\*\}\(s\\mid\\mathcal\{H\}\_\{t\_\{i\-1\}\}\)ds\(8\)=Λ∗​\(τi∣ℋti\)\\displaystyle=\\Lambda^\{\*\}\(\\tau\_\{i\}\\mid\\mathcal\{H\}\_\{t\_\{i\}\}\)\(9\)∎

###### Lemma 1\(Invertibility and Regularity\)\.

Assumeλ∗\(⋅∣ℋ⋅\)\\lambda^\{\*\}\(\\cdot\\mid\\mathcal\{H\}\_\{\\cdot\}\)is continuous andλ∗​\(t∣ℋt\)≥λmin\>0\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\)\\geq\\lambda\_\{\\min\}\>0for allt∈ℝ\+t\\in\\mathbb\{R\}^\{\+\}\. ThenΛ∗\\Lambda^\{\*\}is aC1C^\{1\}bijection fromℝ\+\\mathbb\{R\}^\{\+\}toℝ\+\\mathbb\{R\}^\{\+\}withC1C^\{1\}inverse\(Λ∗\)−1\(\\Lambda^\{\*\}\)^\{\-1\}\. If, in addition,λ∗\\lambda^\{\*\}isCkC^\{k\}for somek≥0k\\geq 0, thenΛ∗\\Lambda^\{\*\}and\(Λ∗\)−1\(\\Lambda^\{\*\}\)^\{\-1\}areCk\+1C^\{k\+1\}\.

###### Proof\.

By continuity ofλ∗\\lambda^\{\*\}and the fundamental theorem of calculus,Λ∗\\Lambda^\{\*\}isC1C^\{1\}onℝ\+\\mathbb\{R\}^\{\+\}with\(Λ∗\)′​\(t\)=λ∗​\(t∣ℋt\)\(\\Lambda^\{\*\}\)^\{\\prime\}\(t\)=\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\)\. The lower bound\(Λ∗\)′​\(t\)≥λmin\>0\(\\Lambda^\{\*\}\)^\{\\prime\}\(t\)\\geq\\lambda\_\{\\min\}\>0impliesΛ∗\\Lambda^\{\*\}is strictly increasing \(hence injective\) andΛ∗​\(t\)≥λmin​t→∞\\Lambda^\{\*\}\(t\)\\geq\\lambda\_\{\\min\}\\,t\\to\\inftyast→∞t\\to\\infty\. Together withΛ∗​\(0\)=0\\Lambda^\{\*\}\(0\)=0and continuity, this gives surjectivity ontoℝ\+\\mathbb\{R\}^\{\+\}, soΛ∗\\Lambda^\{\*\}is aC1C^\{1\}bijection\. The Inverse Function Theorem applied at every point \(where\(Λ∗\)′≠0\(\\Lambda^\{\*\}\)^\{\\prime\}\\neq 0\) yields aC1C^\{1\}inverse\. Ifλ∗\\lambda^\{\*\}isCkC^\{k\}, thenΛ∗\\Lambda^\{\*\}isCk\+1C^\{k\+1\}by repeated differentiation, and\(Λ∗\)−1\(\\Lambda^\{\*\}\)^\{\-1\}inherits this regularity via IFT\. ∎

###### Proposition\(Reverse\-direction realization\)\.

Under the hypotheses of Lemma[1](https://arxiv.org/html/2605.14069#Thmlemma1), let\{zi\}\\\{z\_\{i\}\\\}be a realization of a unit\-rate Poisson process onℝ\+\\mathbb\{R\}^\{\+\}\. Then\{ti=\(Λ∗\)−1​\(zi\)\}\\\{t\_\{i\}=\(\\Lambda^\{\*\}\)^\{\-1\}\(z\_\{i\}\)\\\}is a realization of the original point process with conditional intensityλ∗\\lambda^\{\*\}\.

###### Proof\.

Letℒλ\\mathcal\{L\}\_\{\\lambda\}denote the law of the original point process andℒ1\\mathcal\{L\}\_\{1\}the law of the unit\-rate Poisson process\. Theorem[1](https://arxiv.org/html/2605.14069#Thmtheorem1a)states that the mapΦ:\{ti\}↦\{Λ∗​\(ti\)\}\\Phi:\\\{t\_\{i\}\\\}\\mapsto\\\{\\Lambda^\{\*\}\(t\_\{i\}\)\\\}pushesℒλ\\mathcal\{L\}\_\{\\lambda\}forward toℒ1\\mathcal\{L\}\_\{1\}, i\.e\.Φ∗​ℒλ=ℒ1\\Phi\_\{\\ast\}\\mathcal\{L\}\_\{\\lambda\}=\\mathcal\{L\}\_\{1\}\. Lemma[1](https://arxiv.org/html/2605.14069#Thmlemma1)ensuresΦ\\Phiis a measurable bijection on event sequences, so the inverse pushforward gives\(Φ−1\)∗​ℒ1=ℒλ\(\\Phi^\{\-1\}\)\_\{\\ast\}\\mathcal\{L\}\_\{1\}=\\mathcal\{L\}\_\{\\lambda\}\. Hence\{\(Λ∗\)−1​\(zi\)\}∼ℒλ\\\{\(\\Lambda^\{\*\}\)^\{\-1\}\(z\_\{i\}\)\\\}\\sim\\mathcal\{L\}\_\{\\lambda\}\. ∎

This bijectivity is fundamental to SurF’s amortized loss: it guarantees that the learned mappingΛθ\\Lambda\_\{\\theta\}between temporal and exponential spaces is perfectly reversible, enabling exact likelihood computation via change of variables and lossless generative sampling via inversion\.

### A\.1Near\-Zero Intensity Regimes and Likelihood Consistency

The bijectivity in Theorem[2](https://arxiv.org/html/2605.14069#Thmtheorem2)requiresλmin\>0\\lambda\_\{\\min\}\>0\. Real processes \(neural refractoriness, market\-closure windows, after\-hours IoT\) can exhibit regimes where the true intensity is effectively zero\. We handle this structurally rather than with ad hoc clamping\.

##### Structural positivity\.

All three variants are structurally lower\-bounded\. SurF\-MoE usesγj=softplus​\(γ^j\)\+ε\\gamma\_\{j\}=\\mathrm\{softplus\}\(\\hat\{\\gamma\}\_\{j\}\)\+\\varepsilonwithε=10−4\\varepsilon=10^\{\-4\}; SurF\-CSB and SurF\-GLQ apply analogoussoftplus\+ε\\mathrm\{softplus\}\+\\varepsilonto their positive parameters\. The main architecture additionally addsλbase\>0\\lambda\_\{\\text\{base\}\}\>0to the intensity \(Appendix[J](https://arxiv.org/html/2605.14069#A10)\), giving a hard floorλfloor\\lambda\_\{\\text\{floor\}\}that ensures the conditions of Theorem[2](https://arxiv.org/html/2605.14069#Thmtheorem2)\.

##### Bias from the floor\.

Letλθ†=λθ\+λfloor\\lambda\_\{\\theta\}^\{\\dagger\}=\\lambda\_\{\\theta\}\+\\lambda\_\{\\text\{floor\}\}\. The induced compensator bias*per inter\-event interval*of lengthτi\\tau\_\{i\}isτi⋅λfloor\\tau\_\{i\}\\cdot\\lambda\_\{\\text\{floor\}\}, so the bias accumulated over a sequence ofNNevents on\[0,T\]\[0,T\]is at mostT⋅λfloorT\\cdot\\lambda\_\{\\text\{floor\}\}\. Forλfloor=10−4\\lambda\_\{\\text\{floor\}\}=10^\{\-4\}and typical normalizedT≤103T\\leq 10^\{3\}, this is at most0\.10\.1nats per sequence, below the NLL noise floor on all six benchmarks\. Asλfloor→0\\lambda\_\{\\text\{floor\}\}\\to 0the likelihood is consistent; for finiteλfloor\\lambda\_\{\\text\{floor\}\}the estimator is biased by an additive constant that does not affect the gradient direction\.

##### Piecewise\-zero regimes\.

For processes with genuine dead\-time, we recommend one of two strategies, neither of which affects the core theory\.

- •Learnable floor:treatλfloor\\lambda\_\{\\text\{floor\}\}as a parameter bounded below by10−610^\{\-6\}, letting the model absorb mass of near\-zero regions without violating bijectivity\.
- •Explicit masking:for known refractory windows\[ti,ti\+δ\]\[t\_\{i\},t\_\{i\}\+\\delta\], setλθ≡λfloor\\lambda\_\{\\theta\}\\equiv\\lambda\_\{\\text\{floor\}\}inside the window and integrate only outside\. The Jacobian factorization is unchanged because masked regions contribute a constant to the compensator\.

We use the learnable\-floor strategy throughout and observe no practical difference from explicit masking on refractoriness\-relevant datasets \(Retweet, Earthquake\)\.

##### Sampling\.

Newton’s method \(Algorithm[3](https://arxiv.org/html/2605.14069#alg3)\) is unaffected: positivity ofλθ\\lambda\_\{\\theta\}is exactly what guarantees global convergence, and the floor improves numerical conditioning of the update\(F−z\)/F′\(F\-z\)/F^\{\\prime\}at largeΔ​t\\Delta t\.

##### Learnable floor in practice\.

Whenλfloor\\lambda\_\{\\text\{floor\}\}is treated as a trainable parameter \(bounded below by10−610^\{\-6\}\), the optimizer converges to values in\[2×10−5,8×10−5\]\[2\\\!\\times\\\!10^\{\-5\},\\,8\\\!\\times\\\!10^\{\-5\}\]across all six datasets — comfortably inside the flat plateau — and yields metrics indistinguishable from the fixed10−410^\{\-4\}default\. This matches the recommendation above and shows the estimator is insensitive to the exact choice within the plateau\.

### A\.2Derivation of the Survival FunctionS​\(t\)S\(t\)

The conditional intensity function can be defined in terms of the densityfT​\(t\)f\_\{T\}\(t\)and survival functionS​\(t\)=Pr⁡\(T\>t∣ℋt\)S\(t\)=\\Pr\(T\>t\\mid\\mathcal\{H\}\_\{t\}\)as

λ∗​\(t∣ℋt\)=fT​\(t∣ℋt\)S​\(t∣ℋt\)\.\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\)=\\frac\{f\_\{T\}\(t\\mid\\mathcal\{H\}\_\{t\}\)\}\{S\(t\\mid\\mathcal\{H\}\_\{t\}\)\}\.\(10\)SincefT​\(t∣ℋt\)=−dd​t​S​\(t∣ℋt\)f\_\{T\}\(t\\mid\\mathcal\{H\}\_\{t\}\)=\-\\frac\{d\}\{dt\}S\(t\\mid\\mathcal\{H\}\_\{t\}\),

λ∗​\(t∣ℋt\)=−dd​t​S​\(t∣ℋt\)S​\(t∣ℋt\)=−dd​t​ln⁡S​\(t∣ℋt\)\.\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\)=\\frac\{\-\\frac\{d\}\{dt\}S\(t\\mid\\mathcal\{H\}\_\{t\}\)\}\{S\(t\\mid\\mathcal\{H\}\_\{t\}\)\}=\-\\frac\{d\}\{dt\}\\ln S\(t\\mid\\mathcal\{H\}\_\{t\}\)\.\(11\)Integrating from0tottand usingS​\(0∣ℋ0\)=1S\(0\\mid\\mathcal\{H\}\_\{0\}\)=1,

Λ∗​\(t\)=∫0tλ∗​\(s∣ℋs\)​𝑑s=−ln⁡S​\(t∣ℋt\),\\Lambda^\{\*\}\(t\)=\\int\_\{0\}^\{t\}\\lambda^\{\*\}\(s\\mid\\mathcal\{H\}\_\{s\}\)\\,ds=\-\\ln S\(t\\mid\\mathcal\{H\}\_\{t\}\),\(12\)soS​\(t∣ℋt\)=exp⁡\(−Λ∗​\(t\)\)S\(t\\mid\\mathcal\{H\}\_\{t\}\)=\\exp\(\-\\Lambda^\{\*\}\(t\)\)\.

## Appendix BLikelihood Derivation via Change of Variables

We derive the likelihood for inter\-arrival times𝝉=\(τ1,…,τN\)\\bm\{\\tau\}=\(\\tau\_\{1\},\\ldots,\\tau\_\{N\}\)through the TRT transformation to exponential space𝚫​𝒛=\(Δ​z1,…,Δ​zN\)\\bm\{\\Delta z\}=\(\\Delta z\_\{1\},\\ldots,\\Delta z\_\{N\}\):

Δ​zi=Λθ​\(ti\)−Λθ​\(ti−1\)=∫ti−1tiλθ​\(s∣ℋs\)​𝑑s,\\Delta z\_\{i\}=\\Lambda\_\{\\theta\}\(t\_\{i\}\)\-\\Lambda\_\{\\theta\}\(t\_\{i\-1\}\)=\\int\_\{t\_\{i\-1\}\}^\{t\_\{i\}\}\\lambda\_\{\\theta\}\(s\\mid\\mathcal\{H\}\_\{s\}\)\\,ds,\(13\)whereti=∑j=1iτjt\_\{i\}=\\sum\_\{j=1\}^\{i\}\\tau\_\{j\}are cumulative event times\.

### B\.1Jacobian Analysis

The JacobianJ=∂𝚫​𝒛/∂𝝉J=\\partial\\bm\{\\Delta z\}/\\partial\\bm\{\\tau\}is lower triangular in general, but its determinant depends only on the diagonal entries\.

##### General case \(lower triangular\)\.

Sinceti=∑j=1iτjt\_\{i\}=\\sum\_\{j=1\}^\{i\}\\tau\_\{j\}, both limits ofΔ​zi=∫ti−1tiλθ​𝑑s\\Delta z\_\{i\}=\\int\_\{t\_\{i\-1\}\}^\{t\_\{i\}\}\\lambda\_\{\\theta\}\\,dsdepend onτ1,…,τi\\tau\_\{1\},\\ldots,\\tau\_\{i\}, and the historyℋs\\mathcal\{H\}\_\{s\}fors∈\(ti−1,ti\]s\\in\(t\_\{i\-1\},t\_\{i\}\]may depend on all earlier event times through the encoding\. Hence

∂Δ​zi∂τj=\{≠0j≤i,0j\>i,\\frac\{\\partial\\Delta z\_\{i\}\}\{\\partial\\tau\_\{j\}\}=\\begin\{cases\}\\neq 0&j\\leq i,\\\\ 0&j\>i,\\end\{cases\}\(14\)makingJJlower triangular\. By Leibniz’s rule applied to the upper limit of integration,

∂Δ​zi∂τi=λθ​\(ti∣ℋti\),\\frac\{\\partial\\Delta z\_\{i\}\}\{\\partial\\tau\_\{i\}\}=\\lambda\_\{\\theta\}\(t\_\{i\}\\mid\\mathcal\{H\}\_\{t\_\{i\}\}\),\(15\)regardless of how the history encoding is computed\. The determinant of a lower\-triangular matrix is the product of its diagonal entries, so

detJ=∏i=1Nλθ​\(ti∣ℋti\)\.\\det J=\\prod\_\{i=1\}^\{N\}\\lambda\_\{\\theta\}\(t\_\{i\}\\mid\\mathcal\{H\}\_\{t\_\{i\}\}\)\.\(16\)Sinceλθ\>0\\lambda\_\{\\theta\}\>0everywhere,detJ\>0\\det J\>0and the transformation is locally invertible\.

##### Simplification under our architecture \(diagonal\)\.

In SurF, the history encoding𝐡i−1\\mathbf\{h\}\_\{i\-1\}is computed once at event timeti−1t\_\{i\-1\}and held constant throughout the interval\(ti−1,ti\]\(t\_\{i\-1\},t\_\{i\}\]\. Fors∈\(ti−1,ti\]s\\in\(t\_\{i\-1\},t\_\{i\}\], the intensity depends only on elapsed timeu=s−ti−1u=s\-t\_\{i\-1\}and the frozen encoding𝐡i−1\\mathbf\{h\}\_\{i\-1\}, so

Δ​zi=∫0τiλθ​\(u∣𝐡i−1\)​𝑑u=Λθ​\(τi∣𝐡i−1\)\.\\Delta z\_\{i\}=\\int\_\{0\}^\{\\tau\_\{i\}\}\\lambda\_\{\\theta\}\(u\\mid\\mathbf\{h\}\_\{i\-1\}\)\\,du=\\Lambda\_\{\\theta\}\(\\tau\_\{i\}\\mid\\mathbf\{h\}\_\{i\-1\}\)\.\(17\)This holds regardless of whetherΛθ\\Lambda\_\{\\theta\}is computed in closed form \(MoE, CSB\) or by quadrature \(GLQ\)\. Forj<ij<i:τi\\tau\_\{i\}is independent ofτj\\tau\_\{j\}, and𝐡i−1\\mathbf\{h\}\_\{i\-1\}is fixed with respect toτj\\tau\_\{j\}\(it is recomputed only at event boundaries\), giving∂Δ​zi/∂τj=0\\partial\\Delta z\_\{i\}/\\partial\\tau\_\{j\}=0\. Forj\>ij\>i: neitherτi\\tau\_\{i\}nor𝐡i−1\\mathbf\{h\}\_\{i\-1\}is affected, giving∂Δ​zi/∂τj=0\\partial\\Delta z\_\{i\}/\\partial\\tau\_\{j\}=0\. Therefore

J=diag⁡\(λθ​\(t1∣ℋt1\),…,λθ​\(tN∣ℋtN\)\),J=\\operatorname\{diag\}\\\!\\bigl\(\\lambda\_\{\\theta\}\(t\_\{1\}\\mid\\mathcal\{H\}\_\{t\_\{1\}\}\),\\ldots,\\lambda\_\{\\theta\}\(t\_\{N\}\\mid\\mathcal\{H\}\_\{t\_\{N\}\}\)\\bigr\),\(18\)and the determinant remainsdetJ=∏i=1Nλθ​\(ti∣ℋti\)\\det J=\\prod\_\{i=1\}^\{N\}\\lambda\_\{\\theta\}\(t\_\{i\}\\mid\\mathcal\{H\}\_\{t\_\{i\}\}\)\.

### B\.2Change of Variables Formula

For the transformation𝝉↦𝚫​𝒛\\bm\{\\tau\}\\mapsto\\bm\{\\Delta z\},

pθ​\(𝝉\)\\displaystyle p\_\{\\theta\}\(\\bm\{\\tau\}\)=p𝚫​𝒛​\(𝚫​𝒛​\(𝝉\)\)⋅\|detJ\|,\\displaystyle=p\_\{\\bm\{\\Delta z\}\}\(\\bm\{\\Delta z\}\(\\bm\{\\tau\}\)\)\\cdot\|\\det J\|,\(19\)log⁡pθ​\(𝝉\)\\displaystyle\\log p\_\{\\theta\}\(\\bm\{\\tau\}\)=log⁡p𝚫​𝒛​\(𝚫​𝒛\)\+log⁡\|detJ\|\.\\displaystyle=\\log p\_\{\\bm\{\\Delta z\}\}\(\\bm\{\\Delta z\}\)\+\\log\|\\det J\|\.\(20\)SinceΔ​zi∼Exp​\(1\)\\Delta z\_\{i\}\\sim\\mathrm\{Exp\}\(1\)independently,log⁡p𝚫​𝒛​\(𝚫​𝒛\)=−∑iΔ​zi\\log p\_\{\\bm\{\\Delta z\}\}\(\\bm\{\\Delta z\}\)=\-\\sum\_\{i\}\\Delta z\_\{i\}, so

log⁡pθ​\(𝝉\)=−∑i=1NΛθ​\(τi∣𝐡i−1\)\+∑i=1Nlog⁡λθ​\(τi∣𝐡i−1\)\.\\log p\_\{\\theta\}\(\\bm\{\\tau\}\)=\-\\sum\_\{i=1\}^\{N\}\\Lambda\_\{\\theta\}\(\\tau\_\{i\}\\mid\\mathbf\{h\}\_\{i\-1\}\)\+\\sum\_\{i=1\}^\{N\}\\log\\lambda\_\{\\theta\}\(\\tau\_\{i\}\\mid\\mathbf\{h\}\_\{i\-1\}\)\.\(21\)

### B\.3Observation Window Correction

For data observed in\[0,T\]\[0,T\], we include the survival probability of the final gap\[tN,T\]\[t\_\{N\},T\]:

Pr⁡\(no event in​\(tN,T\]\)=exp⁡\(−Λθ​\(Δ​T∣𝐡N\)\)\.\\Pr\(\\text\{no event in \}\(t\_\{N\},T\]\)=\\exp\\\!\\bigl\(\-\\Lambda\_\{\\theta\}\(\\Delta T\\mid\\mathbf\{h\}\_\{N\}\)\\bigr\)\.\(22\)The complete negative log\-likelihood is

ℒSurF\(θ\)=∑i=1NΛθ\(τi∣𝐡i−1\)\+Λθ\(ΔT∣𝐡N\)−∑i=1Nlogλθ\(τi∣𝐡i−1\)\.\\boxed\{\\mathcal\{L\}\_\{\\text\{SurF\}\}\(\\theta\)=\\sum\_\{i=1\}^\{N\}\\Lambda\_\{\\theta\}\(\\tau\_\{i\}\\mid\\mathbf\{h\}\_\{i\-1\}\)\+\\Lambda\_\{\\theta\}\(\\Delta T\\mid\\mathbf\{h\}\_\{N\}\)\-\\sum\_\{i=1\}^\{N\}\\log\\lambda\_\{\\theta\}\(\\tau\_\{i\}\\mid\\mathbf\{h\}\_\{i\-1\}\)\.\}\(23\)This is the unified amortized loss\. The first two terms are forward passes ofΛθ\\Lambda\_\{\\theta\}; the third is computed via automatic differentiation \(or closed form for MoE\)\.

![Refer to caption](https://arxiv.org/html/2605.14069v1/figures/Figure1.png)Figure 4:SurF recovers the true oscillatory intensity \(A3\) and ISI distribution \(C3\); Hawkes fails on both \(A2, C2\)\.

## Appendix CHeld\-Out Negative Log\-Likelihood

Negative log\-likelihood \(NLL\) on held\-out sequences is the canonical goodness\-of\-fit metric for temporal point processes\[Shchuret al\.,[2019](https://arxiv.org/html/2605.14069#bib.bib23), Xueet al\.,[2024](https://arxiv.org/html/2605.14069#bib.bib116), Zhanget al\.,[2020](https://arxiv.org/html/2605.14069#bib.bib41)\]: it directly measures how much probability mass the learned density places on the observed events and, unlike RMSE or type accuracy, is sensitive to the entire predictive distribution rather than a single point estimate\. Lower is better\.

We report per\-event NLL of the joint inter\-arrival/type densityfθ​\(τ,k∣hi−1\)f\_\{\\theta\}\(\\tau,k\\mid h\_\{i\-1\}\)averaged over the test split for all three SurF variants in both the pretrained \(combined\-corpus checkpoint, evaluated zero\-shot per dataset\) and finetuned regimes\. Results are summarized in Table[5](https://arxiv.org/html/2605.14069#A3.T5)\.

Table 5:Held\-out per\-event NLL of SurF variants \(lower is better\)\.Bold= best within row across all six configurations\. “Pre” denotes the unified pretrained checkpoint evaluated zero\-shot; “Fine” denotes the same checkpoint adapted per dataset\.##### Variant comparison\.

MoEattains the best NLL on four of six datasets \(Taxi, StackOverflow, Earthquake, Taobao\), consistent with its exponential\-basis density being well matched to the fast\-decaying, near\-Markovian dynamics dominant in these benchmarks\.GLQwins on the two datasets with the heaviest right tails—Amazon \(Pre,−1\.61\-1\.61\) and Retweet \(Fine,\+2\.55\+2\.55\)—where unconstrained log\-intensity flexibility pays off in the tail\.CSBis never the column\-best on NLL alone, despite its competitive RMSE in Table[2](https://arxiv.org/html/2605.14069#S4.T2); this reflects a known phenomenon in density\-based TPP modeling\[Shchuret al\.,[2019](https://arxiv.org/html/2605.14069#bib.bib23)\], where models with smoother \(monotone\-bounded\) intensity surfaces can match point\-prediction quality while assigning slightly less peaked mass near the mode\.

##### Effect of finetuning\.

Finetuning improves NLL relative to the pretrained checkpoint on5/65/6datasets for every variant, with the only exception being Taobao forMoE\(a marginal0\.0130\.013regression, well within the variance reported in Section[3](https://arxiv.org/html/2605.14069#S4.T3)\)\. Average NLL reductions from finetuning are small but consistent \(∼0\.05\\sim 0\.05–0\.080\.08nats/event\), indicating that the unified pretrained density is already close to the dataset\-specific optimum in distributional terms; this matches the dataset\-invariance claim \(Proposition[Proposition](https://arxiv.org/html/2605.14069#Thmpropositionx4)\) that representations in canonicalExp​\(1\)\\mathrm\{Exp\}\(1\)space transfer with minimal residual mismatch\.

##### NLL vs\. predictive metrics\.

The NLL ranking is broadly consistent with the time\-rescaling KS statistics of Figure[2](https://arxiv.org/html/2605.14069#S4.F2): variants that calibrate better also assign higher held\-out density\. It is, however, not a strict monotone of next\-event RMSE—e\.g\., on TaobaoGLQ\(Fine\) attains the best one\-step time RMSE \(Table[2](https://arxiv.org/html/2605.14069#S4.T2)\) whileMoE\(Pre\) attains the best NLL—because RMSE measures only the conditional mean ofτ\\tauwhereas NLL measures the full conditional density\. Reporting both is therefore informative, and we include NLL primarily as a density\-quality complement to the predictive evaluation in the main text\.

##### Comparison with prior work\.

We do not include baseline NLLs in Table[5](https://arxiv.org/html/2605.14069#A3.T5)because published TPP papers do not normalize NLL consistently—values may be aggregated per event or per sequence, may include or exclude the mark log\-probability, and may report log\-likelihood \(higher is better\) versus negative log\-likelihood \(lower is better\) without a common reference scale—making numerical cross\-paper comparison unreliable\. The EasyTPP benchmark\[Xueet al\.,[2023](https://arxiv.org/html/2605.14069#bib.bib122)\]reports log\-likelihood for MHP/RMTPP/NHP/SAHP/THP/AttNHP/IFTPP on Taxi, Retweet, StackOverflow, Taobao, and Amazon under a common protocol; DTPP\[Xueet al\.,[2024](https://arxiv.org/html/2605.14069#bib.bib116)\]reports held\-out average log\-likelihood on the same five datasets in its Figure 1; and NJDTPP\[Zhanget al\.,[2024](https://arxiv.org/html/2605.14069#bib.bib123)\]reports NLL values that are not directly comparable due to its distinct CIF parameterization\. We defer dataset\-by\-dataset numerical comparison against these baselines to a future controlled\-protocol study, and instead anchor SurF’s goodness\-of\-fit claims on \(i\) the KS calibration in Figure[2](https://arxiv.org/html/2605.14069#S4.F2), which is normalization\-free, and \(ii\) the within\-method NLL ablation across variants reported here\.

## Appendix DAmortized Event Intensity: Detailed Derivations

We provide detailed derivations for each of the three monotone parameterizations ofΛθ\\Lambda\_\{\\theta\}\.

### D\.1SurF\-CSB: Cumulative Softplus Basis

The cumulative intensity is parameterized as

Λθ​\(Δ​t∣𝐡\)=∑m=1Mαm​\(𝐡\)​\[log⁡\(1\+eβm​\(𝐡\)​Δ​t\+δm​\(𝐡\)\)−log⁡\(1\+eδm​\(𝐡\)\)\],\\Lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\)=\\sum\_\{m=1\}^\{M\}\\alpha\_\{m\}\(\\mathbf\{h\}\)\\bigl\[\\log\(1\+e^\{\\beta\_\{m\}\(\\mathbf\{h\}\)\\Delta t\+\\delta\_\{m\}\(\\mathbf\{h\}\)\}\)\-\\log\(1\+e^\{\\delta\_\{m\}\(\\mathbf\{h\}\)\}\)\\bigr\],\(24\)withαm=softplus​\(α^m​\(𝐡\)\)\>0\\alpha\_\{m\}=\\mathrm\{softplus\}\(\\hat\{\\alpha\}\_\{m\}\(\\mathbf\{h\}\)\)\>0andβm=softplus​\(β^m​\(𝐡\)\)\>0\\beta\_\{m\}=\\mathrm\{softplus\}\(\\hat\{\\beta\}\_\{m\}\(\\mathbf\{h\}\)\)\>0\.

##### Boundary condition\.

AtΔ​t=0\\Delta t=0:Λθ​\(0∣𝐡\)=∑mαm​\[log⁡\(1\+eδm\)−log⁡\(1\+eδm\)\]=0\\Lambda\_\{\\theta\}\(0\\mid\\mathbf\{h\}\)=\\sum\_\{m\}\\alpha\_\{m\}\[\\log\(1\+e^\{\\delta\_\{m\}\}\)\-\\log\(1\+e^\{\\delta\_\{m\}\}\)\]=0\. ✓

##### Monotonicity\.

The derivative is

λθ​\(Δ​t∣𝐡\)=∂Λθ∂\(Δ​t\)=∑m=1Mαm​βm​σ​\(βm​Δ​t\+δm\),\\lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\)=\\frac\{\\partial\\Lambda\_\{\\theta\}\}\{\\partial\(\\Delta t\)\}=\\sum\_\{m=1\}^\{M\}\\alpha\_\{m\}\\beta\_\{m\}\\,\\sigma\(\\beta\_\{m\}\\Delta t\+\\delta\_\{m\}\),\(25\)whereσ\\sigmais the sigmoid\. Sinceαm,βm\>0\\alpha\_\{m\},\\beta\_\{m\}\>0andσ∈\(0,1\)\\sigma\\in\(0,1\), we haveλθ\>0\\lambda\_\{\\theta\}\>0everywhere, guaranteeing strict monotonicity\.

##### Expressiveness\.

Each term is a scaled sigmoid centered at−δm/βm\-\\delta\_\{m\}/\\beta\_\{m\}with steepnessβm\\beta\_\{m\}\. By varying\(αm,βm,δm\)\(\\alpha\_\{m\},\\beta\_\{m\},\\delta\_\{m\}\)the CSB family represents: monotonic decay \(δm≫0\\delta\_\{m\}\\gg 0\), monotonic increase \(δm≪0\\delta\_\{m\}\\ll 0\), non\-monotonic \(hump\-shaped\) profiles \(mixing signs ofδm\\delta\_\{m\}\), and multimodal patterns \(largeMMwith diverse parameters\)\.

###### Proposition\(Universal approximation\)\.

The CSB parameterization withM→∞M\\to\\inftycomponents can approximate any continuous positive functionλ∗:ℝ\+→ℝ\+\\lambda^\{\*\}\\\!:\\mathbb\{R\}^\{\+\}\\\!\\to\\mathbb\{R\}^\{\+\}uniformly on compact sets\. Equivalently, any continuous monotone increasingΛ∗:ℝ\+→ℝ\+\\Lambda^\{\*\}\\\!:\\mathbb\{R\}^\{\+\}\\\!\\to\\mathbb\{R\}^\{\+\}withΛ∗​\(0\)=0\\Lambda^\{\*\}\(0\)=0can be approximated to arbitrary precision\.

###### Proof sketch\.

The derivativeλθ=∑mαm​βm​σ​\(βm​Δ​t\+δm\)\\lambda\_\{\\theta\}=\\sum\_\{m\}\\alpha\_\{m\}\\beta\_\{m\}\\,\\sigma\(\\beta\_\{m\}\\Delta t\+\\delta\_\{m\}\)is a mixture of scaled sigmoids\. Asβm→∞\\beta\_\{m\}\\to\\infty, each term approachesαm​βm​𝟏​\[Δ​t\>−δm/βm\]\\alpha\_\{m\}\\beta\_\{m\}\\mathbf\{1\}\[\\Delta t\>\-\\delta\_\{m\}/\\beta\_\{m\}\], a step function\. Sums of step functions with arbitrary heights and positions are dense inL1​\(\[0,T\]\)L^\{1\}\(\[0,T\]\)for any compact\[0,T\]\[0,T\], so the derivative can approximate any integrable positive function\. The cumulative, being its integral, approximates any monotone function with the correct boundary condition\. ∎

##### Gradient computation\.

The compensator termsΛθ​\(τi∣𝐡i−1\)\\Lambda\_\{\\theta\}\(\\tau\_\{i\}\\mid\\mathbf\{h\}\_\{i\-1\}\)are forward passes; the log\-intensity terms are evaluated via the standard numerically stable log\-sum\-exp identity\. All gradients flow through the network outputs\(α^m,β^m,δ^m\)\(\\hat\{\\alpha\}\_\{m\},\\hat\{\\beta\}\_\{m\},\\hat\{\\delta\}\_\{m\}\)via standard backpropagation\.

### D\.2SurF\-GLQ: Gauss–Legendre Quadrature

The intensity is a free\-form positive MLP,

λ~θ​\(Δ​t∣𝐡\)=softplus​\(fθ​\(Δ​t,𝐡\)\),\\tilde\{\\lambda\}\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\)=\\mathrm\{softplus\}\(f\_\{\\theta\}\(\\Delta t,\\mathbf\{h\}\)\),\(26\)withfθ:ℝ×ℝd→ℝf\_\{\\theta\}\\\!:\\mathbb\{R\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}unconstrained\. The cumulative is approximated byQQ\-point Gauss–Legendre quadrature on\[0,τi\]\[0,\\tau\_\{i\}\]:

Λθ​\(τi∣𝐡\)≈τi2​∑q=1Qwq​λ~θ​\(τi2​\(1\+xq\),𝐡\),\\Lambda\_\{\\theta\}\(\\tau\_\{i\}\\mid\\mathbf\{h\}\)\\approx\\frac\{\\tau\_\{i\}\}\{2\}\\sum\_\{q=1\}^\{Q\}w\_\{q\}\\,\\tilde\{\\lambda\}\_\{\\theta\}\\\!\\Bigl\(\\tfrac\{\\tau\_\{i\}\}\{2\}\(1\+x\_\{q\}\),\\;\\mathbf\{h\}\\Bigr\),\(27\)where\{\(xq,wq\)\}q=1Q\\\{\(x\_\{q\},w\_\{q\}\)\\\}\_\{q=1\}^\{Q\}are theQQ\-point nodes and weights on\[−1,1\]\[\-1,1\]\.

##### Quadrature error bound\.

TheQQ\-point Gauss–Legendre error on\[0,τ\]\[0,\\tau\]is

\|Λθ​\(τ\)−Λθ\(Q\)​\(τ\)\|≤\(Q\!\)4​τ2​Q\+1\(2​Q\+1\)​\[\(2​Q\)\!\]3​supu∈\[0,τ\]\|d2​Qd​u2​Q​λ~θ​\(u\)\|\.\\bigl\|\\Lambda\_\{\\theta\}\(\\tau\)\-\\Lambda\_\{\\theta\}^\{\(Q\)\}\(\\tau\)\\bigr\|\\leq\\frac\{\(Q\!\)^\{4\}\\tau^\{2Q\+1\}\}\{\(2Q\+1\)\[\(2Q\)\!\]^\{3\}\}\\sup\_\{u\\in\[0,\\tau\]\}\\bigl\|\\tfrac\{d^\{2Q\}\}\{du^\{2Q\}\}\\tilde\{\\lambda\}\_\{\\theta\}\(u\)\\bigr\|\.\(28\)For smooth intensities \(guaranteed by the softplus\-MLP composition\),Q=8Q=8gives error below10−1210^\{\-12\}on typical inter\-arrival time scales\. The quadrature costO​\(Q\)O\(Q\)is*per interval*, not per sequence\.

##### Comparison with likelihood quadrature\.

Prior work \(e\.g\. THP, SAHP\) uses numerical quadrature to approximate∫0Tλθ​\(s\)​𝑑s\\int\_\{0\}^\{T\}\\lambda\_\{\\theta\}\(s\)\\,ds*at the likelihood level*\. This requiresO​\(N⋅Q′\)O\(N\\cdot Q^\{\\prime\}\)evaluations whereQ′Q^\{\\prime\}must grow withTT, and the quadrature error enters the gradient\. In SurF\-GLQ the quadrature approximates a fixed\-length integral per interval withQQfixed at88; the dominant cost remains the Transformer encoding\.

#### D\.2\.1Gradient Computation under Gauss–Legendre Quadrature

Because SurF\-GLQ is the only variant whose likelihood involves a numerical step, we give an explicit derivation of how parameter gradients are computed, and state the precise condition under which quadrature error does*not*contaminate the training signal\.

##### Quadrature constants are fixed\.

The nodes\{xq\}q=1Q\\\{x\_\{q\}\\\}\_\{q=1\}^\{Q\}and weights\{wq\}q=1Q\\\{w\_\{q\}\\\}\_\{q=1\}^\{Q\}in equation[27](https://arxiv.org/html/2605.14069#A4.E27)are theQQroots of the Legendre polynomialPQP\_\{Q\}on\[−1,1\]\[\-1,1\]and the associated Gauss weights —*data\- and parameter\-independent constants*, computed once at initialization and stored\. In our implementation they are registered as non\-trainable buffers, so autodiff treats them as stop\-gradient\. We donotbackpropagate through the nodes or weights, and we use no differentiable\-quadrature surrogate\.

##### Gradient flows only through the integrand\.

With this convention, differentiating equation[27](https://arxiv.org/html/2605.14069#A4.E27)with respect toθ\\theta— withτ\\tau,𝐡\\mathbf\{h\}, and\(xq,wq\)\(x\_\{q\},w\_\{q\}\)all independent ofθ\\theta— gives

∇θΛθ\(Q\)​\(τ∣𝐡\)=τ2​∑q=1Qwq​∇θλ~θ​\(τ2​\(1\+xq\),𝐡\)\.\\nabla\_\{\\theta\}\\,\\Lambda\_\{\\theta\}^\{\(Q\)\}\(\\tau\\mid\\mathbf\{h\}\)\\;=\\;\\frac\{\\tau\}\{2\}\\sum\_\{q=1\}^\{Q\}w\_\{q\}\\,\\nabla\_\{\\theta\}\\,\\tilde\{\\lambda\}\_\{\\theta\}\\\!\\Bigl\(\\tfrac\{\\tau\}\{2\}\(1\+x\_\{q\}\),\\;\\mathbf\{h\}\\Bigr\)\.\(29\)The gradient of the quadrature is the quadrature of the gradient, with the same nodes and weights\. No derivatives ofxqx\_\{q\}orwqw\_\{q\}appear\.

##### Quadrature error does not enter the gradient\.

Subtracting the exact parameter\-gradient integral from equation[29](https://arxiv.org/html/2605.14069#A4.E29), the residual∇θΛθ\(Q\)−∇θΛθ\\nabla\_\{\\theta\}\\Lambda\_\{\\theta\}^\{\(Q\)\}\-\\nabla\_\{\\theta\}\\Lambda\_\{\\theta\}is*itself*a Gauss–Legendre quadrature error, applied to the integrandg:=∇θλ~θg:=\\nabla\_\{\\theta\}\\tilde\{\\lambda\}\_\{\\theta\}\. The standard Gauss–Legendre error bound on\[0,τ\]\[0,\\tau\]for aC2​QC^\{2Q\}integrand gives

\|∫0τg​\(u\)​𝑑u−τ2​∑qwq​g​\(τ2​\(1\+xq\)\)\|≤\(Q\!\)4​τ2​Q\+1\(2​Q\+1\)​\[\(2​Q\)\!\]3​supu∈\[0,τ\]\|g\(2​Q\)​\(u\)\|\.\\Bigl\|\\\!\\int\_\{0\}^\{\\tau\}g\(u\)\\,du\-\\tfrac\{\\tau\}\{2\}\\sum\_\{q\}w\_\{q\}\\,g\\bigl\(\\tfrac\{\\tau\}\{2\}\(1\+x\_\{q\}\)\\bigr\)\\Bigr\|\\;\\leq\\;\\frac\{\(Q\!\)^\{4\}\\,\\tau^\{2Q\+1\}\}\{\(2Q\+1\)\[\(2Q\)\!\]^\{3\}\}\\sup\_\{u\\in\[0,\\tau\]\}\|g^\{\(2Q\)\}\(u\)\|\.\(30\)Becauseλ~θ=softplus​\(fθ\)\\tilde\{\\lambda\}\_\{\\theta\}=\\mathrm\{softplus\}\(f\_\{\\theta\}\)withfθf\_\{\\theta\}a smooth MLP,ggand its higher derivatives exist and are bounded on the finite interval\[0,τ\]\[0,\\tau\]\. The claim “error does not enter the gradient” therefore holds under two conditions: \(i\) the integrand is sufficiently smooth \(guaranteed by our architecture\), and \(ii\)QQis large enough that equation[30](https://arxiv.org/html/2605.14069#A4.E30)is below the floating\-point noise floor\. We*verify*\(ii\) empirically rather than boundingsup\|g\(2​Q\)\|\\sup\|g^\{\(2Q\)\}\|analytically, which would require a Lipschitz analysis of the trained MLP\. Specifically, we recompute both the loss and its parameter gradients atQ=64Q=64on10,00010\{,\}000held\-out intervals from each benchmark and compare toQ=8Q=8: the per\-interval loss differs by less than3×10−133\\times 10^\{\-13\}and the per\-parameter gradient by less than8×10−138\\times 10^\{\-13\}\(both nearfloat32machine epsilon\) across all six datasets\. Training withQ=16Q=16instead ofQ=8Q=8produces no measurable change in test NLL\.

##### MLE consistency\.

ℒSurF\(Q\)\\mathcal\{L\}\_\{\\text\{SurF\}\}^\{\(Q\)\}is a perturbation ofℒSurF\\mathcal\{L\}\_\{\\text\{SurF\}\}uniform inθ\\thetaon any compact parameter set, with perturbation size bounded by equation[30](https://arxiv.org/html/2605.14069#A4.E30)applied toλ~θ\\tilde\{\\lambda\}\_\{\\theta\}itself\. Standard M\-estimator theory\[van der Vaart,[2000](https://arxiv.org/html/2605.14069#bib.bib120), Thm\. 5\.7\]givesθ\(Q\)→θ⋆\\theta^\{\(Q\)\}\\to\\theta^\{\\star\}asQ→∞Q\\to\\inftyunder the usual identifiability and equicontinuity conditions\. AtQ=8Q=8and the empirical error magnitudes reported above, the quadrature\-induced perturbation is orders of magnitude smaller than the statistical variation across training seeds; for SurF\-MoE and SurF\-CSB the question is moot sinceΛθ\\Lambda\_\{\\theta\}is closed\-form\.

##### What would break the claim\.

Two scenarios re\-introduce quadrature error into the gradient: \(a\)*learning*the quadrature nodes/weights, which would require∂/∂θ\\partial/\\partial\\thetaofxq,wqx\_\{q\},w\_\{q\}\(we do not do this\); \(b\) usingQQtoo small for the smoothness offθf\_\{\\theta\}, e\.g\. replacing the softplus\-MLP with a non\-smooth layer\. Neither applies to SurF as described\.

### D\.3SurF\-MoE: Mixture\-of\-Exponentials \(Closed\-Form\)

![Refer to caption](https://arxiv.org/html/2605.14069v1/figures/surf_loss.png)Figure 5:SurF\-MoE on a toy example\. Learned intensityλθ​\(t\)=∑jwj​e−γj​Δ​t\\lambda\_\{\\theta\}\(t\)=\\sum\_\{j\}w\_\{j\}e^\{\-\\gamma\_\{j\}\\Delta t\}resets at each event as a sum of fast\- \(coral\) and slow\-decay \(teal\) components\. The shaded region betweent2t\_\{2\}andt3t\_\{3\}isΔ​z3\\Delta z\_\{3\}, one compensator term; dots marklog⁡λθ​\(ti\)\\log\\lambda\_\{\\theta\}\(t\_\{i\}\)\. The vertical line attNt\_\{N\}separates the event horizon from the survival horizon\[tN,T\]\[t\_\{N\},T\]\.Between\-event intensity is modeled as

λθ​\(Δ​t∣𝐡\)=∑j=1Jwj​exp⁡\(−γj​Δ​t\),\\lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\)=\\sum\_\{j=1\}^\{J\}w\_\{j\}\\,\\exp\(\-\\gamma\_\{j\}\\Delta t\),\(31\)withwj=softplus​\(w^j​\(𝐡\)\)\>0w\_\{j\}=\\mathrm\{softplus\}\(\\hat\{w\}\_\{j\}\(\\mathbf\{h\}\)\)\>0andγj=softplus​\(γ^j​\(𝐡\)\)\+ϵ\>0\\gamma\_\{j\}=\\mathrm\{softplus\}\(\\hat\{\\gamma\}\_\{j\}\(\\mathbf\{h\}\)\)\+\\epsilon\>0\. The cumulative admits an exact closed form,

ΛθMoE​\(Δ​t∣𝐡\)=∑j=1Jwjγj​\(1−e−γj​Δ​t\)\.\\Lambda\_\{\\theta\}^\{\\text\{MoE\}\}\(\\Delta t\\mid\\mathbf\{h\}\)=\\sum\_\{j=1\}^\{J\}\\frac\{w\_\{j\}\}\{\\gamma\_\{j\}\}\\bigl\(1\-e^\{\-\\gamma\_\{j\}\\Delta t\}\\bigr\)\.\(32\)
##### Relationship to the other variants\.

MoE is the least expressive of the three amortized parameterizations: each component\(wj/γj\)​\(1−e−γj​Δ​t\)\(w\_\{j\}/\\gamma\_\{j\}\)\(1\-e^\{\-\\gamma\_\{j\}\\Delta t\}\)is monotone*and concave*inΔ​t\\Delta t, soΛθMoE\\Lambda\_\{\\theta\}^\{\\text\{MoE\}\}cannot represent non\-concave cumulative\-intensity shapes, andλθMoE\\lambda\_\{\\theta\}^\{\\text\{MoE\}\}is always a sum of strictly decreasing components\. CSB lifts this restriction via its sigmoid basis; GLQ lifts it completely via an unconstrained positive MLP\.

##### Expressiveness limitations of MoE\.

The MoE variant cannot represent: \(i\) non\-monotonic intensities \(e\.g\. delayed excitation\), \(ii\) multimodal within\-interval patterns, or \(iii\) constant or increasing intensities between events\. CSB and GLQ handle all three\.

### D\.4Closed\-Form Loss under MoE

Under MoE all terms ofℒSurF\\mathcal\{L\}\_\{\\text\{SurF\}\}admit closed forms\. The log\-intensity attit\_\{i\}usesλθ​\(ti∣ℋti\)=∑jwj\(i\)​e−γj\(i\)​τi\\lambda\_\{\\theta\}\(t\_\{i\}\\mid\\mathcal\{H\}\_\{t\_\{i\}\}\)=\\sum\_\{j\}w\_\{j\}^\{\(i\)\}e^\{\-\\gamma\_\{j\}^\{\(i\)\}\\tau\_\{i\}\}and is evaluated via log\-sum\-exp\. The survival term is

ΛθMoE​\(Δ​T∣𝐡N\)=∑j=1Jwj\(N\)γj\(N\)​\(1−e−γj\(N\)​Δ​T\)\.\\Lambda\_\{\\theta\}^\{\\text\{MoE\}\}\(\\Delta T\\mid\\mathbf\{h\}\_\{N\}\)=\\sum\_\{j=1\}^\{J\}\\frac\{w\_\{j\}^\{\(N\)\}\}\{\\gamma\_\{j\}^\{\(N\)\}\}\\bigl\(1\-e^\{\-\\gamma\_\{j\}^\{\(N\)\}\\Delta T\}\\bigr\)\.\(33\)Substituting into the unified loss gives

ℒSurFMoE\(θ\)=∑i=1N∑j=1Jwj\(i\)γj\(i\)\(1−e−γj\(i\)​τi\)\+∑j=1Jwj\(N\)γj\(N\)\(1−e−γj\(N\)​Δ​T\)−∑i=1Nlog\(∑j=1Jwj\(i\)e−γj\(i\)​τi\)\.\\boxed\{\\mathcal\{L\}\_\{\\text\{SurF\}\}^\{\\text\{MoE\}\}\(\\theta\)=\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{J\}\\frac\{w\_\{j\}^\{\(i\)\}\}\{\\gamma\_\{j\}^\{\(i\)\}\}\\bigl\(1\-e^\{\-\\gamma\_\{j\}^\{\(i\)\}\\tau\_\{i\}\}\\bigr\)\+\\sum\_\{j=1\}^\{J\}\\frac\{w\_\{j\}^\{\(N\)\}\}\{\\gamma\_\{j\}^\{\(N\)\}\}\\bigl\(1\-e^\{\-\\gamma\_\{j\}^\{\(N\)\}\\Delta T\}\\bigr\)\-\\sum\_\{i=1\}^\{N\}\\log\\\!\\Bigl\(\\sum\_\{j=1\}^\{J\}w\_\{j\}^\{\(i\)\}e^\{\-\\gamma\_\{j\}^\{\(i\)\}\\tau\_\{i\}\}\\Bigr\)\.\}\(34\)
##### Comparison with piecewise\-linear\.

Under a piecewise\-linear parameterizationλθ​\(s\)=λi−1\+αi−1​\(s−ti−1\)\\lambda\_\{\\theta\}\(s\)=\\lambda\_\{i\-1\}\+\\alpha\_\{i\-1\}\(s\-t\_\{i\-1\}\), the integral isλi−1​τi\+12​αi−1​τi2\\lambda\_\{i\-1\}\\tau\_\{i\}\+\\tfrac\{1\}\{2\}\\alpha\_\{i\-1\}\\tau\_\{i\}^\{2\}\. While also closed\-form, this form has three problems at foundation\-model scale: \(i\) positivity requires clampingmax⁡\(λmin,⋅\)\\max\(\\lambda\_\{\\min\},\\cdot\), creating non\-smooth gradients; \(ii\) the intensity is monotonic within each interval, preventing multimodal behavior; and \(iii\)log⁡λi\\log\\lambda\_\{i\}is unstable whenλi\\lambda\_\{i\}is near zero\. MoE avoids all three issues structurally; CSB and GLQ additionally avoid the monotone\-decay limitation\.

Algorithm 1SurF Training \(Unified Amortized Framework\)1:Input:Training sequences

\{𝒯\(k\)\}k=1K\\\{\\mathcal\{T\}^\{\(k\)\}\\\}\_\{k=1\}^\{K\}with observation windows

\{T\(k\)\}k=1K\\\{T^\{\(k\)\}\\\}\_\{k=1\}^\{K\}, learning rate

η\\eta, iterations

MM, type weight

βtype\\beta\_\{\\text\{type\}\}, variant

∈\{MoE,CSB,GLQ\}\\in\\\{\\text\{MoE\},\\text\{CSB\},\\text\{GLQ\}\\\}, learned initial\-history embedding

𝐡0\\mathbf\{h\}\_\{0\}
2:Initialize parameters

θ\\theta
3:foriteration

m=1m=1to

MMdo

4:Sample mini\-batch

ℬ\\mathcal\{B\}from training sequences

5:

ℒbatch←0\\mathcal\{L\}\_\{\\text\{batch\}\}\\leftarrow 0
6:foreach sequence

𝒯=\{t1,…,tN\}\\mathcal\{T\}=\\\{t\_\{1\},\\ldots,t\_\{N\}\\\}with window

TT, set

t0:=0t\_\{0\}:=0do

7:Encode:

\{𝐡i\}i=1N←Transformerθ​\(\{ti,ki\}i=1N\)\\\{\\mathbf\{h\}\_\{i\}\\\}\_\{i=1\}^\{N\}\\leftarrow\\text\{Transformer\}\_\{\\theta\}\(\\\{t\_\{i\},k\_\{i\}\\\}\_\{i=1\}^\{N\}\)
8:for

i=1i=1to

NNdo

9:

τi←ti−ti−1\\tau\_\{i\}\\leftarrow t\_\{i\}\-t\_\{i\-1\}\{

τ1=t1\\tau\_\{1\}=t\_\{1\}since

t0=0t\_\{0\}=0\}

10:

Δ​zi←Λθ​\(τi∣𝐡i−1\)\\Delta z\_\{i\}\\leftarrow\\Lambda\_\{\\theta\}\(\\tau\_\{i\}\\mid\\mathbf\{h\}\_\{i\-1\}\)\{forward pass;

𝐡0\\mathbf\{h\}\_\{0\}used for

i=1i\{=\}1\}

11:

λi←λθ​\(τi∣𝐡i−1\)\\lambda\_\{i\}\\leftarrow\\lambda\_\{\\theta\}\(\\tau\_\{i\}\\mid\\mathbf\{h\}\_\{i\-1\}\)\{autodiff or closed\-form\}

12:endfor

13:

Δ​T←T−tN\\Delta T\\leftarrow T\-t\_\{N\}
14:

ℒMLE←∑i=1NΔ​zi\+Λθ​\(Δ​T∣𝐡N\)−∑i=1Nlog⁡λi\\mathcal\{L\}\_\{\\text\{MLE\}\}\\leftarrow\\sum\_\{i=1\}^\{N\}\\Delta z\_\{i\}\+\\Lambda\_\{\\theta\}\(\\Delta T\\mid\\mathbf\{h\}\_\{N\}\)\-\\sum\_\{i=1\}^\{N\}\\log\\lambda\_\{i\}
15:

\{ℓi\}i=0N−1←TypeNetθ​\(\{𝐡i\}i=0N−1\)\\\{\\bm\{\\ell\}\_\{i\}\\\}\_\{i=0\}^\{N\-1\}\\leftarrow\\text\{TypeNet\}\_\{\\theta\}\(\\\{\\mathbf\{h\}\_\{i\}\\\}\_\{i=0\}^\{N\-1\}\)
16:

ℒtype←∑i=1NCrossEntropy​\(ℓi−1,ki\)\\mathcal\{L\}\_\{\\text\{type\}\}\\leftarrow\\sum\_\{i=1\}^\{N\}\\text\{CrossEntropy\}\(\\bm\{\\ell\}\_\{i\-1\},k\_\{i\}\)
17:

ℒbatch←ℒbatch\+ℒMLE\+βtype​ℒtype\+βfcst​ℒfcst\\mathcal\{L\}\_\{\\text\{batch\}\}\\leftarrow\\mathcal\{L\}\_\{\\text\{batch\}\}\+\\mathcal\{L\}\_\{\\text\{MLE\}\}\+\\beta\_\{\\text\{type\}\}\\mathcal\{L\}\_\{\\text\{type\}\}\+\\beta\_\{\\textrm\{fcst\}\}\\mathcal\{L\}\_\{\\textrm\{fcst\}\}
18:endfor

19:

θ←θ−η​∇θℒbatch\\theta\\leftarrow\\theta\-\\eta\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{batch\}\}
20:endfor

## Appendix ESampling Algorithms

Algorithm 2SurF Sampling via Inverse Time Rescaling \(Safeguarded Newton, General\)1:Input:Trained parameters

θ\\theta, horizon

TT, initial type

k1k\_\{1\}, variant

∈\{MoE,CSB,GLQ\}\\in\\\{\\text\{MoE\},\\text\{CSB\},\\text\{GLQ\}\\\}, tolerance

ϵtol=10−4\\epsilon\_\{\\text\{tol\}\}=10^\{\-4\}, max iterations

MM, floor

ϵ=10−6\\epsilon=10^\{\-6\}
2:Initialize:

t←0t\\leftarrow 0,

𝒯←∅\\mathcal\{T\}\\leftarrow\\emptyset,

ℋt←∅\\mathcal\{H\}\_\{t\}\\leftarrow\\emptyset
3:while

t<Tt<Tdo

4:

Δ​z∼Exponential​\(1\)\\Delta z\\sim\\text\{Exponential\}\(1\)
5:

𝐡←Transformerθ​\(ℋt\)\\mathbf\{h\}\\leftarrow\\text\{Transformer\}\_\{\\theta\}\(\\mathcal\{H\}\_\{t\}\)
6:Initialize bracket:

a←ϵa\\leftarrow\\epsilon,

b←2​Δ​z/λθ​\(0∣𝐡\)b\\leftarrow 2\\Delta z/\\lambda\_\{\\theta\}\(0\\mid\\mathbf\{h\}\)
7:while

Λθ​\(b∣𝐡\)<Δ​z\\Lambda\_\{\\theta\}\(b\\mid\\mathbf\{h\}\)<\\Delta zdo

8:

b←2​bb\\leftarrow 2b
9:endwhile

10:

Δ​t←\(a\+b\)/2\\Delta t\\leftarrow\(a\+b\)/2
11:for

n=1n=1to

MMdo

12:

F←Λθ​\(Δ​t∣𝐡\)F\\leftarrow\\Lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\);

F′←λθ​\(Δ​t∣𝐡\)\\;F^\{\\prime\}\\leftarrow\\lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\)
13:if

\|F−Δ​z\|<ϵtol\|F\-\\Delta z\|<\\epsilon\_\{\\text\{tol\}\}then

14:break

15:endif

16:

Δ​tN←Δ​t−\(F−Δ​z\)/max⁡\(F′,ϵ\)\\Delta t\_\{\\text\{N\}\}\\leftarrow\\Delta t\-\(F\-\\Delta z\)/\\max\(F^\{\\prime\},\\epsilon\)
17:if

Δ​tN∈\(a,b\)\\Delta t\_\{\\text\{N\}\}\\in\(a,b\)then

18:

Δ​t←Δ​tN\\Delta t\\leftarrow\\Delta t\_\{\\text\{N\}\}
19:else

20:

Δ​t←\(a\+b\)/2\\Delta t\\leftarrow\(a\+b\)/2\{bisection fallback\}

21:endif

22:if

Λθ​\(Δ​t∣𝐡\)<Δ​z\\Lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\)<\\Delta zthen

23:

a←Δ​ta\\leftarrow\\Delta t
24:else

25:

b←Δ​tb\\leftarrow\\Delta t
26:endif

27:endfor

28:

tnew←t\+Δ​tt\_\{\\text\{new\}\}\\leftarrow t\+\\Delta t
29:if

tnew\>Tt\_\{\\text\{new\}\}\>Tthen

30:break

31:endif

32:

knew∼Categorical​\(softmax​\(TypeNetθ​\(𝐡\)\)\)k\_\{\\text\{new\}\}\\sim\\text\{Categorical\}\(\\text\{softmax\}\(\\text\{TypeNet\}\_\{\\theta\}\(\\mathbf\{h\}\)\)\)
33:

𝒯←𝒯∪\{\(tnew,knew\)\}\\mathcal\{T\}\\leftarrow\\mathcal\{T\}\\cup\\\{\(t\_\{\\text\{new\}\},k\_\{\\text\{new\}\}\)\\\};

ℋt←ℋt∪\{\(tnew,knew\)\}\\mathcal\{H\}\_\{t\}\\leftarrow\\mathcal\{H\}\_\{t\}\\cup\\\{\(t\_\{\\text\{new\}\},k\_\{\\text\{new\}\}\)\\\};

t←tnewt\\leftarrow t\_\{\\text\{new\}\}
34:endwhile

35:Return

𝒯\\mathcal\{T\}

Algorithm 3SurF\-MoE Sampling \(Safeguarded Newton, Closed\-Form Specialization\)1:Input:Trained parameters

θ\\theta, horizon

TT, initial type

k1k\_\{1\}, tolerance

ϵtol=10−4\\epsilon\_\{\\text\{tol\}\}=10^\{\-4\}, max iterations

M=8M=8, floor

ϵ=10−6\\epsilon=10^\{\-6\}
2:Initialize:

t←0t\\leftarrow 0,

𝒯←∅\\mathcal\{T\}\\leftarrow\\emptyset,

ℋt←∅\\mathcal\{H\}\_\{t\}\\leftarrow\\emptyset
3:while

t<Tt<Tdo

4:if

ℋt=∅\\mathcal\{H\}\_\{t\}=\\emptysetthen

5:use initial embedding for type

k1k\_\{1\}
6:else

7:

𝐡←Transformerθ​\(ℋt\)\\mathbf\{h\}\\leftarrow\\text\{Transformer\}\_\{\\theta\}\(\\mathcal\{H\}\_\{t\}\)
8:endif

9:

\{wj\}←softplus​\(w^j​\(𝐡\)\)\\\{w\_\{j\}\\\}\\leftarrow\\text\{softplus\}\(\\hat\{w\}\_\{j\}\(\\mathbf\{h\}\)\),

\{γj\}←softplus​\(γ^j​\(𝐡\)\)\+ϵ\\\{\\gamma\_\{j\}\\\}\\leftarrow\\text\{softplus\}\(\\hat\{\\gamma\}\_\{j\}\(\\mathbf\{h\}\)\)\+\\epsilon
10:

U∼Uniform​\(0,1\)U\\sim\\text\{Uniform\}\(0,1\);

z←−log⁡Uz\\leftarrow\-\\log U
11:Initialize bracket:

a←ϵa\\leftarrow\\epsilon,

b←2​z/∑jwjb\\leftarrow 2z/\\sum\_\{j\}w\_\{j\}
12:while

∑j\(wj/γj\)​\(1−e−γj​b\)<z\\sum\_\{j\}\(w\_\{j\}/\\gamma\_\{j\}\)\(1\-e^\{\-\\gamma\_\{j\}b\}\)<zdo

13:

b←2​bb\\leftarrow 2b\{expand upper bracket; terminates by Prop\.[Proposition](https://arxiv.org/html/2605.14069#Thmpropositionx3)\}

14:endwhile

15:

Δ​t←\(a\+b\)/2\\Delta t\\leftarrow\(a\+b\)/2\{initial guess\}

16:for

n=1n=1to

MMdo

17:

F←∑j\(wj/γj\)​\(1−e−γj​Δ​t\)F\\leftarrow\\sum\_\{j\}\(w\_\{j\}/\\gamma\_\{j\}\)\(1\-e^\{\-\\gamma\_\{j\}\\Delta t\}\);

F′←∑jwj​e−γj​Δ​t\\;F^\{\\prime\}\\leftarrow\\sum\_\{j\}w\_\{j\}\\,e^\{\-\\gamma\_\{j\}\\Delta t\}
18:if

\|F−z\|<ϵtol\|F\-z\|<\\epsilon\_\{\\text\{tol\}\}then

19:break\{converged\}

20:endif

21:

Δ​tN←Δ​t−\(F−z\)/max⁡\(F′,ϵ\)\\Delta t\_\{\\text\{N\}\}\\leftarrow\\Delta t\-\(F\-z\)/\\max\(F^\{\\prime\},\\epsilon\)\{candidate Newton step\}111Under the hypotheses of Prop\.[Proposition](https://arxiv.org/html/2605.14069#Thmpropositionx3),F′=λθ≥λfloor\>0F^\{\\prime\}=\\lambda\_\{\\theta\}\\geq\\lambda\_\{\\text\{floor\}\}\>0, so themax⁡\(⋅,ϵ\)\\max\(\\cdot,\\epsilon\)is purely a numerical guard against floating\-point underflow at very largeΔ​t\\Delta t, not a theoretical requirement\.

22:if

Δ​tN∈\(a,b\)\\Delta t\_\{\\text\{N\}\}\\in\(a,b\)then

23:

Δ​t←Δ​tN\\Delta t\\leftarrow\\Delta t\_\{\\text\{N\}\}\{accept Newton\}

24:else

25:

Δ​t←\(a\+b\)/2\\Delta t\\leftarrow\(a\+b\)/2\{bisection fallback\}

26:endif

27:

Fnew←∑j\(wj/γj\)​\(1−e−γj​Δ​t\)F\_\{\\text\{new\}\}\\leftarrow\\sum\_\{j\}\(w\_\{j\}/\\gamma\_\{j\}\)\(1\-e^\{\-\\gamma\_\{j\}\\Delta t\}\)
28:if

Fnew<zF\_\{\\text\{new\}\}<zthen

29:

a←Δ​ta\\leftarrow\\Delta t
30:else

31:

b←Δ​tb\\leftarrow\\Delta t
32:endif\{tighten bracket; sign of

Fnew−zF\_\{\\text\{new\}\}\-zpreserves invariant\}

33:endfor

34:

tnew←t\+Δ​tt\_\{\\text\{new\}\}\\leftarrow t\+\\Delta t
35:if

tnew\>Tt\_\{\\text\{new\}\}\>Tthen

36:break

37:endif

38:

knew∼Categorical​\(softmax​\(TypeNetθ​\(𝐡\)\)\)k\_\{\\text\{new\}\}\\sim\\text\{Categorical\}\(\\text\{softmax\}\(\\text\{TypeNet\}\_\{\\theta\}\(\\mathbf\{h\}\)\)\)
39:

𝒯←𝒯∪\{\(tnew,knew\)\}\\mathcal\{T\}\\leftarrow\\mathcal\{T\}\\cup\\\{\(t\_\{\\text\{new\}\},k\_\{\\text\{new\}\}\)\\\};

ℋt←ℋt∪\{\(tnew,knew\)\}\\mathcal\{H\}\_\{t\}\\leftarrow\\mathcal\{H\}\_\{t\}\\cup\\\{\(t\_\{\\text\{new\}\},k\_\{\\text\{new\}\}\)\\\};

t←tnewt\\leftarrow t\_\{\\text\{new\}\}
40:endwhile

41:Return

𝒯\\mathcal\{T\}

### E\.1Safeguarded Newton: Formal Convergence Guarantee

The sampling step of Algorithm[3](https://arxiv.org/html/2605.14069#alg3)requires solvingΛθ​\(Δ​t∣𝐡\)=z\\Lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\)=zforΔ​t\>0\\Delta t\>0givenz∼Exp​\(1\)z\\sim\\mathrm\{Exp\}\(1\)\. Pure Newton’s method on this univariate equation need not converge globally — curvature ofΛθ\\Lambda\_\{\\theta\}can cause overshoot outside a neighborhood of the root\. We therefore run a safeguarded variant \(Newton update when inside the bracket, bisection otherwise\), for which the following precise guarantee holds\.

###### Proposition\(Global convergence of safeguarded Newton\)\.

LetΛθ\(⋅∣𝐡\):ℝ\+→ℝ\+\\Lambda\_\{\\theta\}\(\\cdot\\mid\\mathbf\{h\}\):\\mathbb\{R\}^\{\+\}\\to\\mathbb\{R\}^\{\+\}beC2C^\{2\}and satisfy: \(i\)Λθ​\(0∣𝐡\)=0\\Lambda\_\{\\theta\}\(0\\mid\\mathbf\{h\}\)=0, \(ii\)λθ​\(Δ​t∣𝐡\)≥λfloor\>0\\lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\)\\geq\\lambda\_\{\\text\{floor\}\}\>0for allΔ​t≥0\\Delta t\\geq 0, and \(iii\)Λθ​\(Δ​t∣𝐡\)→∞\\Lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\)\\to\\inftyasΔ​t→∞\\Delta t\\to\\infty\. For anyz\>0z\>0, let\[a0,b0\]\[a\_\{0\},b\_\{0\}\]be a bracket withΛθ​\(a0∣𝐡\)≤z≤Λθ​\(b0∣𝐡\)\\Lambda\_\{\\theta\}\(a\_\{0\}\\mid\\mathbf\{h\}\)\\leq z\\leq\\Lambda\_\{\\theta\}\(b\_\{0\}\\mid\\mathbf\{h\}\)\. Then the safeguarded Newton iteration, Newton update when the iterate lies in the current bracket, bisection otherwise, with the bracket tightened each step, converges to the unique rootΔ​t⋆\\Delta t^\{\\star\}\. Moreover:

- •Global \(at least linear\) rate\.Bisection halves the bracket on any step the Newton update is rejected, giving\|Δ​t\(n\)−Δ​t⋆\|≤2−nB​\(b0−a0\)\|\\Delta t^\{\(n\)\}\-\\Delta t^\{\\star\}\|\\leq 2^\{\-n\_\{B\}\}\(b\_\{0\}\-a\_\{0\}\)wherenBn\_\{B\}is the number of bisection steps\.
- •Local quadratic rate\.Once the iterate enters a neighborhood ofΔ​t⋆\\Delta t^\{\\star\}on which the Newton–Kantorovich condition\|λθ′\|/λθ2≤L\|\\lambda^\{\\prime\}\_\{\\theta\}\|/\\lambda\_\{\\theta\}^\{2\}\\leq LwithL​\|z−Λθ​\(Δ​t\(n\)\)\|/λθ​\(Δ​t\(n\)\)<12L\\,\|z\-\\Lambda\_\{\\theta\}\(\\Delta t^\{\(n\)\}\)\|/\\lambda\_\{\\theta\}\(\\Delta t^\{\(n\)\}\)<\\tfrac\{1\}\{2\}holds, Newton steps are accepted and the iteration is quadratically convergent\[Nocedal and Wright,[2006](https://arxiv.org/html/2605.14069#bib.bib119), Thm\. 11\.5\]\.

###### Sketch\.

Existence and uniqueness ofΔ​t⋆\\Delta t^\{\\star\}follow from Theorem[2](https://arxiv.org/html/2605.14069#Thmtheorem2)under the stated hypotheses\. Bisection convergence is standard\. The Newton step is accepted iff it lies inside the current bracket, which remains a valid bracket throughout by construction\. The local quadratic rate is the standard Newton–Kantorovich theorem applied to the smooth univariate residualr​\(Δ​t\)=Λθ​\(Δ​t∣𝐡\)−zr\(\\Delta t\)=\\Lambda\_\{\\theta\}\(\\Delta t\\mid\\mathbf\{h\}\)\-z; hypothesis \(ii\) ensuresr′=λθr^\{\\prime\}=\\lambda\_\{\\theta\}is bounded away from zero, so the required Lipschitz constant onr′/λθr^\{\\prime\}/\\lambda\_\{\\theta\}is finite on any compact subinterval\. ∎

##### Bracket initialization\.

We seta0=ϵ=10−6a\_\{0\}=\\epsilon=10^\{\-6\}andb0=2​z/λθ​\(0∣𝐡\)b\_\{0\}=2z/\\lambda\_\{\\theta\}\(0\\mid\\mathbf\{h\}\)\(twice the constant\-intensity estimate\)\. IfΛθ​\(b0∣𝐡\)<z\\Lambda\_\{\\theta\}\(b\_\{0\}\\mid\\mathbf\{h\}\)<z, we doubleb0b\_\{0\}until the upper bracket is reached; termination is guaranteed in at most⌈log2⁡\(z/\(b0​λfloor\)\)⌉\\lceil\\log\_\{2\}\(z/\(b\_\{0\}\\,\\lambda\_\{\\text\{floor\}\}\)\)\\rceildoublings, since hypothesis \(ii\) impliesΛθ​\(b\)≥b​λfloor\\Lambda\_\{\\theta\}\(b\)\\geq b\\,\\lambda\_\{\\text\{floor\}\}\.

##### When pure Newton would fail\.

Without the bracket safeguard, Newton can overshoot in regimes whereλθ\\lambda\_\{\\theta\}changes rapidly relative to its magnitude — concretely, when\|λθ′​\(Δ​t\(0\)\)\|​\|z\|/λθ​\(Δ​t\(0\)\)2≳1\|\\lambda^\{\\prime\}\_\{\\theta\}\(\\Delta t^\{\(0\)\}\)\|\\,\|z\|/\\lambda\_\{\\theta\}\(\\Delta t^\{\(0\)\}\)^\{2\}\\gtrsim 1\. This arises in practice at the tail of a short\-timescale MoE component shortly after an event; the safeguard catches exactly these cases\.

##### Empirical iteration counts\.

Across all six benchmarks,M=8M=8iterations suffice to reach tolerance10−410^\{\-4\}in\>99\.9%\>\\\!99\.9\\%of inversions; bisection is triggered only on inputs near numerical overflow ofe−γj​Δ​te^\{\-\\gamma\_\{j\}\\Delta t\}, where it activates automatically and yields linear convergence on the final iteration\. The total wall\-clock fraction spent in Newton during sampling is under1\.2%1\.2\\%of Transformer encoder time on every dataset, consistent with the cost accounting in Appendix[K](https://arxiv.org/html/2605.14069#A11)\.

## Appendix FGradient Flow Analysis

SurF transforms the optimization landscape through the time rescaling, providing natural gradient conditioning\. This section analyzes how gradients transform between temporal and exponential space, derivesO​\(N\)O\(N\)parameter\-gradient bounds for all three variants, and contrasts with theO​\(σN\)O\(\\sigma^\{N\}\)scaling of RNN\-based TPPs\.

### F\.1Gradient Transformation

Under SurF’s diagonal Jacobian \(Appendix[B](https://arxiv.org/html/2605.14069#A2)\),∂Δ​zi/∂τj=λθ​\(τi∣𝐡i−1\)​δi​j\\partial\\Delta z\_\{i\}/\\partial\\tau\_\{j\}=\\lambda\_\{\\theta\}\(\\tau\_\{i\}\\mid\\mathbf\{h\}\_\{i\-1\}\)\\delta\_\{ij\}, so

∂ℒSurF∂τi=λθ​\(τi∣𝐡i−1\)⋅∂ℒSurF∂Δ​zi,∂ℒSurF∂Δ​zi=1λθ​\(τi∣𝐡i−1\)​∂ℒSurF∂τi\.\\frac\{\\partial\\mathcal\{L\}\_\{\\text\{SurF\}\}\}\{\\partial\\tau\_\{i\}\}=\\lambda\_\{\\theta\}\(\\tau\_\{i\}\\mid\\mathbf\{h\}\_\{i\-1\}\)\\cdot\\frac\{\\partial\\mathcal\{L\}\_\{\\text\{SurF\}\}\}\{\\partial\\Delta z\_\{i\}\},\\qquad\\frac\{\\partial\\mathcal\{L\}\_\{\\text\{SurF\}\}\}\{\\partial\\Delta z\_\{i\}\}=\\frac\{1\}\{\\lambda\_\{\\theta\}\(\\tau\_\{i\}\\mid\\mathbf\{h\}\_\{i\-1\}\)\}\\frac\{\\partial\\mathcal\{L\}\_\{\\text\{SurF\}\}\}\{\\partial\\tau\_\{i\}\}\.\(35\)

### F\.2Gradient Norm Bounds

For all three variants,λθ\>0\\lambda\_\{\\theta\}\>0by construction\. Letλmax,λmin\\lambda\_\{\\max\},\\lambda\_\{\\min\}denote the supremum and infimum ofλθ\\lambda\_\{\\theta\}over the dataset\. For MoE,

0<λθ≤∑jwj=λmax,λmin=∑jwj​e−γj​τmax\>0\.0<\\lambda\_\{\\theta\}\\leq\\sum\_\{j\}w\_\{j\}=\\lambda\_\{\\max\},\\qquad\\lambda\_\{\\min\}=\\sum\_\{j\}w\_\{j\}e^\{\-\\gamma\_\{j\}\\tau\_\{\\max\}\}\>0\.\(36\)For CSB,

0<λθ≤∑mαm​βm=λmaxCSB,0<\\lambda\_\{\\theta\}\\leq\\sum\_\{m\}\\alpha\_\{m\}\\beta\_\{m\}=\\lambda\_\{\\max\}^\{\\text\{CSB\}\},\(37\)withλminCSB\\lambda\_\{\\min\}^\{\\text\{CSB\}\}strictly positive \(its exact minimum depends on the learnedδm\\delta\_\{m\}but can be bounded below from∑mαm​βm​σ​\(δm\)\\sum\_\{m\}\\alpha\_\{m\}\\beta\_\{m\}\\sigma\(\\delta\_\{m\}\)\)\. The gradient norms satisfy

1λmax​‖∂ℒSurF∂𝝉‖2≤‖∂ℒSurF∂𝚫​𝒛‖2≤1λmin​‖∂ℒSurF∂𝝉‖2\.\\frac\{1\}\{\\lambda\_\{\\max\}\}\\Bigl\\\|\\tfrac\{\\partial\\mathcal\{L\}\_\{\\text\{SurF\}\}\}\{\\partial\\bm\{\\tau\}\}\\Bigr\\\|\_\{2\}\\leq\\Bigl\\\|\\tfrac\{\\partial\\mathcal\{L\}\_\{\\text\{SurF\}\}\}\{\\partial\\bm\{\\Delta z\}\}\\Bigr\\\|\_\{2\}\\leq\\frac\{1\}\{\\lambda\_\{\\min\}\}\\Bigl\\\|\\tfrac\{\\partial\\mathcal\{L\}\_\{\\text\{SurF\}\}\}\{\\partial\\bm\{\\tau\}\}\\Bigr\\\|\_\{2\}\.\(38\)The condition numberκ=λmax/λmin\\kappa=\\lambda\_\{\\max\}/\\lambda\_\{\\min\}depends on the learned parameters; for CSB it tends to be smaller than for MoE because sigmoid components can maintain non\-vanishing intensity at largeΔ​t\\Delta t\.

### F\.3Parameter Gradients: SurF\-MoE

For the integral term,

∇θ​∑i=1NΔ​zi=∑i=1N∑j=1J∇θ\[wj\(i\)γj\(i\)​\(1−e−γj\(i\)​τi\)\]\.\\nabla\_\{\\theta\}\\sum\_\{i=1\}^\{N\}\\Delta z\_\{i\}=\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{J\}\\nabla\_\{\\theta\}\\\!\\Bigl\[\\frac\{w\_\{j\}^\{\(i\)\}\}\{\\gamma\_\{j\}^\{\(i\)\}\}\\bigl\(1\-e^\{\-\\gamma\_\{j\}^\{\(i\)\}\\tau\_\{i\}\}\\bigr\)\\Bigr\]\.\(39\)Gradients flow through the softplus activations:

∂∂w^j​wjγj​\(1−e−γj​τ\)\\displaystyle\\frac\{\\partial\}\{\\partial\\hat\{w\}\_\{j\}\}\\frac\{w\_\{j\}\}\{\\gamma\_\{j\}\}\(1\-e^\{\-\\gamma\_\{j\}\\tau\}\)=σ′​\(w^j\)γj​\(1−e−γj​τ\),\\displaystyle=\\frac\{\\sigma^\{\\prime\}\(\\hat\{w\}\_\{j\}\)\}\{\\gamma\_\{j\}\}\(1\-e^\{\-\\gamma\_\{j\}\\tau\}\),\(40\)∂∂γ^j​wjγj​\(1−e−γj​τ\)\\displaystyle\\frac\{\\partial\}\{\\partial\\hat\{\\gamma\}\_\{j\}\}\\frac\{w\_\{j\}\}\{\\gamma\_\{j\}\}\(1\-e^\{\-\\gamma\_\{j\}\\tau\}\)=σ′​\(γ^j\)⋅wj​\[τ​e−γj​τγj−1−e−γj​τγj2\],\\displaystyle=\\sigma^\{\\prime\}\(\\hat\{\\gamma\}\_\{j\}\)\\cdot w\_\{j\}\\Bigl\[\\frac\{\\tau e^\{\-\\gamma\_\{j\}\\tau\}\}\{\\gamma\_\{j\}\}\-\\frac\{1\-e^\{\-\\gamma\_\{j\}\\tau\}\}\{\\gamma\_\{j\}^\{2\}\}\\Bigr\],\(41\)whereσ′\\sigma^\{\\prime\}is the derivative of softplus \(i\.e\. sigmoid\)\. For the log\-intensity term,

∇θlog⁡λ=∑j\(∇θwj\)​e−γj​τ−∑jwj​τ​\(∇θγj\)​e−γj​τ∑jwj​e−γj​τ\.\\nabla\_\{\\theta\}\\log\\lambda=\\frac\{\\sum\_\{j\}\(\\nabla\_\{\\theta\}w\_\{j\}\)e^\{\-\\gamma\_\{j\}\\tau\}\-\\sum\_\{j\}w\_\{j\}\\tau\(\\nabla\_\{\\theta\}\\gamma\_\{j\}\)e^\{\-\\gamma\_\{j\}\\tau\}\}\{\\sum\_\{j\}w\_\{j\}e^\{\-\\gamma\_\{j\}\\tau\}\}\.\(42\)The denominator equalsλθ\>0\\lambda\_\{\\theta\}\>0, so the gradient is always well\-defined\.

### F\.4Parameter Gradients: SurF\-CSB

∂ΛθCSB∂α^m\\displaystyle\\frac\{\\partial\\Lambda\_\{\\theta\}^\{\\text\{CSB\}\}\}\{\\partial\\hat\{\\alpha\}\_\{m\}\}=σ′​\(α^m\)​\[log⁡\(1\+eβm​Δ​t\+δm\)−log⁡\(1\+eδm\)\],\\displaystyle=\\sigma^\{\\prime\}\(\\hat\{\\alpha\}\_\{m\}\)\\bigl\[\\log\(1\+e^\{\\beta\_\{m\}\\Delta t\+\\delta\_\{m\}\}\)\-\\log\(1\+e^\{\\delta\_\{m\}\}\)\\bigr\],\(43\)∂ΛθCSB∂β^m\\displaystyle\\frac\{\\partial\\Lambda\_\{\\theta\}^\{\\text\{CSB\}\}\}\{\\partial\\hat\{\\beta\}\_\{m\}\}=σ′​\(β^m\)⋅αm⋅Δ​t⋅σ​\(βm​Δ​t\+δm\),\\displaystyle=\\sigma^\{\\prime\}\(\\hat\{\\beta\}\_\{m\}\)\\cdot\\alpha\_\{m\}\\cdot\\Delta t\\cdot\\sigma\(\\beta\_\{m\}\\Delta t\+\\delta\_\{m\}\),\(44\)∂ΛθCSB∂δ^m\\displaystyle\\frac\{\\partial\\Lambda\_\{\\theta\}^\{\\text\{CSB\}\}\}\{\\partial\\hat\{\\delta\}\_\{m\}\}=αm​\[σ​\(βm​Δ​t\+δm\)−σ​\(δm\)\]\.\\displaystyle=\\alpha\_\{m\}\\bigl\[\\sigma\(\\beta\_\{m\}\\Delta t\+\\delta\_\{m\}\)\-\\sigma\(\\delta\_\{m\}\)\\bigr\]\.\(45\)All expressions are smooth and bounded\. The log\-intensity gradient is

∇θlog⁡λθCSB=∑m∇θ\[αm​βm​σ​\(βm​Δ​t\+δm\)\]∑mαm​βm​σ​\(βm​Δ​t\+δm\),\\nabla\_\{\\theta\}\\log\\lambda\_\{\\theta\}^\{\\text\{CSB\}\}=\\frac\{\\sum\_\{m\}\\nabla\_\{\\theta\}\[\\alpha\_\{m\}\\beta\_\{m\}\\sigma\(\\beta\_\{m\}\\Delta t\+\\delta\_\{m\}\)\]\}\{\\sum\_\{m\}\\alpha\_\{m\}\\beta\_\{m\}\\sigma\(\\beta\_\{m\}\\Delta t\+\\delta\_\{m\}\)\},\(46\)with denominatorλθCSB\>0\\lambda\_\{\\theta\}^\{\\text\{CSB\}\}\>0\.

### F\.5O​\(N\)O\(N\)Gradient Norm Bound

For a neural parameterization with Lipschitz constantLL,‖∇θλθ‖≤L\\\|\\nabla\_\{\\theta\}\\lambda\_\{\\theta\}\\\|\\leq L, and

‖∇θℒSurF‖≤∑i=1N‖∇θΛθ​\(τi\)‖\+‖∇θΛθ​\(Δ​T\)‖\+∑i=1N‖∇θλθ​\(τi\)‖λθ​\(τi\)\.\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{SurF\}\}\\\|\\leq\\sum\_\{i=1\}^\{N\}\\\|\\nabla\_\{\\theta\}\\Lambda\_\{\\theta\}\(\\tau\_\{i\}\)\\\|\+\\\|\\nabla\_\{\\theta\}\\Lambda\_\{\\theta\}\(\\Delta T\)\\\|\+\\sum\_\{i=1\}^\{N\}\\frac\{\\\|\\nabla\_\{\\theta\}\\lambda\_\{\\theta\}\(\\tau\_\{i\}\)\\\|\}\{\\lambda\_\{\\theta\}\(\\tau\_\{i\}\)\}\.\(47\)For sequences with average rateλ¯\\bar\{\\lambda\}andT≈N/λ¯T\\approx N/\\bar\{\\lambda\},

‖∇θℒSurF‖≤L⋅N⋅\(1λ¯\+1λmin\)=O​\(N\)\.\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{SurF\}\}\\\|\\leq L\\cdot N\\cdot\\Bigl\(\\tfrac\{1\}\{\\bar\{\\lambda\}\}\+\\tfrac\{1\}\{\\lambda\_\{\\min\}\}\\Bigr\)=O\(N\)\.\(48\)This bound applies to all three variants\. The softplus/sigmoid activations ensure smooth gradients, and structural positivity ofλθ\\lambda\_\{\\theta\}prevents log\-intensity singularities\.

### F\.6Comparison with RNN\-Based TPPs

RNN\-based TPPs update hidden states as𝐡ti=gϕ​\(𝐡ti−1,τi\)\\mathbf\{h\}\_\{t\_\{i\}\}=g\_\{\\phi\}\(\\mathbf\{h\}\_\{t\_\{i\-1\}\},\\tau\_\{i\}\)\. Backpropagation through time yields

∂ℒ∂τi=∑j=iN∂ℒ∂𝐡tj​∏k=i\+1j∂𝐡tk∂𝐡tk−1​∂𝐡ti∂τi\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\tau\_\{i\}\}=\\sum\_\{j=i\}^\{N\}\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\mathbf\{h\}\_\{t\_\{j\}\}\}\\prod\_\{k=i\+1\}^\{j\}\\frac\{\\partial\\mathbf\{h\}\_\{t\_\{k\}\}\}\{\\partial\\mathbf\{h\}\_\{t\_\{k\-1\}\}\}\\frac\{\\partial\\mathbf\{h\}\_\{t\_\{i\}\}\}\{\\partial\\tau\_\{i\}\}\.\(49\)If Jacobians have singular\-value boundsσmin,σmax\\sigma\_\{\\min\},\\sigma\_\{\\max\},

σminj−i≤‖∏k=i\+1j∂𝐡tk∂𝐡tk−1‖≤σmaxj−i\.\\sigma\_\{\\min\}^\{j\-i\}\\leq\\Bigl\\\|\\prod\_\{k=i\+1\}^\{j\}\\tfrac\{\\partial\\mathbf\{h\}\_\{t\_\{k\}\}\}\{\\partial\\mathbf\{h\}\_\{t\_\{k\-1\}\}\}\\Bigr\\\|\\leq\\sigma\_\{\\max\}^\{j\-i\}\.\(50\)ForN\>100N\>100, even small deviations fromσ=1\\sigma=1yield exponential gradient growth or decay\. In contrast, SurF’s amortized loss gives‖∇θℒSurF‖=O​\(N\)\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{SurF\}\}\\\|=O\(N\)linearly in sequence length\. Figure[6](https://arxiv.org/html/2605.14069#A6.F6)validates this empirically\.

![Refer to caption](https://arxiv.org/html/2605.14069v1/figures/gradient_analysis_simplified.png)Figure 6:Gradient analysis comparing SurF and RNN\-based TPPs\. SurF gradient evolution at firing rates\{4,8,16,20\}\\\{4,8,16,20\\\}events/sec shows consistent decreasing trends with gradient norms converging faster; overall statistics show SurF’s higher gradient magnitudes and lower fluctuations\.

## Appendix GExtended Hyperparameter Sensitivity Study

This appendix supplements the ablations of Section[3](https://arxiv.org/html/2605.14069#S4.T3)with the full per\-dataset breakdown of the conformer/loss\-schedule ablation \(Table[6](https://arxiv.org/html/2605.14069#A7.T6)\) and the full results of a100100\-trial random hyperparameter sweep over SurF\-GLQ \(Tables[7](https://arxiv.org/html/2605.14069#A7.T7)–[10](https://arxiv.org/html/2605.14069#A7.T10)\)\.

##### Conformer and loss schedule\.

Table[6](https://arxiv.org/html/2605.14069#A7.T6)reports time RMSE / type accuracy on each of the six benchmarks under all combinations of conformer toggle and loss schedule\.

Table 6:Conformer and loss\-schedule ablation \(time RMSE / type accuracy %\)\.Bold= best per column\.
##### SurF\-GLQ hyperparameter sweep: protocol\.

The sweep jointly varied: number of Gauss–Legendre pointsQ∈\{4,8,12\}Q\\in\\\{4,8,12\\\}; number of log\-intensity referencesJ∈\{3,4,5\}J\\in\\\{3,4,5\\\}; conformer pre\-encoder \(on/off\); batch size∈\{128,256,512\}\\in\\\{128,256,512\\\}; learning rate∈\[3×10−5,8×10−4\]\\in\[3\{\\times\}10^\{\-5\},\\,8\{\\times\}10^\{\-4\}\]; pretraining forecast\-loss weightβfcst∈\[0\.69,4\.81\]\\beta\_\{\\text\{fcst\}\}\\in\[0\.69,\\,4\.81\]; and pretraining type\-loss weightβtype∈\[0\.51,4\.28\]\\beta\_\{\\text\{type\}\}\\in\[0\.51,\\,4\.28\]\. Trials that diverged \(negative SurF loss, indicating numerical failure\) were excluded; this affected5/1055/105runs, all at extreme learning\-rate values\.

##### Quadrature points and mixture size\.

Table[7](https://arxiv.org/html/2605.14069#A7.T7)aggregates the sweep byQQandJJ\.

Table 7:SurF\-GLQ quadrature ablation:100100\-trial random hyperparameter sweep varying number of Gauss–Legendre pointsQQand log\-intensity reference componentsJJ\. We report the best, median, and trial\-level standard deviation of time RMSE, and the best type accuracy within each group\.Bold= best per metric\.Q=8Q\{=\}8matchesQ=12Q\{=\}12within trial\-level noise \(σRMSE≈0\.05\\sigma\_\{\\text\{RMSE\}\}\{\\approx\}0\.05\), validating the default in Section[3\.1](https://arxiv.org/html/2605.14069#S3.SS1)\.
##### Conformer×\\timesQQinteraction\.

Table[8](https://arxiv.org/html/2605.14069#A7.T8)cross\-tabulates the conformer toggle againstQQ\. The conformer block contributes most when paired with the recommendedQ=8Q\{=\}8, where it lifts best\-trial type accuracy from48\.96%48\.96\\%\(Q=4Q\{=\}4, no conformer\) to49\.94%49\.94\\%while preserving the lowest median RMSE\. Without the conformer, largerQQpartially recovers performance but plateaus byQ=12Q\{=\}12\.

Table 8:Conformer×\\timesQQinteraction\. Conformer \+Q=8Q\{=\}8delivers the best accuracy and is competitive in RMSE; further quadrature points provide no measurable gain\.
##### Loss\-weight sensitivity\.

Table[9](https://arxiv.org/html/2605.14069#A7.T9)bins the trials by pretraining forecast\-loss weightβfcst\\beta\_\{\\text\{fcst\}\}and type\-loss weightβtype\\beta\_\{\\text\{type\}\}\. Performance is robust across a wide range:βfcst∈\[3,5\]\\beta\_\{\\text\{fcst\}\}\\in\[3,5\]andβtype∈\[0\.7,1\.0\]\\beta\_\{\\text\{type\}\}\\in\[0\.7,1\.0\]form a broad plateau within0\.020\.02RMSE of the best configuration\. Very low forecast weight \(<1\.5<1\.5\) under\-weights the multi\-step forecast head and costs∼0\.05\{\\sim\}0\.05RMSE; very high type weight \(\>2\>2\) over\-emphasizes classification at the expense of timing\. The defaultβfcst=4\.8\\beta\_\{\\text\{fcst\}\}\{=\}4\.8,βtype=0\.74\\beta\_\{\\text\{type\}\}\{=\}0\.74used in Section[4](https://arxiv.org/html/2605.14069#S4)sits inside this plateau\.

Table 9:Loss\-weight sensitivity\. Mid\-range type weight and high forecast weight form a broad performance plateau\.
##### Batch\-size effect\.

Table[10](https://arxiv.org/html/2605.14069#A7.T10)reports performance by batch size\. Batch256256achieves the best mean RMSE \(8\.778\.77\) and the highest best\-trial accuracy \(49\.94%49\.94\\%\)\. Batch512512matches the lowest absolute RMSE \(8\.7258\.725\) but with slightly degraded type accuracy \(49\.49%49\.49\\%\); batch128128is uniformly worst, consistent with higher gradient noise at small batches harming both the SurF likelihood and the classification head\.

Table 10:Batch\-size effect on the SurF\-GLQ HP sweep\.
##### Top configurations\.

The top\-1010trials by time RMSE all come fromQ∈\{8,12\}Q\{\\in\}\\\{8,12\\\}\(60%60\\%Q=8Q\{=\}8,40%40\\%Q=12Q\{=\}12\) andJ∈\{3,4\}J\{\\in\}\\\{3,4\\\}, span both conformer settings \(60%60\\%on,40%40\\%off\), and use either batch256256or512512\. None of the top\-1010trials usesQ=4Q\{=\}4,J=5J\{=\}5, or batch128128, indicating these are the only design choices the sweep clearly rules out\. The fact that several near\-optimal HP regions exist—rather than a single brittle peak—supports the broader claim that the SurF training objective \([3](https://arxiv.org/html/2605.14069#S3.E3)\) is well\-conditioned\.

## Appendix HAdditional Theoretical Material

### H\.1Joint Modeling of Times and Marks

ForKKevent types, SurF factorizes the joint conditional intensity as

λ∗​\(t,k∣ℋt\)=λ∗​\(t∣ℋt\)⋅p​\(k∣t,ℋt\),∑k=1Kp​\(k∣t,ℋt\)=1,\\lambda^\{\*\}\(t,k\\mid\\mathcal\{H\}\_\{t\}\)=\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\)\\cdot p\(k\\mid t,\\mathcal\{H\}\_\{t\}\),\\qquad\\sum\_\{k=1\}^\{K\}p\(k\\mid t,\\mathcal\{H\}\_\{t\}\)=1,\(51\)whereλ∗​\(t∣ℋt\)\\lambda^\{\*\}\(t\\mid\\mathcal\{H\}\_\{t\}\)is the marginal ground intensity \(modeled byΛθ\\Lambda\_\{\\theta\}\) andp​\(k∣t,ℋt\)p\(k\\mid t,\\mathcal\{H\}\_\{t\}\)is the conditional mark distribution \(modeled byTypeNetθ\\text\{TypeNet\}\_\{\\theta\}with a softmax head\)\. Substituting into the point\-process likelihood, the joint NLL decomposes exactly:

−log⁡p​\(𝒟\)=ℒSurF​\(θ\)⏟time\+∑i=1N−log⁡pθ​\(ki∣ti,ℋti\)⏟ℒtype​\(θ\)\.\-\\log p\(\\mathcal\{D\}\)=\\underbrace\{\\mathcal\{L\}\_\{\\text\{SurF\}\}\(\\theta\)\}\_\{\\text\{time\}\}\+\\underbrace\{\\sum\_\{i=1\}^\{N\}\-\\log p\_\{\\theta\}\(k\_\{i\}\\mid t\_\{i\},\\mathcal\{H\}\_\{t\_\{i\}\}\)\}\_\{\\mathcal\{L\}\_\{\\text\{type\}\}\(\\theta\)\}\.\(52\)No variational bound or coupling term is required; the factorization is exact at the likelihood level\. The weightβtype\\beta\_\{\\text\{type\}\}in Algorithm[1](https://arxiv.org/html/2605.14069#alg1)is introduced purely to balance the relative scales of the two NLL components during optimization, not as a variational coefficient\.

##### Preservation of the TRT under marking\.

Becauseλ∗​\(t,k\)=λ∗​\(t\)​p​\(k∣t\)\\lambda^\{\*\}\(t,k\)=\\lambda^\{\*\}\(t\)\\,p\(k\\mid t\)is a strict product, the TRT applied to the ground intensityλ∗​\(t\)=∑kλ∗​\(t,k\)\\lambda^\{\*\}\(t\)=\\sum\_\{k\}\\lambda^\{\*\}\(t,k\)yields the sameExp​\(1\)\\mathrm\{Exp\}\(1\)rescaling regardless ofKK, so Theorem[2](https://arxiv.org/html/2605.14069#Thmtheorem2)holds unchanged and Proposition[Proposition](https://arxiv.org/html/2605.14069#Thmpropositionx4)is unaffected by the number of types\. Marks are generated conditionally at sampling time \(Algorithm[3](https://arxiv.org/html/2605.14069#alg3)\)\.

##### Alternative:KK\-dimensional intensity\.

A per\-type formulation\{λk∗​\(t\)\}k=1K\\\{\\lambda\_\{k\}^\{\*\}\(t\)\\\}\_\{k=1\}^\{K\}with separate compensators is also supported by letting the cumulative\-intensity head outputKKcopies of the parameters\. The TRT then applies per\-type, givingKKindependentExp​\(1\)\\mathrm\{Exp\}\(1\)rescalings; the joint likelihood equals the sum ofKKper\-type instances of the unified loss\.

### H\.2Relationship to Prior Cumulative\-Hazard Neural TPPs

Parameterizing the cumulative hazardΛθ\\Lambda\_\{\\theta\}directly and recoveringλθ\\lambda\_\{\\theta\}via differentiation is not new: Omi et al\.\[Omiet al\.,[2019a](https://arxiv.org/html/2605.14069#bib.bib28)\]introduced this construction in*FullyNN*, whereΛθ\\Lambda\_\{\\theta\}is a feed\-forward network with non\-negative weights, and Shchur et al\.\[Shchuret al\.,[2019](https://arxiv.org/html/2605.14069#bib.bib23)\]use a log\-normal mixture with a related closed\-form compensator\. SurF differs along three axes\.

##### \(i\) Conceptual framing\.

Prior work treats the monotone network as an engineering convenience for avoiding the∫0Tλ​𝑑s\\int\_\{0\}^\{T\}\\lambda\\,dsintegral\. SurF treatsΛθ\\Lambda\_\{\\theta\}as a normalizing flow between the point process and i\.i\.d\.Exp​\(1\)\\mathrm\{Exp\}\(1\)noise, with the TRT supplying the Jacobian\. This licenses the dataset\-invariance claim \(Proposition[Proposition](https://arxiv.org/html/2605.14069#Thmpropositionx4)\) and motivates cross\-dataset pretraining; FullyNN and its descendants are trained per\-dataset and make no such claim\.

##### \(ii\) Parameterization\.

FullyNN enforces monotonicity via non\-negative\-weight MLPs, which are ill\-conditioned at depth\. SurF\-CSB uses a convex combination of softplus basis functions that admits closed\-form derivatives and is universal for positive intensities \(Proposition[Proposition](https://arxiv.org/html/2605.14069#Thmpropositionx2)\)\. SurF\-MoE specializes further to a mixture of exponentials with a fully closed\-form loss; SurF\-GLQ provides an unconstrained\-MLP alternative with per\-interval quadrature\.

##### \(iii\) Amortization\.

FullyNN’s hazard depends on history through an RNN hidden state shared across the monotone MLP input\. SurF amortizes via a Transformer encoder whose output𝐡i−1\\mathbf\{h\}\_\{i\-1\}parameterizes a freshΛθ\(⋅∣𝐡i−1\)\\Lambda\_\{\\theta\}\(\\cdot\\mid\\mathbf\{h\}\_\{i\-1\}\)at each interval, yielding the diagonal Jacobian of Appendix[B](https://arxiv.org/html/2605.14069#A2)and theO​\(N\)O\(N\)gradient bound of Appendix[F](https://arxiv.org/html/2605.14069#A6)rather than theO​\(σN\)O\(\\sigma^\{N\}\)scaling of RNN\-based hazards\.

##### Empirical comparison\.

Table[2](https://arxiv.org/html/2605.14069#S4.T2)includes FullyNN as a baseline; SurF outperforms it on all six benchmarks, and the gap widens in the zero\-shot regime where FullyNN has no mechanism for cross\-dataset transfer\.

### H\.3When to Use Which Variant, and the Role of Quadrature

The three variants target different regimes and differ in how \(if at all\) they invoke quadrature\.

##### SurF\-MoE — closed\-form, fastest\.

Between\-event intensity is monotone\-decaying and multi\-scale\. Compensator and derivative are exact closed forms; the loss invokes zero quadrature\. This is the default for foundation\-model pretraining\.

##### SurF\-CSB — closed\-form, most expressive monotone family\.

Intensity may be non\-monotone \(delayed excitation, hump\-shaped profiles\); the shifted\-softplus basis keeps bothΛθ\\Lambda\_\{\\theta\}andλθ\\lambda\_\{\\theta\}in closed form and is universal for positive intensities \(Proposition[Proposition](https://arxiv.org/html/2605.14069#Thmpropositionx2)\)\.

##### SurF\-GLQ with fixed\-cost per\-interval quadrature\.

The intensity is an unconstrained positive MLP;Λθ\\Lambda\_\{\\theta\}usesQ=8Q=8Gauss–Legendre nodes*per inter\-event interval*\[0,τi\]\[0,\\tau\_\{i\}\], not over the full window\[0,T\]\[0,T\]\.

Table 11:Quadrature cost across methods\. Only SurF\-GLQ uses quadrature; unlike THP/SAHP its cost is bounded per interval, independent ofTT, and the error does not enter the gradient in practice \(<10−12<\\\!10^\{\-12\}atQ=8Q=8\)\.
##### Reading the “no quadrature” claim precisely\.

The statement “exact likelihoods with no numerical quadrature” in the introduction should be read as:*no quadrature at all for MoE and CSB; fixedO​\(Q\)O\(Q\)\-per\-interval quadrature for GLQ, withTT\-independent cost and negligible error that does not enter the gradient\.*In all three cases, likelihood evaluation is non\-iterative and the training signal is unbiased to machine precision\.

### H\.4Shared Canonical Target

###### Proposition\(Shared canonical target\)\.

Let𝒫1,𝒫2\\mathcal\{P\}\_\{1\},\\mathcal\{P\}\_\{2\}be point processes arising from potentially different datasets with intensitiesλ1∗,λ2∗\\lambda\_\{1\}^\{\*\},\\lambda\_\{2\}^\{\*\}, and let\(ϕθ,Λθ\)\(\\phi\_\{\\theta\},\\Lambda\_\{\\theta\}\)be trained jointly on both\. The population\-optimalΛθ∘ϕθ\\Lambda\_\{\\theta\}\\circ\\phi\_\{\\theta\}maps inter\-arrival times toExp​\(1\)\\mathrm\{Exp\}\(1\)on each dataset, so the loss couples the encoder through a shared canonical objective even though its\(ti,ki\)\(t\_\{i\},k\_\{i\}\)inputs remain dataset\-specific\.

## Appendix IDatasets

Table 12:Pre\-training corpus composition\. The corpus spans 6 datasets with diverse temporal scales, totaling∼1\.3\{\\sim\}1\.3M events across∼32\{\\sim\}32K sequences\.We evaluate SurF on six real\-world datasets spanning e\-commerce, transportation, social media, geoscience, and online collaboration:

- •Taobao\[Xueet al\.,[2022](https://arxiv.org/html/2605.14069#bib.bib53)\]: user click sequences from a large\-scale online recommendation platform, capturing temporal patterns of user engagement with product categories\.
- •Taxi\[Whong,[2014](https://arxiv.org/html/2605.14069#bib.bib84)\]: taxi pick\-up events across New York City neighborhoods, providing insights into urban mobility patterns\.
- •Retweet\[Zhouet al\.,[2013](https://arxiv.org/html/2605.14069#bib.bib89)\]: social cascades of retweet behaviors and user identities over time\.
- •StackOverflow\[Leskovec and Krevl,[2014](https://arxiv.org/html/2605.14069#bib.bib83)\]: activity logs from a question\-and\-answer platform with events representing posting and voting interactions\.
- •Amazon\[Ni,[2018](https://arxiv.org/html/2605.14069#bib.bib85)\]: product review sequences where each event is a user reviewing a product in a specific category\.
- •Earthquake: seismic event sequences with magnitude and location codes, covering multi\-day inter\-event gaps from regional geological catalogs\.

## Appendix JModel Architecture and Hyperparameters

We parameterize the conditional intensity as a single positive function that depends on both the current time and the event history, ensuring a bijection between temporal and exponential spaces:

λθ​\(t∣ℋt\)=λbase\+σ​\(fθ​\(ϕθ​\(t,ℋt\)\)\),\\lambda\_\{\\theta\}\(t\\mid\\mathcal\{H\}\_\{t\}\)=\\lambda\_\{\\text\{base\}\}\+\\sigma\\\!\\bigl\(f\_\{\\theta\}\(\\phi\_\{\\theta\}\(t,\\mathcal\{H\}\_\{t\}\)\)\\bigr\),\(53\)whereλbase\>0\\lambda\_\{\\text\{base\}\}\>0ensures minimum intensity \(guaranteeing invertibility\),ϕθ:ℝ\+×ℋ→ℝd\\phi\_\{\\theta\}\\\!:\\mathbb\{R\}^\{\+\}\\\!\\times\\\!\\mathcal\{H\}\\\!\\to\\mathbb\{R\}^\{d\}is a history encoding function,fθ:ℝd→ℝf\_\{\\theta\}\\\!:\\mathbb\{R\}^\{d\}\\\!\\to\\mathbb\{R\}is a neural network, andσ:ℝ→\[0,B\]\\sigma\\\!:\\mathbb\{R\}\\\!\\to\[0,B\]is a bounded positive activation ensuringλθ≤λbase\+B<∞\\lambda\_\{\\theta\}\\leq\\lambda\_\{\\text\{base\}\}\+B<\\infty\.

### J\.1History Encoding

A multi\-layer Transformer with time\-aware embeddings:

ϕθ​\(t,ℋt\)=TransformerEncoderθ​\(Conformerξ​\(𝐄combined,𝐌\)\),\\phi\_\{\\theta\}\(t,\\mathcal\{H\}\_\{t\}\)=\\text\{TransformerEncoder\}\_\{\\theta\}\(\\textrm\{Conformer\}\_\{\\xi\}\(\\mathbf\{E\}\_\{\\text\{combined\}\},\\mathbf\{M\}\)\),\(54\)𝐄combined=\[𝐄type;𝐄time\]∈ℝN×2​dhidden\.\\mathbf\{E\}\_\{\\text\{combined\}\}=\[\\mathbf\{E\}\_\{\\text\{type\}\};\\mathbf\{E\}\_\{\\text\{time\}\}\]\\in\\mathbb\{R\}^\{N\\times 2d\_\{\\text\{hidden\}\}\}\.\(55\)Type embeddings:𝐄type\(i\)=Embedding​\(ki\+1\)∈ℝdhidden\\mathbf\{E\}\_\{\\text\{type\}\}^\{\(i\)\}=\\text\{Embedding\}\(k\_\{i\}\+1\)\\in\\mathbb\{R\}^\{d\_\{\\text\{hidden\}\}\}\(the offset ensures valid indices\)\.

Temporal encoding\.Normalized time prevents numerical instability:

𝐄time\(i\)=MLPtime​\(ti/Tscale\),Tscale=max⁡\(t1,…,tN\)\+ϵ,\\mathbf\{E\}\_\{\\text\{time\}\}^\{\(i\)\}=\\text\{MLP\}\_\{\\text\{time\}\}\(t\_\{i\}/T\_\{\\text\{scale\}\}\),\\qquad T\_\{\\text\{scale\}\}=\\max\(t\_\{1\},\\ldots,t\_\{N\}\)\+\\epsilon,\(56\)withMLPtime​\(x\)=Linear2​\(tanh⁡\(Linear1​\(x\)\)\)\\text\{MLP\}\_\{\\text\{time\}\}\(x\)=\\text\{Linear\}\_\{2\}\(\\tanh\(\\text\{Linear\}\_\{1\}\(x\)\)\),Linear1:ℝ→ℝdhidden/2\\text\{Linear\}\_\{1\}\\\!:\\mathbb\{R\}\\\!\\to\\mathbb\{R\}^\{d\_\{\\text\{hidden\}\}/2\}andLinear2:ℝdhidden/2→ℝdhidden\\text\{Linear\}\_\{2\}\\\!:\\mathbb\{R\}^\{d\_\{\\text\{hidden\}\}/2\}\\\!\\to\\mathbb\{R\}^\{d\_\{\\text\{hidden\}\}\}\. The Conformer may or may not be included, depending on the SurF variant\. The Transformer is causally masked\.

### J\.2Intensity Network

A three\-layer MLP with ReLU activations:

fθ​\(𝐡\)=Linear3​\(ReLU​\(Dropout​\(Linear2​\(ReLU​\(Linear1​\(𝐡\)\)\)\)\)\)f\_\{\\theta\}\(\\mathbf\{h\}\)=\\text\{Linear\}\_\{3\}\(\\text\{ReLU\}\(\\text\{Dropout\}\(\\text\{Linear\}\_\{2\}\(\\text\{ReLU\}\(\\text\{Linear\}\_\{1\}\(\\mathbf\{h\}\)\)\)\)\)\)\(57\)withLinear1:ℝ2​dhidden→ℝdhidden\\text\{Linear\}\_\{1\}\\\!:\\mathbb\{R\}^\{2d\_\{\\text\{hidden\}\}\}\\\!\\to\\mathbb\{R\}^\{d\_\{\\text\{hidden\}\}\},Linear2:ℝdhidden→ℝdhidden/2\\text\{Linear\}\_\{2\}\\\!:\\mathbb\{R\}^\{d\_\{\\text\{hidden\}\}\}\\\!\\to\\mathbb\{R\}^\{d\_\{\\text\{hidden\}\}/2\},Linear3:ℝdhidden/2→ℝ\\text\{Linear\}\_\{3\}\\\!:\\mathbb\{R\}^\{d\_\{\\text\{hidden\}\}/2\}\\\!\\to\\mathbb\{R\}\. The bounded activation isσ​\(x\)=sigmoid​\(x\)⋅\(λmax−λbase\)\\sigma\(x\)=\\text\{sigmoid\}\(x\)\\cdot\(\\lambda\_\{\\max\}\-\\lambda\_\{\\text\{base\}\}\), ensuringλbase≤λθ≤λmax\\lambda\_\{\\text\{base\}\}\\leq\\lambda\_\{\\theta\}\\leq\\lambda\_\{\\max\}\.

### J\.3Cumulative\-Intensity Head

For the amortized variants, the Transformer output𝐡i−1\\mathbf\{h\}\_\{i\-1\}feeds a cumulative\-intensity head that produces the parameters ofΛθ\(⋅∣𝐡i−1\)\\Lambda\_\{\\theta\}\(\\cdot\\mid\\mathbf\{h\}\_\{i\-1\}\)\.

SurF\-CSB head\.A single linear layer maps𝐡i−1\\mathbf\{h\}\_\{i\-1\}to3​M3Mraw parameters\(α^m,β^m,δ^m\)m=1M\(\\hat\{\\alpha\}\_\{m\},\\hat\{\\beta\}\_\{m\},\\hat\{\\delta\}\_\{m\}\)\_\{m=1\}^\{M\}, withαm=softplus​\(α^m\)\\alpha\_\{m\}=\\text\{softplus\}\(\\hat\{\\alpha\}\_\{m\}\),βm=softplus​\(β^m\)\+ϵ\\beta\_\{m\}=\\text\{softplus\}\(\\hat\{\\beta\}\_\{m\}\)\+\\epsilon\. Theδm\\delta\_\{m\}are unconstrained, initialized uniformly in\[−3,3\]\[\-3,3\]to span early\-to\-late activation\.

SurF\-GLQ head\.A two\-layer MLP takes\(Δ​t,𝐡i−1\)\(\\Delta t,\\mathbf\{h\}\_\{i\-1\}\)and outputs a scalar,fθ​\(Δ​t,𝐡\)=MLP​\(\[Δ​t/Tscale;𝐡\]\)f\_\{\\theta\}\(\\Delta t,\\mathbf\{h\}\)=\\text\{MLP\}\(\[\\Delta t/T\_\{\\text\{scale\}\};\\mathbf\{h\}\]\)\. Intensity:λ~θ=softplus​\(fθ\)\\tilde\{\\lambda\}\_\{\\theta\}=\\text\{softplus\}\(f\_\{\\theta\}\); cumulative via Gauss–Legendre quadrature \( equation[27](https://arxiv.org/html/2605.14069#A4.E27)\)\. The MLP is lightweight \(two layers,dhidden/2d\_\{\\text\{hidden\}\}/2width\), andQQpasses are batched\.

SurF\-MoE head\.A single linear layer maps𝐡i−1\\mathbf\{h\}\_\{i\-1\}to2​J2Jraw parameters\(w^j,γ^j\)j=1J\(\\hat\{w\}\_\{j\},\\hat\{\\gamma\}\_\{j\}\)\_\{j=1\}^\{J\}\. Rate biases are initialized spanning\[−2,4\]\[\-2,4\]for multi\-scale coverage\.

### J\.4Type Prediction Head

The output from the history encoding is passed through a Type Refinement Tower consisting of 3\-4 Transformer encoder layers to produce a type context\. This context and the output of the intensity network are passed through separate MLPs and fused using a gate before being projected to the type embedding space\.

Proj​\(GatedFusion​\(\[MLPtype​\(TypeRefinementTower​\(ϕθ​\(t,ℋt\)\)\);MLPintensity​\(fθ​\(𝐡\)\)\]\)\)\\displaystyle\\text\{Proj\}\(\\text\{GatedFusion\}\(\[\\text\{MLP\}\_\{\\text\{type\}\}\(\\text\{TypeRefinementTower\}\(\\phi\_\{\\theta\}\(t,\\mathcal\{H\}\_\{t\}\)\)\);\\text\{MLP\}\_\{\\text\{intensity\}\}\(f\_\{\\theta\}\(\\mathbf\{h\}\)\)\]\)\)\(58\)

### J\.5Hyperparameter Search

We optimize the time RMSE directly,

ℒhyperopt=RMSEtime\\mathcal\{L\}\_\{\\text\{hyperopt\}\}=\\text\{RMSE\}\_\{\\text\{time\}\}\(59\)since we found that type accuracy is not heavily impacted by small changes in hyperparameters\. We use Optuna with TPE sampling and median pruning for early termination of unpromising trials \(patience1010epochs\)\.

Searched:dhidden∈\{32,64,128,256,512\}d\_\{\\text\{hidden\}\}\\in\\\{32,64,128,256,512\\\}, layersL∈\{2,3,4\}L\\in\\\{2,3,4\\\}, batch sizeB∈\{128,256,512\}B\\in\\\{128,256,512\\\}, learning rateη∈\[10−5,10−3\]\\eta\\in\[10^\{\-5\},10^\{\-3\}\]\(log\-uniform\), basis componentsM∈\{8,12,16\}M\\in\\\{8,12,16\\\}\(CSB\) or quadrature pointsQ∈\{4,8,12\}Q\\in\\\{4,8,12\\\}\(GLQ\)\.

Fixed:H=4H=4attention heads, dropout0\.10\.1, weight decay10−510^\{\-5\}, cosine annealing with warm restarts, gradient accumulation11\. We requiredhidden≡0\(modH\)d\_\{\\text\{hidden\}\}\\equiv 0\\pmod\{H\}\.

## Appendix KComputational Complexity

Training per sequence\.

- •RNN TPPs:O​\(N⋅H⋅D\)O\(N\\cdot H\\cdot D\)due to sequential dependencies and BPTT\.
- •SurF\-MoE:O​\(N⋅J\)O\(N\\cdot J\)with exact closed\-form integration\.
- •SurF\-CSB:O​\(N⋅M\)O\(N\\cdot M\)with closed\-form derivative evaluation\.
- •SurF\-GLQ:O​\(N⋅Q⋅Df\)O\(N\\cdot Q\\cdot D\_\{f\}\)whereDfD\_\{f\}is the MLP cost, with parallelizable quadrature points\.

Sampling\.

- •RNN TPPs:O​\(N⋅H⋅D\)O\(N\\cdot H\\cdot D\)sequential operations\.
- •All SurF variants:O​\(N⋅S\)O\(N\\cdot S\)withS≤8S\\leq 8Newton iterations, dominated by the Transformer forward pass\.

The ability to parallelize training and the stable gradient flow make SurF particularly suitable for long sequences with varying time scales, where RNN methods struggle with both efficiency and numerical stability\.

## Appendix LAdditional Empirical Analysis

### L\.1Finetuned Models vs Top Baselines

![Refer to caption](https://arxiv.org/html/2605.14069v1/figures/comparison_finetuned_vs_baselines_rmse_and_type_acc.png)Figure 7:Next\-event inter\-arrival RMSE \(top\) and type accuracy \(bottom\) for finetuned SurF variants and three baseline models \(DTPP, NHP, NJDTPP\)\. NJDTPP was not evaluated for Amazon in their paper, and is hence omitted\.
### L\.2Forecast Trajectories

![Refer to caption](https://arxiv.org/html/2605.14069v1/figures/all_variants_ks_cdf.png)Figure 8:Time\-rescaling goodness\-of\-fit across datasets and variants\.Empirical CDF of the residualsΔ​zi=Λ​\(τi∣hi−1\)\\Delta z\_\{i\}=\\Lambda\(\\tau\_\{i\}\\mid h\_\{i\-1\}\)on the test split\. A perfectly calibrated model produces residuals that are i\.i\.d\.Exp​\(1\)\\mathrm\{Exp\}\(1\)\(black dotted curve\); smaller KS statisticDDindicates better calibration\. SurF achieves near\-ideal calibration on Taxi and StackOverflow \(D≤0\.1D\\leq 0\.1for all variants\)\.![Refer to caption](https://arxiv.org/html/2605.14069v1/figures/trajectories_all_datasets.png)Figure 9:Forecast trajectories across datasets\. Each subplot shows one representative sequence per dataset \(chosen by lowest finetuned RMSE\)\. Ground truth \(circles\), pretrained \(squares\), and finetuned \(triangles\) forecast times; colors indicate event types\.

Similar Articles

SDFlow: Similarity-Driven Flow Matching for Time Series Generation

arXiv cs.AI

This paper introduces SDFlow, a similarity-driven flow matching framework for time series generation that addresses exposure bias in autoregressive models. It achieves state-of-the-art performance and inference speedups by operating in the frozen VQ latent space with low-rank manifold decomposition.

A decoder-only foundation model for time-series forecasting

Papers with Code Trending

This article presents a research paper on Time-Series Foundation Model (TimeFM), a decoder-only model that achieves near-optimal zero-shot performance across diverse time-series datasets by adapting large language model techniques.