Latent Block-Diffusion Temporal Point Processes: A Semi-Autoregressive Framework for Asynchronous Event Sequence Generation

arXiv cs.LG Papers

Summary

Introduces a semi-autoregressive framework that combines latent block diffusion with temporal point processes for generating asynchronous event sequences, reducing error accumulation while enabling variable-length output.

arXiv:2606.24982v1 Announce Type: new Abstract: Modeling and sampling from the underlying distribution of asynchronous event sequences are crucial in various real-world applications, including social networks, medical diagnosis, and financial transactions. Existing autoregressive methods suffer from error accumulation during multi-step generation, while non-autoregressive diffusion methods are typically limited to fixed-length output sequences. In this paper, we propose Latent Block-Diffusion Temporal Point Processes (LBDTPP), a novel semi-autoregressive TPP framework that introduces a latent block diffusion mechanism for high-quality and variable-length event sequence generation. The core idea is to define an autoregressive probability distribution over event blocks in latent space and perform Gaussian diffusion within each block. By sequentially generating blocks while simultaneously sampling events in each block, LBDTPP preserves the length flexibility of autoregressive TPPs and inherits the parallel high-quality generation capability of diffusion models. Theoretically, we derive Wasserstein error bounds showing that, under suitable local approximation and prefix-stability assumptions, block-wise generation can reduce error accumulation compared with event-wise autoregressive generation. Extensive experiments on six real-world benchmark datasets demonstrate that LBDTPP outperforms state-of-the-art TPP baselines in both unconditional and conditional generation tasks. Further empirical analyses verify the benefits of latent-space diffusion and block-wise generation, and reveal the trade-off between generation quality and block size. Our code is available at https://github.com/Zh-Shuai/LBDTPP.
Original Article
View Cached Full Text

Cached at: 06/25/26, 05:09 AM

# Latent Block-Diffusion Temporal Point Processes: A Semi-Autoregressive Framework for Asynchronous Event Sequence Generation
Source: [https://arxiv.org/html/2606.24982](https://arxiv.org/html/2606.24982)
Shuai Zhang, Yancheng Chen, Chuan Zhou, , Yang Liu, Xixun Lin, Xiangyu Zhao, Jun Zhu, , and Zhi\-Ming MaShuai Zhang, Yancheng Chen, Chuan Zhou, Yang Liu, and Zhi\-Ming Ma are with the Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China \(e\-mail:[zhangshuai2021@amss\.ac\.cn](https://arxiv.org/html/2606.24982v1/[email protected]);[chenyancheng22@mails\.ucas\.ac\.cn](https://arxiv.org/html/2606.24982v1/[email protected]);[zhouchuan@amss\.ac\.cn](https://arxiv.org/html/2606.24982v1/[email protected]);[liuyang2020@amss\.ac\.cn](https://arxiv.org/html/2606.24982v1/[email protected]);[mazm@amt\.ac\.cn](https://arxiv.org/html/2606.24982v1/[email protected])\)\.Xixun Lin is with the Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China \(e\-mail:[linxixun@iie\.ac\.cn](https://arxiv.org/html/2606.24982v1/[email protected])\)\.Xiangyu Zhao is with the Department of Data Science, City University of Hong Kong, Hong Kong 999077, China \(e\-mail:[xy\.zhao@cityu\.edu\.hk](https://arxiv.org/html/2606.24982v1/[email protected])\)\.Jun Zhu is with the Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China \(e\-mail:[dcszj@tsinghua\.edu\.cn](https://arxiv.org/html/2606.24982v1/[email protected])\)\.

###### Abstract

Modeling and sampling from the underlying distribution of asynchronous event sequences are crucial in various real\-world applications, including social networks, medical diagnosis, and financial transactions\. Existing autoregressive methods suffer from error accumulation during multi\-step generation, while non\-autoregressive diffusion methods are typically limited to fixed\-length output sequences\. In this paper, we propose Latent Block\-Diffusion Temporal Point Processes \(LBDTPP\), a novel semi\-autoregressive TPP framework that introduces a latent block diffusion mechanism for high\-quality and variable\-length event sequence generation\. The core idea is to define an autoregressive probability distribution over event blocks in latent space and perform Gaussian diffusion within each block\. By sequentially generating blocks while simultaneously sampling events in each block, LBDTPP preserves the length flexibility of autoregressive TPPs and inherits the parallel high\-quality generation capability of diffusion models\. Theoretically, we derive Wasserstein error bounds showing that, under suitable local approximation and prefix\-stability assumptions, block\-wise generation can reduce error accumulation compared with event\-wise autoregressive generation\. Extensive experiments on six real\-world benchmark datasets demonstrate that LBDTPP outperforms state\-of\-the\-art TPP baselines in both unconditional and conditional generation tasks\. Further empirical analyses verify the benefits of latent\-space diffusion and block\-wise generation, and reveal the trade\-off between generation quality and block size\. Our code is available at[https://github\.com/Zh\-Shuai/LBDTPP](https://github.com/Zh-Shuai/LBDTPP)\.

## IIntroduction

Asynchronous event sequences are abundant in many real\-world applications, including social networks\[[14](https://arxiv.org/html/2606.24982#bib.bib62),[8](https://arxiv.org/html/2606.24982#bib.bib69),[17](https://arxiv.org/html/2606.24982#bib.bib67)\], medical diagnosis\[[13](https://arxiv.org/html/2606.24982#bib.bib72),[35](https://arxiv.org/html/2606.24982#bib.bib48),[53](https://arxiv.org/html/2606.24982#bib.bib73)\], and financial transactions\[[54](https://arxiv.org/html/2606.24982#bib.bib74),[69](https://arxiv.org/html/2606.24982#bib.bib17),[24](https://arxiv.org/html/2606.24982#bib.bib34)\]\. Each event in a sequence consists of a continuous timestamp and a discrete mark, representing when and what the event occurred\. For example, in a social network, an event may record that a user interacted with a post at a specific time, with the mark indicating the corresponding interaction type such as posting, commenting, or sharing\. Faithfully modeling and generating such sequences are essential for understanding complex temporal dynamics and supporting decision\-making in various domains\. Based on the availability of historical information, event sequence generation tasks can be broadly categorized into unconditional generation\[[45](https://arxiv.org/html/2606.24982#bib.bib50),[37](https://arxiv.org/html/2606.24982#bib.bib19),[39](https://arxiv.org/html/2606.24982#bib.bib20)\], which aims to simulate high\-fidelity event sequences from the underlying data distribution, and conditional generation\[[12](https://arxiv.org/html/2606.24982#bib.bib49),[79](https://arxiv.org/html/2606.24982#bib.bib4),[65](https://arxiv.org/html/2606.24982#bib.bib12)\], which focuses on predicting event occurrences given historical observations\.

Temporal point processes \(TPPs\)\[[10](https://arxiv.org/html/2606.24982#bib.bib21),[55](https://arxiv.org/html/2606.24982#bib.bib70),[29](https://arxiv.org/html/2606.24982#bib.bib64),[46](https://arxiv.org/html/2606.24982#bib.bib65)\]are the dominant modeling framework for asynchronous event sequences\. Most existing TPP models, including Poisson processes\[[27](https://arxiv.org/html/2606.24982#bib.bib56)\], self\-correcting processes\[[23](https://arxiv.org/html/2606.24982#bib.bib40)\], Hawkes processes\[[20](https://arxiv.org/html/2606.24982#bib.bib39)\]and their Transformer variants\[[70](https://arxiv.org/html/2606.24982#bib.bib60),[79](https://arxiv.org/html/2606.24982#bib.bib4),[67](https://arxiv.org/html/2606.24982#bib.bib2)\], follow an autoregressive paradigm, where events are modeled sequentially, with each event conditioned on the preceding history\. While autoregressive TPPs naturally support variable\-length generation and have achieved strong performance in predicting the next event, their one\-by\-one generation procedure suffers from error accumulation during multi\-step generation\[[37](https://arxiv.org/html/2606.24982#bib.bib19),[25](https://arxiv.org/html/2606.24982#bib.bib29)\]\. Small errors introduced at early steps may propagate through subsequent steps and gradually amplify, leading to substantial degradation in generation quality\[[66](https://arxiv.org/html/2606.24982#bib.bib8),[69](https://arxiv.org/html/2606.24982#bib.bib17)\]\.

Recently, diffusion probabilistic models\[[56](https://arxiv.org/html/2606.24982#bib.bib25),[21](https://arxiv.org/html/2606.24982#bib.bib24),[57](https://arxiv.org/html/2606.24982#bib.bib41)\]have emerged as a powerful framework for generative modeling, with notable applications in computer vision\[[58](https://arxiv.org/html/2606.24982#bib.bib87),[51](https://arxiv.org/html/2606.24982#bib.bib55),[36](https://arxiv.org/html/2606.24982#bib.bib44),[4](https://arxiv.org/html/2606.24982#bib.bib77)\]and natural language processing\[[31](https://arxiv.org/html/2606.24982#bib.bib76),[44](https://arxiv.org/html/2606.24982#bib.bib54),[60](https://arxiv.org/html/2606.24982#bib.bib53)\]\. Building upon this framework, non\-autoregressive diffusion TPPs have been proposed for modeling event sequences\[[37](https://arxiv.org/html/2606.24982#bib.bib19),[69](https://arxiv.org/html/2606.24982#bib.bib17),[25](https://arxiv.org/html/2606.24982#bib.bib29)\]\. By generating multiple events simultaneously, these models avoid the one\-by\-one sampling of autoregressive TPPs and achieve improved performance in multi\-step forecasting tasks\. Nevertheless, existing diffusion\-based approaches for event sequences typically model the conditional distribution of a fixed\-length future sequence given historical events\[[69](https://arxiv.org/html/2606.24982#bib.bib17),[77](https://arxiv.org/html/2606.24982#bib.bib28)\]\. When applied to unconditional generation within a time interval\[0,T\]\[0,T\], such non\-autoregressive methods need to model the joint distribution of the entire sequence and generate it in a single shot\. As a result, this paradigm is limited to producing sequences with a pre\-specified number of events, reducing flexibility and making it unsuitable for realistic scenarios where sequence lengths are unknown and variable\.

To mitigate the issues of error accumulation in autoregressive TPPs and fixed\-length generation in non\-autoregressive diffusion TPPs, we introduce Latent Block\-Diffusion Temporal Point Processes \(LBDTPP\), a novel semi\-autoregressive TPP framework that supports high\-quality, variable\-length event sequence generation in unconditional and conditional settings\. LBDTPP decomposes event sequence generation into two levels: \(i\) sequential generation across blocks to preserve event dependencies and support variable\-length generation, and \(ii\) parallel generation of multiple events within each block via Gaussian diffusion to alleviate error accumulation\. This design enables our model to combine the strengths of prior paradigms: it retains the length flexibility of autoregressive TPPs, while obtaining the parallel high\-quality generation capability of non\-autoregressive diffusion TPPs\.

Specifically, LBDTPP draws inspiration from discrete block diffusion models\[[1](https://arxiv.org/html/2606.24982#bib.bib18)\]developed for token generation in natural language, but introduces a latent block diffusion formulation for asynchronous event sequences\. Unlike discrete text tokens, event data couple continuous timestamps with discrete marks, making both discrete and continuous diffusion unsuitable for direct application\. To this end, LBDTPP first maps each event into a continuous latent space and then factorizes the latent sequence distribution as a product of conditional distributions over event blocks, while performing Gaussian diffusion within each block\. Our model sequentially samples latent blocks, generates multiple event representations in parallel within each block, and decodes them back to the original event space, enabling variable\-length and high\-quality event sequence generation\. We further derive Wasserstein generation\-error bounds under local approximation and prefix\-stability assumptions, indicating that block\-wise sampling can reduce prefix\-level error accumulation by shortening the recursive sampling horizon from events to blocks\. Extensive experiments on six real\-world datasets demonstrate that LBDTPP outperforms state\-of\-the\-art TPP baselines, and subsequent analysis shows that the gains stem from latent\-space diffusion and block\-wise generation\.

Our main contributions are as follows:

- •We introduce LBDTPP, a latent block diffusion framework for modeling asynchronous event sequences\. By factorizing the latent sequence distribution across event blocks and performing Gaussian diffusion within each block, LBDTPP forms a semi\-autoregressive generation paradigm that supports variable\-length generation and parallel high\-quality sampling, while mitigating error accumulation in autoregressive TPPs and overcoming the fixed\-length limitation of non\-autoregressive diffusion TPPs\.
- •We provide a theoretical analysis of error accumulation for event\-wise and block\-wise generation\. Under explicit local approximation and prefix\-stability assumptions, we derive Wasserstein bounds showing that event\-wise autoregressive generation accumulates errors over all event\-level sampling steps, whereas block\-wise generation accumulates errors only over block\-level transitions\. This analysis explains why block\-wise generation can mitigate prefix\-level error accumulation and clarifies the block size trade\-off observed empirically\.
- •We conduct extensive experiments on six real\-world benchmark datasets across multiple domains\. Experimental results demonstrate that LBDTPP outperforms both autoregressive and non\-autoregressive TPP baselines in unconditional and conditional generation tasks\. Further analysis validates the benefits of latent\-space diffusion and block\-wise generation\. Sampling time comparisons show that LBDTPP achieves competitive generation efficiency, and its fast version can be faster than all baseline models\.

## IIRelated Work

In this section, we review TPP\-based methods for event sequence modeling and generation, including autoregressive TPPs and non\-autoregressive diffusion TPPs\. We also briefly discuss block diffusion models, which provide methodological inspiration for our latent block diffusion TPPs\.

### II\-AAutoregressive TPPs

Temporal point processes \(TPPs\)\[[10](https://arxiv.org/html/2606.24982#bib.bib21),[16](https://arxiv.org/html/2606.24982#bib.bib66),[32](https://arxiv.org/html/2606.24982#bib.bib68),[75](https://arxiv.org/html/2606.24982#bib.bib33)\]are a class of stochastic processes for modeling sequences of random events in continuous time\. Most existing TPP models follow an autoregressive paradigm, where each event is modeled conditioned on its preceding events\. Early works\[[20](https://arxiv.org/html/2606.24982#bib.bib39),[23](https://arxiv.org/html/2606.24982#bib.bib40),[55](https://arxiv.org/html/2606.24982#bib.bib70),[15](https://arxiv.org/html/2606.24982#bib.bib75)\]rely on parametric conditional intensity functions \(CIFs\) to characterize the expected rate of event occurrences given the history\. More recent approaches employ neural networks to learn more flexible CIFs\[[64](https://arxiv.org/html/2606.24982#bib.bib71),[41](https://arxiv.org/html/2606.24982#bib.bib1),[79](https://arxiv.org/html/2606.24982#bib.bib4),[71](https://arxiv.org/html/2606.24982#bib.bib42)\]or conditional probability density functions\[[52](https://arxiv.org/html/2606.24982#bib.bib3),[73](https://arxiv.org/html/2606.24982#bib.bib86),[47](https://arxiv.org/html/2606.24982#bib.bib57),[72](https://arxiv.org/html/2606.24982#bib.bib78)\]\. Although autoregressive TPPs naturally support variable\-length generation and perform well in next\-event prediction, their sampling procedure remains inherently sequential, generating events one by one\. Consequently, errors from early steps may propagate through subsequent events and accumulate during long\-horizon generation\[[66](https://arxiv.org/html/2606.24982#bib.bib8),[37](https://arxiv.org/html/2606.24982#bib.bib19),[69](https://arxiv.org/html/2606.24982#bib.bib17)\]\. In contrast, our LBDTPP model generates multiple high\-quality events simultaneously within each block, alleviating the error accumulation issue and having the potential to improve sampling efficiency\.

### II\-BNon\-autoregressive Diffusion TPPs

Diffusion\-based TPP models have emerged as a promising approach for event sequence modeling\. DSTPP\[[68](https://arxiv.org/html/2606.24982#bib.bib26)\]adopts denoising diffusion models to capture spatio\-temporal event dynamics\. AddThin\[[37](https://arxiv.org/html/2606.24982#bib.bib19)\]and PSDiff\[[39](https://arxiv.org/html/2606.24982#bib.bib20)\]leverage the thinning and superposition properties of point processes\[[10](https://arxiv.org/html/2606.24982#bib.bib21)\]to design diffusion\-like models on the positive real space and general metric spaces, respectively\. However, none of these methods has been explored for TPPs with discrete marks\. EventFlow\[[25](https://arxiv.org/html/2606.24982#bib.bib29)\]and EdiTPP\[[38](https://arxiv.org/html/2606.24982#bib.bib30)\]employ flow matching\[[34](https://arxiv.org/html/2606.24982#bib.bib32),[19](https://arxiv.org/html/2606.24982#bib.bib31)\]for unconditional and conditional generation of event timestamp sequences, without modeling event marks\. CDiff\[[69](https://arxiv.org/html/2606.24982#bib.bib17)\]introduces two interacting diffusion processes for long\-horizon marked event forecasting, but as a non\-autoregressive diffusion model, it models fixed\-length future sequences given the history\. When directly applied to unconditional generation, such an approach needs to model the complete sequence and generate it in one shot, and thus can only produce sequences with a pre\-specified number of events\. In contrast, LBDTPP generates marked event sequences block by block, enabling variable\-length generation in both unconditional and conditional settings\.

### II\-CBlock Diffusion Models

Block diffusion models\[[18](https://arxiv.org/html/2606.24982#bib.bib23),[1](https://arxiv.org/html/2606.24982#bib.bib18),[2](https://arxiv.org/html/2606.24982#bib.bib80)\]have been proposed to integrate the strengths of autoregressive and diffusion language models, supporting variable\-length, high\-quality generation and improving inference efficiency with key\-value \(KV\) caching and parallel sampling\. These methods are mainly designed for discrete token sequences: they partition tokens into blocks and model each block conditioned on preceding blocks with discrete diffusion\[[22](https://arxiv.org/html/2606.24982#bib.bib85),[3](https://arxiv.org/html/2606.24982#bib.bib52)\], enabling parallel generation within blocks while maintaining dependencies across blocks\. Such a semi\-autoregressive paradigm has shown promise in video generation\[[50](https://arxiv.org/html/2606.24982#bib.bib59),[9](https://arxiv.org/html/2606.24982#bib.bib58),[74](https://arxiv.org/html/2606.24982#bib.bib81)\]and diffusion large language models\[[44](https://arxiv.org/html/2606.24982#bib.bib54),[60](https://arxiv.org/html/2606.24982#bib.bib53),[63](https://arxiv.org/html/2606.24982#bib.bib83),[62](https://arxiv.org/html/2606.24982#bib.bib84)\]\. Different from these works, LBDTPP targets asynchronous event sequences, where each event contains a continuous timestamp and a discrete mark\. We therefore introduce latent block diffusion, which embeds mixed\-type events into a latent space and performs continuous Gaussian diffusion within latent blocks\. This latent formulation preserves the length flexibility and intra\-block parallelism of block diffusion, while making it suitable for event sequence modeling and generation\.

## IIIPreliminary

In this section, we provide an overview of temporal point processes and diffusion probabilistic models\. Throughout this paper, the symbol “tt” denotes the event timestamp, and “kk” denotes thekk\-th step in the diffusion process\.

### III\-ATemporal Point Processes

Given a set of asynchronous event sequences drawn from the data distributionq​\(𝐱\)q\(\\mathbf\{x\}\), where each sequence is represented as𝐱=\(𝐱1,…,𝐱L\)\\mathbf\{x\}=\\left\(\\mathbf\{x\}^\{1\},\\ldots,\\mathbf\{x\}^\{L\}\\right\), and theℓ\\ell\-th event𝐱ℓ=\(tℓ,mℓ\)\\mathbf\{x\}^\{\\ell\}=\(t^\{\\ell\},m^\{\\ell\}\)consists of the occurrence timestamptℓ∈ℝ\+t^\{\\ell\}\\in\\mathbb\{R\}\_\{\+\}and the markmℓ∈\[M\]:=\{1,…,M\}m^\{\\ell\}\\in\[M\]:=\\\{1,\\ldots,M\\\}, withtℓ−1<tℓt^\{\\ell\-1\}<t^\{\\ell\}\. The sequence lengthLL, i\.e\., the number of events in the sequence, can vary across different sequences\. It is worth mentioning that the event timestamps can be equivalently expressed as the inter\-event timesτℓ=tℓ−tℓ−1∈ℝ\+\\tau^\{\\ell\}=t^\{\\ell\}\-t^\{\\ell\-1\}\\in\\mathbb\{R\}\_\{\+\}, wheret0=0t^\{0\}=0\. If not otherwise specified, we use the inter\-event time representation, i\.e\.,𝐱ℓ=\(τℓ,mℓ\)\\mathbf\{x\}^\{\\ell\}=\(\\tau^\{\\ell\},m^\{\\ell\}\)\. The goal is to fit a modelpθ​\(𝐱\)p\_\{\\theta\}\(\\mathbf\{x\}\)ofq​\(𝐱\)q\(\\mathbf\{x\}\)to learn the event sequence distribution, and then use the learned model to generate high\-fidelity sequences or make accurate event predictions\.

The common approach for modeling event sequences is using temporal point processes \(TPPs\)\[[10](https://arxiv.org/html/2606.24982#bib.bib21)\], which characterize the occurrence of discrete events in continuous time by defining conditional intensity functions \(CIFs\)\[[12](https://arxiv.org/html/2606.24982#bib.bib49)\]or conditional probability density functions \(PDFs\)\[[52](https://arxiv.org/html/2606.24982#bib.bib3)\]\. Most existing TPP models, such as Poisson processes\[[27](https://arxiv.org/html/2606.24982#bib.bib56)\], self\-correcting processes\[[23](https://arxiv.org/html/2606.24982#bib.bib40)\], Hawkes processes\[[20](https://arxiv.org/html/2606.24982#bib.bib39)\]and their Transformer variants\[[70](https://arxiv.org/html/2606.24982#bib.bib60),[79](https://arxiv.org/html/2606.24982#bib.bib4)\], typically parameterize the distribution ofLLevents in the autoregressive \(AR\) form:

log⁡pθ​\(𝐱\)=∑ℓ=1Llog⁡pθ​\(𝐱ℓ∣𝐱<ℓ\),\\log p\_\{\\theta\}\(\\mathbf\{x\}\)=\\sum\_\{\\ell=1\}^\{L\}\\log p\_\{\\theta\}\\left\(\\mathbf\{x\}^\{\\ell\}\\mid\\mathbf\{x\}^\{<\\ell\}\\right\),\(1\)where𝐱<ℓ=\(𝐱1,…,𝐱ℓ−1\)\\mathbf\{x\}^\{<\\ell\}=\\left\(\\mathbf\{x\}^\{1\},\\ldots,\\mathbf\{x\}^\{\\ell\-1\}\\right\)denotes the historical events before theℓ\\ell\-th event\. Since the conditional PDFpθ​\(𝐱ℓ∣𝐱<ℓ\)p\_\{\\theta\}\\left\(\\mathbf\{x\}^\{\\ell\}\\mid\\mathbf\{x\}^\{<\\ell\}\\right\)and the CIFλθ​\(𝐱ℓ∣𝐱<ℓ\)\\lambda\_\{\\theta\}\\left\(\\mathbf\{x\}^\{\\ell\}\\mid\\mathbf\{x\}^\{<\\ell\}\\right\)can be expressed in terms of each other\[[49](https://arxiv.org/html/2606.24982#bib.bib51)\], the above AR factorized distribution can also be equivalently specified using CIFs\.

We emphasize that the autoregressive factorization used here is defined for a finite realized event sequence of lengthLL, where the modeling target is the probability of the realized sequence itself\[[6](https://arxiv.org/html/2606.24982#bib.bib88)\], rather than the full point\-process likelihood up to a fixed terminal timeTT\. Under the latter formulation, an additional survival term over the terminal interval would indeed appear\. However, under our current sequence\-level modeling setup, this term is not required, and omitting it does not affect subsequent model training or optimization objectives\.

While autoregressive TPPs have been successful in generating a single subsequent event\[[79](https://arxiv.org/html/2606.24982#bib.bib4),[67](https://arxiv.org/html/2606.24982#bib.bib2),[65](https://arxiv.org/html/2606.24982#bib.bib12)\], their one\-by\-one sequential sampling procedure can lead to error accumulation in multi\-step generation, thereby degrading the overall generation performance\[[66](https://arxiv.org/html/2606.24982#bib.bib8),[69](https://arxiv.org/html/2606.24982#bib.bib17)\]\.

### III\-BDiffusion Probabilistic Models

To simplify the expression, we allow some symbolic abuse\. In this subsection, we rewrite𝐱∼q​\(𝐱\)\\mathbf\{x\}\\sim q\(\\mathbf\{x\}\)as a vector in the Euclidean spaceℝL\\mathbb\{R\}^\{L\}\. Diffusion probabilistic models \(hereafter diffusion models\)\[[56](https://arxiv.org/html/2606.24982#bib.bib25),[21](https://arxiv.org/html/2606.24982#bib.bib24)\]overcome the aforementioned sequential sampling limitation by learning the distributionpθ​\(𝐱\)p\_\{\\theta\}\(\\mathbf\{x\}\)directly, admitting parallel generation\.

Diffusion models define a forward process that gradually adds Gaussian noise to the clean data𝐱0=𝐱\\mathbf\{x\}\_\{0\}=\\mathbf\{x\}:

q​\(𝐱1:K∣𝐱0\)\\displaystyle q\\left\(\\mathbf\{x\}\_\{1:K\}\\mid\\mathbf\{x\}\_\{0\}\\right\)=∏k=1Kq​\(𝐱k∣𝐱k−1\),\\displaystyle=\\prod\_\{k=1\}^\{K\}q\\left\(\\mathbf\{x\}\_\{k\}\\mid\\mathbf\{x\}\_\{k\-1\}\\right\),\(2\)q​\(𝐱k∣𝐱k−1\)\\displaystyle q\\left\(\\mathbf\{x\}\_\{k\}\\mid\\mathbf\{x\}\_\{k\-1\}\\right\)=𝒩​\(𝐱k;αk​𝐱k−1,\(1−αk\)​𝐈\),\\displaystyle=\\mathcal\{N\}\\left\(\\mathbf\{x\}\_\{k\};\\sqrt\{\\alpha\_\{k\}\}\\mathbf\{x\}\_\{k\-1\},\(1\-\\alpha\_\{k\}\)\\mathbf\{I\}\\right\),\(3\)whereα1,…,αK\\alpha\_\{1\},\\ldots,\\alpha\_\{K\}are the decreasing values in\[0,1\]\[0,1\], andKKis the total number of diffusion steps\.

On the other hand, the reverse denoising process starts fromp​\(𝐱K\)=𝒩​\(𝐱K;𝟎,𝐈\)p\\left\(\\mathbf\{x\}\_\{K\}\\right\)=\\mathcal\{N\}\\left\(\\mathbf\{x\}\_\{K\};\\mathbf\{0\},\\mathbf\{I\}\\right\)and proceeds as follows:

pθ​\(𝐱0:K\)\\displaystyle p\_\{\\theta\}\\left\(\\mathbf\{x\}\_\{0:K\}\\right\)=p​\(𝐱K\)​∏k=1Kpθ​\(𝐱k−1∣𝐱k\),\\displaystyle=p\\left\(\\mathbf\{x\}\_\{K\}\\right\)\\prod\_\{k=1\}^\{K\}p\_\{\\theta\}\\left\(\\mathbf\{x\}\_\{k\-1\}\\mid\\mathbf\{x\}\_\{k\}\\right\),\(4\)pθ​\(𝐱k−1∣𝐱k\)\\displaystyle p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{x\}\_\{k\-1\}\\mid\\mathbf\{x\}\_\{k\}\\right\)=𝒩​\(𝐱k−1;𝝁θ​\(𝐱k,k\),σk2​𝐈\),\\displaystyle=\\mathcal\{N\}\\left\(\\mathbf\{x\}\_\{k\-1\};\\boldsymbol\{\\mu\}\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{x\}\_\{k\},k\\right\),\\sigma\_\{k\}^\{2\}\\mathbf\{I\}\\right\),\(5\)whereσk2=1−αk\\sigma\_\{k\}^\{2\}=1\-\\alpha\_\{k\}\. To learn the parametersθ\\theta, the standard variational inference method involves minimizing the negative evidence lower bound \(NELBO\)\[[40](https://arxiv.org/html/2606.24982#bib.bib37)\]:

−log⁡pθ​\(𝐱0\)≤𝔼q​\(𝐱1:K∣𝐱0\)​\[−log⁡pθ​\(𝐱0:K\)q​\(𝐱1:K∣𝐱0\)\]\.\-\\log p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{x\}\_\{0\}\\right\)\\leq\\mathbb\{E\}\_\{q\\left\(\\mathbf\{x\}\_\{1:K\}\\mid\\mathbf\{x\}\_\{0\}\\right\)\}\\left\[\-\\log\\frac\{p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{x\}\_\{0:K\}\\right\)\}\{q\\left\(\\mathbf\{x\}\_\{1:K\}\\mid\\mathbf\{x\}\_\{0\}\\right\)\}\\right\]\.\(6\)Prior work\[[21](https://arxiv.org/html/2606.24982#bib.bib24)\]simplified this NELBO, replacing it with the following loss function:

ℒsimple​\(𝐱0;θ\)=𝔼k,𝐱0,ϵ​\[‖ϵ−ϵθ​\(𝐱k,k\)‖2\],\\mathcal\{L\}\_\{\\text\{simple\}\}\(\\mathbf\{x\}\_\{0\};\\theta\)=\\mathbb\{E\}\_\{k,\\mathbf\{x\}\_\{0\},\\boldsymbol\{\\epsilon\}\}\\left\[\\left\\\|\\boldsymbol\{\\epsilon\}\-\\boldsymbol\{\\epsilon\}\_\{\\theta\}\\left\(\\mathbf\{x\}\_\{k\},k\\right\)\\right\\\|^\{2\}\\right\],\(7\)whereα¯k=∏s=1kαs\\bar\{\\alpha\}\_\{k\}=\\prod\_\{s=1\}^\{k\}\\alpha\_\{s\},ϵ∼𝒩​\(𝟎,𝐈\)\\boldsymbol\{\\epsilon\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\), andϵθ\\boldsymbol\{\\epsilon\}\_\{\\theta\}is parameterized by neural networks to predict noiseϵ\\boldsymbol\{\\epsilon\}using the noisy input𝐱k=α¯k​𝐱0\+1−α¯k​ϵ\\mathbf\{x\}\_\{k\}=\\sqrt\{\\bar\{\\alpha\}\_\{k\}\}\\mathbf\{x\}\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{k\}\}\\boldsymbol\{\\epsilon\}and the diffusion stepkk\.

After training, a new data point, still denoted as𝐱0\\mathbf\{x\}\_\{0\}for convenience, that follows the learned distributionpθ​\(𝐱0\)p\_\{\\theta\}\(\\mathbf\{x\}\_\{0\}\), can be generated as follows\. We start by sampling𝐱K∼𝒩​\(𝟎,𝐈\)\\mathbf\{x\}\_\{K\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\), and then conduct the reverse processpθ​\(𝐱k−1∣𝐱k\)p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{x\}\_\{k\-1\}\\mid\\mathbf\{x\}\_\{k\}\\right\)to iteratively sample𝐳∼𝒩​\(𝟎,𝐈\)\\mathbf\{z\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)and compute𝐱k−1\\mathbf\{x\}\_\{k\-1\}fork=Kk=Kto11:

𝐱k−1=1αk​\(𝐱k−1−αk1−α¯k​ϵθ​\(𝐱k,k\)\)\+σk​𝐳\.\\mathbf\{x\}\_\{k\-1\}=\\frac\{1\}\{\\sqrt\{\\alpha\_\{k\}\}\}\\big\(\\mathbf\{x\}\_\{k\}\-\\frac\{1\-\\alpha\_\{k\}\}\{\\sqrt\{1\-\\bar\{\\alpha\}\_\{k\}\}\}\\boldsymbol\{\\epsilon\}\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{x\}\_\{k\},k\\right\)\\big\)\+\\sigma\_\{k\}\\mathbf\{z\}\.\(8\)
Although the above standard diffusion model can generate all dimensions of𝐱0\\mathbf\{x\}\_\{0\}simultaneously, it is tied to a fixed\-dimensional vector of lengthLL\. This fixed\-length constraint limits its ability to simulate event sequences in real\-world applications, where the number of events is unknown, varies across sequences, and may exceed the pre\-specified lengthLLas time progresses\. Moreover, if only a short sequence is needed at inference time, generating a full length\-LLsequence introduces unnecessary computation and sampling time\.

![Refer to caption](https://arxiv.org/html/2606.24982v1/x1.png)Figure 1:The end\-to\-end training framework of LBDTPP\. The position of each event on the timeline represents its timestamp, and the color denotes its event mark\. Our model first embeds the event sequence into a continuous latent space, then learns the latent distribution by autoregressively modeling event blocks and performing Gaussian diffusion within each block, while reconstructing the input event sequence from the latent representations\. We jointly train the block diffusion Transformer and the encoder\-decoder by minimizing a weighted combination of the latent block diffusion loss and the reconstruction loss\.

## IVProposed Method: LBDTPP

In this section, we present Latent Block\-Diffusion Temporal Point Processes \(LBDTPP\), a novel semi\-autoregressive TPP framework for modeling asynchronous event sequences\. LBDTPP first maps events into a continuous latent space and then models the latent sequence representation in a block\-wise manner: it factorizes dependencies autoregressively across event blocks, and performs Gaussian diffusion to learn each block distribution\. At inference time, LBDTPP sequentially generates latent event blocks, produces multiple latent event representations in parallel within each block, and subsequently decodes them back into the event space\. This design preserves the variable\-length generation ability of autoregressive TPPs while inheriting the parallel high\-quality generation capability of non\-autoregressive diffusion TPPs, and meanwhile mitigates their respective limitations of error accumulation and fixed\-length generation\. The training framework of LBDTPP is illustrated in[Fig\.1](https://arxiv.org/html/2606.24982#S3.F1)\.

In what follows, we first introduce the model architecture in[SectionIV\-A](https://arxiv.org/html/2606.24982#S4.SS1), which consists of the event encoder, latent block diffusion, and event decoder\. Then, we detail the training and sampling procedures for unconditional event sequence generation in Sections[IV\-B](https://arxiv.org/html/2606.24982#S4.SS2)and[IV\-C](https://arxiv.org/html/2606.24982#S4.SS3), respectively\. We next describe the extension to conditional generation in[SectionIV\-D](https://arxiv.org/html/2606.24982#S4.SS4), and then provide a theoretical analysis of generation\-error accumulation in[SectionIV\-E](https://arxiv.org/html/2606.24982#S4.SS5)\.

### IV\-AModel Architecture

Since asynchronous event sequences contain both continuous timestamps and discrete marks, neither standard continuous diffusion\[[21](https://arxiv.org/html/2606.24982#bib.bib24)\]nor discrete diffusion models\[[3](https://arxiv.org/html/2606.24982#bib.bib52)\]can be directly applied to raw event sequences\. We therefore first construct continuous latent event representations that jointly encode temporal and mark information, providing a homogeneous space for subsequent generative modeling\. Below, we describe the architecture components in detail\.

Event Encoder\.For each event𝐱ℓ=\(τℓ,mℓ\)\\mathbf\{x\}^\{\\ell\}=\(\\tau^\{\\ell\},m^\{\\ell\}\)in the sequence𝐱=\(𝐱1,…,𝐱L\)\\mathbf\{x\}=\(\\mathbf\{x\}^\{1\},\\ldots,\\mathbf\{x\}^\{L\}\), we first encode its temporal and mark information separately\. Specifically, the inter\-event time is mapped to a time embedding𝐳τℓ=TimeEmbed⁡\(τℓ\)∈ℝD\\mathbf\{z\}\_\{\\tau\}^\{\\ell\}=\\operatorname\{TimeEmbed\}\(\\tau^\{\\ell\}\)\\in\\mathbb\{R\}^\{D\}using positional encoding\[[79](https://arxiv.org/html/2606.24982#bib.bib4)\], while the mark embedding𝐳mℓ=MarkEmbed⁡\(mℓ\)∈ℝD\\mathbf\{z\}\_\{m\}^\{\\ell\}=\\operatorname\{MarkEmbed\}\(m^\{\\ell\}\)\\in\\mathbb\{R\}^\{D\}is obtained by applying a linear transformation to the one\-hot representation ofmℓm^\{\\ell\}:

\[TimeEmbed⁡\(τℓ\)\]d\\displaystyle\\left\[\\operatorname\{TimeEmbed\}\(\\tau^\{\\ell\}\)\\right\]\_\{d\}=\{cos⁡\(τℓ/10000d−1D\),if​d​is odd,sin⁡\(τℓ/10000dD\),if​d​is even,\\displaystyle=\\\!\\left\\\{\\begin\{array\}\[\]\{ll\}\\\!\\\!\\\!\\cos\\big\(\\tau^\{\\ell\}/10000^\{\\frac\{d\-1\}\{D\}\}\\big\),&\\\!\\\!\\\!\\text\{if \}d\\text\{ is odd\},\\vskip 2\.0pt\\\\ \\\!\\\!\\\!\\sin\\big\(\\tau^\{\\ell\}/10000^\{\\frac\{d\}\{D\}\}\\big\),&\\\!\\\!\\\!\\text\{if \}d\\text\{ is even\},\\end\{array\}\\right\.\(11\)MarkEmbed⁡\(mℓ\)\\displaystyle\\operatorname\{MarkEmbed\}\(m^\{\\ell\}\)=𝐖⋅OneHot⁡\(mℓ\),\\displaystyle=\\mathbf\{W\}\\cdot\\operatorname\{OneHot\}\(m^\{\\ell\}\),\(12\)whered=0,…,D−1d=0,\\ldots,D\-1, andOneHot⁡\(⋅\):\[M\]→\{0,1\}M\\operatorname\{OneHot\}\(\\cdot\):\[M\]\\rightarrow\\\{0,1\\\}^\{M\}denotes the one\-hot encoding function\. The embedding matrix𝐖∈ℝD×M\\mathbf\{W\}\\in\\mathbb\{R\}^\{D\\times M\}is initialized from a uniform distribution and kept fixed during training\. Our experiments show that keeping it fixed yields slightly better empirical performance\. As a result, the event encoder contains no learnable parameters\.

The overall representation of event𝐱ℓ\\mathbf\{x\}^\{\\ell\}is then obtained by adding its time and mark embeddings, i\.e\.,𝐳ℓ=𝐳τℓ\+𝐳mℓ∈ℝD\\mathbf\{z\}^\{\\ell\}=\\mathbf\{z\}\_\{\\tau\}^\{\\ell\}\+\\mathbf\{z\}\_\{m\}^\{\\ell\}\\in\\mathbb\{R\}^\{D\}\. In this work, we use addition as the default fusion operation, while other operations such as concatenation are also compatible with our framework\. By stacking the event representations in temporal order, we form the representation of the entire sequence𝐱\\mathbf\{x\}of lengthLLas𝐳=\(𝐳1,…,𝐳L\)∈ℝL×D\\mathbf\{z\}=\(\\mathbf\{z\}^\{1\},\\ldots,\\mathbf\{z\}^\{L\}\)\\in\\mathbb\{R\}^\{L\\times D\}\.

Given the sequence representation𝐳\\mathbf\{z\}, we aim to model its distribution while mitigating error accumulation caused by one\-by\-one autoregressive sampling and overcoming fixed\-length non\-autoregressive generation\. To this end, we propose to model𝐳\\mathbf\{z\}at the block level, which captures event dependencies across blocks, supports variable\-length generation, and enables parallel diffusion\-based generation within each block\.

Latent Block Diffusion\.We partition the latent sequence representation𝐳=\(𝐳1,…,𝐳L\)\\mathbf\{z\}=\\left\(\\mathbf\{z\}^\{1\},\\ldots,\\mathbf\{z\}^\{L\}\\right\)intoB:=L/L′B:=L/L^\{\\prime\}non\-overlapping blocks of lengthL′L^\{\\prime\}, and assumeBBis an integer \(if not, we pad the raw event sequence in advance so thatLLis divisible byL′L^\{\\prime\}\)\. For eachb∈\[B\]b\\in\[B\], we denote thebb\-th event block\(𝐳ℓb\+1,…,𝐳ℓb\+1\)\\left\(\\mathbf\{z\}^\{\\ell\_\{b\}\+1\},\\ldots,\\mathbf\{z\}^\{\\ell\_\{b\+1\}\}\\right\)simply as𝐳b\\mathbf\{z\}^\{b\}, whereℓb=\(b−1\)​L′\\ell\_\{b\}=\(b\-1\)L^\{\\prime\}\. Thus,𝐳b\\mathbf\{z\}^\{b\}containsL′L^\{\\prime\}consecutive event representations\. We denote the historical blocks before blockbbas𝐳<b=\(𝐳1,…,𝐳ℓb\)\\mathbf\{z\}^\{<b\}=\\left\(\\mathbf\{z\}^\{1\},\\ldots,\\mathbf\{z\}^\{\\ell\_\{b\}\}\\right\)\.

Here and below, superscriptℓ\\ellindexes individual events, so𝐳ℓ∈ℝD\\mathbf\{z\}^\{\\ell\}\\in\\mathbb\{R\}^\{D\}, whereas superscriptbbindexes blocks, so𝐳b∈ℝL′×D\\mathbf\{z\}^\{b\}\\in\\mathbb\{R\}^\{L^\{\\prime\}\\times D\}\. This distinction avoids ambiguity between event\-level and block\-level representations\.

Different from discrete block diffusion\[[1](https://arxiv.org/html/2606.24982#bib.bib18)\]designed for token states, our latent block diffusion operates on continuous event representations that jointly encode timestamp and mark information\. Specifically, we model the distribution of the latent sequence representation by factorizing it autoregressively over blocks and performing Gaussian diffusion within each block\. That is, we decompose the log\-likelihood of the sequence representation𝐳\\mathbf\{z\}over blocks as:

log⁡pθ​\(𝐳\)=∑b=1Blog⁡pθ​\(𝐳b∣𝐳<b\),\\log p\_\{\\mathbf\{\\theta\}\}\(\\mathbf\{z\}\)=\\sum\_\{b=1\}^\{B\}\\log p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\),\(13\)where each conditional distributionpθ​\(𝐳b∣𝐳<b\)p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)is learned by conducting Gaussian diffusion over a block ofL′L^\{\\prime\}event representations\. The sequential factorization captures dependencies across blocks and supports variable\-length sequence generation through block\-by\-block generation, while the within\-block diffusion learns the conditional distribution of each block and enables parallel high\-quality generation of multiple events to reduce error accumulation\. The detailed generation procedure will be described in[SectionIV\-C](https://arxiv.org/html/2606.24982#S4.SS3), while this part focuses on the modeling formulation\.

For each blockb∈\[B\]b\\in\[B\], to learn the conditional distributionpθ​\(𝐳b∣𝐳<b\)p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\), we define a forward diffusion process that gradually adds Gaussian noise to the clean block𝐳0b=𝐳b\\mathbf\{z\}\_\{0\}^\{b\}=\\mathbf\{z\}^\{b\}:

q​\(𝐳1:Kb∣𝐳0b\)=∏k=1Kq​\(𝐳kb∣𝐳k−1b\),q\\left\(\\mathbf\{z\}\_\{1:K\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)=\\prod\_\{k=1\}^\{K\}q\\left\(\\mathbf\{z\}\_\{k\}^\{b\}\\mid\\mathbf\{z\}\_\{k\-1\}^\{b\}\\right\),\(14\)q​\(𝐳kb∣𝐳k−1b\)=𝒩​\(𝐳kb;αk​𝐳k−1b,\(1−αk\)​𝐈\),q\\left\(\\mathbf\{z\}\_\{k\}^\{b\}\\mid\\mathbf\{z\}\_\{k\-1\}^\{b\}\\right\)=\\mathcal\{N\}\\left\(\\mathbf\{z\}\_\{k\}^\{b\};\\sqrt\{\\alpha\_\{k\}\}\\mathbf\{z\}\_\{k\-1\}^\{b\},\(1\-\\alpha\_\{k\}\)\\mathbf\{I\}\\right\),\(15\)where𝐳kb\\mathbf\{z\}\_\{k\}^\{b\}denotes the noisybb\-th block at diffusion stepkk\. This noise\-adding process operates solely on the currently considered blockbband is independent of all other blocks, including both historical and subsequent blocks\. Thus, Eqs\. \([14](https://arxiv.org/html/2606.24982#S4.E14)\) and \([15](https://arxiv.org/html/2606.24982#S4.E15)\) are not conditioned on𝐳<b\\mathbf\{z\}^\{<b\}\. Similar to standard diffusion models\[[21](https://arxiv.org/html/2606.24982#bib.bib24)\], we can sample𝐳kb\\mathbf\{z\}\_\{k\}^\{b\}directly based on𝐳0b\\mathbf\{z\}\_\{0\}^\{b\}:

q​\(𝐳kb∣𝐳0b\)=𝒩​\(𝐳kb;α¯k​𝐳0b,\(1−α¯k\)​𝐈\),q\(\\mathbf\{z\}\_\{k\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\)=\\mathcal\{N\}\\left\(\\mathbf\{z\}\_\{k\}^\{b\};\\sqrt\{\\bar\{\\alpha\}\_\{k\}\}\\mathbf\{z\}\_\{0\}^\{b\},\(1\-\\bar\{\\alpha\}\_\{k\}\)\\mathbf\{I\}\\right\),\(16\)whereα¯k=∏s=1kαs\\bar\{\\alpha\}\_\{k\}=\\prod\_\{s=1\}^\{k\}\\alpha\_\{s\}\. In other words, the noisy block at any stepkkhas the closed\-form expression:

𝐳kb=α¯k​𝐳0b\+1−α¯k​ϵb,where​ϵb∼𝒩​\(𝟎,𝐈\)\.\\mathbf\{z\}\_\{k\}^\{b\}=\\sqrt\{\\bar\{\\alpha\}\_\{k\}\}\\mathbf\{z\}\_\{0\}^\{b\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{k\}\}\\boldsymbol\{\\epsilon\}^\{b\},\\quad\\text\{where\}\\,\\,\\boldsymbol\{\\epsilon\}^\{b\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)\.\(17\)Here the added noiseϵb\\boldsymbol\{\\epsilon\}^\{b\}is independent across different blocks\.

The corresponding reverse denoising process, defined on blockbb, starts fromp​\(𝐳Kb∣𝐳<b\)=𝒩​\(𝐳Kb;𝟎,𝐈\)p\\left\(\\mathbf\{z\}\_\{K\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)=\\mathcal\{N\}\\left\(\\mathbf\{z\}\_\{K\}^\{b\};\\mathbf\{0\},\\mathbf\{I\}\\right\)and proceeds as follows:

pθ​\(𝐳0:Kb∣𝐳<b\)=p​\(𝐳Kb∣𝐳<b\)​∏k=1Kpθ​\(𝐳k−1b∣𝐳kb,𝐳<b\),\\\!\\\!\\\!\\\!p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{0:K\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)=p\\left\(\\mathbf\{z\}\_\{K\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)\\prod\_\{k=1\}^\{K\}p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{k\-1\}^\{b\}\\mid\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{z\}^\{<b\}\\right\),\(18\)pθ​\(𝐳k−1b∣𝐳kb,𝐳<b\)=𝒩​\(𝐳k−1b;𝝁θb​\(𝐳kb,𝐳<b,k\),σk2​𝐈\)\.p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{k\-1\}^\{b\}\\mid\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{z\}^\{<b\}\\right\)=\\mathcal\{N\}\\left\(\\mathbf\{z\}\_\{k\-1\}^\{b\};\\boldsymbol\{\\mu\}\_\{\\mathbf\{\\theta\}\}^\{b\}\\left\(\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{z\}^\{<b\},k\\right\),\\sigma\_\{k\}^\{2\}\\mathbf\{I\}\\right\)\.\(19\)Notably, the reverse process at blockbbis conditioned on all preceding blocks𝐳<b\\mathbf\{z\}^\{<b\}, enabling the model to learn the conditional distributionpθ​\(𝐳b∣𝐳<b\)p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)and capture dependencies from historical event blocks\.

Based on the forward and reverse processes described above, we can derive the negative evidence lower bound \(NELBO\) for each termlog⁡pθ​\(𝐳b∣𝐳<b\)\\log p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)in Eq\. \([13](https://arxiv.org/html/2606.24982#S4.E13)\), and then simplify it to the following loss for blockbb:

ℒLBDb​\(𝐳b,𝐳<b;θ\)=𝔼k,𝐳b,ϵb​\[‖𝐳b−𝐳θb​\(𝐳kb,𝐳<b,k\)‖2\],\\mathcal\{L\}\_\{\\text\{LBD\}\}^\{b\}\(\\mathbf\{z\}^\{b\},\\mathbf\{z\}^\{<b\};\\mathbf\{\\theta\}\)=\\mathbb\{E\}\_\{k,\\mathbf\{z\}^\{b\},\\boldsymbol\{\\epsilon\}^\{b\}\}\\left\[\\left\\\|\\mathbf\{z\}^\{b\}\-\\mathbf\{z\}\_\{\\mathbf\{\\theta\}\}^\{b\}\\left\(\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{z\}^\{<b\},k\\right\)\\right\\\|^\{2\}\\right\],\(20\)wherek∼Unif⁡\(\{1,…,K\}\)k\\sim\\operatorname\{Unif\}\\left\(\\\{1,\\ldots,K\\\}\\right\), and the noisy block𝐳kb\\mathbf\{z\}\_\{k\}^\{b\}is obtained according to Eq\. \([17](https://arxiv.org/html/2606.24982#S4.E17)\)\. Moreover, the relationship between the block predictor𝐳θb\\mathbf\{z\}\_\{\\mathbf\{\\theta\}\}^\{b\}and the reverse process mean𝝁θb\\boldsymbol\{\\mu\}\_\{\\mathbf\{\\theta\}\}^\{b\}is given by:

𝝁θb​\(𝐳kb,𝐳<b,k\)=\\displaystyle\\boldsymbol\{\\mu\}\_\{\\mathbf\{\\theta\}\}^\{b\}\\left\(\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{z\}^\{<b\},k\\right\)=\{\}αk​\(1−α¯k−1\)​𝐳kb1−α¯k\\displaystyle\\frac\{\\sqrt\{\\alpha\_\{k\}\}\(1\-\\bar\{\\alpha\}\_\{k\-1\}\)\\mathbf\{z\}\_\{k\}^\{b\}\}\{1\-\\bar\{\\alpha\}\_\{k\}\}\(21\)\+α¯k−1​\(1−αk\)1−α¯k​𝐳θb​\(𝐳kb,𝐳<b,k\)\.\\displaystyle\+\\frac\{\\sqrt\{\\bar\{\\alpha\}\_\{k\-1\}\}\(1\-\\alpha\_\{k\}\)\}\{1\-\\bar\{\\alpha\}\_\{k\}\}\\mathbf\{z\}\_\{\\mathbf\{\\theta\}\}^\{b\}\\left\(\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{z\}^\{<b\},k\\right\)\.
In Eq\. \([20](https://arxiv.org/html/2606.24982#S4.E20)\), we use the denoising model𝐳θb​\(𝐳kb,𝐳<b,k\)\\mathbf\{z\}\_\{\\mathbf\{\\theta\}\}^\{b\}\\left\(\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{z\}^\{<b\},k\\right\)to directly predict the clean latent block𝐳b\\mathbf\{z\}^\{b\}\(with the predicted output denoted as𝐳^b\\hat\{\\mathbf\{z\}\}^\{b\}\) using the noisy block𝐳kb\\mathbf\{z\}\_\{k\}^\{b\}, historical blocks𝐳<b\\mathbf\{z\}^\{<b\}, and diffusion stepkk\. From our early experiments, we found that this𝐳b\\mathbf\{z\}^\{b\}\-prediction approach outperforms predicting the block noiseϵb\\boldsymbol\{\\epsilon\}^\{b\}\. TheBBdenoisers𝐳θb\\mathbf\{z\}\_\{\\mathbf\{\\theta\}\}^\{b\}are parameterized by a single Transformer\[[59](https://arxiv.org/html/2606.24982#bib.bib43)\],𝐳θ\\mathbf\{z\}\_\{\\mathbf\{\\theta\}\}, which is equipped with a specialized attention mask to facilitate efficient training as described in[SectionIV\-B](https://arxiv.org/html/2606.24982#S4.SS2)\.

We then aggregate the latent block diffusion loss over all event blocks as:

ℒLBD​\(𝐳;θ\)=1L​∑b=1BℒLBDb​\(𝐳b,𝐳<b;θ\)\.\\mathcal\{L\}\_\{\\text\{LBD\}\}\(\\mathbf\{z\};\\mathbf\{\\theta\}\)=\\frac\{1\}\{L\}\\sum\_\{b=1\}^\{B\}\\mathcal\{L\}\_\{\\text\{LBD\}\}^\{b\}\(\\mathbf\{z\}^\{b\},\\mathbf\{z\}^\{<b\};\\mathbf\{\\theta\}\)\.\(22\)
Formally, we summarize the NELBO of our LBDTPP model and its simplification to the above loss function in the following proposition\.

###### Proposition 1\(NELBO of LBDTPP\)\.

Under the sequential factorization over event blocks in Eq\. \([13](https://arxiv.org/html/2606.24982#S4.E13)\), suppose the block\-wise forward diffusion process satisfies Eqs\. \([14](https://arxiv.org/html/2606.24982#S4.E14)\) and \([15](https://arxiv.org/html/2606.24982#S4.E15)\), and the reverse denoising process for each blockb∈\[B\]b\\in\[B\]is given by Eqs\. \([18](https://arxiv.org/html/2606.24982#S4.E18)\) and \([19](https://arxiv.org/html/2606.24982#S4.E19)\)\. Then the negative log\-likelihood of the latent event sequence representation satisfies

−log⁡pθ​\(𝐳\)≤∑b=1B𝒥b​\(𝐳b,𝐳<b;θ\),\-\\log p\_\{\\mathbf\{\\theta\}\}\(\\mathbf\{z\}\)\\leq\\sum\_\{b=1\}^\{B\}\\mathcal\{J\}\_\{b\}\(\\mathbf\{z\}^\{b\},\\mathbf\{z\}^\{<b\};\\theta\),\(23\)where𝒥b​\(𝐳b,𝐳<b;θ\)\\mathcal\{J\}\_\{b\}\(\\mathbf\{z\}^\{b\},\\mathbf\{z\}^\{<b\};\\theta\)is the standard NELBO for the conditional distributionpθ​\(𝐳b∣𝐳<b\)p\_\{\\mathbf\{\\theta\}\}\(\\mathbf\{z\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\)\. Moreover, similar to\[[21](https://arxiv.org/html/2606.24982#bib.bib24)\], the upper bound in Eq\. \([23](https://arxiv.org/html/2606.24982#S4.E23)\) can be further simplified to the surrogate latent block diffusion loss in Eq\. \([22](https://arxiv.org/html/2606.24982#S4.E22)\)\.

The proof is provided in the supplementary material\. The resulting latent block diffusion lossℒLBD​\(𝐳;θ\)\\mathcal\{L\}\_\{\\text\{LBD\}\}\(\\mathbf\{z\};\\mathbf\{\\theta\}\)serves as the latent distribution learning objective: it trains the denoising Transformer to recover clean latent event blocks from noisy ones conditioned on historical blocks, enabling the reverse process to sample coherent latent blocks at inference time\. To obtain actual event sequences, the generated latent representations need to be mapped back to the event space\.

Algorithm 1LBDTPP TrainingInput:event sequence𝐱\\mathbf\{x\}of lengthLL, block sizeL′L^\{\\prime\}, diffusion stepsKK

repeat

1. 1\.𝐳\\mathbf\{z\}←\\leftarrowEncoder​\(𝐱\)\\texttt\{Encoder\}\(\\mathbf\{x\}\);𝐱^\\hat\{\\mathbf\{x\}\}←\\leftarrowDecoder​\(𝐳\)\\texttt\{Decoder\}\(\\mathbf\{z\}\)
2. 2\.Samplek1,…,kB∼Unif⁡\(\{1,…,K\}\)⊳B=L/L′k\_\{1\},\\dots,k\_\{B\}\\sim\\operatorname\{Unif\}\\left\(\\\{1,\\dots,K\\\}\\right\)\\hfill\\triangleright\\text\{\\small$B=L/L^\{\\prime\}$\}
3. 3\.∀b∈\{1,…,B\}\\forall b\\in\\\{1,\\dots,B\\\}:𝐳kbb∼q\(⋅∣𝐳b\)⊳\\mathbf\{z\}\_\{k\_\{b\}\}^\{b\}\\sim q\(\\,\\cdot\\mid\\mathbf\{z\}^\{b\}\)\\hfill\\trianglerightEq\. \([16](https://arxiv.org/html/2606.24982#S4.E16)\)
4. 4\.∅,𝐊1:B,𝐕1:B←𝐳θ​\(𝐳\)⊳\\emptyset,\\mathbf\{K\}^\{1:B\},\\mathbf\{V\}^\{1:B\}\\leftarrow\\mathbf\{z\}\_\{\\mathbf\{\\theta\}\}\(\\mathbf\{z\}\)\\hfill\\trianglerightKV\\mathrm\{KV\}cache
5. 5\.∀b\\forall b:𝐳^b,∅,∅←𝐳θb​\(𝐳kbb,𝐊1:b−1,𝐕1:b−1,kb\)\\hat\{\\mathbf\{z\}\}^\{b\},\\emptyset,\\emptyset\\leftarrow\\mathbf\{z\}\_\{\\mathbf\{\\theta\}\}^\{b\}\\left\(\\mathbf\{z\}\_\{k\_\{b\}\}^\{b\},\\mathbf\{K\}^\{1:b\-1\},\\mathbf\{V\}^\{1:b\-1\},k\_\{b\}\\right\)
6. 6\.𝐳^←𝐳^1⊕⋯⊕𝐳^B\\hat\{\\mathbf\{z\}\}\\leftarrow\\hat\{\\mathbf\{z\}\}^\{1\}\\oplus\\cdots\\oplus\\hat\{\\mathbf\{z\}\}^\{B\}
7. 7\.Take gradient descent step on∇θ,ϕℒOverall​\(𝐱;θ,ϕ\)\\nabla\_\{\\mathbf\{\\theta\},\\mathbf\{\\phi\}\}\\,\\mathcal\{L\}\_\{\\text\{Overall\}\}\(\\mathbf\{x\};\\mathbf\{\\theta\},\\mathbf\{\\phi\}\)

untilconverged

Event Decoder\.We design an event decoder to reconstruct the event sequence𝐱\\mathbf\{x\}from its latent representation𝐳\\mathbf\{z\}, so that latent blocks sampled during inference can be decoded back to the original event space\. Specifically, theℓ\\ell\-th reconstructed event𝐱^ℓ=\(τ^ℓ,m^ℓ\)\\hat\{\\mathbf\{x\}\}^\{\\ell\}=\(\\hat\{\\tau\}^\{\\ell\},\\hat\{m\}^\{\\ell\}\)is decoded from𝐳ℓ\\mathbf\{z\}^\{\\ell\}by:

τ^ℓ=Softplus⁡\(MLPτ⁡\(𝐳ℓ\)\),\\hat\{\\tau\}^\{\\ell\}=\\operatorname\{Softplus\}\\left\(\\operatorname\{MLP\}\_\{\\tau\}\\left\(\\mathbf\{z\}^\{\\ell\}\\right\)\\right\),\(24\)m^ℓ=argmax𝑚​𝐩^ℓ​\[m\],𝐩^ℓ=Softmax⁡\(MLPm⁡\(𝐳ℓ\)\),\\hat\{m\}^\{\\ell\}=\\underset\{m\}\{\\operatorname\{argmax\}\}\\,\\hat\{\\mathbf\{p\}\}^\{\\ell\}\[m\],\\quad\\hat\{\\mathbf\{p\}\}^\{\\ell\}=\\operatorname\{Softmax\}\\left\(\\operatorname\{MLP\}\_\{m\}\\left\(\\mathbf\{z\}^\{\\ell\}\\right\)\\right\),\(25\)whereMLPτ⁡\(⋅\):ℝD→ℝ\\operatorname\{MLP\}\_\{\\tau\}\(\\cdot\):\\mathbb\{R\}^\{D\}\\rightarrow\\mathbb\{R\}andMLPm⁡\(⋅\):ℝD→ℝM\\operatorname\{MLP\}\_\{m\}\(\\cdot\):\\mathbb\{R\}^\{D\}\\rightarrow\\mathbb\{R\}^\{M\}are two learnable multi\-layer perceptrons \(MLPs\) for inter\-event time and event mark reconstruction, respectively\. TheSoftplus\\operatorname\{Softplus\}activation function ensures the non\-negativity of the reconstructed inter\-event times\. Here,𝐩^ℓ​\[m\]\\hat\{\\mathbf\{p\}\}^\{\\ell\}\[m\]denotes themm\-th element of the probability vector𝐩^ℓ∈\(0,1\)M\\hat\{\\mathbf\{p\}\}^\{\\ell\}\\in\(0,1\)^\{M\}, i\.e\., the predicted probability assigned to the markmm\.

Importantly, we train this event decoder to reconstruct event sequences accurately, such that latent representations generated by the reverse denoising process can be mapped to high\-quality event sequences at inference time\. We therefore define the reconstruction loss using the mean squared error for inter\-event times and the cross\-entropy loss for one\-hot encoded marks:

ℒRecon​\(𝐱,𝐱^;ϕ\)\\displaystyle\\mathcal\{L\}\_\{\\text\{Recon\}\}\(\\mathbf\{x\},\\hat\{\\mathbf\{x\}\};\\mathbf\{\\phi\}\)=1L​∑b=1BℒReconb​\(𝐱b,𝐱^b;ϕ\)\\displaystyle=\\frac\{1\}\{L\}\\sum\_\{b=1\}^\{B\}\\mathcal\{L\}\_\{\\text\{Recon\}\}^\{b\}\(\\mathbf\{x\}^\{b\},\\hat\{\\mathbf\{x\}\}^\{b\};\\mathbf\{\\phi\}\)\(26\)=1L​∑b=1B∑ℓ=ℓb\+1ℓb\+1\(\(τℓ−τ^ℓ\)2−log⁡𝐩^ℓ​\[mℓ\]\)\.\\displaystyle=\\frac\{1\}\{L\}\\sum\_\{b=1\}^\{B\}\\sum\_\{\\ell=\\ell\_\{b\}\+1\}^\{\\ell\_\{b\+1\}\}\\Big\(\\\!\\left\(\\tau^\{\\ell\}\-\\hat\{\\tau\}^\{\\ell\}\\right\)^\{2\}\\\!\-\\log\\hat\{\\mathbf\{p\}\}^\{\\ell\}\[m^\{\\ell\}\]\\Big\)\.
As previously mentioned, the event encoder does not contain learnable parameters; thus, optimizing the reconstruction lossℒRecon\\mathcal\{L\}\_\{\\text\{Recon\}\}only updates the event decoder parametersϕ\\mathbf\{\\phi\}\.

Algorithm 2LBDTPP SamplingInput:model𝐳θ\\mathbf\{z\}\_\{\\mathbf\{\\theta\}\}, generation interval\[0,T\]\[0,T\]

𝐱,𝐊,𝐕←∅\\mathbf\{x\},\\mathbf\{K\},\\mathbf\{V\}\\leftarrow\\emptyset;b=1b=1,ℓb=0\\ell\_\{b\}=0,tℓb=0t^\{\\ell\_\{b\}\}=0

whiletℓb<Tt^\{\\ell\_\{b\}\}<Tdo

1. 1\.𝐳b←SAMPLE​\(𝐳θb,𝐊1:b−1,𝐕1:b−1\)⊳len​\(𝐳b\)=L′\\mathbf\{z\}^\{b\}\\leftarrow\\texttt\{SAMPLE\}\\left\(\\mathbf\{z\}\_\{\\theta\}^\{b\},\\mathbf\{K\}^\{1:b\-1\},\\mathbf\{V\}^\{1:b\-1\}\\right\)\\,\\hfill\\triangleright\\text\{\\small$\\texttt\{len\}\(\\mathbf\{z\}^\{b\}\)=L^\{\\prime\}$\}
2. 2\.∅,𝐊b,𝐕b←𝐳θb​\(𝐳b\)⊳\\emptyset,\\mathbf\{K\}^\{b\},\\mathbf\{V\}^\{b\}\\leftarrow\\mathbf\{z\}\_\{\\theta\}^\{b\}\\left\(\\mathbf\{z\}^\{b\}\\right\)\\hfill\\trianglerightKV\\mathrm\{KV\}cache
3. 3\.\(𝐊,𝐕\)←\(𝐊1:b−1⊕𝐊b,𝐕1:b−1⊕𝐕b\)\(\\mathbf\{K\},\\mathbf\{V\}\)\\leftarrow\\left\(\\mathbf\{K\}^\{1:b\-1\}\\oplus\\mathbf\{K\}^\{b\},\\mathbf\{V\}^\{1:b\-1\}\\oplus\\mathbf\{V\}^\{b\}\\right\)
4. 4\.𝐱b=\{\(τℓb\+i,mℓb\+i\)\}i=1L′←Decoder​\(𝐳b\)\\mathbf\{x\}^\{b\}=\\\{\(\\tau^\{\\ell\_\{b\}\+i\},m^\{\\ell\_\{b\}\+i\}\)\\\}\_\{i=1\}^\{L^\{\\prime\}\}\\leftarrow\\texttt\{Decoder\}\(\\mathbf\{z\}^\{b\}\)
5. 5\.𝐱←𝐱1:b−1⊕𝐱b\\mathbf\{x\}\\leftarrow\\mathbf\{x\}^\{1:b\-1\}\\oplus\\mathbf\{x\}^\{b\}
6. 6\.ℓb\+1=ℓb\+L′\\ell\_\{b\+1\}=\\ell\_\{b\}\+\\text\{\\small$L^\{\\prime\}$\};tℓb\+1=tℓb\+∑i=1L′τℓb\+it^\{\\ell\_\{b\+1\}\}=t^\{\\ell\_\{b\}\}\+\\sum\_\{i=1\}^\{L^\{\\prime\}\}\\tau^\{\\ell\_\{b\}\+i\}
7. 7\.b←b\+1b\\leftarrow b\+1

end

returntruncate​\(𝐱,T\)⊳\\text\{truncate\}\(\\mathbf\{x\},T\)\\hfill\\trianglerightTruncate atTTbased on timestamps

### IV\-BEnd\-to\-End Training

We train LBDTPP end\-to\-end by jointly optimizing two objectives: the latent block diffusion loss in Eq\. \([22](https://arxiv.org/html/2606.24982#S4.E22)\) and the reconstruction loss in Eq\. \([26](https://arxiv.org/html/2606.24982#S4.E26)\)\. The former loss learns the conditional distribution of latent event blocks, while the latter ensures that latent representations can be decoded into events that closely match the original sequence\. The overall objective is defined as their weighted combination:

ℒOverall​\(𝐱;θ,ϕ\)=ℒLBD\+λ​ℒRecon\.\\mathcal\{L\}\_\{\\text\{Overall\}\}\(\\mathbf\{x\};\\mathbf\{\\theta\},\\mathbf\{\\phi\}\)=\\mathcal\{L\}\_\{\\text\{LBD\}\}\+\\lambda\\mathcal\{L\}\_\{\\text\{Recon\}\}\.\(27\)Here, the hyperparameterλ\>0\\lambda\>0balances the two loss terms\.

To efficiently compute the above latent block diffusion lossℒLBD\\mathcal\{L\}\_\{\\text\{LBD\}\}, we pre\-calculate keys and values𝐊1:B,𝐕1:B\\mathbf\{K\}^\{1:B\},\\mathbf\{V\}^\{1:B\}for the full event sequence representation𝐳\\mathbf\{z\}in a first forward pass\(∅,𝐊1:B,𝐕1:B\)←𝐳θ​\(𝐳\)\(\\emptyset,\\mathbf\{K\}^\{1:B\},\\mathbf\{V\}^\{1:B\}\)\\leftarrow\\mathbf\{z\}\_\{\\mathbf\{\\theta\}\}\(\\mathbf\{z\}\), as shown in[Algorithm1](https://arxiv.org/html/2606.24982#alg1), where𝐊b\\mathbf\{K\}^\{b\}and𝐕b\\mathbf\{V\}^\{b\}correspond to blockbb\. We then compute the denoised prediction𝐳^b\\hat\{\\mathbf\{z\}\}^\{b\}using𝐳θb​\(𝐳kb,𝐊1:b−1,𝐕1:b−1,kb\)\\mathbf\{z\}\_\{\\mathbf\{\\theta\}\}^\{b\}\(\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{K\}^\{1:b\-1\},\\mathbf\{V\}^\{1:b\-1\},k\_\{b\}\)for eachb∈\[B\]b\\in\[B\], where the cached keys and values from preceding blocks𝐳<b\\mathbf\{z\}^\{<b\}are utilized\. In practice, instead of invoking the denoising network𝐳θb\\mathbf\{z\}\_\{\\mathbf\{\\theta\}\}^\{b\}in a loopBBtimes, we adopt a vectorized implementation approach by leveraging the block diffusion attention mask\[[1](https://arxiv.org/html/2606.24982#bib.bib18)\]\. This specialized attention mask for the concatenation𝐳noisy⊕𝐳\\mathbf\{z\}\_\{\\text\{noisy\}\}\\oplus\\mathbf\{z\}ensures that noisy event representations attend to other noisy event representations in their block and to all clean event representations in preceding blocks, which allows us to computeℒLBD\\mathcal\{L\}\_\{\\text\{LBD\}\}in a single forward pass on𝐳noisy⊕𝐳\\mathbf\{z\}\_\{\\text\{noisy\}\}\\oplus\\mathbf\{z\}\. Here, the noisy sequence representation𝐳noisy:=𝐳k11⊕⋯⊕𝐳kBB\\mathbf\{z\}\_\{\\text\{noisy\}\}:=\\mathbf\{z\}\_\{k\_\{1\}\}^\{1\}\\oplus\\cdots\\oplus\\mathbf\{z\}\_\{k\_\{B\}\}^\{B\}is obtained by applying a noise levelkbk\_\{b\}to each block𝐳b\\mathbf\{z\}^\{b\}based on Eq\. \([17](https://arxiv.org/html/2606.24982#S4.E17)\)\.

### IV\-CUnconditional Generation

During inference, we sample one block ofL′L^\{\\prime\}latent event representations at each time, conditioned on the previously sampled blocks, with the keys and values cached to avoid redundant computations\. Note that any diffusion sampling algorithmSAMPLE, such as DDPM\[[21](https://arxiv.org/html/2606.24982#bib.bib24)\]or DDIM\[[57](https://arxiv.org/html/2606.24982#bib.bib41)\], can be utilized to perform the reverse process in Eqs\. \([18](https://arxiv.org/html/2606.24982#S4.E18)\) and \([19](https://arxiv.org/html/2606.24982#S4.E19)\), ultimately obtaining a sample𝐳b\\mathbf\{z\}^\{b\}frompθ​\(𝐳b∣𝐳<b\)p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)\. Then, the latent block𝐳b\\mathbf\{z\}^\{b\}is decoded to obtain the correspondingL′L^\{\\prime\}events𝐱b\\mathbf\{x\}^\{b\}\. Repeat this process until the generated sequence reaches the termination timeTT, and then truncate the sequence at this point, as the event timestamps from the last block may exceedTT\. We summarize this unconditional sampling procedure in[Algorithm2](https://arxiv.org/html/2606.24982#alg2)\.

Crucially, the sampling procedure of LBDTPP offers two advantages: \(i\) it generates multiple high\-quality events simultaneously via Gaussian diffusion within blocks, which can reduce error accumulation caused by one\-by\-one generation in autoregressive TPPs; and \(ii\) it enables variable\-length event sequence generation in a block\-by\-block manner, overcoming the fixed\-length generation limitation of non\-autoregressive diffusion TPPs\.

### IV\-DConditional Generation

Next, we extend the unconditional modeling and sampling methods introduced above to the context of conditional generation\. We re\-express an event sequence in the interval\[0,T\]\[0,T\]as𝐱=\(𝐱c,𝐱u\)\\mathbf\{x\}=\(\\mathbf\{x\}\_\{c\},\\mathbf\{x\}\_\{u\}\), where𝐱c=\(𝐱1,…,𝐱Lc\)\\mathbf\{x\}\_\{c\}=\(\\mathbf\{x\}^\{1\},\\ldots,\\mathbf\{x\}^\{L\_\{c\}\}\)represents the historical events in\[0,Tc\]\[0,T\_\{c\}\],𝐱u=\(𝐱Lc\+1,…,𝐱L\)\\mathbf\{x\}\_\{u\}=\(\\mathbf\{x\}^\{L\_\{c\}\+1\},\\ldots,\\mathbf\{x\}^\{L\}\)represents the future events in\(Tc,T\]\(T\_\{c\},T\], and0<Tc<T0<T\_\{c\}<T\. The goal of conditional generation is to generate the future sequence𝐱u\\mathbf\{x\}\_\{u\}based on the historical sequence𝐱c\\mathbf\{x\}\_\{c\}\.

To achieve this, we first encode the entire sequence𝐱\\mathbf\{x\}into its latent representation𝐳=\(𝐳c,𝐳u\)\\mathbf\{z\}=\(\\mathbf\{z\}\_\{c\},\\mathbf\{z\}\_\{u\}\)using the event encoder described in[SectionIV\-A](https://arxiv.org/html/2606.24982#S4.SS1), where𝐳c\\mathbf\{z\}\_\{c\}and𝐳u\\mathbf\{z\}\_\{u\}correspond to the historical and future sequence representations, respectively\. We then partition𝐳u\\mathbf\{z\}\_\{u\}intoBu:=Lu/L′B\_\{u\}:=L\_\{u\}/L^\{\\prime\}blocks of lengthL′L^\{\\prime\}, whereLu=L−LcL\_\{u\}=L\-L\_\{c\}\. Similar to Eq\. \([13](https://arxiv.org/html/2606.24982#S4.E13)\), we factorize the log\-likelihood of the future sequence representation𝐳u\\mathbf\{z\}\_\{u\}conditioned on the historical representation𝐳c\\mathbf\{z\}\_\{c\}as:

log⁡pθ​\(𝐳u∣𝐳c\)=∑bu=1Bulog⁡pθ​\(𝐳ubu∣𝐳c,𝐳u<bu\),\\log p\_\{\\mathbf\{\\theta\}\}\(\\mathbf\{z\}\_\{u\}\\mid\\mathbf\{z\}\_\{c\}\)=\\sum\_\{b\_\{u\}=1\}^\{B\_\{u\}\}\\log p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{u\}^\{b\_\{u\}\}\\mid\\mathbf\{z\}\_\{c\},\\mathbf\{z\}\_\{u\}^\{<b\_\{u\}\}\\right\),\(28\)where𝐳ubu\\mathbf\{z\}\_\{u\}^\{b\_\{u\}\}denotes thebub\_\{u\}\-th block of𝐳u\\mathbf\{z\}\_\{u\}, and𝐳u<bu\\mathbf\{z\}\_\{u\}^\{<b\_\{u\}\}represents all historical blocks before blockbub\_\{u\}within𝐳u\\mathbf\{z\}\_\{u\}\. Each conditional distributionpθ​\(𝐳ubu∣𝐳c,𝐳u<bu\)p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{u\}^\{b\_\{u\}\}\\mid\\mathbf\{z\}\_\{c\},\\mathbf\{z\}\_\{u\}^\{<b\_\{u\}\}\\right\)is modeled by Gaussian diffusion over blockbub\_\{u\}, similar to Eqs\. \([14](https://arxiv.org/html/2606.24982#S4.E14)\)–\([15](https://arxiv.org/html/2606.24982#S4.E15)\) and Eqs\. \([18](https://arxiv.org/html/2606.24982#S4.E18)\)–\([19](https://arxiv.org/html/2606.24982#S4.E19)\)\. The training algorithm for conditional generation closely follows that of unconditional generation\. Specifically, both of them share the same reconstruction loss\. Besides, the latent block diffusion loss of conditional generation can be seen as a part of the summation term in Eq\. \([22](https://arxiv.org/html/2606.24982#S4.E22)\), since the log\-likelihood of conditional generation in Eq\. \([28](https://arxiv.org/html/2606.24982#S4.E28)\) is a component of that of unconditional generation in Eq\. \([13](https://arxiv.org/html/2606.24982#S4.E13)\), i\.e\.,log⁡pθ​\(𝐳\)=log⁡pθ​\(𝐳c\)\+log⁡pθ​\(𝐳u∣𝐳c\)\\log p\_\{\\mathbf\{\\theta\}\}\(\\mathbf\{z\}\)=\\log p\_\{\\mathbf\{\\theta\}\}\(\\mathbf\{z\}\_\{c\}\)\+\\log p\_\{\\mathbf\{\\theta\}\}\(\\mathbf\{z\}\_\{u\}\\mid\\mathbf\{z\}\_\{c\}\)\.

For conditional generation, unlike unconditional generation where the entire sequence is sampled from scratch, we only need to generate the future sequence𝐱u\\mathbf\{x\}\_\{u\}given the historical sequence𝐱c\\mathbf\{x\}\_\{c\}\. Similar to[Algorithm2](https://arxiv.org/html/2606.24982#alg2), this is achieved by sequentially sampling the future block𝐳ubu\\mathbf\{z\}\_\{u\}^\{b\_\{u\}\}through the diffusion sampling algorithmSAMPLE, conditioned on both the historical representation𝐳c\\mathbf\{z\}\_\{c\}and the previously sampled future blocks𝐳u<bu\\mathbf\{z\}\_\{u\}^\{<b\_\{u\}\}\. Subsequently, we decode𝐳ubu\\mathbf\{z\}\_\{u\}^\{b\_\{u\}\}to obtain the future events𝐱ubu\\mathbf\{x\}\_\{u\}^\{b\_\{u\}\}\. This procedure is repeated until the termination timeTT, where the sequence is then truncated\.

### IV\-ETheoretical Analysis

We now provide a theoretical analysis under the assumptions stated below to illustrate how block\-wise generation can reduce error accumulation compared with event\-wise generation in unconditional generation\. The conditional case follows similarly by fixing the observed history\. We measure error accumulation by the Wasserstein discrepancy between the generated latent sequence distribution and the true latent sequence distribution\. We consider a length\-LLlatent sequence generated from scratch by an unconditional model, and assumeL=B​L′L=BL^\{\\prime\}after padding if necessary\. The stopping rule based on the termination timeTTonly determines how many generated events are retained, and is thus orthogonal to the prefix\-level error accumulation analyzed below\. We emphasize prefix\-level accumulation because each transition is conditioned on previously generated events or blocks, so discrepancies in the generated prefix can perturb the conditioning context and propagate to subsequent transitions\.

For two latent sequences𝐮,𝐯∈ℝr×D\\mathbf\{u\},\\mathbf\{v\}\\in\\mathbb\{R\}^\{r\\times D\}, define the additive sequence metric

dr​\(𝐮,𝐯\)=∑ℓ=1r‖𝐮ℓ−𝐯ℓ‖2,d\_\{r\}\(\\mathbf\{u\},\\mathbf\{v\}\)=\\sum\_\{\\ell=1\}^\{r\}\\\|\\mathbf\{u\}^\{\\ell\}\-\\mathbf\{v\}^\{\\ell\}\\\|\_\{2\},\(29\)and letW1r​\(⋅,⋅\)W\_\{1\}^\{r\}\(\\cdot,\\cdot\)denote the Wasserstein\-1 distance induced bydrd\_\{r\}\. Denote by𝖯1:L\\mathsf\{P\}\_\{1:L\}the true unconditional latent distribution of the length\-LLsequence, by𝖰1:LAR\\mathsf\{Q\}\_\{1:L\}^\{\\mathrm\{AR\}\}the distribution generated from scratch by an event\-wise autoregressive model, and by𝖰1:LBL\\mathsf\{Q\}\_\{1:L\}^\{\\mathrm\{BL\}\}the distribution generated from scratch block by block\.

###### Assumption 1\(Uniform local approximation and prefix stability\)\.

For event\-wise autoregressive generation, let𝖯ℓ\(⋅∣𝐳<ℓ\)\\mathsf\{P\}\_\{\\ell\}\(\\cdot\\mid\\mathbf\{z\}^\{<\\ell\}\)and𝖰ℓAR\(⋅∣𝐳<ℓ\)\\mathsf\{Q\}\_\{\\ell\}^\{\\mathrm\{AR\}\}\(\\cdot\\mid\\mathbf\{z\}^\{<\\ell\}\)denote the true and learned one\-event transition kernels in the chain\-rule factorization of the unconditional sequence distribution\. There exist constantsεAR≥0\\varepsilon\_\{\\mathrm\{AR\}\}\\geq 0andρAR≥0\\rho\_\{\\mathrm\{AR\}\}\\geq 0such that, for allℓ\\elland event\-prefix realizations𝐡,𝐡′∈ℝ\(ℓ−1\)×D\\mathbf\{h\},\\mathbf\{h\}^\{\\prime\}\\in\\mathbb\{R\}^\{\(\\ell\-1\)\\times D\},

W11\(𝖯ℓ\(⋅∣𝐡\),𝖰ℓAR\(⋅∣𝐡\)\)\\displaystyle W\_\{1\}^\{1\}\\\!\\left\(\\mathsf\{P\}\_\{\\ell\}\(\\cdot\\mid\\mathbf\{h\}\),\\mathsf\{Q\}\_\{\\ell\}^\{\\mathrm\{AR\}\}\(\\cdot\\mid\\mathbf\{h\}\)\\right\)≤εAR,\\displaystyle\\leq\\varepsilon\_\{\\mathrm\{AR\}\},\(30\)W11\(𝖯ℓ\(⋅∣𝐡\),𝖯ℓ\(⋅∣𝐡′\)\)\\displaystyle W\_\{1\}^\{1\}\\\!\\left\(\\mathsf\{P\}\_\{\\ell\}\(\\cdot\\mid\\mathbf\{h\}\),\\mathsf\{P\}\_\{\\ell\}\(\\cdot\\mid\\mathbf\{h\}^\{\\prime\}\)\\right\)≤ρAR​dℓ−1​\(𝐡,𝐡′\)\.\\displaystyle\\leq\\rho\_\{\\mathrm\{AR\}\}d\_\{\\ell\-1\}\(\\mathbf\{h\},\\mathbf\{h\}^\{\\prime\}\)\.\(31\)For block\-wise unconditional generation, let𝖯bBL\(⋅∣𝐳<b\)\\mathsf\{P\}\_\{b\}^\{\\mathrm\{BL\}\}\(\\cdot\\mid\\mathbf\{z\}^\{<b\}\)and𝖰bBL\(⋅∣𝐳<b\)\\mathsf\{Q\}\_\{b\}^\{\\mathrm\{BL\}\}\(\\cdot\\mid\\mathbf\{z\}^\{<b\}\)denote the true and learned transition kernels for thebb\-th block in the block factorization of the unconditional sequence distribution\. There exist constantsεBL≥0\\varepsilon\_\{\\mathrm\{BL\}\}\\geq 0andρBL≥0\\rho\_\{\\mathrm\{BL\}\}\\geq 0such that, for allbband block\-prefix realizations𝐠,𝐠′∈ℝ\(b−1\)​L′×D\\mathbf\{g\},\\mathbf\{g\}^\{\\prime\}\\in\\mathbb\{R\}^\{\(b\-1\)L^\{\\prime\}\\times D\},

W1L′\(𝖯bBL\(⋅∣𝐠\),𝖰bBL\(⋅∣𝐠\)\)\\displaystyle W\_\{1\}^\{L^\{\\prime\}\}\\\!\\left\(\\mathsf\{P\}\_\{b\}^\{\\mathrm\{BL\}\}\(\\cdot\\mid\\mathbf\{g\}\),\\mathsf\{Q\}\_\{b\}^\{\\mathrm\{BL\}\}\(\\cdot\\mid\\mathbf\{g\}\)\\right\)≤εBL,\\displaystyle\\leq\\varepsilon\_\{\\mathrm\{BL\}\},\(32\)W1L′\(𝖯bBL\(⋅∣𝐠\),𝖯bBL\(⋅∣𝐠′\)\)\\displaystyle W\_\{1\}^\{L^\{\\prime\}\}\\\!\\left\(\\mathsf\{P\}\_\{b\}^\{\\mathrm\{BL\}\}\(\\cdot\\mid\\mathbf\{g\}\),\\mathsf\{P\}\_\{b\}^\{\\mathrm\{BL\}\}\(\\cdot\\mid\\mathbf\{g\}^\{\\prime\}\)\\right\)≤ρBL​d\(b−1\)​L′​\(𝐠,𝐠′\)\.\\displaystyle\\leq\\rho\_\{\\mathrm\{BL\}\}d\_\{\(b\-1\)L^\{\\prime\}\}\(\\mathbf\{g\},\\mathbf\{g\}^\{\\prime\}\)\.\(33\)

The following theorem formalizes the comparison between event\-wise and block\-wise generation\. Its purpose is to isolate the accumulation mechanism rather than to assert that block\-wise generation always has a smaller local approximation error: when the block\-level approximation and stability are comparable to their event\-wise counterparts, reducing the number of recursive transitions fromLLevents toBBblocks reduces the opportunities for prefix\-level errors to propagate\.

###### Theorem 1\(Prefix\-level generation\-error accumulation in the unconditional setting\)\.

Under[1](https://arxiv.org/html/2606.24982#Thmassumption1), define

An​\(ρ\)=\{\(1\+ρ\)n−1ρ,ρ\>0,n,ρ=0\.A\_\{n\}\(\\rho\)=\\begin\{cases\}\\frac\{\(1\+\\rho\)^\{n\}\-1\}\{\\rho\},&\\rho\>0,\\\\ n,&\\rho=0\.\\end\{cases\}\(34\)Then the event\-wise autoregressive generator satisfies

W1L​\(𝖯1:L,𝖰1:LAR\)≤εAR​AL​\(ρAR\),W\_\{1\}^\{L\}\\\!\\left\(\\mathsf\{P\}\_\{1:L\},\\mathsf\{Q\}\_\{1:L\}^\{\\mathrm\{AR\}\}\\right\)\\leq\\varepsilon\_\{\\mathrm\{AR\}\}A\_\{L\}\(\\rho\_\{\\mathrm\{AR\}\}\),\(35\)whereas the block\-wise generator satisfies

W1L​\(𝖯1:L,𝖰1:LBL\)≤εBL​AB​\(ρBL\)\.W\_\{1\}^\{L\}\\\!\\left\(\\mathsf\{P\}\_\{1:L\},\\mathsf\{Q\}\_\{1:L\}^\{\\mathrm\{BL\}\}\\right\)\\leq\\varepsilon\_\{\\mathrm\{BL\}\}A\_\{B\}\(\\rho\_\{\\mathrm\{BL\}\}\)\.\(36\)Furthermore, ifεBL≤L′​εAR\\varepsilon\_\{\\mathrm\{BL\}\}\\leq L^\{\\prime\}\\varepsilon\_\{\\mathrm\{AR\}\}andρBL≤ρAR=ρ\\rho\_\{\\mathrm\{BL\}\}\\leq\\rho\_\{\\mathrm\{AR\}\}=\\rho, then the relative block\-wise accumulation factor, i\.e\., the ratio between the block\-wise and event\-wise accumulation upper bounds under these conditions, is bounded by

L′​AB​\(ρ\)AL​\(ρ\)≤1,\\frac\{L^\{\\prime\}A\_\{B\}\(\\rho\)\}\{A\_\{L\}\(\\rho\)\}\\leq 1,\(37\)with strict inequality whenρ\>0\\rho\>0andL′\>1L^\{\\prime\}\>1\.

[Theorem1](https://arxiv.org/html/2606.24982#Thmtheorem1)makes the generation\-error accumulation explicit\. The bound in Eq\. \([35](https://arxiv.org/html/2606.24982#S4.E35)\) shows that, for event\-wise autoregressive generation, the local one\-event approximation errorεAR\\varepsilon\_\{\\mathrm\{AR\}\}is multiplied by an accumulation factorAL​\(ρAR\)A\_\{L\}\(\\rho\_\{\\mathrm\{AR\}\}\)over allLLevent\-level sampling steps\. The bound in Eq\. \([36](https://arxiv.org/html/2606.24982#S4.E36)\) is the block\-wise counterpart: the local block approximation errorεBL\\varepsilon\_\{\\mathrm\{BL\}\}is multiplied byAB​\(ρBL\)A\_\{B\}\(\\rho\_\{\\mathrm\{BL\}\}\)over onlyB=L/L′B=L/L^\{\\prime\}block\-level sampling steps\. SinceAn​\(ρ\)A\_\{n\}\(\\rho\)is nondecreasing innnforρ≥0\\rho\\geq 0andB≤LB\\leq L, block\-wise generation has a shorter accumulation horizon than event\-wise generation\. Within each block, theL′L^\{\\prime\}latent event representations are sampled simultaneously by diffusion rather than being recursively fed back one by one, so errors made for earlier events inside the same block do not enter as prefix\-level distributional discrepancies for later events in that block\.

The comparison in Eq\. \([37](https://arxiv.org/html/2606.24982#S4.E37)\) relies on two conditions\. The conditionεBL≤L′​εAR\\varepsilon\_\{\\mathrm\{BL\}\}\\leq L^\{\\prime\}\\varepsilon\_\{\\mathrm\{AR\}\}means that learning one block is no worse, in distributional error, than accumulating theL′L^\{\\prime\}corresponding event\-wise local errors\. The conditionρBL≤ρAR=ρ\\rho\_\{\\mathrm\{BL\}\}\\leq\\rho\_\{\\mathrm\{AR\}\}=\\rhoassumes that the block transition is at least as stable as the event\-wise transition with respect to prefix perturbations\. Under these comparable\-error and comparable\-stability conditions, the ratio in Eq\. \([37](https://arxiv.org/html/2606.24982#S4.E37)\) is no larger than one, meaning that the block\-wise error accumulation upper bound is no larger than its event\-wise counterpart\. The ratio is strictly smaller whenρ\>0\\rho\>0andL′\>1L^\{\\prime\}\>1\. This is consistent with the empirical trade\-off observed in[SectionV\-D](https://arxiv.org/html/2606.24982#S5.SS4): increasingL′L^\{\\prime\}initially improves generation performance by reducing event\-by\-event recursion, whereas overly large blocks may degrade performance because the block transition distribution becomes higher\-dimensional and harder to learn\.

If the decodergϕg\_\{\\phi\}isLdecL\_\{\\mathrm\{dec\}\}\-Lipschitz from\(ℝL×D,dL\)\(\\mathbb\{R\}^\{L\\times D\},d\_\{L\}\)to an event\-space discrepancyΔL\\Delta\_\{L\}, then the latent\-space result can be transferred to decoded event sequences: for either generator𝖰∈\{𝖰AR,𝖰BL\}\\mathsf\{Q\}\\in\\\{\\mathsf\{Q\}^\{\\mathrm\{AR\}\},\\mathsf\{Q\}^\{\\mathrm\{BL\}\}\\\},

WΔL​\(\(gϕ\)\#​𝖯1:L,\(gϕ\)\#​𝖰1:L\)≤Ldec​W1L​\(𝖯1:L,𝖰1:L\),W\_\{\\Delta\_\{L\}\}\\\!\\left\(\(g\_\{\\phi\}\)\_\{\\\#\}\\mathsf\{P\}\_\{1:L\},\(g\_\{\\phi\}\)\_\{\\\#\}\\mathsf\{Q\}\_\{1:L\}\\right\)\\leq L\_\{\\mathrm\{dec\}\}W\_\{1\}^\{L\}\\\!\\left\(\\mathsf\{P\}\_\{1:L\},\\mathsf\{Q\}\_\{1:L\}\\right\),\(38\)where\(gϕ\)\#\(g\_\{\\phi\}\)\_\{\\\#\}denotes the push\-forward distribution\. Therefore, reducing the latent distributional discrepancy directly reduces decoded event sequence discrepancy up to the decoder Lipschitz constant\. The proof is provided in the supplementary material\.

## VExperiments

In this section, we conduct extensive experiments to evaluate the performance of LBDTPP, addressing the following major research questions \(RQs\):

- •RQ1:How does LBDTPP perform compared to state\-of\-the\-art TPP baselines for the unconditional generation task?
- •RQ2:How does LBDTPP perform compared to state\-of\-the\-art TPP baselines for the conditional generation task?
- •RQ3:What are the sources of performance improvement of LBDTPP over autoregressive and non\-autoregressive TPP baselines, and how does the block size affect its performance?
- •RQ4:Can LBDTPP capture the empirical distributions of event timestamps and marks?
- •RQ5:How sensitive is LBDTPP with respect to different hyperparameters?
- •RQ6:How do different model variants, such as utilizing a learnable mark encoder or adoptingϵb\\boldsymbol\{\\epsilon\}^\{b\}\-prediction, affect the performance of LBDTPP?
- •RQ7:How does the sampling time of LBDTPP compare to that of baseline models?

### V\-AExperimental Setup

Datasets\.We use six real\-world benchmark datasets containing event sequences from multiple domains\.[TableI](https://arxiv.org/html/2606.24982#S5.T1)summarizes the statistics of these datasets\. All datasets are available at the CDiff repository\[[69](https://arxiv.org/html/2606.24982#bib.bib17)\]\.

- •Taxi\[[61](https://arxiv.org/html/2606.24982#bib.bib10)\]contains time\-stamped taxi pick\-up and drop\-off events throughout the five boroughs of New York city\. Each combination of borough, whether it’s a pick\-up or drop\-off, defines a mark, resulting in a total of 10 marks\.
- •Taobao\[[78](https://arxiv.org/html/2606.24982#bib.bib14)\]includes time\-stamped user click behaviors on the Taobao platform\. Each user has a sequence of product click events, where each event contains a timestamp and a product category\.
- •StackOverflow\[[30](https://arxiv.org/html/2606.24982#bib.bib11)\]contains user\-awarded collections from a question\-answering website\. Each user is awarded a sequence of badges, with a total of 22 different badge marks\.
- •Retweet\[[76](https://arxiv.org/html/2606.24982#bib.bib9)\]consists of sequences of time\-stamped user retweet events, categorized into three marks based on the users’ following sizes: “small”, “medium”, and “large”\.
- •MOOC\[[28](https://arxiv.org/html/2606.24982#bib.bib15)\]includes records of student interactions within an online course platform\. Each type of interaction \(e\.g\., video watching, forum posting\) is treated as a distinct mark\. We use the same pre\-processing approach as in\[[5](https://arxiv.org/html/2606.24982#bib.bib45),[69](https://arxiv.org/html/2606.24982#bib.bib17)\]\.
- •Amazon\[[43](https://arxiv.org/html/2606.24982#bib.bib16)\]contains time\-stamped user product review behaviors where product categories are seen as event marks\.

The data pre\-processing for the two generation tasks is as follows\. \(i\) For unconditional generation \([SectionV\-B](https://arxiv.org/html/2606.24982#S5.SS2)\), event timestamps within each sequence are normalized to a unified scale by dividing them by the maximum termination time, defined as the maximum last timestamp in the training set\. During evaluation, the generated timestamps are mapped back to the original scale for comparison with the test sequences\. \(ii\) For conditional generation \([SectionV\-C](https://arxiv.org/html/2606.24982#S5.SS3)\), future events are predicted directly based on historical events in their natural time scale, so no explicit time normalization is applied\.

TABLE I:Statistics of each datasetBaselines\.We compare our model with nine TPP baselines, including both autoregressive and non\-autoregressive TPPs\.

- •NHP\[[41](https://arxiv.org/html/2606.24982#bib.bib1)\]introduces a continuous\-time long short\-term memory \(LSTM\) network for modeling event sequences\. The conditional intensity function of NHP is capable of decaying over time, allowing it to flexibly capture the influence of past events on future event occurrences\.
- •LNM\[[52](https://arxiv.org/html/2606.24982#bib.bib3)\]models the conditional density distribution of TPPs using a log\-normal mixture model\. This intensity\-free approach allows for a more flexible and efficient representation of the temporal dynamics, providing advantages in terms of both expressiveness and ease of sampling\.
- •THP\[[79](https://arxiv.org/html/2606.24982#bib.bib4)\]incorporates a Transformer architecture to model the conditional intensity function of TPPs\. By leveraging self\-attention mechanisms, THP captures long\-range dependencies in event sequences, enabling it to model intricate temporal patterns effectively\.
- •AttNHP\[[67](https://arxiv.org/html/2606.24982#bib.bib2)\]utilizes a Transformer architecture to model event sequences, learning rich embeddings of actual and possible events at any given time, based on lower\-level representations of these events and their context\.
- •S2P2\[[7](https://arxiv.org/html/2606.24982#bib.bib7)\]adapts deep state\-space models to marked TPPs by combining neural jump stochastic differential equations with nonlinear transformations\. This architecture allows the model to efficiently capture continuous\-time dynamics and long\-range dependencies in event sequences\.
- •DualTPP\[[11](https://arxiv.org/html/2606.24982#bib.bib6)\]combines two components for long\-horizon event forecasting: an autoregressive TPP model that captures short\-term event dynamics at a microscopic level and a count model that handles the macroscopic, long\-term behavior\.
- •HYPRO\[[66](https://arxiv.org/html/2606.24982#bib.bib8)\]introduces a hybridly normalized probabilistic model designed for long\-horizon prediction of event sequences\. It combines an autoregressive base model with an energy function, which reweights the predicted sequences to improve the realism of long\-term forecasts\.
- •TCDDM\[[33](https://arxiv.org/html/2606.24982#bib.bib5)\]designs a generative framework for neural TPPs that adopts a diffusion\-based probabilistic decoder\. This approach enhances predictive performance by leveraging diffusion models to generate high\-quality inter\-event times\.
- •CDiff\[[69](https://arxiv.org/html/2606.24982#bib.bib17)\]proposes to address the task of long\-horizon event forecasting by employing interacting diffusion processes\. It introduces two coupled diffusion processes, one for event marks and one for inter\-event times, which interact through their respective denoising functions\.

The specific baseline settings for the two generation tasks are as follows\. \(i\) For unconditional generation, we utilize the EasyTPP benchmark\[[65](https://arxiv.org/html/2606.24982#bib.bib12)\]to evaluate five autoregressive TPP baselines:NHP,LNM,THP,AttNHP, andS2P2\. We do not compare against the other four baselines \(i\.e\.,DualTPP,HYPRO,TCDDM, andCDiff\), because they are specifically designed for conditional generation and cannot be applied to variable\-length unconditional generation without significant modifications\. \(ii\) For conditional generation, we compare our model with all nine TPP baselines\. We evaluateTHPandS2P2with the EasyTPP benchmark, while the results of the other seven baselines are taken from\[[69](https://arxiv.org/html/2606.24982#bib.bib17)\]\.

TABLE II:OTDandRMSEm\\textbf\{RMSE\}\_\{m\}of unconditional generation reported in mean±\\pms\.d\. ​Bestand ​second bestare highlightedEvaluation Metrics\.We adopt four commonly used metrics to evaluate the quality of the generated marked event sequences\. In this work, we do not report density\-based metrics such as log\-likelihood because several diffusion TPP baselines and our LBDTPP model, are implicit generators or operate in latent space, making event\-space log\-likelihoods unavailable or not directly comparable\.

- •OTD: The optimal transport distance between two marked event sequences, which defines the minimum cost required to edit a generated event sequence into the ground truth event sequence, measuring the sequence\-level similarity between them\. We report the average values of the OTD across different values of the deletion/insertion cost constantCC:\{0\.05,0\.5,1,1\.5,2,3,4\}\\\{0\.05,0\.5,1,1\.5,2,3,4\\\}\. More details on this OTD metric can be found in\[[42](https://arxiv.org/html/2606.24982#bib.bib36)\]\.
- •RMSEm\\textbf\{RMSE\}\_\{m\}: The root mean square error of the number of events for each mark, which quantifies how well the event mark distribution of the generated sequence matches that of the ground truth sequence\. Form∈\[M\]m\\in\[M\], we compute the count of events of markmmin the generated sequence,C^m\\hat\{C\}\_\{m\}, and the true sequence,CmC\_\{m\}\. This metric is then calculated as1M​∑m=1M\(Cm−C^m\)2\\sqrt\{\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\(C\_\{m\}\-\\hat\{C\}\_\{m\}\)^\{2\}\}\.
- •RMSEτ\\textbf\{RMSE\}\_\{\\tau\}: The root mean square error between the inter\-event times of the generated sequence and those of the ground truth sequence, which assesses the temporal accuracy of the generated events\. For a generated sequence with inter\-event timesτ^1,…,τ^H\\hat\{\\tau\}\_\{1\},\\ldots,\\hat\{\\tau\}\_\{H\}and a ground truth sequence with inter\-event timesτ1,…,τH\\tau\_\{1\},\\ldots,\\tau\_\{H\}, we computeRMSEτ=1H​∑i=1H\(τi−τ^i\)2\\text\{RMSE\}\_\{\\tau\}=\\sqrt\{\\frac\{1\}\{H\}\\sum\_\{i=1\}^\{H\}\(\\tau\_\{i\}\-\\hat\{\\tau\}\_\{i\}\)^\{2\}\}, whereHHis the number of inter\-event times\.
- •sMAPE: The symmetric mean absolute percentage error between the inter\-event times of the generated sequence and those of the ground truth sequence, which also evaluates the temporal accuracy\. It is defined assMAPE=100H​∑i=1H2​\|τi−τ^i\|\|τi\|\+\|τ^i\|\\text\{sMAPE\}=\\frac\{100\}\{H\}\\sum\_\{i=1\}^\{H\}\\frac\{2\|\\tau\_\{i\}\-\\hat\{\\tau\}\_\{i\}\|\}\{\|\\tau\_\{i\}\|\+\|\\hat\{\\tau\}\_\{i\}\|\}\.

SinceRMSEτ\\textbf\{RMSE\}\_\{\\tau\}andsMAPEmetrics require the generated and ground\-truth sequences to have the same number of events, they are evaluated only in the fixed\-length conditional generation benchmark\[[69](https://arxiv.org/html/2606.24982#bib.bib17)\]\([SectionV\-C](https://arxiv.org/html/2606.24982#S5.SS3)\), where the number of future events is fixed toH=20H=20\. In contrast,OTDandRMSEm\\textbf\{RMSE\}\_\{m\}are evaluated in both unconditional and conditional generation benchmarks, as they can be computed between sequences with either different or equal length\. For each run, all metrics are averaged over the full test set\. Each model is trained with 10 different random seeds, and the mean and standard deviation \(s\.d\.\) of each metric are reported in this work\.

Implementation Details\.As described in[SectionIV\-A](https://arxiv.org/html/2606.24982#S4.SS1), the event encoder contains no learnable parameters, the clean latent blockzb\\textbf\{z\}^\{b\}is predicted by a Transformer equipped with the block diffusion attention mask\[[1](https://arxiv.org/html/2606.24982#bib.bib18)\], and the time and mark decoders are implemented as two\-layer MLPs\. For both unconditional and conditional generation tasks, we set the forward diffusion steps toK=100K=100and use the DDIM sampler\[[57](https://arxiv.org/html/2606.24982#bib.bib41)\]withS=50S=50sampling steps to accelerate sampling\. All experiments are conducted on an NVIDIA GeForce RTX 3090 GPU with 24GB of memory\. We implement our model using PyTorch\[[48](https://arxiv.org/html/2606.24982#bib.bib13)\]\.

Training Details\.We set the maximum number of epochs to 50 for all experiments and evaluate the model every 2 epochs on the validation set, selecting the one with the best performance for testing\. For both training and evaluation, the batch size is fixed at 32 for all datasets\. The Adam optimizer\[[26](https://arxiv.org/html/2606.24982#bib.bib46)\]is employed for model optimization\.

Hyperparameter Setting\.We perform grid search to determine the hyperparameters for LBDTPP based on the validation set\. Specifically, we tune the learning rate from\{0\.0001,0\.0005,0\.001,0\.005,0\.01\}\\\{0\.0001,0\.0005,0\.001,0\.005,0\.01\\\}, the embedding dimensionDDfrom\{16,32,64,128\}\\\{16,32,64,128\\\}, the number of attention heads from\{1,2,4\}\\\{1,2,4\\\}, and the number of Transformer layers from\{1,2,4\}\\\{1,2,4\\\}\. In our initial experiments, setting the weightλ\\lambdain the loss function Eq\. \([27](https://arxiv.org/html/2606.24982#S4.E27)\) to11already yields stronger performance than the baselines, so we fix this value in all main experiments\. We also analyze the sensitivity ofλ\\lambdain[SectionV\-F](https://arxiv.org/html/2606.24982#S5.SS6)\.

### V\-BUnconditional Generation Task \(A1\)

We first evaluate the performance of our LBDTPP model on the unconditional generation task, where the objective is to generate new event sequences that align well with the underlying data distribution\. This task serves as a fundamental benchmark for assessing the fitting capacity and generation quality of different TPP models\. A model with strong performance in this setting can also benefit downstream applications, such as system simulation and data augmentation\[[25](https://arxiv.org/html/2606.24982#bib.bib29)\]\. Compared with prior settings, unconditional generation for marked event sequences has received limited attention\. Existing studies\[[37](https://arxiv.org/html/2606.24982#bib.bib19),[25](https://arxiv.org/html/2606.24982#bib.bib29),[69](https://arxiv.org/html/2606.24982#bib.bib17)\]have primarily focused on generation of unmarked event sequences in unconditional and conditional settings, or on conditional generation of marked event sequences\. We therefore include this setting as an important evaluation scenario to assess whether a model can jointly capture temporal dynamics and mark distributions without historical conditioning\.

Since there are no ground truth event sequences available in unconditional generation, we generate event sequences with the same termination times as those in the test set\. We then calculate the OTD andRMSEm\\text\{RMSE\}\_\{m\}metrics between the generated and test sequences to evaluate how well the generated sequences follow the underlying data distribution\. Under this setting, the generated sequences and the test sequences generally contain different numbers of events\. For all datasets, we set the block sizeL′=8L^\{\\prime\}=8in our LBDTPP model, meaning that each block contains 8 events\.

Results\.[TableII](https://arxiv.org/html/2606.24982#S5.T2)summarizes the OTD andRMSEm\\text\{RMSE\}\_\{m\}results of unconditional generation\. We observe that LBDTPP consistently outperforms all five TPP baselines across six real\-world datasets in terms of both metrics\. This shows the superior capability of LBDTPP in capturing the underlying distribution of event sequences\. Moreover, the performance gap between LBDTPP and these autoregressive TPPs also highlights the effectiveness of our latent block diffusion modeling approach in mitigating error accumulation and generating high\-fidelity event sequences\.

TABLE III:OTD,RMSEm\\textbf\{RMSE\}\_\{m\},RMSEτ\\textbf\{RMSE\}\_\{\\tau\}andsMAPEof conditional generation reported in mean±\\pms\.d\.Bestandsecond bestare highlighted
### V\-CConditional Generation Task \(A2\)

Predicting future event occurrences based on historical observations is a crucial task in various real\-world applications, including medical diagnosis and financial transactions\. Below we compare the forecasting performance of LBDTPP with nine autoregressive and non\-autoregressive TPP models\. We follow the experimental setup of prior work\[[69](https://arxiv.org/html/2606.24982#bib.bib17)\], predicting the last 20 events of each sequence based on preceding events \(i\.e\.,H=20H=20\)\. Metrics are computed by comparing the generated future sequences with the ground truth, and the block sizeL′L^\{\\prime\}is set to 4 for all datasets\.

Results\.The conditional generation results are presented in[TableIII](https://arxiv.org/html/2606.24982#S5.T3)\. Across the six datasets, LBDTPP demonstrates the superior overall performance compared with all autoregressive and non\-autoregressive TPP baselines\. Specifically, our model achieves the best OTD andRMSEm\\text\{RMSE\}\_\{m\}scores on 5 out of 6 datasets, the bestRMSEτ\\text\{RMSE\}\_\{\\tau\}on all 6 datasets, and the best sMAPE on 4 out of 6 datasets, while performing comparably on the remaining cases\. Its advantage over autoregressive TPP baselines indicates that generating multiple events simultaneously within each block can effectively mitigate error accumulation caused by event\-by\-event autoregressive generation, leading to more accurate future event predictions\. Moreover, LBDTPP outperforms CDiff, the current state\-of\-the\-art diffusion\-based TPP baseline, in most cases\. This demonstrates that our latent block diffusion approach can more accurately capture the conditional distribution of future sequences given historical observations and produce reliable event forecasts\.

### V\-DSources of Performance Improvement \(A3\)

In this subsection, we empirically analyze the sources of LBDTPP’s performance improvement over baseline methods\. The experimental settings are kept consistent with those in[SectionV\-C](https://arxiv.org/html/2606.24982#S5.SS3), with only the block sizeL′L^\{\\prime\}varied among the set\{1,2,4,8,16,20\}\\\{1,2,4,8,16,20\\\}\. Through this study, we find that the performance gain of LBDTPP mainly comes from two factors: latent\-space diffusion and block\-wise generation\.

The results of our LBDTPP model under different block sizes are presented in[Fig\.2](https://arxiv.org/html/2606.24982#S5.F2)\. We observe that, on most datasets, both OTD andRMSEm\\text\{RMSE\}\_\{m\}metrics first decrease and then increase asL′L^\{\\prime\}becomes larger, indicating that the generation quality first improves and then degrades\. Moreover, regardless of the block size, LBDTPP generally outperforms the baseline methods reported in[TableIII](https://arxiv.org/html/2606.24982#S5.T3)\. Below, a more detailed analysis explains where the improvement comes from\. WhenL′=1L^\{\\prime\}=1, LBDTPP reduces to an autoregressive generation paradigm, where events are generated one by one\. Even under this setting, LBDTPP still performs better than autoregressive baselines, which demonstrates the benefit of latent\-space diffusion\. AsL′L^\{\\prime\}increases from 1, LBDTPP generates multiple events in parallel within each block, and its performance improves accordingly\. This shows the advantage of block\-wise generation, which mitigates the error accumulation issue caused by strictly autoregressive event\-by\-event sampling\.

![Refer to caption](https://arxiv.org/html/2606.24982v1/x2.png)Figure 2:Performance of LBDTPP under different block sizes \(L′L^\{\\prime\}\) for conditional generation\. The left y\-axis represents OTD and the right y\-axis representsRMSEm\\text\{RMSE\}\_\{m\}\. The shaded regions represent the standard deviation\.![Refer to caption](https://arxiv.org/html/2606.24982v1/x3.png)Figure 3:Impact of block size \(L′L^\{\\prime\}\) on LBDTPP sampling time for conditional generation on Taxi and Taobao datasets\.This benefit, however, does not keep increasing with largerL′L^\{\\prime\}\. Overly large event blocks lead to a coarser factorization of the sequence distribution in Eq\. \([13](https://arxiv.org/html/2606.24982#S4.E13)\), making the conditional distribution of each block harder to learn\. They also make the denoising problem more challenging, since each diffusion step needs to jointly recover higher\-dimensional event block representations\. Nevertheless, whenL′=20L^\{\\prime\}=20, LBDTPP becomes a non\-autoregressive model that generates all 20 future events in one shot, and it still outperforms non\-autoregressive baselines in most cases\. To be specific, compared with TCDDM and CDiff, which perform diffusion directly in the raw event space, the superior performance of LBDTPP further confirms the benefit of performing diffusion in latent space\.

This performance trend as the block size changes is consistent with[Theorem1](https://arxiv.org/html/2606.24982#Thmtheorem1)\. Although the theorem is stated for unconditional generation, the same intuition applies to conditional generation after fixing the observed history\. IncreasingL′L^\{\\prime\}reduces the number of recursive transitions needed to generate the 20 future events, and thus can reduce prefix\-level error accumulation\. At the same time, this advantage requires the block\-level approximation error and stability to remain well controlled as the block size increases\. WhenL′L^\{\\prime\}becomes too large, the harder block distribution and denoising problem can increase the local block error, which explains the degradation observed after the optimal block size\.

To sum up, these results suggest that the improvement of our LBDTPP model comes from the combination of latent\-space diffusion and block\-wise generation\. We also report the sampling time on the entire test set for different block sizes on the Taxi and Taobao datasets in[Fig\.3](https://arxiv.org/html/2606.24982#S5.F3)\. The sampling time decreases as the block size increases, which is expected because larger blocks allow the model to generate more events in each sampling round and thus reduce the total number of block sampling rounds\. In practice, we recommend using block sizeL′=4L^\{\\prime\}=4when generation quality is the primary concern\. If faster generation is preferred, larger block sizes can also be used, since the performance degradation of LBDTPP is relatively mild\.

### V\-EDistribution Evaluation \(A4\)

We further evaluate whether LBDTPP captures the empirical distributions of event timestamps and marks in the conditional generation task\. As shown in[Fig\.4](https://arxiv.org/html/2606.24982#S5.F4), we plot the empirical density of inter\-event times and the empirical frequency distribution of event marks for both the ground\-truth future events and the events generated by LBDTPP\. The results show that the generated inter\-event times closely follow the distribution of the true future events, and the generated event marks also match the real mark distribution well\. These observations indicate that LBDTPP can effectively capture both temporal and categorical distributional patterns, demonstrating its strong generative forecasting ability beyond standard evaluation metrics\.

![Refer to caption](https://arxiv.org/html/2606.24982v1/x4.png)Figure 4:Distribution evaluation of LBDTPP for conditional generation\. We plot the empirical density of inter\-event times and the empirical frequency distribution of event marks for both the ground\-truth future events and the events generated by LBDTPP\.
### V\-FHyperparameter Sensitivity \(A5\)

We conduct a sensitivity analysis on the number of sampling stepsSSin the DDIM sampler for conditional generation on Taxi and Taobao datasets\. We maintain the same experimental settings as in[SectionV\-C](https://arxiv.org/html/2606.24982#S5.SS3), altering onlySSfrom the set\{10,20,30,40,50\}\\\{10,20,30,40,50\\\}\. The corresponding OTD andRMSEm\\text\{RMSE\}\_\{m\}trends are illustrated in[Fig\.5](https://arxiv.org/html/2606.24982#S5.F5)\. To be specific, on the Taxi dataset, LBDTPP achieves OTD scores ranging from 19\.039 to 19\.258 andRMSEm\\text\{RMSE\}\_\{m\}scores between 0\.967 and 0\.993 across different sampling steps\. On the Taobao dataset, LBDTPP attains OTD scores from 41\.210 to 41\.642 andRMSEm\\text\{RMSE\}\_\{m\}scores between 2\.103 and 2\.136\. From these results, we observe that LBDTPP consistently sustains superior performance compared to the baseline methods in[TableIII](https://arxiv.org/html/2606.24982#S5.T3), even with as few as 10 sampling steps\. This indicates that LBDTPP can effectively generate high\-quality event sequences using a small number of sampling iterations, highlighting its efficiency and enhancing its practicality for real\-world applications\. Besides, from[Fig\.6](https://arxiv.org/html/2606.24982#S5.F6), the sampling time decreases as the number of sampling stepsSSis reduced, which is expected\.

We also study the hyperparameter sensitivity ofλ\\lambdain the loss function\. We varyλ\\lambdafrom the set\{0\.01,0\.05,0\.1,0\.5,1\.0\}\\\{0\.01,0\.05,0\.1,0\.5,1\.0\\\}and present the results in[Fig\.7](https://arxiv.org/html/2606.24982#S5.F7)\. We observe that LBDTPP achieves stable performance across different values ofλ\\lambdaand consistently outperforms all baselines in[TableIII](https://arxiv.org/html/2606.24982#S5.T3)with respect to OTD andRMSEm\\text\{RMSE\}\_\{m\}on both datasets\. This demonstrates that LBDTPP is robust to the choice ofλ\\lambdaand can maintain strong performance without requiring extensive hyperparameter tuning\.

### V\-GModel Variants \(A6\)

We now conduct model variant experiments to examine two design choices of LBDTPP on the conditional generation task using Taxi and Taobao datasets\. As described in[SectionIV\-A](https://arxiv.org/html/2606.24982#S4.SS1), the default LBDTPP encoder contains no learnable parameters: the temporal component is encoded by sinusoidal time embeddings, and the mark component is obtained from a fixed embedding matrix\. To evaluate whether a learnable mark representation is beneficial, we consider a variant named LBDTPP\-LM, where the embedding matrix𝐖\\mathbf\{W\}in Eq\. \([12](https://arxiv.org/html/2606.24982#S4.E12)\) is optimized during training\. In addition, the default LBDTPP adopts𝐳b\\mathbf\{z\}^\{b\}\-prediction in the latent block diffusion module, where the denoising network directly predicts the clean latent block\. To assess this prediction target, we introduce another variant named LBDTPP\-EP, which instead adoptsϵb\\boldsymbol\{\\epsilon\}^\{b\}\-prediction and predicts the Gaussian noise added to the latent block\.

![Refer to caption](https://arxiv.org/html/2606.24982v1/x5.png)Figure 5:Impact of sampling step \(SS\) on LBDTPP performance for conditional generation on Taxi and Taobao datasets\.![Refer to caption](https://arxiv.org/html/2606.24982v1/x6.png)Figure 6:Impact of sampling step \(SS\) on LBDTPP sampling time for conditional generation on Taxi and Taobao datasets\.![Refer to caption](https://arxiv.org/html/2606.24982v1/x7.png)Figure 7:Impact of the values ofλ\\lambdaon LBDTPP performance for conditional generation on Taxi and Taobao datasets\.The results are shown in[Fig\.8](https://arxiv.org/html/2606.24982#S5.F8)\. We can observe that LBDTPP slightly outperforms LBDTPP\-LM on both datasets, indicating that a parameter\-free event encoder is already sufficient for constructing effective latent event representations\. This also suggests that introducing a learnable mark embedding matrix does not necessarily improve generation quality under our setting, and the fixed encoder can avoid additional parameters without sacrificing performance\. Moreover, LBDTPP consistently performs better than LBDTPP\-EP, demonstrating that𝐳b\\mathbf\{z\}^\{b\}\-prediction is more effective thanϵb\\boldsymbol\{\\epsilon\}^\{b\}\-prediction in our latent block diffusion framework\. One possible reason is that the clean latent block𝐳b\\mathbf\{z\}^\{b\}is exactly the input to the event decoder, so directly predicting𝐳b\\mathbf\{z\}^\{b\}provides the decoder with the target latent representation without an additional recovery step\. In contrast,ϵb\\boldsymbol\{\\epsilon\}^\{b\}\-prediction first estimates the injected Gaussian noise and then recovers the clean latent block using the relation in Eq\. \([17](https://arxiv.org/html/2606.24982#S4.E17)\), which involves dividing byα¯k\\sqrt\{\\bar\{\\alpha\}\_\{k\}\}\. As a result, prediction errors may be amplified whenkkis large andα¯k\\bar\{\\alpha\}\_\{k\}is small, and these errors can subsequently propagate to the event decoder\. Since the latent representations jointly encode continuous temporal information and discrete mark information, preserving their clean block structure is particularly important for accurate event sequence generation\.

![Refer to caption](https://arxiv.org/html/2606.24982v1/x8.png)Figure 8:Performance of different LBDTPP variants for conditional generation on Taxi and Taobao datasets\.![Refer to caption](https://arxiv.org/html/2606.24982v1/x9.png)Figure 9:Comparison of sampling time and model parameters with baselines\. Sampling time is reported in minutes, and model parameters are reported in thousands \(K\)\. Our LBDTPP model achieves comparable or lower sampling time\.
### V\-HSampling Time Comparison \(A7\)

We compare the sampling time of our model with that of TPP baselines for both unconditional and conditional generation tasks on Taxi and Taobao datasets\. For unconditional generation, we compare LBDTPP with five autoregressive TPP baselines\. For conditional generation, we compare LBDTPP with CDiff, the current state\-of\-the\-art diffusion\-based TPP baseline\. The sampling time is measured on the entire test set and the batch size is set to 32 for all models\. Our model keeps the same experimental settings as in Sections[V\-B](https://arxiv.org/html/2606.24982#S5.SS2)and[V\-C](https://arxiv.org/html/2606.24982#S5.SS3)\.

The sampling time results and model parameters are illustrated in[Fig\.9](https://arxiv.org/html/2606.24982#S5.F9)\. We can see that our LBDTPP model achieves comparable or lower sampling times in both tasks\. LNM models the conditional density function of inter\-event times through a mixture of log\-normal distributions\. A primary advantage of this approach lies in its closed\-form sampling expression, which leads to more efficient sampling\. The other autoregressive baselines model the conditional intensity function and rely on the thinning algorithm\[[45](https://arxiv.org/html/2606.24982#bib.bib50),[66](https://arxiv.org/html/2606.24982#bib.bib8)\]for iterative sampling, resulting in slower speeds\. CDiff introduces multiple sampling rounds, where the average of the sampled times and the mode of the sampled marks are used as the final generated events, thereby increasing the overall sampling time\. Here, we use 5 rounds from the original code\. In contrast, our model requires only one sampling round to achieve high\-quality event sequence\. For CDiff, we follow the optimal hyperparameter settings reported in the original paper, using 100 sampling steps on Taxi and 200 sampling steps on Taobao, whereas LBDTPP uses 50 sampling steps in all main experiments\. For a more controlled comparison, we run CDiff with 50 sampling steps, denoted as CDiff\-50, and report its sampling time in[Fig\.9](https://arxiv.org/html/2606.24982#S5.F9)\. Even under the same sampling\-step budget, our model still shows a clear sampling\-time advantage\.

Based on the experimental results in Sections[V\-D](https://arxiv.org/html/2606.24982#S5.SS4)and[V\-F](https://arxiv.org/html/2606.24982#S5.SS6), LBDTPP’s performance is stable with respect to both block size and sampling steps\. Note thatL′=8L^\{\\prime\}=8andS=50S=50are used for unconditional generation, whileL′=4L^\{\\prime\}=4andS=50S=50are used for conditional generation\. We further report the sampling time of a faster version for both tasks by settingL′=20L^\{\\prime\}=20andS=10S=10, denoted as LBDTPP\-F\. As shown in[Fig\.9](https://arxiv.org/html/2606.24982#S5.F9), LBDTPP\-F is faster than all baseline models\.

## VIConclusion

We have presented LBDTPP, a novel semi\-autoregressive TPP framework that introduces latent block diffusion for modeling asynchronous event sequences\. By generating event sequences block by block with parallel generation within each block, LBDTPP supports high\-quality, variable\-length generation while mitigating error accumulation in autoregressive TPPs and overcoming the fixed\-length generation limitation of non\-autoregressive diffusion TPPs\. Extensive experiments on six real\-world datasets demonstrate the superiority of LBDTPP over state\-of\-the\-art TPP baselines in both unconditional and conditional generation tasks\. Further analysis confirms the contributions of latent\-space diffusion and block\-wise generation to the performance improvement\. Moreover, LBDTPP achieves comparable or lower sampling times, showcasing its efficiency in generating high\-quality event sequences\. Future work includes investigating more efficient diffusion sampling techniques to further reduce sampling times, and extending our framework to handle spatio\-temporal point processes\.

## VIIAcknowledgments

This work was partially supported by the Strategic Priority Research Program of the Chinese Academy of Sciences \(No\. XDB0680101\), the National Natural Science Foundation of China \(No\. 62472416 and 62402491\), and the CAS Project for Young Scientists in Basic Research \(No\. YSBR\-008\)\. The model training was performed on the robotic AI\-Scientist platform of Chinese Academy of Sciences\.

## References

- \[1\]\(2025\)Block diffusion: interpolating between autoregressive and diffusion language models\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p5.1),[§II\-C](https://arxiv.org/html/2606.24982#S2.SS3.p1.1),[§IV\-A](https://arxiv.org/html/2606.24982#S4.SS1.p7.1),[§IV\-B](https://arxiv.org/html/2606.24982#S4.SS2.p3.19),[§V\-A](https://arxiv.org/html/2606.24982#S5.SS1.p10.3),[§VII\-A](https://arxiv.org/html/2606.24982#Sx1.SS1.4.p4.1)\.
- \[2\]M\. Arriola, Y\. Schiff, H\. Phung, A\. Gokaslan, and V\. Kuleshov\(2025\)Encoder\-decoder diffusion language models for efficient training and inference\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§II\-C](https://arxiv.org/html/2606.24982#S2.SS3.p1.1)\.
- \[3\]J\. Austin, D\. D\. Johnson, J\. Ho, D\. Tarlow, and R\. Van Den Berg\(2021\)Structured denoising diffusion models in discrete state\-spaces\.Advances in Neural Information Processing Systems\.Cited by:[§II\-C](https://arxiv.org/html/2606.24982#S2.SS3.p1.1),[§IV\-A](https://arxiv.org/html/2606.24982#S4.SS1.p1.1)\.
- \[4\]F\. Bao, S\. Nie, K\. Xue, Y\. Cao, C\. Li, H\. Su, and J\. Zhu\(2023\)All are worth words: a vit backbone for diffusion models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p3.1)\.
- \[5\]T\. Bosser and S\. B\. Taieb\(2023\)On the predictive accuracy of neural temporal point process models for continuous\-time event data\.Transactions on Machine Learning Research\.Cited by:[5th item](https://arxiv.org/html/2606.24982#S5.I2.i5.p1.1)\.
- \[6\]A\. Boyd, Y\. Chang, S\. Mandt, and P\. Smyth\(2023\)Probabilistic querying of continuous\-time event sequences\.InInternational Conference on Artificial Intelligence and Statistics,Cited by:[§III\-A](https://arxiv.org/html/2606.24982#S3.SS1.p3.2)\.
- \[7\]Y\. Chang, A\. J\. Boyd, C\. Xiao, T\. Kass\-Hout, P\. Bhatia, P\. Smyth, and A\. Warrington\(2025\)Deep continuous\-time state\-space models for marked event sequences\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[5th item](https://arxiv.org/html/2606.24982#S5.I3.i5.p1.1)\.
- \[8\]C\. Chen, H\. Geng, N\. Yang, X\. Yang, and J\. Yan\(2024\)Easydgl: encode, train and interpret for continuous\-time dynamic graph learning\.IEEE Transactions on Pattern Analysis and Machine Intelligence46\(12\),pp\. 10845–10862\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p1.1)\.
- \[9\]J\. Chen, Y\. Zhao, J\. Yu, R\. Chu, J\. Chen, S\. Yang, X\. Wang, Y\. Pan, D\. Zhou, H\. Ling,et al\.\(2025\)Sana\-video: efficient video generation with block linear diffusion transformer\.arXiv preprint arXiv:2509\.24695\.Cited by:[§II\-C](https://arxiv.org/html/2606.24982#S2.SS3.p1.1)\.
- \[10\]D\. J\. Daley, D\. Vere\-Jones,et al\.\(2003\)An introduction to the theory of point processes: volume i: elementary theory and methods\.Springer\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1),[§II\-B](https://arxiv.org/html/2606.24982#S2.SS2.p1.1),[§III\-A](https://arxiv.org/html/2606.24982#S3.SS1.p2.1)\.
- \[11\]P\. Deshpande, K\. Marathe, A\. De, and S\. Sarawagi\(2021\)Long horizon forecasting with temporal point processes\.InInternational Conference on Web Search and Data Mining,Cited by:[6th item](https://arxiv.org/html/2606.24982#S5.I3.i6.p1.1)\.
- \[12\]N\. Du, H\. Dai, R\. Trivedi, U\. Upadhyay, M\. Gomez\-Rodriguez, and L\. Song\(2016\)Recurrent marked temporal point processes: embedding event history to vector\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p1.1),[§III\-A](https://arxiv.org/html/2606.24982#S3.SS1.p2.1)\.
- \[13\]J\. Enguehard, D\. Busbridge, A\. Bozson, C\. Woodcock, and N\. Hammerla\(2020\)Neural temporal point processes for modelling electronic health records\.InMachine Learning for Health,pp\. 85–113\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p1.1)\.
- \[14\]X\. Fan, Y\. Li, L\. Chen, B\. Li, and S\. A\. Sisson\(2022\)Hawkes processes with stochastic exogenous effects for continuous\-time interaction modelling\.IEEE Transactions on Pattern Analysis and Machine Intelligence45\(2\),pp\. 1848–1861\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p1.1)\.
- \[15\]M\. Farajtabar, Y\. Wang, M\. Gomez\-Rodriguez, S\. Li, H\. Zha, and L\. Song\(2017\)Coevolve: a joint point process model for information diffusion and network evolution\.Journal of Machine Learning Research18\(41\),pp\. 1–49\.Cited by:[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1)\.
- \[16\]J\. Favreau, F\. Lafarge, A\. Bousseau, and A\. Auvolat\(2019\)Extracting geometric structures in images with delaunay point processes\.IEEE Transactions on Pattern Analysis and Machine Intelligence42\(4\),pp\. 837–850\.Cited by:[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1)\.
- \[17\]M\. Gao, C\. Zhang, and J\. Zhou\(2024\)Learning network\-structured dependence from non\-stationary multivariate point process data\.IEEE Transactions on Information Theory70\(8\),pp\. 5935–5968\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p1.1)\.
- \[18\]X\. Han, S\. Kumar, and Y\. Tsvetkov\(2023\)Ssd\-lm: semi\-autoregressive simplex\-based diffusion language model for text generation and modular control\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Cited by:[§II\-C](https://arxiv.org/html/2606.24982#S2.SS3.p1.1)\.
- \[19\]M\. Havasi, B\. Karrer, I\. Gat, and R\. T\. Chen\(2025\)Edit flows: variable length discrete flow matching with sequence\-level edit operations\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§II\-B](https://arxiv.org/html/2606.24982#S2.SS2.p1.1)\.
- \[20\]A\. G\. Hawkes\(1971\)Spectra of some self\-exciting and mutually exciting point processes\.Biometrika58\(1\),pp\. 83–90\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1),[§III\-A](https://arxiv.org/html/2606.24982#S3.SS1.p2.1)\.
- \[21\]J\. Ho, A\. Jain, and P\. Abbeel\(2020\)Denoising diffusion probabilistic models\.Advances in Neural Information Processing Systems\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p3.1),[§III\-B](https://arxiv.org/html/2606.24982#S3.SS2.p1.3),[§III\-B](https://arxiv.org/html/2606.24982#S3.SS2.p3.10),[§IV\-A](https://arxiv.org/html/2606.24982#S4.SS1.p1.1),[§IV\-A](https://arxiv.org/html/2606.24982#S4.SS1.p8.10),[§IV\-C](https://arxiv.org/html/2606.24982#S4.SS3.p1.9),[§VII\-A](https://arxiv.org/html/2606.24982#Sx1.SS1.4.p4.1),[Proposition 1](https://arxiv.org/html/2606.24982#Thmproposition1.p2.2.2)\.
- \[22\]E\. Hoogeboom, D\. Nielsen, P\. Jaini, P\. Forré, and M\. Welling\(2021\)Argmax flows and multinomial diffusion: learning categorical distributions\.Advances in Neural Information Processing Systems\.Cited by:[§II\-C](https://arxiv.org/html/2606.24982#S2.SS3.p1.1)\.
- \[23\]V\. Isham and M\. Westcott\(1979\)A self\-correcting point process\.Stochastic processes and their applications8\(3\),pp\. 335–347\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1),[§III\-A](https://arxiv.org/html/2606.24982#S3.SS1.p2.1)\.
- \[24\]Y\. Jiang, J\. Li, Y\. Liu, D\. Yang, F\. Zhou, and Q\. Kong\(2025\)Danmakutppbench: a multi\-modal benchmark for temporal point process modeling and understanding\.Advances in Neural Information Processing Systems\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p1.1)\.
- \[25\]G\. Kerrigan, K\. Nelson, and P\. Smyth\(2024\)EventFlow: forecasting continuous\-time event data with flow matching\.arXiv e\-prints,pp\. arXiv–2410\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p2.1),[§I](https://arxiv.org/html/2606.24982#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.24982#S2.SS2.p1.1),[§V\-B](https://arxiv.org/html/2606.24982#S5.SS2.p1.1)\.
- \[26\]D\. P\. Kingma and J\. Ba\(2015\)Adam: a method for stochastic optimization\.International Conference on Learning Representations\.Cited by:[§V\-A](https://arxiv.org/html/2606.24982#S5.SS1.p11.1)\.
- \[27\]J\. F\. C\. Kingman\(1992\)Poisson processes\.Vol\.3,Clarendon Press\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p2.1),[§III\-A](https://arxiv.org/html/2606.24982#S3.SS1.p2.1)\.
- \[28\]S\. Kumar, X\. Zhang, and J\. Leskovec\(2019\)Predicting dynamic embedding trajectory in temporal interaction networks\.InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,Cited by:[5th item](https://arxiv.org/html/2606.24982#S5.I2.i5.p1.1)\.
- \[29\]C\. Lacoste, X\. Descombes, and J\. Zerubia\(2005\)Point processes for unsupervised line network extraction in remote sensing\.IEEE Transactions on Pattern Analysis and Machine Intelligence27\(10\),pp\. 1568–1579\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p2.1)\.
- \[30\]J\. Leskovec and A\. Krevl\(2014\)SNAP datasets: stanford large network dataset collection\.Cited by:[3rd item](https://arxiv.org/html/2606.24982#S5.I2.i3.p1.1)\.
- \[31\]X\. Li, J\. Thickstun, I\. Gulrajani, P\. S\. Liang, and T\. B\. Hashimoto\(2022\)Diffusion\-lm improves controllable text generation\.Advances in Neural Information Processing Systems\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p3.1)\.
- \[32\]H\. Lin, C\. Tan, L\. Wu, Z\. Liu, Z\. Gao, and S\. Z\. Li\(2024\)An extensive survey with empirical studies on deep temporal point process\.IEEE Transactions on Knowledge and Data Engineering37\(4\),pp\. 1599–1619\.Cited by:[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1)\.
- \[33\]H\. Lin, L\. Wu, G\. Zhao, L\. Pai, and S\. Z\. Li\(2022\)Exploring generative neural temporal point process\.Transactions on Machine Learning Research\.Cited by:[8th item](https://arxiv.org/html/2606.24982#S5.I3.i8.p1.1)\.
- \[34\]Y\. Lipman, R\. T\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le\(2023\)Flow matching for generative modeling\.InInternational Conference on Learning Representations,Cited by:[§II\-B](https://arxiv.org/html/2606.24982#S2.SS2.p1.1)\.
- \[35\]S\. Liu and M\. Hauskrecht\(2021\)Event outlier detection in continuous time\.InInternational Conference on Machine Learning,Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p1.1)\.
- \[36\]C\. Lu, Y\. Zhou, F\. Bao, J\. Chen, C\. Li, and J\. Zhu\(2022\)Dpm\-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps\.Advances in Neural Information Processing Systems\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p3.1)\.
- \[37\]D\. Lüdke, M\. Biloš, O\. Shchur, M\. Lienen, and S\. Günnemann\(2023\)Add and thin: diffusion for temporal point processes\.Advances in Neural Information Processing Systems\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p1.1),[§I](https://arxiv.org/html/2606.24982#S1.p2.1),[§I](https://arxiv.org/html/2606.24982#S1.p3.1),[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1),[§II\-B](https://arxiv.org/html/2606.24982#S2.SS2.p1.1),[§V\-B](https://arxiv.org/html/2606.24982#S5.SS2.p1.1)\.
- \[38\]D\. Lüdke, M\. Lienen, M\. Kollovieh, and S\. Günnemann\(2026\)Edit\-based flow matching for temporal point processes\.InInternational Conference on Learning Representations,Cited by:[§II\-B](https://arxiv.org/html/2606.24982#S2.SS2.p1.1)\.
- \[39\]D\. Lüdke, E\. R\. Raventós, M\. Kollovieh, and S\. Günnemann\(2025\)Unlocking point processes through point set diffusion\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p1.1),[§II\-B](https://arxiv.org/html/2606.24982#S2.SS2.p1.1)\.
- \[40\]C\. Luo\(2022\)Understanding diffusion models: a unified perspective\.arXiv preprint arXiv:2208\.11970\.Cited by:[§III\-B](https://arxiv.org/html/2606.24982#S3.SS2.p3.3),[§VII\-A](https://arxiv.org/html/2606.24982#Sx1.SS1.4.p4.1)\.
- \[41\]H\. Mei and J\. M\. Eisner\(2017\)The neural hawkes process: a neurally self\-modulating multivariate point process\.Advances in Neural Information Processing Systems\.Cited by:[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1),[1st item](https://arxiv.org/html/2606.24982#S5.I3.i1.p1.1)\.
- \[42\]H\. Mei, G\. Qin, and J\. Eisner\(2019\)Imputing missing events in continuous\-time event streams\.InInternational Conference on Machine Learning,Cited by:[1st item](https://arxiv.org/html/2606.24982#S5.I4.i1.p1.2)\.
- \[43\]J\. Ni, J\. Li, and J\. McAuley\(2019\)Justifying recommendations using distantly\-labeled reviews and fine\-grained aspects\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),Cited by:[6th item](https://arxiv.org/html/2606.24982#S5.I2.i6.p1.1)\.
- \[44\]S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li\(2025\)Large language diffusion models\.Advances in Neural Information Processing Systems\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p3.1),[§II\-C](https://arxiv.org/html/2606.24982#S2.SS3.p1.1)\.
- \[45\]Y\. Ogata\(1981\)On lewis’ simulation method for point processes\.IEEE Transactions on Information Theory\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p1.1),[§V\-H](https://arxiv.org/html/2606.24982#S5.SS8.p2.1)\.
- \[46\]M\. Ortner, X\. Descombes, and J\. Zerubia\(2008\)A marked point process of rectangles and segments for automatic analysis of digital elevation models\.IEEE Transactions on Pattern Analysis and Machine Intelligence30\(1\),pp\. 105–119\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p2.1)\.
- \[47\]A\. Panos\(2024\)Decomposable transformer point processes\.Advances in Neural Information Processing Systems\.Cited by:[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1)\.
- \[48\]A\. Paszke, S\. Gross, S\. Chintala, G\. Chanan, E\. Yang, Z\. DeVito, Z\. Lin, A\. Desmaison, L\. Antiga, and A\. Lerer\(2017\)Automatic differentiation in pytorch\.NIPS\-W\.Cited by:[§V\-A](https://arxiv.org/html/2606.24982#S5.SS1.p10.3)\.
- \[49\]J\. G\. Rasmussen\(2018\)Lecture notes: temporal point processes and the conditional intensity function\.arXiv preprint arXiv:1806\.00221\.Cited by:[§III\-A](https://arxiv.org/html/2606.24982#S3.SS1.p2.5)\.
- \[50\]S\. Ren, S\. Ma, X\. Sun, and F\. Wei\(2025\)Next block prediction: video generation via semi\-autoregressive modeling\.arXiv preprint arXiv:2502\.07737\.Cited by:[§II\-C](https://arxiv.org/html/2606.24982#S2.SS3.p1.1)\.
- \[51\]R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer\(2022\)High\-resolution image synthesis with latent diffusion models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p3.1)\.
- \[52\]O\. Shchur, M\. Biloš, and S\. Günnemann\(2020\)Intensity\-free learning of temporal point processes\.InInternational Conference on Learning Representations,Cited by:[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1),[§III\-A](https://arxiv.org/html/2606.24982#S3.SS1.p2.1),[2nd item](https://arxiv.org/html/2606.24982#S5.I3.i2.p1.1)\.
- \[53\]O\. Shchur, A\. C\. Turkmen, T\. Januschowski, J\. Gasthaus, and S\. Günnemann\(2021\)Detecting anomalous event sequences with temporal point processes\.Advances in Neural Information Processing Systems\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p1.1)\.
- \[54\]X\. Shi, S\. Xue, K\. Wang, F\. Zhou, J\. Zhang, J\. Zhou, C\. Tan, and H\. Mei\(2023\)Language models can improve event prediction by few\-shot abductive reasoning\.Advances in Neural Information Processing Systems\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p1.1)\.
- \[55\]D\. L\. Snyder and M\. I\. Miller\(2012\)Random point processes in time and space\.Springer Science & Business Media\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1)\.
- \[56\]J\. Sohl\-Dickstein, E\. Weiss, N\. Maheswaranathan, and S\. Ganguli\(2015\)Deep unsupervised learning using nonequilibrium thermodynamics\.InInternational Conference on Machine Learning,Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p3.1),[§III\-B](https://arxiv.org/html/2606.24982#S3.SS2.p1.3)\.
- \[57\]J\. Song, C\. Meng, and S\. Ermon\(2021\)Denoising diffusion implicit models\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p3.1),[§IV\-C](https://arxiv.org/html/2606.24982#S4.SS3.p1.9),[§V\-A](https://arxiv.org/html/2606.24982#S5.SS1.p10.3)\.
- \[58\]Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole\(2021\)Score\-based generative modeling through stochastic differential equations\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p3.1)\.
- \[59\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.Advances in Neural Information Processing Systems\.Cited by:[§IV\-A](https://arxiv.org/html/2606.24982#S4.SS1.p11.11)\.
- \[60\]X\. Wang, C\. Xu, Y\. Jin, J\. Jin, H\. Zhang, and Z\. Deng\(2026\)Diffusion llms can do faster\-than\-ar inference via discrete diffusion forcing\.International Conference on Learning Representations\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p3.1),[§II\-C](https://arxiv.org/html/2606.24982#S2.SS3.p1.1)\.
- \[61\]C\. Whong\(2014\)FOILing NYC’s taxi trip data\.Cited by:[1st item](https://arxiv.org/html/2606.24982#S5.I2.i1.p1.1)\.
- \[62\]C\. Wu, H\. Zhang, S\. Xue, S\. Diao, Y\. Fu, Z\. Liu, P\. Molchanov, P\. Luo, S\. Han, and E\. Xie\(2026\)Fast\-dllm v2: efficient block\-diffusion llm\.International Conference on Learning Representations\.Cited by:[§II\-C](https://arxiv.org/html/2606.24982#S2.SS3.p1.1)\.
- \[63\]C\. Wu, H\. Zhang, S\. Xue, Z\. Liu, S\. Diao, L\. Zhu, P\. Luo, S\. Han, and E\. Xie\(2026\)Fast\-dllm: training\-free acceleration of diffusion llm by enabling kv cache and parallel decoding\.International Conference on Learning Representations\.Cited by:[§II\-C](https://arxiv.org/html/2606.24982#S2.SS3.p1.1)\.
- \[64\]S\. Xiao, J\. Yan, X\. Yang, H\. Zha, and S\. Chu\(2017\)Modeling the intensity function of point process via recurrent neural networks\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1)\.
- \[65\]S\. Xue, X\. Shi, Z\. Chu, Y\. Wang, F\. Zhou, H\. Hao, C\. Jiang, C\. Pan, Y\. Xu, J\. Y\. Zhang,et al\.\(2024\)Easytpp: towards open benchmarking the temporal point processes\.International Conference on Learning Representations\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p1.1),[§III\-A](https://arxiv.org/html/2606.24982#S3.SS1.p4.1),[§V\-A](https://arxiv.org/html/2606.24982#S5.SS1.p6.1)\.
- \[66\]S\. Xue, X\. Shi, J\. Zhang, and H\. Mei\(2022\)Hypro: a hybridly normalized probabilistic model for long\-horizon prediction of event sequences\.Advances in Neural Information Processing Systems\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1),[§III\-A](https://arxiv.org/html/2606.24982#S3.SS1.p4.1),[7th item](https://arxiv.org/html/2606.24982#S5.I3.i7.p1.1),[§V\-H](https://arxiv.org/html/2606.24982#S5.SS8.p2.1)\.
- \[67\]C\. Yang, H\. Mei, and J\. Eisner\(2022\)Transformer embeddings of irregularly spaced events and their participants\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p2.1),[§III\-A](https://arxiv.org/html/2606.24982#S3.SS1.p4.1),[4th item](https://arxiv.org/html/2606.24982#S5.I3.i4.p1.1)\.
- \[68\]Y\. Yuan, J\. Ding, C\. Shao, D\. Jin, and Y\. Li\(2023\)Spatio\-temporal diffusion point processes\.InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,Cited by:[§II\-B](https://arxiv.org/html/2606.24982#S2.SS2.p1.1)\.
- \[69\]M\. Zeng, F\. Regol, and M\. Coates\(2024\)Interacting diffusion processes for event sequence forecasting\.InInternational Conference on Machine Learning,Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p1.1),[§I](https://arxiv.org/html/2606.24982#S1.p2.1),[§I](https://arxiv.org/html/2606.24982#S1.p3.1),[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1),[§II\-B](https://arxiv.org/html/2606.24982#S2.SS2.p1.1),[§III\-A](https://arxiv.org/html/2606.24982#S3.SS1.p4.1),[5th item](https://arxiv.org/html/2606.24982#S5.I2.i5.p1.1),[9th item](https://arxiv.org/html/2606.24982#S5.I3.i9.p1.1),[§V\-A](https://arxiv.org/html/2606.24982#S5.SS1.p1.1),[§V\-A](https://arxiv.org/html/2606.24982#S5.SS1.p6.1),[§V\-A](https://arxiv.org/html/2606.24982#S5.SS1.p9.3),[§V\-B](https://arxiv.org/html/2606.24982#S5.SS2.p1.1),[§V\-C](https://arxiv.org/html/2606.24982#S5.SS3.p1.2)\.
- \[70\]Q\. Zhang, A\. Lipani, O\. Kirnap, and E\. Yilmaz\(2020\)Self\-attentive hawkes process\.InInternational Conference on Machine Learning,Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p2.1),[§III\-A](https://arxiv.org/html/2606.24982#S3.SS1.p2.1)\.
- \[71\]S\. Zhang, C\. Zhou, Y\. A\. Liu, P\. Zhang, X\. Lin, and Z\. Ma\(2024\)Neural jump\-diffusion temporal point processes\.InInternational Conference on Machine Learning,Cited by:[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1)\.
- \[72\]S\. Zhang, C\. Zhou, Y\. Liu, P\. Zhang, X\. Lin, and S\. Pan\(2025\)Conformal anomaly detection in event sequences\.InInternational Conference on Machine Learning,Cited by:[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1)\.
- \[73\]S\. Zhang, C\. Zhou, P\. Zhang, Y\. Liu, Z\. Li, and H\. Chen\(2023\)Multiple hypothesis testing for anomaly detection in multi\-type event sequences\.In2023 IEEE International Conference on Data Mining,Cited by:[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1)\.
- \[74\]Z\. Zhang, S\. Chang, Y\. He, Y\. Han, J\. Tang, F\. Wang, and B\. Zhuang\(2025\)BlockVid: block diffusion for high\-quality and consistent minute\-long video generation\.arXiv preprint arXiv:2511\.22973\.Cited by:[§II\-C](https://arxiv.org/html/2606.24982#S2.SS3.p1.1)\.
- \[75\]F\. Zhou, Q\. Kong, J\. Qiao, C\. Wan, Y\. Zhang, and R\. Cai\(2026\)Advances in temporal point processes: bayesian, neural, and llm approaches\.Transactions on Machine Learning Research\.Cited by:[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1)\.
- \[76\]K\. Zhou, H\. Zha, and L\. Song\(2013\)Learning triggering kernels for multi\-dimensional hawkes processes\.InInternational Conference on Machine Learning,Cited by:[4th item](https://arxiv.org/html/2606.24982#S5.I2.i4.p1.1)\.
- \[77\]W\. Zhou, Z\. Kang, L\. Tian, J\. Zhang, and Y\. Liu\(2025\)Non\-autoregressive diffusion\-based temporal point processes for continuous\-time long\-term event prediction\.Expert Systems with Applications\.Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p3.1)\.
- \[78\]H\. Zhu, X\. Li, P\. Zhang, G\. Li, J\. He, H\. Li, and K\. Gai\(2018\)Learning tree\-based deep model for recommender systems\.InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,Cited by:[2nd item](https://arxiv.org/html/2606.24982#S5.I2.i2.p1.1)\.
- \[79\]S\. Zuo, H\. Jiang, Z\. Li, T\. Zhao, and H\. Zha\(2020\)Transformer hawkes process\.InInternational Conference on Machine Learning,Cited by:[§I](https://arxiv.org/html/2606.24982#S1.p1.1),[§I](https://arxiv.org/html/2606.24982#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.24982#S2.SS1.p1.1),[§III\-A](https://arxiv.org/html/2606.24982#S3.SS1.p2.1),[§III\-A](https://arxiv.org/html/2606.24982#S3.SS1.p4.1),[§IV\-A](https://arxiv.org/html/2606.24982#S4.SS1.p2.5),[3rd item](https://arxiv.org/html/2606.24982#S5.I3.i3.p1.1)\.

## Supplementary Material

### VII\-ADerivation of NELBO for LBDTPP in Latent Space

###### Proof of[Proposition1](https://arxiv.org/html/2606.24982#Thmproposition1)\.

Given the latent event sequence representation𝐳=\(𝐳1,…,𝐳L\)\\mathbf\{z\}=\(\\mathbf\{z\}^\{1\},\\ldots,\\mathbf\{z\}^\{L\}\)partitioned intoB:=L/L′B:=L/L^\{\\prime\}blocks of lengthL′L^\{\\prime\}\(with padding applied ifLLis not divisible byL′L^\{\\prime\}\), we denote the index range of thebb\-th block asℓb\+1=\(b−1\)​L′\+1\\ell\_\{b\}\+1=\(b\-1\)L^\{\\prime\}\+1toℓb\+1=b​L′\\ell\_\{b\+1\}=bL^\{\\prime\}, and thebb\-th block as𝐳b=\(𝐳ℓb\+1,…,𝐳ℓb\+1\)\\mathbf\{z\}^\{b\}=\(\\mathbf\{z\}^\{\\ell\_\{b\}\+1\},\\ldots,\\mathbf\{z\}^\{\\ell\_\{b\+1\}\}\)\. We use𝐳<b=\(𝐳1,…,𝐳ℓb\)\\mathbf\{z\}^\{<b\}=\(\\mathbf\{z\}^\{1\},\\ldots,\\mathbf\{z\}^\{\\ell\_\{b\}\}\)to denote all the historical blocks before thebb\-th block\. For eachb∈\[B\]b\\in\[B\], we define a forward diffusion process that gradually adds Gaussian noise to the clean block𝐳0b=𝐳b\\mathbf\{z\}\_\{0\}^\{b\}=\\mathbf\{z\}^\{b\}:

q​\(𝐳1:Kb∣𝐳0b\)\\displaystyle q\\left\(\\mathbf\{z\}\_\{1:K\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)=∏k=1Kq​\(𝐳kb∣𝐳k−1b\),\\displaystyle=\\prod\_\{k=1\}^\{K\}q\\left\(\\mathbf\{z\}\_\{k\}^\{b\}\\mid\\mathbf\{z\}\_\{k\-1\}^\{b\}\\right\),q​\(𝐳kb∣𝐳k−1b\)\\displaystyle q\\left\(\\mathbf\{z\}\_\{k\}^\{b\}\\mid\\mathbf\{z\}\_\{k\-1\}^\{b\}\\right\)=𝒩​\(𝐳kb;αk​𝐳k−1b,\(1−αk\)​𝐈\)\.\\displaystyle=\\mathcal\{N\}\\left\(\\mathbf\{z\}\_\{k\}^\{b\};\\sqrt\{\\alpha\_\{k\}\}\\mathbf\{z\}\_\{k\-1\}^\{b\},\(1\-\\alpha\_\{k\}\)\\mathbf\{I\}\\right\)\.
The reverse denoising process for thebb\-th block starts fromp​\(𝐳Kb∣𝐳<b\)=𝒩​\(𝐳Kb;𝟎,𝐈\)p\\left\(\\mathbf\{z\}\_\{K\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)=\\mathcal\{N\}\\left\(\\mathbf\{z\}\_\{K\}^\{b\};\\mathbf\{0\},\\mathbf\{I\}\\right\)and proceeds as follows:

pθ​\(𝐳0:Kb∣𝐳<b\)\\displaystyle p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{0:K\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)=p​\(𝐳Kb∣𝐳<b\)​∏k=1Kpθ​\(𝐳k−1b∣𝐳kb,𝐳<b\),\\displaystyle=p\\left\(\\mathbf\{z\}\_\{K\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)\\prod\_\{k=1\}^\{K\}p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{k\-1\}^\{b\}\\mid\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{z\}^\{<b\}\\right\),pθ​\(𝐳k−1b∣𝐳kb,𝐳<b\)\\displaystyle p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{k\-1\}^\{b\}\\mid\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{z\}^\{<b\}\\right\)=𝒩​\(𝐳k−1b;𝝁θb​\(𝐳kb,𝐳<b,k\),σk2​𝐈\)\.\\displaystyle=\\mathcal\{N\}\\left\(\\mathbf\{z\}\_\{k\-1\}^\{b\};\\boldsymbol\{\\mu\}\_\{\\mathbf\{\\theta\}\}^\{b\}\\left\(\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{z\}^\{<b\},k\\right\),\\sigma\_\{k\}^\{2\}\\mathbf\{I\}\\right\)\.
Then the NELBO of our model is obtained as follows:

−log⁡pθ​\(𝐳\)=−∑b=1Blog⁡pθ​\(𝐳b∣𝐳<b\)\\displaystyle\-\\log p\_\{\\theta\}\(\\mathbf\{z\}\)=\-\\sum\_\{b=1\}^\{B\}\\log p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)=−∑b=1Blog​∫pθ​\(𝐳0:Kb∣𝐳<b\)​d𝐳1:Kb\\displaystyle=\-\\sum\_\{b=1\}^\{B\}\\log\\int p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{0:K\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)\\mathrm\{d\}\\mathbf\{z\}\_\{1:K\}^\{b\}=−∑b=1Blog​∫pθ​\(𝐳0:Kb∣𝐳<b\)​q​\(𝐳1:Kb∣𝐳0b\)q​\(𝐳1:Kb∣𝐳0b\)​d𝐳1:Kb\\displaystyle=\-\\sum\_\{b=1\}^\{B\}\\log\\int\\frac\{p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{0:K\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)q\\left\(\\mathbf\{z\}\_\{1:K\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\}\{q\\left\(\\mathbf\{z\}\_\{1:K\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\}\\mathrm\{d\}\\mathbf\{z\}\_\{1:K\}^\{b\}=−∑b=1Blog⁡𝔼q​\(𝐳1:Kb∣𝐳0b\)​\[pθ​\(𝐳0:Kb∣𝐳<b\)q​\(𝐳1:Kb∣𝐳0b\)\]\\displaystyle=\-\\sum\_\{b=1\}^\{B\}\\log\\mathbb\{E\}\_\{q\\left\(\\mathbf\{z\}\_\{1:K\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\}\\bigg\[\\frac\{p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{0:K\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)\}\{q\\left\(\\mathbf\{z\}\_\{1:K\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\}\\bigg\]≤−∑b=1B𝔼q​\(𝐳1:Kb∣𝐳0b\)​\[log⁡pθ​\(𝐳0:Kb∣𝐳<b\)q​\(𝐳1:Kb∣𝐳0b\)\]\\displaystyle\\leq\-\\sum\_\{b=1\}^\{B\}\\mathbb\{E\}\_\{q\\left\(\\mathbf\{z\}\_\{1:K\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\}\\bigg\[\\log\\frac\{p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{0:K\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)\}\{q\\left\(\\mathbf\{z\}\_\{1:K\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\}\\bigg\]=−∑b=1B𝔼q​\(𝐳1:Kb∣𝐳0b\)​\[log⁡p​\(𝐳Kb\|𝐳<b\)​∏k=1Kpθ​\(𝐳k−1b\|𝐳kb,𝐳<b\)∏k=1Kq​\(𝐳kb∣𝐳k−1b\)\]\\displaystyle=\-\\sum\_\{b=1\}^\{B\}\\mathbb\{E\}\_\{q\\left\(\\mathbf\{z\}\_\{1:K\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\}\\Bigg\[\\log\\frac\{p\\left\(\\mathbf\{z\}\_\{K\}^\{b\}\|\\mathbf\{z\}^\{<b\}\\right\)\\prod\\limits\_\{k=1\}^\{K\}p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{k\-1\}^\{b\}\|\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{z\}^\{<b\}\\right\)\}\{\\prod\\limits\_\{k=1\}^\{K\}q\\left\(\\mathbf\{z\}\_\{k\}^\{b\}\\mid\\mathbf\{z\}\_\{k\-1\}^\{b\}\\right\)\}\\Bigg\]=−∑b=1B𝔼q​\(𝐳1:Kb∣𝐳0b\)\[logp​\(𝐳Kb∣𝐳<b\)​pθ​\(𝐳0b∣𝐳1b,𝐳<b\)q​\(𝐳1b∣𝐳0b\)\\displaystyle=\-\\sum\_\{b=1\}^\{B\}\\mathbb\{E\}\_\{q\\left\(\\mathbf\{z\}\_\{1:K\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\}\\Bigg\[\\log\\frac\{p\\left\(\\mathbf\{z\}\_\{K\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{0\}^\{b\}\\mid\\mathbf\{z\}\_\{1\}^\{b\},\\mathbf\{z\}^\{<b\}\\right\)\}\{q\\left\(\\mathbf\{z\}\_\{1\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\}\+log∏k=2Kpθ​\(𝐳k−1b∣𝐳kb,𝐳<b\)q​\(𝐳kb∣𝐳k−1b,𝐳0b\)\]\\displaystyle\\qquad\\qquad\\qquad\\qquad\\quad\+\\log\\prod\\limits\_\{k=2\}^\{K\}\\frac\{p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{k\-1\}^\{b\}\\mid\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{z\}^\{<b\}\\right\)\}\{q\\left\(\\mathbf\{z\}\_\{k\}^\{b\}\\mid\\mathbf\{z\}\_\{k\-1\}^\{b\},\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\}\\Bigg\]=−∑b=1B𝔼q​\(𝐳1:Kb∣𝐳0b\)\[logp​\(𝐳Kb∣𝐳<b\)​pθ​\(𝐳0b∣𝐳1b,𝐳<b\)q​\(𝐳1b∣𝐳0b\)\\displaystyle=\-\\sum\_\{b=1\}^\{B\}\\mathbb\{E\}\_\{q\\left\(\\mathbf\{z\}\_\{1:K\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\}\\left\[\\log\\frac\{p\\left\(\\mathbf\{z\}\_\{K\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{0\}^\{b\}\\mid\\mathbf\{z\}\_\{1\}^\{b\},\\mathbf\{z\}^\{<b\}\\right\)\}\{q\\left\(\\mathbf\{z\}\_\{1\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\}\\right\.\+log∏k=2Kpθ​\(𝐳k−1b∣𝐳kb,𝐳<b\)q​\(𝐳k−1b∣𝐳kb,𝐳0b\)​q​\(𝐳kb∣𝐳0b\)q​\(𝐳k−1b∣𝐳0b\)\]\\displaystyle\\qquad\\qquad\\qquad\\qquad\\quad\\left\.\+\\log\\prod\\limits\_\{k=2\}^\{K\}\\frac\{p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{k\-1\}^\{b\}\\mid\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{z\}^\{<b\}\\right\)\}\{\\frac\{q\\left\(\\mathbf\{z\}\_\{k\-1\}^\{b\}\\mid\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{z\}\_\{0\}^\{b\}\\right\)q\\left\(\\mathbf\{z\}\_\{k\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\}\{q\\left\(\\mathbf\{z\}\_\{k\-1\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\}\}\\right\]=∑b=1B\[−𝔼q​\(𝐳1b∣𝐳0b\)​\[log⁡pθ​\(𝐳0b∣𝐳1b,𝐳<b\)\]⏟reconstruction term\\displaystyle=\\sum\_\{b=1\}^\{B\}\\bigg\[\\underbrace\{\-\\mathbb\{E\}\_\{q\\left\(\\mathbf\{z\}\_\{1\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\}\\left\[\\log p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{0\}^\{b\}\\mid\\mathbf\{z\}\_\{1\}^\{b\},\\mathbf\{z\}^\{<b\}\\right\)\\right\]\}\_\{\\text\{reconstruction term\}\}\+DKL\(q\(𝐳Kb∣𝐳0b\)∥p\(𝐳Kb∣𝐳<b\)\)⏟prior matching term\\displaystyle\\quad\\quad\\quad\\,\\,\\,\+\\underbrace\{D\_\{\\mathrm\{KL\}\}\\left\(q\\left\(\\mathbf\{z\}\_\{K\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\\\|\\,p\\left\(\\mathbf\{z\}\_\{K\}^\{b\}\\mid\\mathbf\{z\}^\{<b\}\\right\)\\right\)\}\_\{\\text\{prior matching term\}\}\+∑k=2K𝔼q​\(𝐳kb∣𝐳0b\)\[DKL\(q\(𝐳k−1b\|𝐳kb,𝐳0b\)∥pθ\(𝐳k−1b\|𝐳kb,𝐳<b\)\)\]⏟denoising matching term\]\\displaystyle\+\\underbrace\{\\sum\_\{k=2\}^\{K\}\\mathbb\{E\}\_\{q\\left\(\\mathbf\{z\}\_\{k\}^\{b\}\\mid\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\}\\\!\\left\[D\_\{\\mathrm\{KL\}\}\\\!\\left\(q\\left\(\\mathbf\{z\}\_\{k\-1\}^\{b\}\|\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{z\}\_\{0\}^\{b\}\\right\)\\\|\\,p\_\{\\mathbf\{\\theta\}\}\\left\(\\mathbf\{z\}\_\{k\-1\}^\{b\}\|\\mathbf\{z\}\_\{k\}^\{b\},\\mathbf\{z\}^\{<b\}\\right\)\\right\)\\right\]\}\_\{\\text\{denoising matching term\}\}\\\!\\bigg\]=:∑b=1B𝒥b\(𝐳b,𝐳<b;θ\)\.\\displaystyle=:\\sum\_\{b=1\}^\{B\}\\mathcal\{J\}\_\{b\}\(\\mathbf\{z\}^\{b\},\\mathbf\{z\}^\{<b\};\\theta\)\.
In the above derivations, we use Jensen’s inequality together with the Markov properties of the forward and reverse processes \(Eqs\. \([14](https://arxiv.org/html/2606.24982#S4.E14)\) and \([18](https://arxiv.org/html/2606.24982#S4.E18)\)\), closely following prior work\[[21](https://arxiv.org/html/2606.24982#bib.bib24),[40](https://arxiv.org/html/2606.24982#bib.bib37),[1](https://arxiv.org/html/2606.24982#bib.bib18)\]\. We can further leverage the Gaussian form of the transition distributions \(Eqs\. \([15](https://arxiv.org/html/2606.24982#S4.E15)\) and \([19](https://arxiv.org/html/2606.24982#S4.E19)\)\) to simplify this NELBO into the surrogate latent block diffusion loss functionℒLBD​\(𝐳;θ\)\\mathcal\{L\}\_\{\\text\{LBD\}\}\(\\mathbf\{z\};\\mathbf\{\\theta\}\)in Eq\. \([22](https://arxiv.org/html/2606.24982#S4.E22)\)\. Importantly, the Kullback–Leibler \(KL\) divergence in the denoising matching term admits a closed\-form Gaussian expression and can be reduced, up to constants and weighting coefficients, to a mean squared error \(MSE\)\-based denoising loss\. We omit the detailed simplification steps here and refer the reader to\[[40](https://arxiv.org/html/2606.24982#bib.bib37)\]; in our setting, the same derivation applies after carrying out the simplification blockwise\. ∎

### VII\-BProof of the Generation\-Error Accumulation Results

We provide the detailed proof of[Theorem1](https://arxiv.org/html/2606.24982#Thmtheorem1)for unconditional generation\. The kernels written with “∣\\mid” below are internal transition kernels from the chain\-rule factorization of an unconditional sequence distribution; no external observed sequence is provided\. Recall thatdr​\(𝐮,𝐯\)=∑ℓ=1r‖𝐮ℓ−𝐯ℓ‖2d\_\{r\}\(\\mathbf\{u\},\\mathbf\{v\}\)=\\sum\_\{\\ell=1\}^\{r\}\\\|\\mathbf\{u\}^\{\\ell\}\-\\mathbf\{v\}^\{\\ell\}\\\|\_\{2\}, andW1rW\_\{1\}^\{r\}is the Wasserstein\-1 distance induced bydrd\_\{r\}\.

###### Proof of[Theorem1](https://arxiv.org/html/2606.24982#Thmtheorem1)\.

We first consider event\-wise autoregressive generation\. Let𝖯1:ℓ\\mathsf\{P\}\_\{1:\\ell\}be the true joint distribution of the firstℓ\\ellgenerated latent events, and let𝖰1:ℓAR\\mathsf\{Q\}\_\{1:\\ell\}^\{\\mathrm\{AR\}\}be the corresponding joint distribution produced by the event\-wise autoregressive generator\. Define

Dℓ=W1ℓ​\(𝖯1:ℓ,𝖰1:ℓAR\),D0=0\.D\_\{\\ell\}=W\_\{1\}^\{\\ell\}\\\!\\left\(\\mathsf\{P\}\_\{1:\\ell\},\\mathsf\{Q\}\_\{1:\\ell\}^\{\\mathrm\{AR\}\}\\right\),\\qquad D\_\{0\}=0\.\(39\)HereDℓD\_\{\\ell\}measures the accumulated generation error up to eventℓ\\ellthrough the distributional discrepancy between the true and generated length\-ℓ\\ellprefixes\. Fixℓ≥1\\ell\\geq 1\. For anyη\>0\\eta\>0, choose a couplingγℓ−1\\gamma\_\{\\ell\-1\}of the true and generated length\-\(ℓ−1\)\(\\ell\-1\)event prefixes, namely𝖯1:ℓ−1\\mathsf\{P\}\_\{1:\\ell\-1\}and𝖰1:ℓ−1AR\\mathsf\{Q\}\_\{1:\\ell\-1\}^\{\\mathrm\{AR\}\}, such that

𝔼\(𝐡,𝐡^\)∼γℓ−1​\[dℓ−1​\(𝐡,𝐡^\)\]≤Dℓ−1\+η\.\\mathbb\{E\}\_\{\(\\mathbf\{h\},\\widehat\{\\mathbf\{h\}\}\)\\sim\\gamma\_\{\\ell\-1\}\}\\left\[d\_\{\\ell\-1\}\(\\mathbf\{h\},\\widehat\{\\mathbf\{h\}\}\)\\right\]\\leq D\_\{\\ell\-1\}\+\\eta\.\(40\)For each paired event prefix\(𝐡,𝐡^\)\(\\mathbf\{h\},\\widehat\{\\mathbf\{h\}\}\)with𝐡,𝐡^∈ℝ\(ℓ−1\)×D\\mathbf\{h\},\\widehat\{\\mathbf\{h\}\}\\in\\mathbb\{R\}^\{\(\\ell\-1\)\\times D\}, the next true latent event is sampled from the true unconditional transition kernel𝖯ℓ\(⋅∣𝐡\)\\mathsf\{P\}\_\{\\ell\}\(\\cdot\\mid\\mathbf\{h\}\), while the next generated latent event is sampled from the learned autoregressive transition kernel𝖰ℓAR\(⋅∣𝐡^\)\\mathsf\{Q\}\_\{\\ell\}^\{\\mathrm\{AR\}\}\(\\cdot\\mid\\widehat\{\\mathbf\{h\}\}\)\. By the triangle inequality forW11W\_\{1\}^\{1\}and[1](https://arxiv.org/html/2606.24982#Thmassumption1),

W11\(𝖯ℓ\(⋅∣𝐡\),𝖰ℓAR\(⋅∣𝐡^\)\)\\displaystyle W\_\{1\}^\{1\}\\\!\\left\(\\mathsf\{P\}\_\{\\ell\}\(\\cdot\\mid\\mathbf\{h\}\),\\mathsf\{Q\}\_\{\\ell\}^\{\\mathrm\{AR\}\}\(\\cdot\\mid\\widehat\{\\mathbf\{h\}\}\)\\right\)≤W11\(𝖯ℓ\(⋅∣𝐡\),𝖯ℓ\(⋅∣𝐡^\)\)\+W11\(𝖯ℓ\(⋅∣𝐡^\),𝖰ℓAR\(⋅∣𝐡^\)\)\\displaystyle\\quad\\leq W\_\{1\}^\{1\}\\\!\\left\(\\mathsf\{P\}\_\{\\ell\}\(\\cdot\\mid\\mathbf\{h\}\),\\mathsf\{P\}\_\{\\ell\}\(\\cdot\\mid\\widehat\{\\mathbf\{h\}\}\)\\right\)\+W\_\{1\}^\{1\}\\\!\\left\(\\mathsf\{P\}\_\{\\ell\}\(\\cdot\\mid\\widehat\{\\mathbf\{h\}\}\),\\mathsf\{Q\}\_\{\\ell\}^\{\\mathrm\{AR\}\}\(\\cdot\\mid\\widehat\{\\mathbf\{h\}\}\)\\right\)≤ρAR​dℓ−1​\(𝐡,𝐡^\)\+εAR\.\\displaystyle\\quad\\leq\\rho\_\{\\mathrm\{AR\}\}d\_\{\\ell\-1\}\(\\mathbf\{h\},\\widehat\{\\mathbf\{h\}\}\)\+\\varepsilon\_\{\\mathrm\{AR\}\}\.\(41\)For each\(𝐡,𝐡^\)\(\\mathbf\{h\},\\widehat\{\\mathbf\{h\}\}\), choose an optimal, or arbitrarily close to optimal, coupling of𝖯ℓ\(⋅∣𝐡\)\\mathsf\{P\}\_\{\\ell\}\(\\cdot\\mid\\mathbf\{h\}\)and𝖰ℓAR\(⋅∣𝐡^\)\\mathsf\{Q\}\_\{\\ell\}^\{\\mathrm\{AR\}\}\(\\cdot\\mid\\widehat\{\\mathbf\{h\}\}\)\. Combining it withγℓ−1\\gamma\_\{\\ell\-1\}gives a valid coupling of𝖯1:ℓ\\mathsf\{P\}\_\{1:\\ell\}and𝖰1:ℓAR\\mathsf\{Q\}\_\{1:\\ell\}^\{\\mathrm\{AR\}\}\. Under this coupling, the expected length\-ℓ\\ellsequence distance is bounded by

Dℓ\\displaystyle D\_\{\\ell\}≤𝔼\[dℓ−1\(𝐡,𝐡^\)\]\+𝔼\[W11\(𝖯ℓ\(⋅∣𝐡\),𝖰ℓAR\(⋅∣𝐡^\)\)\]\\displaystyle\\leq\\mathbb\{E\}\\\!\\left\[d\_\{\\ell\-1\}\(\\mathbf\{h\},\\widehat\{\\mathbf\{h\}\}\)\\right\]\+\\mathbb\{E\}\\\!\\left\[W\_\{1\}^\{1\}\\\!\\left\(\\mathsf\{P\}\_\{\\ell\}\(\\cdot\\mid\\mathbf\{h\}\),\\mathsf\{Q\}\_\{\\ell\}^\{\\mathrm\{AR\}\}\(\\cdot\\mid\\widehat\{\\mathbf\{h\}\}\)\\right\)\\right\]≤\(1\+ρAR\)​𝔼​\[dℓ−1​\(𝐡,𝐡^\)\]\+εAR\\displaystyle\\leq\(1\+\\rho\_\{\\mathrm\{AR\}\}\)\\mathbb\{E\}\\\!\\left\[d\_\{\\ell\-1\}\(\\mathbf\{h\},\\widehat\{\\mathbf\{h\}\}\)\\right\]\+\\varepsilon\_\{\\mathrm\{AR\}\}≤\(1\+ρAR\)​\(Dℓ−1\+η\)\+εAR\.\\displaystyle\\leq\(1\+\\rho\_\{\\mathrm\{AR\}\}\)\(D\_\{\\ell\-1\}\+\\eta\)\+\\varepsilon\_\{\\mathrm\{AR\}\}\.\(42\)Lettingη↓0\\eta\\downarrow 0yields the recurrence

Dℓ≤\(1\+ρAR\)​Dℓ−1\+εAR\.D\_\{\\ell\}\\leq\(1\+\\rho\_\{\\mathrm\{AR\}\}\)D\_\{\\ell\-1\}\+\\varepsilon\_\{\\mathrm\{AR\}\}\.\(43\)Unrolling it fromD0=0D\_\{0\}=0gives

DL≤εAR​∑j=0L−1\(1\+ρAR\)j=εAR​AL​\(ρAR\),D\_\{L\}\\leq\\varepsilon\_\{\\mathrm\{AR\}\}\\sum\_\{j=0\}^\{L\-1\}\(1\+\\rho\_\{\\mathrm\{AR\}\}\)^\{j\}=\\varepsilon\_\{\\mathrm\{AR\}\}A\_\{L\}\(\\rho\_\{\\mathrm\{AR\}\}\),\(44\)which proves Eq\. \([35](https://arxiv.org/html/2606.24982#S4.E35)\)\.

We next prove the block\-wise result for unconditional generation\. Let𝖯1:bBL\\mathsf\{P\}\_\{1:b\}^\{\\mathrm\{BL\}\}and𝖰1:bBL\\mathsf\{Q\}\_\{1:b\}^\{\\mathrm\{BL\}\}denote the true and generated joint distributions of the firstbblatent blocks, equivalently the firstb​L′bL^\{\\prime\}latent events\. Define

Eb=W1b​L′​\(𝖯1:bBL,𝖰1:bBL\),E0=0\.E\_\{b\}=W\_\{1\}^\{bL^\{\\prime\}\}\\\!\\left\(\\mathsf\{P\}\_\{1:b\}^\{\\mathrm\{BL\}\},\\mathsf\{Q\}\_\{1:b\}^\{\\mathrm\{BL\}\}\\right\),\\qquad E\_\{0\}=0\.\(45\)HereEbE\_\{b\}quantifies the corresponding generation error afterbbgenerated blocks\. Repeating the preceding argument at the block level, for any coupling of previous block prefixes\(𝐠,𝐠′\)\(\\mathbf\{g\},\\mathbf\{g\}^\{\\prime\}\)with𝐠,𝐠′∈ℝ\(b−1\)​L′×D\\mathbf\{g\},\\mathbf\{g\}^\{\\prime\}\\in\\mathbb\{R\}^\{\(b\-1\)L^\{\\prime\}\\times D\}, the triangle inequality and[1](https://arxiv.org/html/2606.24982#Thmassumption1)give

W1L′\(𝖯bBL\(⋅∣𝐠\),𝖰bBL\(⋅∣𝐠′\)\)\\displaystyle W\_\{1\}^\{L^\{\\prime\}\}\\\!\\left\(\\mathsf\{P\}\_\{b\}^\{\\mathrm\{BL\}\}\(\\cdot\\mid\\mathbf\{g\}\),\\mathsf\{Q\}\_\{b\}^\{\\mathrm\{BL\}\}\(\\cdot\\mid\\mathbf\{g\}^\{\\prime\}\)\\right\)≤W1L′\(𝖯bBL\(⋅∣𝐠\),𝖯bBL\(⋅∣𝐠′\)\)\\displaystyle\\quad\\leq W\_\{1\}^\{L^\{\\prime\}\}\\\!\\left\(\\mathsf\{P\}\_\{b\}^\{\\mathrm\{BL\}\}\(\\cdot\\mid\\mathbf\{g\}\),\\mathsf\{P\}\_\{b\}^\{\\mathrm\{BL\}\}\(\\cdot\\mid\\mathbf\{g\}^\{\\prime\}\)\\right\)\+W1L′\(𝖯bBL\(⋅∣𝐠′\),𝖰bBL\(⋅∣𝐠′\)\)\\displaystyle\\qquad\+W\_\{1\}^\{L^\{\\prime\}\}\\\!\\left\(\\mathsf\{P\}\_\{b\}^\{\\mathrm\{BL\}\}\(\\cdot\\mid\\mathbf\{g\}^\{\\prime\}\),\\mathsf\{Q\}\_\{b\}^\{\\mathrm\{BL\}\}\(\\cdot\\mid\\mathbf\{g\}^\{\\prime\}\)\\right\)≤ρBL​d\(b−1\)​L′​\(𝐠,𝐠′\)\+εBL\.\\displaystyle\\quad\\leq\\rho\_\{\\mathrm\{BL\}\}d\_\{\(b\-1\)L^\{\\prime\}\}\(\\mathbf\{g\},\\mathbf\{g\}^\{\\prime\}\)\+\\varepsilon\_\{\\mathrm\{BL\}\}\.\(46\)Since the sequence metric is additive across blocks, this yields

Eb≤\(1\+ρBL\)​Eb−1\+εBL\.E\_\{b\}\\leq\(1\+\\rho\_\{\\mathrm\{BL\}\}\)E\_\{b\-1\}\+\\varepsilon\_\{\\mathrm\{BL\}\}\.\(47\)Unrolling the recurrence fromE0=0E\_\{0\}=0gives

EB≤εBL​∑j=0B−1\(1\+ρBL\)j=εBL​AB​\(ρBL\),E\_\{B\}\\leq\\varepsilon\_\{\\mathrm\{BL\}\}\\sum\_\{j=0\}^\{B\-1\}\(1\+\\rho\_\{\\mathrm\{BL\}\}\)^\{j\}=\\varepsilon\_\{\\mathrm\{BL\}\}A\_\{B\}\(\\rho\_\{\\mathrm\{BL\}\}\),\(48\)which proves Eq\. \([36](https://arxiv.org/html/2606.24982#S4.E36)\)\.

It remains to prove Eq\. \([37](https://arxiv.org/html/2606.24982#S4.E37)\)\. SinceAn​\(ρ\)=∑j=0n−1\(1\+ρ\)jA\_\{n\}\(\\rho\)=\\sum\_\{j=0\}^\{n\-1\}\(1\+\\rho\)^\{j\}is nondecreasing inρ\\rho, the conditionsεBL≤L′​εAR\\varepsilon\_\{\\mathrm\{BL\}\}\\leq L^\{\\prime\}\\varepsilon\_\{\\mathrm\{AR\}\}andρBL≤ρAR=ρ\\rho\_\{\\mathrm\{BL\}\}\\leq\\rho\_\{\\mathrm\{AR\}\}=\\rhoimply that the block\-wise upper bound is at mostL′​εAR​AB​\(ρ\)L^\{\\prime\}\\varepsilon\_\{\\mathrm\{AR\}\}A\_\{B\}\(\\rho\), whereas the event\-wise upper bound isεAR​AL​\(ρ\)\\varepsilon\_\{\\mathrm\{AR\}\}A\_\{L\}\(\\rho\)\. Ifρ=0\\rho=0, thenAB​\(0\)=BA\_\{B\}\(0\)=BandAL​\(0\)=L=B​L′A\_\{L\}\(0\)=L=BL^\{\\prime\}, soL′​AB​\(0\)/AL​\(0\)=1L^\{\\prime\}A\_\{B\}\(0\)/A\_\{L\}\(0\)=1\. Ifρ\>0\\rho\>0, seta=1\+ρ\>1a=1\+\\rho\>1\. SinceL=B​L′L=BL^\{\\prime\},

AL​\(ρ\)AB​\(ρ\)=aB​L′−1aB−1=1\+aB\+a2​B\+⋯\+a\(L′−1\)​B≥L′\.\\frac\{A\_\{L\}\(\\rho\)\}\{A\_\{B\}\(\\rho\)\}=\\frac\{a^\{BL^\{\\prime\}\}\-1\}\{a^\{B\}\-1\}=1\+a^\{B\}\+a^\{2B\}\+\\cdots\+a^\{\(L^\{\\prime\}\-1\)B\}\\geq L^\{\\prime\}\.\(49\)Therefore,L′​AB​\(ρ\)/AL​\(ρ\)≤1L^\{\\prime\}A\_\{B\}\(\\rho\)/A\_\{L\}\(\\rho\)\\leq 1, with strict inequality wheneverρ\>0\\rho\>0andL′\>1L^\{\\prime\}\>1\. This completes the proof\. ∎

###### Proof of Eq\. \([38](https://arxiv.org/html/2606.24982#S4.E38)\)\.

Letμ\\muandν\\nube two latent sequence distributions onℝL×D\\mathbb\{R\}^\{L\\times D\}, and letπ\\pibe any coupling ofμ\\muandν\\nu\. If\(𝐔,𝐕\)∼π\(\\mathbf\{U\},\\mathbf\{V\}\)\\sim\\pi, then\(gϕ​\(𝐔\),gϕ​\(𝐕\)\)\\big\(g\_\{\\phi\}\(\\mathbf\{U\}\),g\_\{\\phi\}\(\\mathbf\{V\}\)\\big\)is a coupling of the push\-forward distributions\(gϕ\)\#​μ\(g\_\{\\phi\}\)\_\{\\\#\}\\muand\(gϕ\)\#​ν\(g\_\{\\phi\}\)\_\{\\\#\}\\nu\. Ifgϕg\_\{\\phi\}isLdecL\_\{\\mathrm\{dec\}\}\-Lipschitz,

𝔼π​\[ΔL​\(gϕ​\(𝐔\),gϕ​\(𝐕\)\)\]≤Ldec​𝔼π​\[dL​\(𝐔,𝐕\)\]\.\\mathbb\{E\}\_\{\\pi\}\\\!\\left\[\\Delta\_\{L\}\\\!\\left\(g\_\{\\phi\}\(\\mathbf\{U\}\),g\_\{\\phi\}\(\\mathbf\{V\}\)\\right\)\\right\]\\leq L\_\{\\mathrm\{dec\}\}\\mathbb\{E\}\_\{\\pi\}\\\!\\left\[d\_\{L\}\(\\mathbf\{U\},\\mathbf\{V\}\)\\right\]\.\(50\)Taking the infimum over all couplingsπ\\piofμ\\muandν\\nugives

WΔL​\(\(gϕ\)\#​μ,\(gϕ\)\#​ν\)≤Ldec​W1L​\(μ,ν\)\.W\_\{\\Delta\_\{L\}\}\\\!\\left\(\(g\_\{\\phi\}\)\_\{\\\#\}\\mu,\(g\_\{\\phi\}\)\_\{\\\#\}\\nu\\right\)\\leq L\_\{\\mathrm\{dec\}\}W\_\{1\}^\{L\}\(\\mu,\\nu\)\.\(51\)Substitutingμ=𝖯1:L\\mu=\\mathsf\{P\}\_\{1:L\}andν=𝖰1:L\\nu=\\mathsf\{Q\}\_\{1:L\}proves Eq\. \([38](https://arxiv.org/html/2606.24982#S4.E38)\)\. ∎

Similar Articles

Discrete Stochastic Localization for Non-autoregressive Generation

arXiv cs.LG

Introduces Discrete Stochastic Localization (DSL), a continuous-state diffusion framework for non-autoregressive text generation that uses unit-sphere token embeddings and a timestep-invariant denoiser, achieving better distributional faithfulness than masked discrete diffusion models on OpenWebText.