Tag
Introduces TUBE, a variational upper bound on log-likelihood for discrete diffusion language models, enabling better evaluation and revealing that masked diffusion models still underperform autoregressive models.
This paper addresses the degradation of likelihood-based machine-generated text detectors by identifying a Simpson's paradox in token-score aggregation. It proposes a learned local calibration step that significantly improves detection performance across various models and datasets.
This paper proposes using Annealed Importance Sampling to evaluate log-likelihoods for decoder-based generative models (VAEs, GANs, etc.), addressing the challenge of intractable likelihood estimation. The authors validate their method and provide evaluation code to analyze model performance, overfitting, and mode coverage.