easyaligner is an open-source forced alignment library with GPU acceleration and flexible text normalization that works with all wav2vec2 models on Hugging Face Hub. It addresses practical workflows like handling partial transcripts, irrelevant speech segments, and long audio without chunking while preserving original text formatting.
https://preview.redd.it/f4d5krhkjyvg1.png?width=1020&format=png&auto=webp&s=11310f377b22abbe3dd110cc7d362ba8aae35f8d I have built [`easyaligner`](https://kb-labb.github.io/easyaligner/), a forced alignment library designed to be performant and easy to use. Having worked with preprocessing hundreds of thousands of hours of audio and text for training speech-to-text models, I found that the available open source forced alignment libraries often missed some convenience features. For our purposes it was, in particular, important for the tooling to be able to: * Handle cases where the transcript does not cover all of the spoken content in the audio (by automatically detecting the relevant audio region). * Handle some irrelevant speech at the start/end of audio segments to be aligned. * Ideally handle long segments of audio and text without the need for chunking. * Normalize ground-truth texts for better alignment quality, while maintaining a mapping between the normalized text and the original text, so that the original text's formatting can be recovered after alignment. `easyaligner` is an attempt to package all of these workflow improvements into a forced alignment library. The documentation has tutorials for different [alignment scenarios](https://kb-labb.github.io/easyaligner/get-started/overview.html#tutorials), and for [custom text processing](https://kb-labb.github.io/easyaligner/get-started/text_processing.html). The aligned outputs can be segmented at any level of granularity (sentence, paragraph, etc.), while preserving the original text’s formatting. The forced alignment backend uses [Pytorch's forced alignment API](https://docs.pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html) with a GPU based implementation of the Viterbi algorithm. It's both fast and memory-efficient, handling hours of audio/text in one pass without the need to chunk the audio. I've adapted the API to support emission extraction from all wav2vec2 models on Hugging Face Hub. You can force align audio and text in any language, as long as there's a w2v2 model on HF Hub that can transcribe the language. `easyaligner` supports aligning both from ground-truth transcripts, as well as from ASR model outputs. Check out its companion library [`easytranscriber`](https://kb-labb.github.io/easytranscriber/) for an example where `easyaligner` is used as a backend to align ASR outputs. It works the same way as `WhisperX`, but transcribes [35% to 102% faster](https://kb-labb.github.io/easytranscriber/benchmarks.html), depending on the hardware. The documentation: [https://kb-labb.github.io/easyaligner/](https://kb-labb.github.io/easyaligner/) Source code on Github (MIT licensed): [https://github.com/kb-labb/easyaligner](https://github.com/kb-labb/easyaligner)
This paper documents the Montreal Forced Aligner 3.0, a widely used open-source tool for forced alignment, achieving state-of-the-art performance across English, Japanese, and Korean with mean boundary errors below 15 ms.
WavAlign introduces a modality-aware adaptive post-training method that uses constrained preference updates and explicit anchoring to boost both semantic quality and speech expressiveness in end-to-end spoken dialogue models.
PolyAlign is a distribution-aware alignment framework that aligns language models to context-specific human response distributions rather than a single global style, improving naturalness and faithfulness across bilingual settings.
A novel method for multilingual word-level forced alignment combines self-supervised representations from MMS and a phoneme boundary detector with a learned dynamic programming decoder, outperforming existing aligners on English and unseen languages without further training.
FAST-GOAL is a fine-tuning method that enhances CLIP's ability to align global and local semantics in images and lengthy text, introducing FLISM and TSL modules and the GLIT100k dataset. It achieves improvements on long caption datasets.