Released a free 9.8M doc Indic multilingual corpus — Hindi, Bengali, Tamil, Telugu + 7 more (CC0, HuggingFace) [P]
Summary
Released a free 9.8 million document multilingual Indic corpus (11 languages, CC0 license) on HuggingFace, containing approximately 8.4 billion tokens, built for multilingual research.
Similar Articles
@cognitivelab_ai: Launching NayanaOCR Corpus 1M+ Document images across 22 languages Largest open source synthetic > multilingual > multi…
Launch of NayanaOCR Corpus, an open-source synthetic document corpus with over 1 million images across 22 languages, designed for multilingual, multimodal, and multitask OCR research.
1M datasets on HF !
Celebrating a community milestone of 1 million datasets on Hugging Face, highlighting the collaborative effort to advance AI through open data.
ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation
This paper introduces ForMaT, a parallel corpus of 3,956 PDFs across 15 language pairs designed for visually-grounded multilingual translation, preserving layout metadata to benchmark layout-aware MT systems.
huggingface/transformers Release 5.8.0
Hugging Face has released version 5.8.0 of the Transformers library, a widely used open-source framework for natural language processing and deep learning.
@tom_doerr: Multilingual NLP library supporting 130 languages https://github.com/hankcs/HanLP
HanLP is an open-source multilingual NLP library supporting 130 languages with 10 joint tasks, built on PyTorch and TensorFlow 2.x.