Released a free 9.8M doc Indic multilingual corpus — Hindi, Bengali, Tamil, Telugu + 7 more (CC0, HuggingFace) [P]

Reddit r/MachineLearning Tools

Summary

Released a free 9.8 million document multilingual Indic corpus (11 languages, CC0 license) on HuggingFace, containing approximately 8.4 billion tokens, built for multilingual research.

Built this over the past few weeks as part of a multilingual research project. Figured I'd share it here. Check it out! \~9.8M web documents across 11 languages — hi, bn, ta, te, mr, gu, kn, ml, pa, ur, en. \~8.4B tokens. CC0 license. 🤗 [https://huggingface.co/datasets/AM0908/indic-hplt-v1](https://huggingface.co/datasets/AM0908/indic-hplt-v1)
Original Article

Similar Articles

1M datasets on HF !

Reddit r/LocalLLaMA

Celebrating a community milestone of 1 million datasets on Hugging Face, highlighting the collaborative effort to advance AI through open data.

huggingface/transformers Release 5.8.0

GitHub Releases Watchlist

Hugging Face has released version 5.8.0 of the Transformers library, a widely used open-source framework for natural language processing and deep learning.