Labadain-30k+: A Monolingual Tetun Document-Level Audited Dataset

Labadain-30k+ is a monolingual Tetun dataset containing 33,550 documents spanning from June 2001 to September 2023, excluding the years 2004 and 2005, for which no documents are available. Acquired through web crawling, the dataset is in text format and includes title, URL, source, category, publication date, and content. Each document is separated by two consecutive newlines.

البيانات و الموارد

Text Information Retrieval in TetunTXT
Full dataset
استكشف
- معلومات إضافية
- تنزيل
Labadain crawlerPYTHON
Labadain crawler is a data collection pipeline for low-resource languages...
استكشف
- معلومات إضافية
- الذهاب الى المورد

معلومات إضافية

حقل	القيمة
المؤلف	Gabriel de Jesus, Sérgio Nunes
آخر تحديث	مارس 27, 2025, 16:04 (UTC)
أنشئت	أبريل 3, 2024, 09:16 (UTC)
Acknowledgement	The Labadain-30k+ dataset was developed within the context of a Ph.D research project financed by national funds through the Portuguese funding agency FCT - Fundação Para a Ciência e a Tecnologia under the Ph.D scholarship grant number SFRH/BD/151437/2021.
Citation	de Jesus, G., & Nunes, S. (2024). Labadain-30k+: A Monolingual Tetun Document-Level Audited Dateset [Data set]. INESC TEC. https://doi.org/10.25747/YDWR-N696
Creation Date	January 30, 2024
DOI	https://doi.org/10.25747/ydwr-n696
Instrument Name	Labadain Crawler
Instrument Type	Web crawling
اللغة	Tetun
Relation	de Jesus, G., & Nunes, S. (2024). Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Lingotto Conference Centre - Torino (Italia). Zenodo. https://doi.org/10.5281/zenodo.10911381
Size	84MB
Spatial Coverage	Timor-Leste
Temporal Coverage	2001-2023 (except 2004-2005)