Labadain-30k+: A Monolingual Tetun Document-Level Audited Dataset
Data and Resources
-
Text Information Retrieval in TetunTXT
Full dataset
-
Labadain crawlerPYTHON
Labadain crawler is a data collection pipeline for low-resource languages...
Additional Info
Field | Value |
---|---|
Author | Gabriel de Jesus, Sérgio Nunes |
Last Updated | April 29, 2024, 13:34 (UTC) |
Created | April 3, 2024, 09:16 (UTC) |
Acknowledgement | The Labadain-30k+ dataset was developed within the context of a Ph.D research project financed by national funds through the Portuguese funding agency FCT - Fundação Para a Ciência e a Tecnologia under the Ph.D scholarship grant number SFRH/BD/151437/2021. |
Citation | de Jesus, G., & Nunes, S. (2024). Labadain-30k+: A Monolingual Tetun Document-Level Audited Dateset [Data set]. INESC TEC. https://doi.org/10.25747/YDWR-N696 |
Creation Date | January 30, 2024 |
DOI | https://doi.org/10.25747/ydwr-n696 |
Instrument Name | Labadain Crawler |
Instrument Type | Web crawling |
Language | Tetun |
Relation | de Jesus, G., & Nunes, S. (2024). Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Lingotto Conference Centre - Torino (Italia). Zenodo. https://doi.org/10.5281/zenodo.10911381 |
Size | 84MB |
Spatial Coverage | Timor-Leste |
Temporal Coverage | 2001-2023 (except 2004-2005) |