Labadain-30k+: A Monolingual Tetun Document-Level Audited Dataset
データとリソース
-
Text Information Retrieval in TetunTXT
Full dataset
-
Labadain crawlerPYTHON
Labadain crawler is a data collection pipeline for low-resource languages...
追加情報
フィールド | 値 |
---|---|
作成者 | Gabriel de Jesus, Sérgio Nunes |
最終更新 | 3月 27, 2025, 16:04 (UTC) |
作成日 | 4月 3, 2024, 09:16 (UTC) |
Acknowledgement | The Labadain-30k+ dataset was developed within the context of a Ph.D research project financed by national funds through the Portuguese funding agency FCT - Fundação Para a Ciência e a Tecnologia under the Ph.D scholarship grant number SFRH/BD/151437/2021. |
Citation | de Jesus, G., & Nunes, S. (2024). Labadain-30k+: A Monolingual Tetun Document-Level Audited Dateset [Data set]. INESC TEC. https://doi.org/10.25747/YDWR-N696 |
Creation Date | January 30, 2024 |
DOI | https://doi.org/10.25747/ydwr-n696 |
Instrument Name | Labadain Crawler |
Instrument Type | Web crawling |
言語 | Tetun |
Relation | de Jesus, G., & Nunes, S. (2024). Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Lingotto Conference Centre - Torino (Italia). Zenodo. https://doi.org/10.5281/zenodo.10911381 |
Size | 84MB |
Spatial Coverage | Timor-Leste |
Temporal Coverage | 2001-2023 (except 2004-2005) |