Labadain-30k+: A Monolingual Tetun Document-Level Audited Dataset
Data og ressurser
-
Text Information Retrieval in TetunTXT
Full dataset
-
Labadain crawlerPYTHON
Labadain crawler is a data collection pipeline for low-resource languages...
Tilleggsinformasjon
| Felt | Verdi |
|---|---|
| Forfatter | Gabriel de Jesus, Sérgio Nunes |
| Sist oppdatert | mars 27, 2025, 16:04 (UTC) |
| Opprettet | april 3, 2024, 09:16 (UTC) |
| Acknowledgement | The Labadain-30k+ dataset was developed within the context of a Ph.D research project financed by national funds through the Portuguese funding agency FCT - Fundação Para a Ciência e a Tecnologia under the Ph.D scholarship grant number SFRH/BD/151437/2021. |
| Citation | de Jesus, G., & Nunes, S. (2024). Labadain-30k+: A Monolingual Tetun Document-Level Audited Dateset [Data set]. INESC TEC. https://doi.org/10.25747/YDWR-N696 |
| Creation Date | January 30, 2024 |
| DOI | https://doi.org/10.25747/ydwr-n696 |
| Instrument Name | Labadain Crawler |
| Instrument Type | Web crawling |
| Språk | Tetun |
| Relation | de Jesus, G., & Nunes, S. (2024). Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Lingotto Conference Centre - Torino (Italia). Zenodo. https://doi.org/10.5281/zenodo.10911381 |
| Size | 84MB |
| Spatial Coverage | Timor-Leste |
| Temporal Coverage | 2001-2023 (except 2004-2005) |