1. Overview
Labadain-ZSRunS is a dataset consisting of run files produced by classical sparse and zero-shot dense retrieval models, resulted from the experiments on Tetun ad-hoc text retrieval. It also includes document summaries generated by a large language model (LLM) based on the full content of each document from the Labadain-Avaliadór collection. The dataset is intended to support research in Tetun information retrieval and, more broadly, in multilingual, cross-lingual, and zero-shot retrieval scenarios involving underrepresented languages.
2. Dataset Components
The dataset comprises model-specific run files and LLM-generated document summaries.
2.1. Model-Specific Run Files
This component contains retrieval run files produced by a diverse set of models, including:
- Classical sparse models: BM25, DFR BM25, and Hiemstra LM
- Zero-shot dense retrieval models: This includes pretrained dense retrievers applied to Tetun in a zero-shot setting to generate document embeddings. The models used are DPR, mDPR, Contriever, mContriever, ColBERTv2, and ColBERT-X.
2.2. LLM-Generated Document Summaries
Summaries were automatically generated using Claude Haiku 3, based on the full content of each Tetun document.
3. Dataset Structure
All files are organized under the root folder of Labadain-ZSRunS.
3.1. Run Files
Run files are stored in CSV format, comprising qid, docno, score, and rank columns.
- Sparse retrieval models: Each sparse model (BM25, DFR BM25, Hiemstra LM) has its own run file. All sparse model run files are located within the sparse-baseline-run-files/[files].
- Dense retrieval models: Each dense model (e.g., DPR, mDPR, Contriever) includes two run files: one using title embeddings and the other using contextual (title + summary) embeddings. All dense model run files are located within the dense-zero-shot-run-files/[folders].
3.2. LLM-Generated Summaries
The LLM-generated summaries file is provided in CSV format with docno and contextual_document columns and located within the llm-generated-summaries/[file].
- The docno column represents the document identifier as used in the Labadain-Avaliadór document collection.
- The contextual_document column contains a concatenation of the document's title and its LLM-generated summary, separated by a newline character (\n).