Labadain-ZSRunS: Sparse and Zero-Shot Dense Retrieval Runs with LLM-Generated Summaries for Tetun Ad-Hoc Text Retrieval

1. Overview

Labadain-ZSRunS is a dataset consisting of run files produced by classical sparse and zero-shot dense retrieval models, resulted from the experiments on Tetun ad-hoc text retrieval. It also includes document summaries generated by a large language model (LLM) based on the full content of each document from the Labadain-Avaliadór collection. The dataset is intended to support research in Tetun information retrieval and, more broadly, in multilingual, cross-lingual, and zero-shot retrieval scenarios involving underrepresented languages.

2. Dataset Components

The dataset comprises model-specific run files and LLM-generated document summaries.

2.1. Model-Specific Run Files

This component contains retrieval run files produced by a diverse set of models, including:

  • Classical sparse models: BM25, DFR BM25, and Hiemstra LM
  • Zero-shot dense retrieval models: This includes pretrained dense retrievers applied to Tetun in a zero-shot setting to generate document embeddings. The models used are DPR, mDPR, Contriever, mContriever, ColBERTv2, and ColBERT-X.

2.2. LLM-Generated Document Summaries

Summaries were automatically generated using Claude Haiku 3, based on the full content of each Tetun document.

3. Dataset Structure

All files are organized under the root folder of Labadain-ZSRunS.

3.1. Run Files

Run files are stored in CSV format, comprising qid, docno, score, and rank columns.

  • Sparse retrieval models: Each sparse model (BM25, DFR BM25, Hiemstra LM) has its own run file. All sparse model run files are located within the sparse-baseline-run-files/[files].
  • Dense retrieval models: Each dense model (e.g., DPR, mDPR, Contriever) includes two run files: one using title embeddings and the other using contextual (title + summary) embeddings. All dense model run files are located within the dense-zero-shot-run-files/[folders].

3.2. LLM-Generated Summaries

The LLM-generated summaries file is provided in CSV format with docno and contextual_document columns and located within the llm-generated-summaries/[file].

  • The docno column represents the document identifier as used in the Labadain-Avaliadór document collection.
  • The contextual_document column contains a concatenation of the document's title and its LLM-generated summary, separated by a newline character (\n).

Data og ressourcer

Yderligere info

Felt Værdi
Forfatter Gabriel de Jesus, Siddharth AK Singh, Sérgio Nunes, Andrew Yates
Last Updated juni 12, 2025, 10:26 (UTC)
Oprettet april 29, 2025, 15:44 (UTC)
Sprog Tetun