LabadainLog-17k+: Search Logs from Tetun-Speaking Users Across Chat, Web, and News Platforms

1. Overview

LabadainLog-17k+ is a dataset of interaction logs in Tetun, collected from three different platforms:

  • Labadain Chat (16,952 prompts): An LLM-powered conversational assistant tailored for Tetun speakers, accessible at www.labadain.com.
  • Labadain Search (400 queries): A monolingual search engine designed specifically for Tetun, available at www.labadain.tl.
  • Timor News (400 queries): An online news portal exclusively publishing content in Tetun (www.timornews.tl), which logs incoming search queries through Google Search Console.

This dataset offers a snapshot of real user interactions in Tetun across chat and web-based search scenarios.

2. Dataset Structure

The dataset is organized in the root directory of LabadainLog-17k+ and includes the following CSV files.

2.1. labadain_chat_logs_16952.csv

Logs from the Labadain Chat assistant.

  • datetime: Timestamp of when the user submitted the prompt.
  • location: Geographic location of the user, resolved from IP address.
  • username: Anonymized identifier representing the username (non-personally identifiable).
  • original_prompt: The raw input entered by the user in Tetun.
  • revised_prompt: A corrected version of the query, typically modified to fix misspellings.

2.2. labadain_search_logs_400.csv

Queries submitted to the Labadain Search engine.

  • session_id: Unique identifier for a single search session.
  • location: Geographic location of the search request, resolved from IP address.
  • datetime: Timestamp of when the query was made.
  • original_query: The original search query as entered by the user.
  • revised_query: A corrected version of the query, typically modified to fix misspellings.
  • note – Additional notes or comments.

2.3. google_search_timor_news_400.csv

Search queries that led users to the Timor News site, as tracked via Google Search Console.

  • original_query: The raw query string entered in a Google search engine that resulted in a visit to Timor News.
  • revised_query: A corrected version of the query, typically modified to fix misspellings.
  • note: Additional notes or comments.

3. Data Usage

The LabadainLog-17k+ dataset is best suited for analyzing search behavior and user interaction patterns in Tetun across chat, web search, and news platforms. Key applications include search behavior analysis, cross-platform interaction patters, information access in Tetun, and query formulation.

Data og ressourcer

Yderligere info

Felt Værdi
Forfatter Gabriel de Jesus & Sérgio Nunes
Last Updated juni 12, 2025, 10:25 (UTC)
Oprettet marts 28, 2025, 10:04 (UTC)
Sprog Tetun
Spatial Coverage Timor-Leste