1. Overview
LabadainLog-17k+ is a dataset of interaction logs in Tetun, collected from three different platforms:
- Labadain Chat (16,952 prompts): An LLM-powered conversational assistant tailored for Tetun speakers, accessible at www.labadain.com.
- Labadain Search (400 queries): A monolingual search engine designed specifically for Tetun, available at www.labadain.tl.
- Timor News (400 queries): An online news portal exclusively publishing content in Tetun (www.timornews.tl), which logs incoming search queries through Google Search Console.
This dataset offers a snapshot of real user interactions in Tetun across chat and web-based search scenarios.
2. Dataset Structure
The dataset is organized in the root directory of LabadainLog-17k+ and includes the following CSV files.
2.1. labadain_chat_logs_16952.csv
Logs from the Labadain Chat assistant.
- datetime: Timestamp of when the user submitted the prompt.
- location: Geographic location of the user, resolved from IP address.
- username: Anonymized identifier representing the username (non-personally identifiable).
- original_prompt: The raw input entered by the user in Tetun.
- revised_prompt: A corrected version of the query, typically modified to fix misspellings.
2.2. labadain_search_logs_400.csv
Queries submitted to the Labadain Search engine.
- session_id: Unique identifier for a single search session.
- location: Geographic location of the search request, resolved from IP address.
- datetime: Timestamp of when the query was made.
- original_query: The original search query as entered by the user.
- revised_query: A corrected version of the query, typically modified to fix misspellings.
- note – Additional notes or comments.
2.3. google_search_timor_news_400.csv
Search queries that led users to the Timor News site, as tracked via Google Search Console.
- original_query: The raw query string entered in a Google search engine that resulted in a visit to Timor News.
- revised_query: A corrected version of the query, typically modified to fix misspellings.
- note: Additional notes or comments.
3. Data Usage
The LabadainLog-17k+ dataset is best suited for analyzing search behavior and user interaction patterns in Tetun across chat, web search, and news platforms. Key applications include search behavior analysis, cross-platform interaction patters, information access in Tetun, and query formulation.