Wikipedia and Simple Wikipedia Lead Section Pairs for Nine Categories

The dataset (categorized_dataset folder) contains 9 files in .csv format, each a collection of 10,000 lead section pairs sourced from Wikipedia (https://www.wikipedia.org/) and Simple Wikipedia (https://simple.wikipedia.org/) for a given category. Included categories are Culture, Education, Employment, Entertainment, Health, Leisure, Objects, Science and Time. This dataset was created to understand how effective an open-source large language model (Llama3) is in assessing the readability of texts and simplifying text across multiple domains. The dataset was collected using Wikipedia API.

Dati un resursi

Categorized datasetCSV
Contains nine files, each a selection of 10.000 lead section pairs.
Izpētīt
- Vairāk informācijas
- Lejupielādēt
Model responsesCSV
Contains nine files, each with the model's raw and processed responses for...
Izpētīt
- Vairāk informācijas
- Lejupielādēt
READMETXT
Izpētīt
- Vairāk informācijas
- Lejupielādēt
Full DatasetZIP
Folder containing the complete dataset.
Izpētīt
- Vairāk informācijas
- Lejupielādēt

Papildus informācija

Lauks	Vērtība
Autors	José Frederico Rodrigues, Carla Teixeira Lopes & Henrique Lopes Cardoso
Uzturētājs	João A. Castro (joao.a.castro@inesctec.pt)
Pēdējā atjaunināšana	augusts 9, 2024, 14:14 (UTC)
Izveidots	augusts 9, 2024, 13:41 (UTC)
Citation	Rodrigues, J. F., Teixeira Lopes, C., & Lopes Cardoso, H. (2024). Wikipedia and Simple Wikipedia Lead Section Pairs for Nine Categories [Data set]. INESC TEC. https://doi.org/10.25747/4VC9-ZS43
Creation Date	May, 2024
DOI	doi.org/10.25747/4VC9-ZS43
Valoda	EN
Relation	Master Thesis: Readability Assessment and Text Simplification through Open-Source Large Language Models
Size	454 MB