Wikipedia and Simple Wikipedia Lead Section Pairs for Nine Categories

The dataset (categorized_dataset folder) contains 9 files in .csv format, each a collection of 10,000 lead section pairs sourced from Wikipedia (https://www.wikipedia.org/) and Simple Wikipedia (https://simple.wikipedia.org/) for a given category. Included categories are Culture, Education, Employment, Entertainment, Health, Leisure, Objects, Science and Time. This dataset was created to understand how effective an open-source large language model (Llama3) is in assessing the readability of texts and simplifying text across multiple domains. The dataset was collected using Wikipedia API.

Data e Risorse

Categorized datasetCSV
Contains nine files, each a selection of 10.000 lead section pairs.
Esplora
- Altre informazioni
- Download
Model responsesCSV
Contains nine files, each with the model's raw and processed responses for...
Esplora
- Altre informazioni
- Download
READMETXT
Esplora
- Altre informazioni
- Download
Full DatasetZIP
Folder containing the complete dataset.
Esplora
- Altre informazioni
- Download

Informazioni addizionali

Campo	Valore
Autore	José Frederico Rodrigues, Carla Teixeira Lopes & Henrique Lopes Cardoso
Manutentore	João A. Castro (joao.a.castro@inesctec.pt)
Ultimo aggiornamento	agosto 9, 2024, 14:14 (UTC)
Creato	agosto 9, 2024, 13:41 (UTC)
Citation	Rodrigues, J. F., Teixeira Lopes, C., & Lopes Cardoso, H. (2024). Wikipedia and Simple Wikipedia Lead Section Pairs for Nine Categories [Data set]. INESC TEC. https://doi.org/10.25747/4VC9-ZS43
Creation Date	May, 2024
DOI	doi.org/10.25747/4VC9-ZS43
Linguaggio	EN
Relation	Master Thesis: Readability Assessment and Text Simplification through Open-Source Large Language Models
Size	454 MB