Wikipedia and Simple Wikipedia Lead Section Pairs for Nine Categories

The dataset (categorized_dataset folder) contains 9 files in .csv format, each a collection of 10,000 lead section pairs sourced from Wikipedia (https://www.wikipedia.org/) and Simple Wikipedia (https://simple.wikipedia.org/) for a given category. Included categories are Culture, Education, Employment, Entertainment, Health, Leisure, Objects, Science and Time. This dataset was created to understand how effective an open-source large language model (Llama3) is in assessing the readability of texts and simplifying text across multiple domains. The dataset was collected using Wikipedia API.

Data e Risorse

Informazioni addizionali

Campo Valore
Autore José Frederico Rodrigues, Carla Teixeira Lopes & Henrique Lopes Cardoso
Manutentore João A. Castro (joao.a.castro@inesctec.pt)
Ultimo aggiornamento agosto 9, 2024, 14:14 (UTC)
Creato agosto 9, 2024, 13:41 (UTC)
Citation Rodrigues, J. F., Teixeira Lopes, C., & Lopes Cardoso, H. (2024). Wikipedia and Simple Wikipedia Lead Section Pairs for Nine Categories [Data set]. INESC TEC. https://doi.org/10.25747/4VC9-ZS43
Creation Date May, 2024
DOI doi.org/10.25747/4VC9-ZS43
Linguaggio EN
Relation Master Thesis: Readability Assessment and Text Simplification through Open-Source Large Language Models
Size 454 MB