Wikipedia and Simple Wikipedia Lead Section Pairs for Nine Categories

The dataset (categorized_dataset folder) contains 9 files in .csv format, each a collection of 10,000 lead section pairs sourced from Wikipedia (https://www.wikipedia.org/) and Simple Wikipedia (https://simple.wikipedia.org/) for a given category. Included categories are Culture, Education, Employment, Entertainment, Health, Leisure, Objects, Science and Time. This dataset was created to understand how effective an open-source large language model (Llama3) is in assessing the readability of texts and simplifying text across multiple domains. The dataset was collected using Wikipedia API.

Dados e Recursos

Informação Adicional

Campo Valor
Autor José Frederico Rodrigues, Carla Teixeira Lopes & Henrique Lopes Cardoso
Gestor João A. Castro (joao.a.castro@inesctec.pt)
Última Atualização agosto 9, 2024, 14:14 (UTC)
Data de criação agosto 9, 2024, 13:41 (UTC)
Citation Rodrigues, J. F., Teixeira Lopes, C., & Lopes Cardoso, H. (2024). Wikipedia and Simple Wikipedia Lead Section Pairs for Nine Categories [Data set]. INESC TEC. https://doi.org/10.25747/4VC9-ZS43
Creation Date May, 2024
DOI doi.org/10.25747/4VC9-ZS43
Idioma EN
Relation Master Thesis: Readability Assessment and Text Simplification through Open-Source Large Language Models
Size 454 MB