Wikipedia and Simple Wikipedia Lead Section Pairs for Nine Categories

The dataset (categorized_dataset folder) contains 9 files in .csv format, each a collection of 10,000 lead section pairs sourced from Wikipedia (https://www.wikipedia.org/) and Simple Wikipedia (https://simple.wikipedia.org/) for a given category. Included categories are Culture, Education, Employment, Entertainment, Health, Leisure, Objects, Science and Time. This dataset was created to understand how effective an open-source large language model (Llama3) is in assessing the readability of texts and simplifying text across multiple domains. The dataset was collected using Wikipedia API.

데이터와 리소스

추가 정보

필드
저자 José Frederico Rodrigues, Carla Teixeira Lopes & Henrique Lopes Cardoso
관리자 João A. Castro (joao.a.castro@inesctec.pt)
최종 업데이트 8월 9, 2024, 14:14 (UTC)
생성됨 8월 9, 2024, 13:41 (UTC)
Citation Rodrigues, J. F., Teixeira Lopes, C., & Lopes Cardoso, H. (2024). Wikipedia and Simple Wikipedia Lead Section Pairs for Nine Categories [Data set]. INESC TEC. https://doi.org/10.25747/4VC9-ZS43
Creation Date May, 2024
DOI doi.org/10.25747/4VC9-ZS43
언어 EN
Relation Master Thesis: Readability Assessment and Text Simplification through Open-Source Large Language Models
Size 454 MB