Dataset of synthetic clinical notes in European Portuguese generated using an open-source large language model, along with prompting and evaluation data
Data og ressourcer
-
ReadMeTXT
-
Synthetic SubsetTXT
Contains the subset of synthetic clinical notes (in TXT format, without...
-
Full synthetic datasetZIP
Contains the dataset of synthetic clinical notes in European Portuguese,...
-
Prompts and ResponsesCSV
Contains a CSV file with the prompts and responses from prompt engineering....
-
Human Evaluation ResultsCSV
Contains CSV files with the results from the human evaluation process
Yderligere info
Felt | Værdi |
---|---|
Forfatter | Daniel Félix & Carla Teixeira Lopes |
Last Updated | juni 27, 2025, 14:54 (UTC) |
Oprettet | juni 26, 2025, 13:51 (UTC) |
Citation | Félix, D., & Teixeira Lopes, C. (2025). Dataset of synthetic clinical notes in European Portuguese generated using an open-source large language model, along with prompting and evaluation data [Data set]. INESC TEC. https://doi.org/10.25747/4GC6-DK48 |
Creation Date | December, 2024 |
DOI | https://doi.org/10.25747/4GC6-DK48 |
Data Collection Method | Synthetic dataset was generated with one-shot and few-shot prompting using an open-source large language model, evaluation was done by medical students and professionals |
Format | XML (each clinical note is an XML file), CSV (prompting and evaluation data), TXT (each clinical note in the evaluation subset is a TXT file) |
Instrument Name | Llama 3.2 with 3 billion parameters and instruction tuning (synthetic dataset), notasclinicas.inesctec.pt (evaluations) |
Sprog | Portuguese |
Project | This repository contains data produced in the process of the dissertation project titled "Generating Synthetic Clinical Data in European Portuguese Using an Open-Source Large Language Model". This project was conducted by Daniel Filipe Souto Félix (up201905189@up.pt) at the Faculty of Engineering of the University of Porto, for the Master in Informatics and Computing Engineering, under the orientation of Professor Carla Teixeira Lopes, and integrated in the Health from Portugal project. This project's goal is to increase publicly available clinical data in European Portuguese, so that it can be freely used in research and development. For this, we generated a dataset of synthetic clinical notes in European Portuguese using an open-source large language model. This dataset was evaluated using specialized human evaluation and various automatic methods. Specialized evaluation was performed through a web application we developed. |
Relation | 10.1186/s13326-022-00269-1 (dataset used for fine-tuning model and prompting examples), 10.18653/v1/W19-5024 (dataset used for fine-tuning model) |
Size | 643.9 MB |
Spatial Coverage | Synthetic dataset is intended to represent the country of Portugal |
Temporal Coverage | Synthetic dataset was generated in December 2024 |
Type | Clinical notes, prompting records, evaluation results |
Type of Instrument | Open-source large language model to generate the synthetic dataset, web application created by us to collect evaluations |