Dataset of synthetic clinical notes in European Portuguese generated using an open-source large language model, along with prompting and evaluation data

This dataset was generated using an open-source large language model and carefully curated prompts, simulating realistic clinical narratives while ensuring no real patient data is included.

The primary purpose of this dataset is to support the development, evaluation, and benchmarking of Artificial Intelligence tools for clinical and biomedical applications in the Portuguese language, especially European Portuguese. It is particularly valuable for information extraction (IE) tasks such as named entity recognition, clinical note classification, summarization, and synthetic data generation in low-resource language settings.

The dataset promotes research on the responsible use of synthetic data in healthcare and aims to serve as a foundation for training or fine-tuning domain-specific Portuguese language models in clinical IE and other natural language processing tasks.

About the dataset XML files comprising 98,571 fully synthetic clinical notes in European Portuguese, divided into 4 types: 24,759 admission notes, 24,411 ambulatory notes, 24,639 discharge summaries, and 24,762 nursing notes; CSV file with prompts and responses from prompt engineering; CSV files with prompts and responses from synthetic dataset generation; CSV file with results from human evaluation; TXT files containing 1,000 clinical notes (250 of each type) taken from the synthetic dataset and used during automatic evaluation.

Data og ressourcer

ReadMeTXT
Udforsk
- Mere information
- Download
Synthetic SubsetTXT
Contains the subset of synthetic clinical notes (in TXT format, without...
Udforsk
- Mere information
- Download
Full synthetic datasetZIP
Contains the dataset of synthetic clinical notes in European Portuguese,...
Udforsk
- Mere information
- Download
Prompts and ResponsesCSV
Contains a CSV file with the prompts and responses from prompt engineering....
Udforsk
- Mere information
- Download
Human Evaluation ResultsCSV
Contains CSV files with the results from the human evaluation process
Udforsk
- Mere information
- Download

Yderligere info

Felt	Værdi
Forfatter	Daniel Félix & Carla Teixeira Lopes
Last Updated	juni 27, 2025, 14:54 (UTC)
Oprettet	juni 26, 2025, 13:51 (UTC)
Citation	Félix, D., & Teixeira Lopes, C. (2025). Dataset of synthetic clinical notes in European Portuguese generated using an open-source large language model, along with prompting and evaluation data [Data set]. INESC TEC. https://doi.org/10.25747/4GC6-DK48
Creation Date	December, 2024
DOI	https://doi.org/10.25747/4GC6-DK48
Data Collection Method	Synthetic dataset was generated with one-shot and few-shot prompting using an open-source large language model, evaluation was done by medical students and professionals
Format	XML (each clinical note is an XML file), CSV (prompting and evaluation data), TXT (each clinical note in the evaluation subset is a TXT file)
Instrument Name	Llama 3.2 with 3 billion parameters and instruction tuning (synthetic dataset), notasclinicas.inesctec.pt (evaluations)
Sprog	Portuguese
Project	This repository contains data produced in the process of the dissertation project titled "Generating Synthetic Clinical Data in European Portuguese Using an Open-Source Large Language Model". This project was conducted by Daniel Filipe Souto Félix (up201905189@up.pt) at the Faculty of Engineering of the University of Porto, for the Master in Informatics and Computing Engineering, under the orientation of Professor Carla Teixeira Lopes, and integrated in the Health from Portugal project. This project's goal is to increase publicly available clinical data in European Portuguese, so that it can be freely used in research and development. For this, we generated a dataset of synthetic clinical notes in European Portuguese using an open-source large language model. This dataset was evaluated using specialized human evaluation and various automatic methods. Specialized evaluation was performed through a web application we developed.
Relation	10.1186/s13326-022-00269-1 (dataset used for fine-tuning model and prompting examples), 10.18653/v1/W19-5024 (dataset used for fine-tuning model)
Size	643.9 MB
Spatial Coverage	Synthetic dataset is intended to represent the country of Portugal
Temporal Coverage	Synthetic dataset was generated in December 2024
Type	Clinical notes, prompting records, evaluation results
Type of Instrument	Open-source large language model to generate the synthetic dataset, web application created by us to collect evaluations