Dataset of synthetic clinical notes in European Portuguese generated using an open-source large language model, along with prompting and evaluation data

This dataset was generated using an open-source large language model and carefully curated prompts, simulating realistic clinical narratives while ensuring no real patient data is included.

The primary purpose of this dataset is to support the development, evaluation, and benchmarking of Artificial Intelligence tools for clinical and biomedical applications in the Portuguese language, especially European Portuguese. It is particularly valuable for information extraction (IE) tasks such as named entity recognition, clinical note classification, summarization, and synthetic data generation in low-resource language settings.

The dataset promotes research on the responsible use of synthetic data in healthcare and aims to serve as a foundation for training or fine-tuning domain-specific Portuguese language models in clinical IE and other natural language processing tasks.

About the dataset XML files comprising 98,571 fully synthetic clinical notes in European Portuguese, divided into 4 types: 24,759 admission notes, 24,411 ambulatory notes, 24,639 discharge summaries, and 24,762 nursing notes; CSV file with prompts and responses from prompt engineering; CSV files with prompts and responses from synthetic dataset generation; CSV file with results from human evaluation; TXT files containing 1,000 clinical notes (250 of each type) taken from the synthetic dataset and used during automatic evaluation.

Data og ressourcer

Yderligere info

Felt Værdi
Forfatter Daniel Félix & Carla Teixeira Lopes
Last Updated juni 27, 2025, 14:54 (UTC)
Oprettet juni 26, 2025, 13:51 (UTC)
Citation Félix, D., & Teixeira Lopes, C. (2025). Dataset of synthetic clinical notes in European Portuguese generated using an open-source large language model, along with prompting and evaluation data [Data set]. INESC TEC. https://doi.org/10.25747/4GC6-DK48
Creation Date December, 2024
DOI https://doi.org/10.25747/4GC6-DK48
Data Collection Method Synthetic dataset was generated with one-shot and few-shot prompting using an open-source large language model, evaluation was done by medical students and professionals
Format XML (each clinical note is an XML file), CSV (prompting and evaluation data), TXT (each clinical note in the evaluation subset is a TXT file)
Instrument Name Llama 3.2 with 3 billion parameters and instruction tuning (synthetic dataset), notasclinicas.inesctec.pt (evaluations)
Sprog Portuguese
Project This repository contains data produced in the process of the dissertation project titled "Generating Synthetic Clinical Data in European Portuguese Using an Open-Source Large Language Model". This project was conducted by Daniel Filipe Souto Félix (up201905189@up.pt) at the Faculty of Engineering of the University of Porto, for the Master in Informatics and Computing Engineering, under the orientation of Professor Carla Teixeira Lopes, and integrated in the Health from Portugal project. This project's goal is to increase publicly available clinical data in European Portuguese, so that it can be freely used in research and development. For this, we generated a dataset of synthetic clinical notes in European Portuguese using an open-source large language model. This dataset was evaluated using specialized human evaluation and various automatic methods. Specialized evaluation was performed through a web application we developed.
Relation 10.1186/s13326-022-00269-1 (dataset used for fine-tuning model and prompting examples), 10.18653/v1/W19-5024 (dataset used for fine-tuning model)
Size 643.9 MB
Spatial Coverage Synthetic dataset is intended to represent the country of Portugal
Temporal Coverage Synthetic dataset was generated in December 2024
Type Clinical notes, prompting records, evaluation results
Type of Instrument Open-source large language model to generate the synthetic dataset, web application created by us to collect evaluations