Typewritten Digital Representations of Portuguese Cultural Heritage Documents from the 20th century

The dataset has typewritten Portuguese documents extracted from the Arquivo Nacional da Torre do Tombo (https://digitarq.arquivos.pt/). It includes records from two fonds of the 20th century: the General Administration of National Treasury (DGFP) and the National Secretariat of Information (SNI). The digital representation of documents in the archives can have multiple one-page digital representations. In total, we extracted 8,146 archival records with 26,812 one-page digital representations. The digital representations were classified by writing format (handwritten, typewritten, or blank), as some archival records contain handwritten or blank documents that are not relevant to our work. The dataset has 23,589 typewritten digital representations, 1,681 handwritten digital representations, and 1,542 blank digital representations. Furthermore, the dataset was classified into ten types of digital representations: letters, structured reports, non-structured reports, processes' covers, minutes' covers, minutes' content, books' covers, books' content, theatre plays' covers, and theatre plays' content. Identifying digital representation typologies relevant to the OCR task was carried out by observing existing textual descriptions of the records and document layouts of the digital representations. The classification of the dataset was performed manually. The typewritten dataset has 3,264 letters, 6,560 structured reports, 1,970 non-structured reports, 82 covers of processes, 6 covers of minutes, 1,473 contents of minutes, 19 books' covers, 182 books' contents, 165 theatre play' covers, and 8,845 theatre play' contents.

Data and Resources

Additional Info

Field Value
Author Mariana Dias, Carla Teixeira Lopes
Last Updated May 27, 2024, 12:47 (UTC)
Created May 9, 2022, 14:24 (UTC)
Citation Dias, M., & Lopes, C. T. (2022). Typewritten Digital Representations of Portuguese Cultural Heritage Documents from the 20th century [Data set]. INESC TEC. https://doi.org/10.25747/ZC25-1531
DOI https://doi.org/10.25747/ZC25-1531
dc.Coverage.Spatial Arquivo Nacional da Torre do Tombo, Palácio Nacional da Ajuda
dc.Coverage.Temporal 1910 - 1974
dc.Created February, 5, 2021 to July 29, 2022
dc.File Size .zip file: 5.17 GB, .csv file: 1.5 MB
dc.Format .zip with * .tif, .csv
dc.Language PT
dc.Type Images and associated files