SIGARRA News Corpus

This dataset was taken from the SIGARRA information system at the University of Porto (UP). Every organic unit has its own domain and produces academic news. We collected a sample of 1000 news, manually annotating 905 using the Brat rapid annotation tool. This dataset consists of three files. The first is a CSV file containing news published between 2016-12-14 and 2017-03-01. The second file is a ZIP archive containing one directory per organic unit, with a text file and an annotations file per news article. The third file is an XML containing the complete set of news in a similar format to the HAREM dataset format. This dataset is particularly adequate for training named entity recognition models.

البيانات و الموارد

sigarra_news_corpus-1000-20170302T1422CSV
Comma-separated file with the following columns: news id, title, subtitle,...
استكشف
- معلومات إضافية
- تنزيل
sigarra-news-corpusZIP
Annotated news in the standoff format. Each directory represents an organic...
استكشف
- معلومات إضافية
- تنزيل
sigarra-news-corpusXML
Merged version of the individually annotated news articles, in XML format...
استكشف
- معلومات إضافية
- تنزيل

معلومات إضافية

حقل	القيمة
المصدر	https://sigarra.up.pt
المؤلف	André Pires
آخر تحديث	فبراير 19, 2020, 15:37 (UTC)
أنشئت	يونيو 13, 2017, 14:35 (UTC)
DOI	https://doi.org/10.25747/s5jn-q370
dc.Contributor	José Devezas, Sérgio Nunes
dc.Coverage.Spatial	Porto
dc.Coverage.Temporal	2016-12-14 to 2017-03-01
dc.Date	2017
dc.Format	.csv; .xml; *.zip
dc.Format.Extent	4,22MB
dc.Language	PT
dc.Publisher	INESC TEC
dc.Relation	Master´s thesis: PIRES, André (2017).Named entity recognition on Portuguese web text. Porto: Faculdade de Engenharia da Universidade do Porto.http://hdl.handle.net/10216/106094
dc.Type	Entity Annotated News