This dataset contains two files created for the dissertation "A Social Media Tool for Domain-Specific Information Retrieval - A Case Study in Human Trafficking" by Tito Griné for the Master in Informatics and Computing Engineering from the Faculty of Engineering of the University of Porto (FEUP). Both files were built in the period between the 02/03/2022 and 09/03/2022.
The file, "Topic profile dataset", includes Twitter profiles, identified by their handle, associated with a topic to which they are highly related. These were gathered by first selecting specific topics and finding lists of famous people within them. Afterward, the Twitter API was used to search for profiles using the names from the lists. The first profile returned for each search was manually analyzed to corroborate the relation to the topic and keep it.
This dataset was used to evaluate the performance of an agnostic classifier designed to identify Twitter profiles related to a given topic. The topic was given as a set of keywords that were highly related to the desired topic.
The file contains 271 pairs of topics and Twitter profile handles. There are profiles spanning six different topics: Ambient Music (102 profiles); Climate Activism (69 profiles); Quantum Information (9 profiles); Contemporary Art (26 profiles); Tennis (52 profiles); and Information Retrieval (13 profiles). At the time this dataset was created, all Twitter handles were from publicly visible profiles.
The file, "Profile-website dataset", includes Twitter profiles, identified by their handle, linked to URLs of websites related to the entities behind the profiles. The starting list of Twitter handles was taken from the profiles of the "topic-profile_dataset.csv". The links in each profile's description were gathered using the Twitter API, and each was manually crawled to assess its relatedness to the profile from which it was taken.
This dataset helped evaluate the efficacy of an algorithm developed to classify websites as related or unrelated to a given Twitter profile.
From the initial list of 271 profiles, at least one related link was found for 196 of them. The remaining 75 were not included in this dataset. Hence, the dataset contains 196 unique Twitter handles, with 325 distinct pairs of Twitter handles and corresponding URLs since some Twitter handles appear in more than one row when it is the case that multiple URLs are related.