Classification of online health messages 

The dataset has 487 annotated messages taken from Medhelp, an online health forum with several health communities (https://www.medhelp.org/). It was built in a master thesis entitled "Automatic categorization of health-related messages in online health communities" of the Master in Informatics and Computing Engineering of the Faculty of Engineering of the University of Porto. It expands a dataset created in a previous work [1] whose objective was to propose a classification scheme to analyze messages exchanged in online health forums. 

A website was built to allow the classification of additional messages collected from Medhelp. After using a Python script to scrape the five most recent discussions from popular forums (https://www.medhelp.org/forums/list), we sampled 285 messages from them to annotate. Each message was classified three times by anonymous people in 11 categories from April 2022 until the end of May 2022. For each message, the rater picked the categories associated with the message and its emotional polarity (positive, neutral, and negative).

Our dataset is organized in two CSV files, one containing information regarding the 885 (=3*285) classifications collected via crowdsourcing (CrowdsourcingClassification.csv) and the other containing the 487 messages with their final and consensual classifications (FinalClassification.csv). 

The CrowdsourcingClassification.csv file has one line per message. The first 4 initial columns describe the message. The final columns (signed with # below) correspond to the amount of people which found that label to be representative of what a message is conveying. As each message was classified 3 times, a value of one in one of these columns means that two other people thought that it should not have that label. If it has at least two, then it has the majority of selections. The structure of this file is as follows:
        - text: the textual content of the message 
        - timestamp: a string with the format year-month-day hour:minute:second
        - userID: the user ID in MedHelp after being mapped to a sequential ID
        - threadID: incremental identifier to group the different threads
        - informational_support (#): Share of knowledge and reduction of uncertainties. It can include advice, situation appraisal and teaching. Example: I highly recommend seeing a nutritionist that specializes in addiction too. They can recommend supplements that will replace the things that the drugs took from your system and will help the baby replenish what it hasn't been getting
        - emotional_support (#): Express empathy towards people. It can include sympathy, confidentiality, encouragement, prayer, relief of blame, virtual affection and relationship. Example 1: I talked to my daughter about you and your wife. She feels awful for all of you because she knows what it's like for her and for her loved ones. But, she is hoping that your wife will agree to testing and get the help she needs. Example 2: I fully understand what you all are referring to when you talk about it being stressful mentally and emotionally. It's horrible being sick so regularly and not have an answer and a fix for it.
        - esteem_support (#): Allows to improve one's confidence. It can include compliments or validation. Example 1: You are strong for standing up to it and should not feel embarrassed. Many women are in need of some help during and after pregnancy, it takes a brave and courageous woman to seek help. You can do it I promise. Example 2: I know how you’re feeling, I have ADD too and I don't do it to the extent that you do but when I get excited or bored I fidget with my hands a lot. I talk with my hands a lot too. Its definitely normal for us so don't worry, you are not a freak.
        - network_support (#): Allows a user to improve their social contacts and get to know other people in similar situations. Example: Please keep me posted!!! And good luck,, Us 40 year olds need to stick together!!!!
        - seeking_support (#): Allows a user to improve their social contacts and get to know other people in similar situations. Example: Please keep me posted!!! And good luck,, Us 40 year olds need to stick together!!!!
        - agreement (#): Simply agreeing with something said previously. Example: I certainly agree with that last perspective.
        - disagreement (#): Disagreeing with what was previously said. Example: I think you are in the wrong there.
        - gratitude (#): Express gratitude because of an answer, recognising the give help. Example: “Thank you everyone! Your answers have helped a lot!”.
        - congratulations (#): Expresses joy towards something accomplished. Example: Congratulations! Hope all goes well :).
        - sharing_experiences (#): Share of experiences, thoughts, and feelings of a user related to a certain event in its life. Example: I have the same kind of headache and I'm about one week from the start of my miscarriage. The pounding is terrible!!! I'm looking for answers as well.
        - negative (#): Negative emotions.
        - positive (#): Positive emotions.
        - neutral (#): Not positive nor negative emotions.

The FinalClassification.csv contains the totality of the messages, one per line, and their final and consensual labels, signed with 1 for a positive classification and a 0 for a negative one. For each message classified via crowdsourcing, we selected the labels having a number of selections higher than 1, that is, the majority of selections. The labels we collected are converted into the labels of the dataset in the following way:  everything stays the same except that  "Congratulations", "Gratitude", "Agreement" and "Disagreement" are all mapped into"Interpersonal".   The structure   of the file is as follows. 
-	text
-	emotional-support
-	informational-support
-	network-support
-	sharing-experiences
-	esteem-support
-	seeking-support
-	interpersonal
-	past-dataset
-	positive
-	negative
-	neutral 
-	past dataset: a boolean stating if the instance is from the past dataset (1) or from the crowdsourcing collection (0)



[1] Carla Teixeira Lopes and Bárbara Guimarães Da Silva. "A classification scheme for analyses of messages exchanged in online health forums." Proceedings of the The Information Behaviour Conference (ISIC 2018). 2018.