This repository contains data produced for the dissertation "Real-time prediction of Wikipedia articles' quality". The project was conducted by student Pedro Miguel Moás (up201705208@edu.fe.up.pt) at FEUP, University of Porto, for the Master in Informatics and Computing Engineering. Our end goal is to provide Wikipedia users with a reliable and transparent tool for automatically assessing quality within Wikipedia. That way, readers will know beforehand if an article is worth reading, while editors may easily detect existing flaws in the articles they encounter. We thus propose creating an extension for the Google Chrome browser that uses machine learning to predict, in real-time, the quality of Wikipedia articles. The repository is structured in the following manner: - Wikipedia Titles: Lists all the titles of English Wikipedia's articles, for each quality level. Obtained through the categorymembers method of MediaWiki API, in May 2022. Titles belonging to multiple quality levels are ignored. - FA.txt: Featured Articles. Obtained through category "Featured_articles" - FA.txt: Featured Lists. Obtained through category "FL-Class_articles" - GA.txt: Good Articles. Obtained through category "Wikipedia_good_articles" - A.txt: A-class articles. Obtained through category "A-Class_articles" - B.txt: B-class articles. Obtained through category "B-Class_articles" - C.txt: C-class articles. Obtained through category "C-Class_articles" - Start.txt: Start articles. Obtained through category "Start-Class_articles" - Stub.txt: Stub articles. Obtained through category "All_stub_articles" - Wikipedia Graph: Nodes and edges of Wikipedia's Network Graph, as of May 2022. Generated from the compressed dataset provided in https://law.di.unimi.it/webdata/enwiki-2022/. The CSV files use a '|' separator instead of commas to reduce the number parsing errors. - enwiki-2022-nodes.csv: Supplies the mapping of each node Id to its respective article Title. Each node represents a Wikipedia article. - enwiki-2022-edges.csv: Lists the connections between the articles. An edge between a Source and a Target indicates that the Source article links to the Target article somewhere in its text. - Default Dataset: Balanced english Wikipedia dataset used to train the prediction models. The dataset considers 6 classes (FA, GA, B, C, Start, Stub) and all 145 features. Titles for each quality level were randomly picked from the previously generated "Wikipedia Titles". Network features use the "Wikipedia Graph" information. - 6000x6-csrhn_train.csv: Training data. Comprises ~70% of the dataset. - 6000x6-csrhn_train.csv: Testing data. Comprises ~30% of the dataset. - Dataset Construction Times: This folder fully details the results of the experiments measuring feature computation times. Contains a folder for each of the 6 classes (FA, GA, B, C, Start, Stub), and they are all organized in the same manner. Each experiment measured the feature computation times of 500 random articles of each quality and then calculated their average, to minimize the impact caused by outliers. Therefore, each report will display the time spent, in seconds, of each stage of feature calculation. - FA/GA/B/C/Start/Stub: - titles.txt: Article titles for each experiment - clean_wiki.txt: Time measurements for cleaning wikitext - content.txt: Time measurements for calculating content features - history.txt: Time measurements for calculating history features - readability.txt: Time measurements for calculating readability features - revs.txt: Time measurements for fetching the article's revision history - style.txt: Time measurements for calculating style features - syllables.txt: Time measurements for estimating syllable counts - tokenizer.txt: Time measurements for running the word and sentence tokenizers - wikitext.txt: Time measurements for fetching the article's wikitext - total.txt: Total time measurements - ML Training Reports: Complete reports of the Machine Learning training phase. Each subfolder contains the reports for experiments with different feature subset or number of classes. Classifier and regression reports will show different information, as appropriate metrics differ across the two types of tasks. - : Subfolders will have this naming convention. will describe the used feature categories, one letter representing the initial for each one. indicates the number of distinct quality classes. For example, CSRHN6 lists the reports for models trained with Content, Style, Readability, History and Network features, and 6 levels of quality (all features and all classes). The report name will indicate the used algorithm, as defined below. - ada_c: Ada Boost Classifier - ada_r: Ada Boost Regressor - forest_c: Random Forest Classifier - forest_r: Random Forest Regressor - gboost_c: Gradient Boosting Classifier - gboost_r: Gradient Boosting Regressor - gnb_c: Gaussian Naive Bayes Classifier - knn_c: K-Nearest Neighbors Classifier - linreg_r: Linear Regression - logreg_c: Logistic Regression - mlp_c: Multi-layer Perceptron Classifier - mlp_r: Multi-layer Perceptron Regressor - svc_c: Support Vector Classifier - svr_r: Support Vector Regressor - tree_c: Decision Tree Classifier - tree_r: Decision Tree Regressor - Multi-Language Datasets: These datasets were designed for assessing and comparing our model's performance across different Wikipedia versions. We used MediaWiki's API to obtain random Wikipedia articles of any quality, so they are extremely unbalanced. They have the same structure as "Default Dataset", but without the Network features. Quality values were obtained from the respective Wikipedia's content quality scale. - multi-en-11096-csrh.csv: English dataset, contains 11096 articles. - multi-en-11195-csrh.csv: French dataset, contains 11195 articles. - multi-en-10525-csrh.csv: Portuguese dataset, contains 10525 articles. - multi-en-10341-csrh.csv: Russian dataset, contains 10341 articles.