Social Bot Detection

When you read a tweet, do you know that it might be sent by automated software rather than a human? A portion of social media accounts are in fact bots, controlled by software and created with all kinds of intentions. They will manipulate opinions in important events such as elections. This project replicates a study that detects social bots with machine learning.

In order to detect these bots, we want to achieve both scalability and generalizability, to enable faster real-time detection algorithms and to make the classification more robust to new bots that are different from previous ones. We replicate the paper Scalable and Generalizable Social Bot Detection through Data Selection (Yang, Varol, Hui, Menczer 2020) and test on the robustness of its results. The paper proposed an approach to train random forests model on just user profile information and only on a subset of the available training data. Our results align with Yang, et al. (2020) in that training on subsets of datasets of user metadata improves classifier performance. We differ from their models as we have better model performance than their top models and less number of datasets used for training. These differences might be due to datasets differences, feature calculation discrepancies, API call rate limit, and our relatively newer sampling time for account features of certain datasets. Relationships across all training and testing datasets are investigated to explain top models' selections, and feature importance in the training datasets are analyzed.