Survey Statistics and Data Analytics MSc, 2019.
thesis supervisor: Márton Rakovics
Machine learning algorithms have emerged as an alternative to mainstream statistics, optimising prediction accuracy to its limits, but are often incomprehensible. Leo Breiman argued for what he called algorithmic culture: the most accurate model is preferred to a worse, but more interpretable one. In my thesis, I use Leo Breiman’s and Adele Cutler’s Random Forest classifier, to evaluate a research concerning learning types, that used logistic regression. My goal is to search for new and already known information provided by the Random Forest model that is expected to provide better accuracy in the field of social sciences, where interpretation is key. After introducing the complex ensemble of decision trees that is Random Forests, I demonstrate the three main sources for evaluating the model: out-of-bag, variable importance, and multi-dimension scaling. During my analysis, I produce a marginally better RF classifier, and I manage to find similarities and differences compared to the original research: one particular similarity is the connection of partial dependence based on class vote proportions of trees and logistic regression coefficients.