Jakab Buda – Text classification with a recurrent neural network based language model

2020 Survey Statistics and Data Analytics MSC Supervisor Márton Rakovics

I study text classification with recurrent neural networks in my thesis, more precisely profiling authors by age and gender with language models. The requirements in this field are continuously changing due to the technological developments and the ever altering forms of online content, therefore in the last couple of years many different solutions have been developed for this task. After a review of the most relevant related natural language processing literature dealing with word embeddings, text classification, and language models I discuss the theoretical background of recurrent neural networks and the most important methodological questions of machine learning. Lastly, I test different models with varying architecture and size on the PAN 2013 author profiling database. The question of the thesis concerns whether a classifier that consists of different models fitted to each class and that labels an item according to the class of the model that fits it the best can be a viable alternative to the standard classifier architectures. Although amongst the models fitted in the thesis these classifiers do not have a better overall performance than those with standard classifier architecture, it seems these models are capable of more balanced performance amongst the different classes.