Theses – ELTE Research Center for Computational Social Science

Réka Berbekár – Examining Trianon’s Memory Politics Using Machine Learning and Text Analytics

2022 Survey Statistics and Data Analytics MSC Supervisor Renáta Németh, PhD

Réka Berbekár (LinkedIn, Email)

More than 100 years after signing the Treaty of Trianon, the presence of Trianon in public discourse is still very active. Monuments are unveiled, commemorations are held, and the situation of Hungarians beyond the borders is a constant topic of discussion among journalists and politicians.

In my thesis, I examine whether the style and subject matter of articles on Trianon published on politically different news portals differ. I created topics from the articles using LDA topic modeling and analysed the style using the NarrCat tool (with the help of Tibor Pólya (Eötvös Loránd Research Network, Research Centre for Natural Sciences)). I measured the differences in communication of news portals using the success rate of classification algorithms. Topic affiliation probabilities and NarrCat scores were my explanatory variables, and the political affiliation of the websites publishing the articles was my clustering variable. The best algorithm classified the articles into one of the 4 political groups with 61.2% accuracy, the most important variables in this classification being the topical affiliation scores.

Zsolt Varga – Distance metric learning using Siamese networks for human pose similarity estimation

2022 Survey Statistics and Data Analytics MSC Supervisor Márton Rakovics

Zsolt Varga (LinkedIn)

This thesis proposes the use of deep similarity learning, specifically distance metric learning with a Siamese neural network architecture, to embed human poses into a lower dimensional space for similarity comparison. The goal is to create a map between the original input and the embedding such that the Euclidean distance is small for similar data points and large for dissimilar data points in the embedding space. The approach is shown to be effective in creating a semantic similarity-based human pose embedding that outperforms traditional approaches. The results demonstrate that using these embeddings leads to better classification performance and faster convergence during training. This approach has implications for creating systems that require non-trivial similarity measures, such as invariance to sidedness and the position of body parts, and can serve as input to further models. Overall, this thesis contributes to the development of more advanced techniques for human pose understanding and has potential applications in healthcare, education, fitness, and other fields.

Bendegúz Zaboretzky – Depression and COVID-19 – topic modeling of online forums

2021 Survey Statistics and Data Analytics MSC Supervisor Renáta Németh, PhD

Bendegúz Zaboretzky (GitHub, LinkedIn, Email)

The key role in a thorough understanding of depression lies with the person who is struggling with it. People in this situation can be effectively approached and examined through online forums regarding depression and related issues. Another recent study has done this excellently, upon which this current work is closely built. The novelty of this research lies in the examination of the impact of COVID-19 and the resulting global pandemic on the discourse of depression. The aim of this paper is to build on previous research, supplement the findings, and continue the line of investigation, taking into account this new effect. As a result, this study is also based on topic modeling and uses NLP (Natural Language Processing) methods – mainly LDA (Latent Dirichlet Allocation) and STM (Structural Topic Models) – to present the results.

The research was carried out in connection with the ELTE RC2S2 research group project, as a continuation of this paper.

Bernadett Csala-Ferencz – Cluster analysis of online depression forum posts – Applying the scatter / gather method on textual data

2021 Survey Statistics and Data Analytics MSC Supervisor Renáta Németh, PhD

Bernadett Csala-Ferencz

Cases of depression are increasingly common in our times, and internet forums provide great opportunity to better understand the nature of mental illnesses, and identify severe cases of depression. For the latter, examining the divergent uses of pronouns (such as increased usage of first person singular) is an effective way of identification. For my research I made cluster analysis on 66295 posts from English-speaking forums concerned with the topic of depression, to examine the different groups these posts can be organized into. Getting to know and understand these forums was not the only goal of this research. Methodologically I wanted to find the optimal preprocessing level of the texts and examine if the scatter/gather algoritm can be effectively used to find interpretable clusters. Throughout my work there were 15 clusters identified and it is clear that the applied scatter/gather clustering method was a mostly useful tool to isolate well-interpretable clusters. The usage of the first person singular pronouns helped me discover a cluster in increased risk, but it could be useful to examine the identification of posts with severe cases of depression through other linguistical markers too.

Lilla Békési – Holocaust denial and Holocaust-related distortions on the far-right portal Kuruc.info

2021 Sociology BA Supervisor Ildikó Barna, PhD

Lilla Békési

In my thesis, I examined the phenomenon of Holocaust denial and Holocaust-related distortions in articles and comments published on the far-right portal Kuruc.info. For my thesis, I conducted a qualitative secondary analysis of the texts collected by Ildikó Barna and Árpád Knap, who used topic modelling to research antisemitism on said portal. Using the category system developed by Manfred Gerstenfeld, I sought to answer questions such as which types of Holocaust-related distortions appear on the portal and which are the most frequent. I also investigated whether antisemitic views related to Holocaust distortion are detectable and to what extent users of the portal try to obscure their views. I have also tried to give some insight into the extent to which articles and comments differ in content or wording.

Anna Farkas – Social biases in machine learning: A case study of Google Translate

2020 Sociology BA Supervisor Renáta Németh, PhD

In recent years, several studies have been published about the phenomenon that machine learning algorithms are prone to reinforce or amplify human biases. This paper is a case study that investigates gender bias in Google Translate and its translations of occupations from Hungarian (a gender-neutral language) to English (a gender-based language). Using quantitative methods, the study aims to measure the extent of gender bias in machine translations. It examines the use of pronouns in the English translation of sentences such as “ő egy orvos” (“he/she is a doctor”).

To measure the bias in the algorithm, the study compares Google Translate’s translations to the proportion of men and women in each occupation, and to society’s perception of those occupations. To assess whether people find those occupations feminine or masculine, we used an omnibus survey created with the help of Inspira Group research company. The study found that Google Translate mirrors people’s perception of occupations to a greater extent than the proportion of men and women in those occupations.

The paper also includes research about how using attributives such as “good”, “very good”, “bad”, “very bad” in the sentences modify the translations of the pronouns.

Dániel Tóbiás – Analyzing gender disparity on Twitch.tv channels with text mining techniques

2020 Sociology MA

Dániel Tóbiás (LinkedIn; tobiasdani88@gmail.com)

Digitalization has opened a new era and Sociology has got a new set of tools to analyze and survey society. Here I am using one of the tools (text mining) to unfold gender disparity / gendered conversation in an online video game live-streaming platform and to reveal the potential of text mining. As it shows, there are some minor differences between female and male channels, however there is no sign of gender disparity or objectification in the data.

Jakab Buda – Text classification with a recurrent neural network based language model

2020 Survey Statistics and Data Analytics MSC Supervisor Márton Rakovics

I study text classification with recurrent neural networks in my thesis, more precisely profiling authors by age and gender with language models. The requirements in this field are continuously changing due to the technological developments and the ever altering forms of online content, therefore in the last couple of years many different solutions have been developed for this task. After a review of the most relevant related natural language processing literature dealing with word embeddings, text classification, and language models I discuss the theoretical background of recurrent neural networks and the most important methodological questions of machine learning. Lastly, I test different models with varying architecture and size on the PAN 2013 author profiling database. The question of the thesis concerns whether a classifier that consists of different models fitted to each class and that labels an item according to the class of the model that fits it the best can be a viable alternative to the standard classifier architectures. Although amongst the models fitted in the thesis these classifiers do not have a better overall performance than those with standard classifier architecture, it seems these models are capable of more balanced performance amongst the different classes.

Krisztián Boros – Meta-analysis of missing data handling methods with text-mining

2020 Survey Statistics and Data Analytics MSC

Krisztián Boros (LinkedIn; GitHub)

The ubiquity of missing data in quantitative research is undeniable. We may encounter with missing data due to, for example, non-response, incorrect sampling, or data processing errors. During the past 50 years, researchers have developed a wide variety of missing data handling methods; the spectrum of available techniques extends from the basic deletion methods (e.g. listwise- and pairwise deletion) to the more involved techniques (e.g. Multiple Imputation, EM-algorithm).

The aim of my thesis is twofold. On one hand, I introduce a text-mining approach to collect and analyze papers while pointing out the advantages and disadvantages of this particular approach using the Total Survey Error Framework. On the other hand, I try to examine the possible trends of the missing data handling methods across years and scientific fields.

The results show that the popularity of advanced techniques (e.g. Multiple Imputation, EM-algorithm) had been growing over the past 20 years, but the not-advanced techniques (e.g. deletion methods, mean imputation) are still in widespread use. In the case of the methodology, several limitations of the text-mining approach were pointed out such as the questionable generalizability and reliability of the results.

Norbert Kerekes – Multi-label classification of online forum posts

2020 Survey Statistics and Data Analytics MSC Supervisor Renáta Németh, PhD

Norbert Kerekes (LinkedIn)

Multi-label classification is a machine learning task seldom mentioned, considering how prevalent the problem is in everyday life.
The thesis is about this problem, aiming to overview and compare algorithms suited to solve multi-label problems. The most important representatives of the two greater algorithm families (problem transformation and adaptive algorithm methods) are presented in a text classification problem. The database contains depression-related online forum entries categorized by the biopsychosocial model.

András Hering – Applications of Random Forest methods

2019 Survey Statistics and Data Analytics MSC Supervisor Márton Rakovics

András Hering

Machine learning algorithms have emerged as an alternative to mainstream statistics, optimising prediction accuracy to its limits, but are often incomprehensible. Leo Breiman argued for what he called algorithmic culture: the most accurate model is preferred to a worse, but more interpretable one. In my thesis, I use Leo Breiman’s and Adele Cutler’s Random Forest classifier, to evaluate a research concerning learning types, that used logistic regression. My goal is to search for new and already known information provided by the Random Forest model that is expected to provide better accuracy in the field of social sciences, where interpretation is key. After introducing the complex ensemble of decision trees that is Random Forests, I demonstrate the three main sources for evaluating the model: out-of-bag, variable importance, and multi-dimension scaling. During my analysis, I produce a marginally better RF classifier, and I manage to find similarities and differences compared to the original research: one particular similarity is the connection of partial dependence based on class vote proportions of trees and logistic regression coefficients.

Beáta Gallina – Sentiment analysis on articles from online news sites

2019 Survey Statistics and Data Analytics MSC Supervisor Renáta Németh, PhD

Beáta Gallina (https://github.com/bgallina, www.linkedin.com/in/bgallina)

In my thesis I focus on sentiment analysis (SA) on Hungarian online news articles. In this case study, I present the methodological steps of text mining and sentiment analysis – with special emphasis on preprocessing – the most important SA models, then I accomplish a comparative analysis. In addition I contrast two traditional (lexicon and machine learning based) models with the combination of them and use the model with the best performance to answer the following social science themed research questions: To what extent appears emotional attitudes related to political actors in Hungarian online press; has changes happened in the perception of political actors due to the elections on the side of journalists and is there a parallel between the results of traditional popularity polls and the results of SA, more specifically, is there a relationship between the voters’ preferences and the valency of the political actor presence.

After the model evaluation, I worked with Naive Bayes classifier and on the grounds of the outcomes, it can be concluded that the largest sentiment category is neutral, but the dominant class is greatly influenced by which political actor is represented in the given text. The work revealed that election day had an impact on politicians’ connotation in media: most opposition politicians appeared in more negative light in the opposition media after the voting, than before. In case of some parties, there is a similar tendency in polls and SA.

The accuracy of the models could be further enhanced by inclusion of other features – namely topics, n-grams, article authors – a larger training set and a more comprehensive sentiment dictionary.

Keywords: elections, text mining, sentiment analysis, polls, machine learning, Naives Bayes classifier

Balázs Mayer – The effect of homophily on opinion dynamics processes in social networks – agent based social simulation

2018 Survey Statistics and Data Analytics MSC Supervisor Márton Rakovics

Balázs Mayer

I have studied the effect of homophily on opinion dynamics processes in social networks by agent based social simulation. My main hypothesis (based on the findings of Gargiulo and Gandica, 2017) was that greater opinion homophily leads to an increased chance of consensus formation.

Contrary to the original paper where the opinion variable had a random uniform distribution and similarity between agents was only measured in this one direction, my own growing network model considers both the well-known phenomenon of preferential attachment, the homophily of agents by their demographic attributes (derived from ego-network data about the Hungarian society in the 2000s using a case-control framework) and five different (simulated and real-world) opinion distributions, according to which homophily could be tuned.

The resulting graphs could capture the phenomenon of more similar agents being connected with greater probability both according to their opinion and demographical attributes, and networks with increased opinion homophily displayed greater modularity than simple preferential attachment ones. However, the networks created only considering the effects of similarity of demographic attributes did not show increased modularity.

Upon analysing the opinion dynamics processes in the networks the initial hypothesis was confirmed – it seems that the consensus stimulating effect of opinion homophily does not depend greatly on the distribution of the opinion variable, neither does introducing demographic homophily change this association.

References:

Gargiulo, F., Gandica, Y. (2017). The role of homophily in the emergence of opinion controversies. In: Journal of Artificial Societies and Social Simulation, 20 (3)

URL: http://jasss.soc.surrey.ac.uk/20/3/8.html