Data Science in Social Research

One of the challenges of applying data analytics in sociology is the institutionalization of data science outside of sociology, as the former expertise of sociology was based on its own method of research. Another challenge is epistemological in nature, relates to the noisiness and validity of digital data, and the question of explanation/causation, which is highly important for sociology. These challenges give the background of the tension between the Big Data based, social-related findings and the sociological skepticism questioning the potential of this knowledge-production. The challenges can be solved through the redefinition of the research methodological basis of sociology, by the organic incorporation of data science know-how to its own methods. The solution also needs the combined application of qualitative and quantitative analysis motives, and the use of knowledge-driven science instead of the data-driven approach.

Previous publications in epistemology/sociology of science

  • Katona, Eszter, Németh, Renáta, Kmetty, Zoltán: Text analytics in social sciences – An example for NLP’s application (in Hung., submitted)
  • Bárdits, Anna, Németh, Renáta (2017): The rite of statistical significance testing – contemporary critics; the rite in sociology. Szociológiai Szemle, 27:(1) pp. 119-125. (in Hung.)
  • Bárdits, Anna, Németh, Renáta, Terplán, Győző (2016): An old problem in the spotlight again. The mistaken practice of the null-hypothesis significance test. (in Hung.) Statistical Review, 94:(1) pp. 52-75.
  • Németh, Renáta (2015): Causal inference in empirical sociological research. Szociológiai Szemle, 25(2), pp:2-30. (in Hung.)
  • Németh, Renáta (2015): Do numbers really speak for themselves? Replika, Special issue on Big data and Sociology, 92-92, pp: 203-208. (in Hung.)
  • Németh, Renáta (2014): Methods of quantitative social research paradigms., 2014/3, pp. 1-16. (Hung.)

Foregoing results

Our research stream is motivated by a continuously growing social science interest in data science. As an example, see the case of automated text analytics: the following figure shows that the popularity of automated text analytics has been continuously growing in recent years in general and also in each discipline investigated (to access publication data we used Dimensions, Each trend line is growing persistently even after normalizing for the total number of publications in the discipline. The topic’s percentage portion in sociology increased faster than in sciences in general. In summary, automated text analytics is becoming an increasingly recognized approach in sociology.

Related publications

Németh, Renáta; Koltai, Júlia (2019): Sociological knowledge discovery through text analytics. In: Rudas, Tamás – Péli, Gábor (eds.) Pathways Between Social Science and Computational Social Science – Therories, Methods and Interpretations. New York, NY, Springer. (forthcoming) 

In our work, based on recent research reports, we discuss the advances, challenges and opportunities of Big Data text analytics in sociology. The advances include the utilization of the originally and primarily business and technology-oriented development of information technology, data science, AI and NLP; and also, the rapid growth of computing capacity. These advances provide opportunities. Social behavior can be directly observed, not only on self-reported basis. The observation and analysis could happen in real-time, and – because of the development of NLP methods – the understanding of the content is getting deeper.

As our paper shows, there are new possibilities for sociological research which are in some sense just byproduct of information science. We introduce recently developed methods which can be applied to specific sociological problems outside the scope of business applications. We present sociological topics not yet studied in this area and show new insights the approach can offer to classical sociological questions. As our aim is to encourage sociologists to enter this field, we discuss the new methods on the base of the classic quantitative approach, using its concepts and terminology, addressing also the question of new skills acquired from traditionally trained sociologists.


Koltai, Júlia – Kmetty, Zoltán – Bozsonyi, Károly (2019) From Durkheim to machine learning – finding the relevant sociological content in a social media discourse. In: Rudas, Tamás – Péli, Gábor (eds.) Pathways Between Social Science and Computational Social Science – Therories, Methods and Interpretations. New York, NY, Springer. (forthcoming)

The phenomenon of suicide is in the focus of social scientists since Durkheim. Internet and social media sites provide new ways for people to express their positive feelings, but they are also platforms to express suicide ideation or depressed thoughts. Most of these contents are not notes about real suicides, but some of them are cry for help. Nevertheless, suicide and depression related content varies among platforms and it is not evident, how a researcher can find these contents in mass data of social media.  Our paper uses the corpus of more than 4 million Instagram posts, related to mental health problems. After defining the initial corpus, we present two different strategies to find the relevant sociological content in the noisy environment of social media. The first approach starts with a topic modelling (Latent Dirichlet Allocation), which output serves as the basis of a supervised classification method, based on advanced machine learning techniques. The other strategy is built on an artificial neural network based word embedding language model.


Bartus, Tamás – Kisfalusi, Dorottya – Koltai, Júlia (2019) Logisztikus regressziós együtthatók összehasonlítása (The Comparison of Coefficients in Logistic Regression) In: Statisztikai Szemle (Hungarian Statistical Review) 97(3): 221-240.

Recently, increasing attention has been devoted to the problem that estimated coefficients of logistic (and other non-linear) regression models cannot be compared across groups, samples, or nested model specifications due to the possible differences in the magnitude of unobserved heterogeneity. This study reviews methods which aim to solve this problem and investigates their effectiveness through simulation. Parameter estimates of nested model specifications can be made comparable using y-standardization or by comparing the estimates of the multivariate model to the estimates of a special, quasi-univariate model. Methods which aim to make coefficients comparable across groups and samples (such as testing the proportionality of interaction effects and heterogeneous choice models), however, do not provide adequate solutions for the problem. Causes behind this failure are discussed. 


Related presentations

Németh, Renáta: Data Science and Statistics. Presentation and opening of a debate. Meeting of the Hungarian Society for Clinical Biostatistics, October 19, 2018.

The presentation gave an overview of methodological paradigm of “Big Data” and its relation to classic statistics. The debate concerned the problem’s relevance from a biostatistical point of view.


Kmetty, Zoltán – Koltai, Júlia: Big data based decision making mechanisms from the viewpoint of social sciences. Presentation at the event of HUB Design House, called  ‘The Power of Big Data’ January 9, 2019.

In our presentation, we presented the possibilities of large scale data-based decision making, focusing on the dangers, when this type of decision making does not work properly. Within this latter topic, we emphasised the importance of interpretation and causality.


Kmetty, Zoltán – Koltai, Júlia: Understanding Cultural Choices with NLP (2019). Presentation at the Data Science Meetup Budapest, May 9, 2019.

Parallel with the rise of digital textual data, natural language processing methods developed rapidly in the last decade. In our presentation, we will focus on artificial neural network based word embedding methods, which became widespread in recent years. Different fields apply these methods, such as linguists for dictionary building; developers for music video recommendations systems; companies for the analysis of product reviews, etc. However, their application in the understanding of human behaviour and culture was limited so far, though the huge amount of available digital data (text) provide a lot of information about our preferences, choices and the way we think. We will show several examples of the utilization of word embedding methods in this field. The presentation also provides details about the methodology, the problems to be solved and the directions of further development.