Structured Insights or Preprocessing Artifacts? “Breaking Down” the Impact of Text Chunking Strategies on Topic Model Interpretability
Date
2025-07-09
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Computational text analysis (CTA) has become an essential tool for sociologists seeking to extract cultural meaning from texts, particularly with the increased digitization of historical text corpora. Structural Topic Modeling (STM) is a popular exploratory tool for highlighting latent themes in text that warrant further investigation, but with new tools come new challenges for reliability and validity. Most researchers will adhere to the text preprocessing methods suggested in prominent CTA literature, though with some variability in the size of the text chunk, word tokenization, and vocabulary simplifying and filtering. These decisions are often made somewhat uncritically as part of an established methodological procedure yet have the potential to yield markedly different topics and interpretations. Supplementing existing literature on preprocessing strategies for topic models, this study takes a mixed-methods data science approach to evaluate the effect that text unit size has on topic content and prevalence. It analyzes a corpus of presidential addresses from meetings of the American Economic Association from 1888-2022 and includes meeting year as a topic prevalence covariate. Contributing to a larger research effort on social science presidential addresses, this study not only uncovers preliminary patterns in institutional and academic discourse over time but also discusses how text chunking can be aligned with different empirical questions.
Description
Submitted as a data science thesis to the Department of Mathematics, Statistics, and Data Science
Keywords
Data Science, Computational Social Science, Computational Text Analysis (CTA), Topic Model Interpretation, Text Preprocessing, Document Chunking, American Economic Association (AEA), Natural Language Processing (NLP), Structural Topic Modeling (STM), Sociology Research