Structured Insights or Preprocessing Artifacts? “Breaking Down” the Impact of Text Chunking Strategies on Topic Model Interpretability

Marcus, Becky

Structured Insights or Preprocessing Artifacts? “Breaking Down” the Impact of Text Chunking Strategies on Topic Model Interpretability

dc.contributor	Shaus, Arie
dc.contributor	Townsley, Eleanor
dc.contributor.advisor	Gebre-Medhin, Benjamin
dc.contributor.author	Marcus, Becky
dc.date.accessioned	2025-07-09T17:22:43Z
dc.date.available	2025-07-09T17:22:43Z
dc.date.gradyear	2025
dc.date.issued	2025-07-09
dc.description	Submitted as a data science thesis to the Department of Mathematics, Statistics, and Data Science
dc.description.abstract	Computational text analysis (CTA) has become an essential tool for sociologists seeking to extract cultural meaning from texts, particularly with the increased digitization of historical text corpora. Structural Topic Modeling (STM) is a popular exploratory tool for highlighting latent themes in text that warrant further investigation, but with new tools come new challenges for reliability and validity. Most researchers will adhere to the text preprocessing methods suggested in prominent CTA literature, though with some variability in the size of the text chunk, word tokenization, and vocabulary simplifying and filtering. These decisions are often made somewhat uncritically as part of an established methodological procedure yet have the potential to yield markedly different topics and interpretations. Supplementing existing literature on preprocessing strategies for topic models, this study takes a mixed-methods data science approach to evaluate the effect that text unit size has on topic content and prevalence. It analyzes a corpus of presidential addresses from meetings of the American Economic Association from 1888-2022 and includes meeting year as a topic prevalence covariate. Contributing to a larger research effort on social science presidential addresses, this study not only uncovers preliminary patterns in institutional and academic discourse over time but also discusses how text chunking can be aligned with different empirical questions.
dc.description.sponsorship	Mathematics & Statistics
dc.description.sponsorship	Other or Special Major
dc.identifier.uri	https://hdl.handle.net/10166/6761
dc.language.iso	en
dc.rights.restricted	public
dc.subject	Data Science
dc.subject	Computational Social Science
dc.subject	Computational Text Analysis (CTA)
dc.subject	Topic Model Interpretation
dc.subject	Text Preprocessing
dc.subject	Document Chunking
dc.subject	American Economic Association (AEA)
dc.subject	Natural Language Processing (NLP)
dc.subject	Structural Topic Modeling (STM)
dc.subject	Sociology Research
dc.title	Structured Insights or Preprocessing Artifacts? “Breaking Down” the Impact of Text Chunking Strategies on Topic Model Interpretability
dc.type	Thesis
mhc.degree	Undergraduate
mhc.institution	Mount Holyoke College

Files

Original bundle

Now showing 1 - 1 of 1

Name:: marcus_2025_mhc_data_science_thesis.pdf
Size:: 10.82 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.13 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Student Theses and Honors Collection