Structured Insights or Preprocessing Artifacts? “Breaking Down” the Impact of Text Chunking Strategies on Topic Model Interpretability

dc.contributorShaus, Arie
dc.contributorTownsley, Eleanor
dc.contributor.advisorGebre-Medhin, Benjamin
dc.contributor.authorMarcus, Becky
dc.date.accessioned2025-07-09T17:22:43Z
dc.date.available2025-07-09T17:22:43Z
dc.date.gradyear2025
dc.date.issued2025-07-09
dc.descriptionSubmitted as a data science thesis to the Department of Mathematics, Statistics, and Data Science
dc.description.abstractComputational text analysis (CTA) has become an essential tool for sociologists seeking to extract cultural meaning from texts, particularly with the increased digitization of historical text corpora. Structural Topic Modeling (STM) is a popular exploratory tool for highlighting latent themes in text that warrant further investigation, but with new tools come new challenges for reliability and validity. Most researchers will adhere to the text preprocessing methods suggested in prominent CTA literature, though with some variability in the size of the text chunk, word tokenization, and vocabulary simplifying and filtering. These decisions are often made somewhat uncritically as part of an established methodological procedure yet have the potential to yield markedly different topics and interpretations. Supplementing existing literature on preprocessing strategies for topic models, this study takes a mixed-methods data science approach to evaluate the effect that text unit size has on topic content and prevalence. It analyzes a corpus of presidential addresses from meetings of the American Economic Association from 1888-2022 and includes meeting year as a topic prevalence covariate. Contributing to a larger research effort on social science presidential addresses, this study not only uncovers preliminary patterns in institutional and academic discourse over time but also discusses how text chunking can be aligned with different empirical questions.
dc.description.sponsorshipMathematics & Statistics
dc.description.sponsorshipOther or Special Major
dc.identifier.urihttps://hdl.handle.net/10166/6761
dc.language.isoen
dc.rights.restrictedpublic
dc.subjectData Science
dc.subjectComputational Social Science
dc.subjectComputational Text Analysis (CTA)
dc.subjectTopic Model Interpretation
dc.subjectText Preprocessing
dc.subjectDocument Chunking
dc.subjectAmerican Economic Association (AEA)
dc.subjectNatural Language Processing (NLP)
dc.subjectStructural Topic Modeling (STM)
dc.subjectSociology Research
dc.titleStructured Insights or Preprocessing Artifacts? “Breaking Down” the Impact of Text Chunking Strategies on Topic Model Interpretability
dc.typeThesis
mhc.degreeUndergraduate
mhc.institutionMount Holyoke College

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
marcus_2025_mhc_data_science_thesis.pdf
Size:
10.82 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.13 KB
Format:
Item-specific license agreed upon to submission
Description: