Natural Language Processing and Spanglish: Approaches towards Part-of-Speech Tagging Code-Switched Text
MetadataShow full item record
Natural language processing (NLP) is a field dedicated to the computational understanding of human language. Through computational analysis of human text and speech, the field has made incredible strides and created tools that many people use on a day-to-day basis such as Siri and Google Translate. At the core of these incredibly powerful systems are a number of common NLP tasks that are key to a system’s ability to process language. One of these tasks is part of speech (POS) tagging, which involves assigning parts of speech to each word in an input text. Although NLP has been a heavily researched subfield of computing since the 1950s, to date the vast majority of analysis in the field has focused on a small number of monolingual world languages with large amounts of data available and many systems designed for their analysis, which are commonly referred to as high-resource languages. While there has been significant research into multilingual text and speech, this research has historically not included code-switched language. Code switching, sometimes referred to as code mixing, is when multilingual speakers alternate between two or more languages or dialects within a single conversation or utterance. Despite the fact that code-switching is nearly universal within multilingual communities, the vast majority of NLP tools and products are optimized for monolingual input, ignoring the natural speech and/or writing patterns of many people and communities throughout the world. POS tagging research, specifically, has yielded many different models and methods to tag monolingual data, but much less work has been done to POS tag code-switched language. This thesis takes one of these monolingual models, the Stanford Part of Speech Tagger, and analyzes different approaches through which its performance can be greatly improved for Spanglish (code-switched Spanish and English) input. Specifically, two models are developed and analyzed: 1) a multi-lingual approach that integrates separate monolingual models for the matrix and secondary languages (the language splitting model) and 2) a translation model where the code-switched language is translated before tagging (the translation model). Both models performed at a lower accuracy than the baseline for monolingual input, but greatly improved upon the baselines for tagging the code-switched data and achieved results on par with previous studies which have developed and trained POS taggers specifically for code-switched language.