A Sociolinguistic Analysis of a Deep Learning Based Classification model of South American Voseo in X Posts
Author ORCID Identifier
0000-0002-0839-7001
Document Type
Paper Presentation
Presenter Language
English
Research Area
Language Variation in Digital Spaces
Location
MBSC Gallery Room 308
Start Date
19-10-2024 9:30 AM
End Date
19-10-2024 10:00 AM
Abstract
Here, I present the implementation of a dialectal classification system that uses voseo in X (formerly Twitter) posts to identify speakers of Colombian (Paisa and Caleño) and Argentine (Buenos Aires and La Plata) Spanish. Two datasets of over 18,000 posts were collected from recent X posts according to the geolocalization of the tweet. The data was used to train and evaluate a transformer-based machine learning classifier of South American voseo. Results show that the system is able to identify the voseo region with a high degree of accuracy (0.84 F1 and 0.88 AUC ROC – Area Under the Receiving Operating Characteristic Curve). A sociolinguistics analysis of each dataset gave further insights on the accuracy of the classifier, the status of voseo, and the discourse function of voseo and other second-person singular forms of address (2PS), particularly in the context of Colombian voseo. An examination of the lexical, syntactical, and grammatical properties of Colombian and Argentine voseo also offered more detailed information on the properties not considered by the model. The natural language processing (NLP) methods presented here aim to pave the way for innovative approaches with high potential in Spanish sociolinguistics research.
A Sociolinguistic Analysis of a Deep Learning Based Classification model of South American Voseo in X Posts
MBSC Gallery Room 308
Here, I present the implementation of a dialectal classification system that uses voseo in X (formerly Twitter) posts to identify speakers of Colombian (Paisa and Caleño) and Argentine (Buenos Aires and La Plata) Spanish. Two datasets of over 18,000 posts were collected from recent X posts according to the geolocalization of the tweet. The data was used to train and evaluate a transformer-based machine learning classifier of South American voseo. Results show that the system is able to identify the voseo region with a high degree of accuracy (0.84 F1 and 0.88 AUC ROC – Area Under the Receiving Operating Characteristic Curve). A sociolinguistics analysis of each dataset gave further insights on the accuracy of the classifier, the status of voseo, and the discourse function of voseo and other second-person singular forms of address (2PS), particularly in the context of Colombian voseo. An examination of the lexical, syntactical, and grammatical properties of Colombian and Argentine voseo also offered more detailed information on the properties not considered by the model. The natural language processing (NLP) methods presented here aim to pave the way for innovative approaches with high potential in Spanish sociolinguistics research.