A Sociolinguistic Analysis of a Deep Learning Based Classification model of South American Voseo in X Posts

Presenter Information

Falcon Restrepo-RamosFollow

Author ORCID Identifier

0000-0002-0839-7001

Document Type

Paper Presentation

Presenter Language

English

Research Area

Language Variation in Digital Spaces

Location

MBSC Gallery Room 308

Start Date

19-10-2024 9:30 AM

End Date

19-10-2024 10:00 AM

Abstract

Here, I present the implementation of a dialectal classification system that uses voseo in X (formerly Twitter) posts to identify speakers of Colombian (Paisa and Caleño) and Argentine (Buenos Aires and La Plata) Spanish. Two datasets of over 18,000 posts were collected from recent X posts according to the geolocalization of the tweet. The data was used to train and evaluate a transformer-based machine learning classifier of South American voseo. Results show that the system is able to identify the voseo region with a high degree of accuracy (0.84 F1 and 0.88 AUC ROC – Area Under the Receiving Operating Characteristic Curve). A sociolinguistics analysis of each dataset gave further insights on the accuracy of the classifier, the status of voseo, and the discourse function of voseo and other second-person singular forms of address (2PS), particularly in the context of Colombian voseo. An examination of the lexical, syntactical, and grammatical properties of Colombian and Argentine voseo also offered more detailed information on the properties not considered by the model. The natural language processing (NLP) methods presented here aim to pave the way for innovative approaches with high potential in Spanish sociolinguistics research.

This document is currently not available here.

Share

COinS
 
Oct 19th, 9:30 AM Oct 19th, 10:00 AM

A Sociolinguistic Analysis of a Deep Learning Based Classification model of South American Voseo in X Posts

MBSC Gallery Room 308

Here, I present the implementation of a dialectal classification system that uses voseo in X (formerly Twitter) posts to identify speakers of Colombian (Paisa and Caleño) and Argentine (Buenos Aires and La Plata) Spanish. Two datasets of over 18,000 posts were collected from recent X posts according to the geolocalization of the tweet. The data was used to train and evaluate a transformer-based machine learning classifier of South American voseo. Results show that the system is able to identify the voseo region with a high degree of accuracy (0.84 F1 and 0.88 AUC ROC – Area Under the Receiving Operating Characteristic Curve). A sociolinguistics analysis of each dataset gave further insights on the accuracy of the classifier, the status of voseo, and the discourse function of voseo and other second-person singular forms of address (2PS), particularly in the context of Colombian voseo. An examination of the lexical, syntactical, and grammatical properties of Colombian and Argentine voseo also offered more detailed information on the properties not considered by the model. The natural language processing (NLP) methods presented here aim to pave the way for innovative approaches with high potential in Spanish sociolinguistics research.