Lost in translation: Exploring Google Translate’s handling of Spanish forms of address in film dialogue
Document Type
Paper Presentation
Presenter Language
English
Research Area
Pragmatics and discourse analysis
Location
MBSC Dodge Room 302A
Start Date
17-10-2024 4:30 PM
End Date
17-10-2024 5:00 PM
Abstract
How does a machine know when it is appropriate to ask ¿Cómo estás tú? (How are you? Second person singular, informal) versus ¿Cómo está usted? (How are you? Third person singular, formal)? While machine translation has improved greatly, especially with the development of neural techniques (Choudhary et al., 2020), machines still struggle with pragmatics (Farwell & Helmreich, 2023). Spanish forms of address (FOA) are uniquely challenging because they are rationalized by extralinguistic conditions (Farwell & Helmreich, 2023). This study investigates Google Translate’s handling of a Spanish bipartite singular address system via Spanish and Mexican movie dialogue transcriptions in a round-trip translation. Essentially, data from ACTIV-ES (Francom et al., 2014) is translated into English then back into Spanish to examine if the same Spanish FOA are reproduced after being compressed into English’s unipartite address system. ACTIV-ES, a cross-dialect corpus of Spanish, provided dialogue via opensubtitles.org from 115 Mexican and 170 Spanish films (Francom et al., 2014). A Python script extracted a total of 4,180 tú and 3,911 usted pronouns and with their corresponding verbs within 307,836 sentences. In testing the dependent variables—pronoun mismatch and verb inflection mismatch—this investigation offers several significant findings: (1) Google Translate struggles to reproduce ustedeante FOA in general—pronominal and verbal—and relies heavily on tuteante forms instead, regardless of the context; further, ustedeante FOA translations lag behind tuteante forms, with 95% of all usted pronouns being misrepresented as tú or dropped; (2) pronouns preceding verbs in the subjunctive mood are less likely to be reproduced in translation than pronouns preceding verbs in the indicative mood; (3) verb tense is a significant predictor of pronoun and verbal inflection mismatch, where the present tense suffers the highest rate of pronominal mismatch; (4) the film’s country of origin is a significant predictor of pronoun mismatch, with mismatch rates significantly higher in Mexican data than Spanish data; (5) the decade in which the film was produced significantly influences the rate of pronoun mismatch, with older data producing more errors, and newer data producing less; and (6) film genre is the most highly significant predictor of both dependent variables.
Keywords: Pragmatics, Forms of Address, Machine Translation, Ustedeante, Tuteante
Lost in translation: Exploring Google Translate’s handling of Spanish forms of address in film dialogue
MBSC Dodge Room 302A
How does a machine know when it is appropriate to ask ¿Cómo estás tú? (How are you? Second person singular, informal) versus ¿Cómo está usted? (How are you? Third person singular, formal)? While machine translation has improved greatly, especially with the development of neural techniques (Choudhary et al., 2020), machines still struggle with pragmatics (Farwell & Helmreich, 2023). Spanish forms of address (FOA) are uniquely challenging because they are rationalized by extralinguistic conditions (Farwell & Helmreich, 2023). This study investigates Google Translate’s handling of a Spanish bipartite singular address system via Spanish and Mexican movie dialogue transcriptions in a round-trip translation. Essentially, data from ACTIV-ES (Francom et al., 2014) is translated into English then back into Spanish to examine if the same Spanish FOA are reproduced after being compressed into English’s unipartite address system. ACTIV-ES, a cross-dialect corpus of Spanish, provided dialogue via opensubtitles.org from 115 Mexican and 170 Spanish films (Francom et al., 2014). A Python script extracted a total of 4,180 tú and 3,911 usted pronouns and with their corresponding verbs within 307,836 sentences. In testing the dependent variables—pronoun mismatch and verb inflection mismatch—this investigation offers several significant findings: (1) Google Translate struggles to reproduce ustedeante FOA in general—pronominal and verbal—and relies heavily on tuteante forms instead, regardless of the context; further, ustedeante FOA translations lag behind tuteante forms, with 95% of all usted pronouns being misrepresented as tú or dropped; (2) pronouns preceding verbs in the subjunctive mood are less likely to be reproduced in translation than pronouns preceding verbs in the indicative mood; (3) verb tense is a significant predictor of pronoun and verbal inflection mismatch, where the present tense suffers the highest rate of pronominal mismatch; (4) the film’s country of origin is a significant predictor of pronoun mismatch, with mismatch rates significantly higher in Mexican data than Spanish data; (5) the decade in which the film was produced significantly influences the rate of pronoun mismatch, with older data producing more errors, and newer data producing less; and (6) film genre is the most highly significant predictor of both dependent variables.
Keywords: Pragmatics, Forms of Address, Machine Translation, Ustedeante, Tuteante