Lost in translation: Exploring Google Translate’s handling of Spanish forms of address in film dialogue

Document Type

Paper Presentation

Presenter Language

English

Research Area

Pragmatics and discourse analysis

Location

MBSC Dodge Room 302A

Start Date

17-10-2024 4:30 PM

End Date

17-10-2024 5:00 PM

Abstract

How does a machine know when it is appropriate to ask ¿Cómo estás tú? (How are you? Second person singular, informal) versus ¿Cómo está usted? (How are you? Third person singular, formal)? While machine translation has improved greatly, especially with the development of neural techniques (Choudhary et al., 2020), machines still struggle with pragmatics (Farwell & Helmreich, 2023). Spanish forms of address (FOA) are uniquely challenging because they are rationalized by extralinguistic conditions (Farwell & Helmreich, 2023). This study investigates Google Translate’s handling of a Spanish bipartite singular address system via Spanish and Mexican movie dialogue transcriptions in a round-trip translation. Essentially, data from ACTIV-ES (Francom et al., 2014) is translated into English then back into Spanish to examine if the same Spanish FOA are reproduced after being compressed into English’s unipartite address system. ACTIV-ES, a cross-dialect corpus of Spanish, provided dialogue via opensubtitles.org from 115 Mexican and 170 Spanish films (Francom et al., 2014). A Python script extracted a total of 4,180 and 3,911 usted pronouns and with their corresponding verbs within 307,836 sentences. In testing the dependent variables—pronoun mismatch and verb inflection mismatch—this investigation offers several significant findings: (1) Google Translate struggles to reproduce ustedeante FOA in general—pronominal and verbal—and relies heavily on tuteante forms instead, regardless of the context; further, ustedeante FOA translations lag behind tuteante forms, with 95% of all usted pronouns being misrepresented as or dropped; (2) pronouns preceding verbs in the subjunctive mood are less likely to be reproduced in translation than pronouns preceding verbs in the indicative mood; (3) verb tense is a significant predictor of pronoun and verbal inflection mismatch, where the present tense suffers the highest rate of pronominal mismatch; (4) the film’s country of origin is a significant predictor of pronoun mismatch, with mismatch rates significantly higher in Mexican data than Spanish data; (5) the decade in which the film was produced significantly influences the rate of pronoun mismatch, with older data producing more errors, and newer data producing less; and (6) film genre is the most highly significant predictor of both dependent variables.

Keywords: Pragmatics, Forms of Address, Machine Translation, Ustedeante, Tuteante

This document is currently not available here.

Share

COinS
 
Oct 17th, 4:30 PM Oct 17th, 5:00 PM

Lost in translation: Exploring Google Translate’s handling of Spanish forms of address in film dialogue

MBSC Dodge Room 302A

How does a machine know when it is appropriate to ask ¿Cómo estás tú? (How are you? Second person singular, informal) versus ¿Cómo está usted? (How are you? Third person singular, formal)? While machine translation has improved greatly, especially with the development of neural techniques (Choudhary et al., 2020), machines still struggle with pragmatics (Farwell & Helmreich, 2023). Spanish forms of address (FOA) are uniquely challenging because they are rationalized by extralinguistic conditions (Farwell & Helmreich, 2023). This study investigates Google Translate’s handling of a Spanish bipartite singular address system via Spanish and Mexican movie dialogue transcriptions in a round-trip translation. Essentially, data from ACTIV-ES (Francom et al., 2014) is translated into English then back into Spanish to examine if the same Spanish FOA are reproduced after being compressed into English’s unipartite address system. ACTIV-ES, a cross-dialect corpus of Spanish, provided dialogue via opensubtitles.org from 115 Mexican and 170 Spanish films (Francom et al., 2014). A Python script extracted a total of 4,180 and 3,911 usted pronouns and with their corresponding verbs within 307,836 sentences. In testing the dependent variables—pronoun mismatch and verb inflection mismatch—this investigation offers several significant findings: (1) Google Translate struggles to reproduce ustedeante FOA in general—pronominal and verbal—and relies heavily on tuteante forms instead, regardless of the context; further, ustedeante FOA translations lag behind tuteante forms, with 95% of all usted pronouns being misrepresented as or dropped; (2) pronouns preceding verbs in the subjunctive mood are less likely to be reproduced in translation than pronouns preceding verbs in the indicative mood; (3) verb tense is a significant predictor of pronoun and verbal inflection mismatch, where the present tense suffers the highest rate of pronominal mismatch; (4) the film’s country of origin is a significant predictor of pronoun mismatch, with mismatch rates significantly higher in Mexican data than Spanish data; (5) the decade in which the film was produced significantly influences the rate of pronoun mismatch, with older data producing more errors, and newer data producing less; and (6) film genre is the most highly significant predictor of both dependent variables.

Keywords: Pragmatics, Forms of Address, Machine Translation, Ustedeante, Tuteante