Examining the representativeness of Spanish collocations with corpus and experimental data
Document Type
Paper Presentation
Presenter Language
English
Research Area
variation and change
Location
MBSC Dodge Room 302A
Start Date
19-10-2024 10:00 AM
End Date
19-10-2024 10:30 AM
Abstract
To document collocations (common strings of words), researchers (Sonbul et al., 2023) often reference corpora, which are assumed to reflect the language usage of native speakers. However, few researchers have asked whether the top collocations in corpora accurately represent the top collocations in native speakers’ minds.
To answer this question, 120 native Spanish speakers from Mexico produced collocations in an experiment, which were then compared with collocations in a Spanish corpus. The experiment consisted of 36 incomplete phrases. Twelve phrases lacked an adjective (un hombre__________); 12 lacked a verb (____________ un trabajo); and 12 lacked a noun (tomar un/una ___________). The participants completed the phrases using Qualtrics, and the top 3 collocations per item were calculated.
I then searched for these phrases in the Web/Dialects corpus (Davies, 2016) to document the top three collocations per item. Only corpus data from Mexico was included for a fair comparison. For thoroughness, the corpus collocations were ranked in two different ways: with raw frequency (RF) and with the mutual information (MI) score (a measure of word association strength). The experimental group’s top collocations were then scored based on whether they matched the corpus collocations (e.g., 3 points for producing the top corpus collocation, 2 points for the second corpus collocation, etc.)
A t-test revealed that the experimental group’s responses more closely matched the corpus collocations when using RF rather than MI (p = 0.03). Nevertheless, most of the participants’ responses (66%/78%) did not match the corpus collocations with either ranking method. And, there was often no overlap between experimental and corpus data. For example, the top corpus collocations for un hombre _________ were un hombre casado/bueno/joven, but the experimental group produced un hombre alto/fuerte/guapo. In fact, not a single participant produced un hombre casado, the top corpus collocation.
The results suggest that corpora might not always reflect the strongest collocations in native speakers’ minds. This might be especially relevant in L2 research because corpora are often used as a baseline to measure L2 learning. For a full picture of collocations, a combination of both corpus and experimental data is likely necessary.
Examining the representativeness of Spanish collocations with corpus and experimental data
MBSC Dodge Room 302A
To document collocations (common strings of words), researchers (Sonbul et al., 2023) often reference corpora, which are assumed to reflect the language usage of native speakers. However, few researchers have asked whether the top collocations in corpora accurately represent the top collocations in native speakers’ minds.
To answer this question, 120 native Spanish speakers from Mexico produced collocations in an experiment, which were then compared with collocations in a Spanish corpus. The experiment consisted of 36 incomplete phrases. Twelve phrases lacked an adjective (un hombre__________); 12 lacked a verb (____________ un trabajo); and 12 lacked a noun (tomar un/una ___________). The participants completed the phrases using Qualtrics, and the top 3 collocations per item were calculated.
I then searched for these phrases in the Web/Dialects corpus (Davies, 2016) to document the top three collocations per item. Only corpus data from Mexico was included for a fair comparison. For thoroughness, the corpus collocations were ranked in two different ways: with raw frequency (RF) and with the mutual information (MI) score (a measure of word association strength). The experimental group’s top collocations were then scored based on whether they matched the corpus collocations (e.g., 3 points for producing the top corpus collocation, 2 points for the second corpus collocation, etc.)
A t-test revealed that the experimental group’s responses more closely matched the corpus collocations when using RF rather than MI (p = 0.03). Nevertheless, most of the participants’ responses (66%/78%) did not match the corpus collocations with either ranking method. And, there was often no overlap between experimental and corpus data. For example, the top corpus collocations for un hombre _________ were un hombre casado/bueno/joven, but the experimental group produced un hombre alto/fuerte/guapo. In fact, not a single participant produced un hombre casado, the top corpus collocation.
The results suggest that corpora might not always reflect the strongest collocations in native speakers’ minds. This might be especially relevant in L2 research because corpora are often used as a baseline to measure L2 learning. For a full picture of collocations, a combination of both corpus and experimental data is likely necessary.