Article published In: Romance Parsed Corpora
Edited by Christina Tortora, Beatrice Santorini and Frances Blanchette
[Linguistic Variation 18:1] 2018
► pp. 100–119
Special issue articles
The challenges and benefits of annotating oral bilingual corpora
The Spanish in Texas Corpus Project
Published online: 13 July 2018
https://doi.org/10.1075/lv.00006.bul
https://doi.org/10.1075/lv.00006.bul
Abstract
This article describes efforts to collect, process, and automatically annotate a corpus of Spanish as spoken in Texas. It elaborates the protocols for the development of the corpus and the procedures for automatic annotation, illustrating the common pitfalls to language identification in bilingual corpora and potential methods for circumventing them. The benefits of a comparative corpus approach to contact varieties is illustrated by a case study of a putative verbal calque from the Spanish in Texas data. It is demonstrated that the relative frequency of the verb is much higher than in its source Mexican variety and that the verb selects different complements in Texas than it does in other varieties. The article concludes with a discussion of how computational tools might be fruitfully exploited to resolve long-standing debates about language variation in contact settings.
Article outline
- 1.Introduction
- 2.The Spanish in Texas Corpus
- 2.1Protocols for developing the Corpus
- 2.2Procedures for annotation
- 3.The benefits of a corpus approach to contact phenomena
- 4.Case Study: Is an innovation contact induced or internally motivated?
- 4.1Detection of the potentially innovative uses of the verb
- 4.2Possibilities and limitations of a computational approach to calques
- 5.Discussion and conclusion
- Notes
References
References (54)
Adamou, Evangelia. 2016. A corpus-driven approach to language contact: Endangered languages in a comparative perspective. Walter de Gruyter GmBH & Co KG.
Balam, Osmer, Ana de Prada Pérez & Damaris Mayans. 2014. A congruence approach to the study of bilingual compound verbs in Northern Belize contact Spanish. Spanish in Context 111. 243–265.
Bullock, Barbara E. & A. Jacqueline Toribio. 2013. The Spanish in Texas Corpus project. Center for Open Education Resources and Language Learning (COERLL), the University of Texas at Austin. [URL].
Bybee, Joan L. 2007. Frequency of use and the organization of language. New York & Oxford: Oxford University Press.
Çentinoğlu, Özlem, Sarah Schulz, and Ngoc Thang Vu. “Challenges of computational processing of codeswitching.” arXiv preprint arXiv:1610.02213 (2016).
Coetsem, Frans van. 1990. Review of Thomason and Kaufman (1988), Lehiste (1988), and Wardhaugh (1987), Language in Society 191. 260–268.
Deuchar, Margaret & Jonathan R. Stammers. 2012. What IS the “Nonce Borrowing Hypothesis” anyway? Bilingualism: Language and Cognition 151. 649–650.
Davies, Mark. 2002. Corpus del Español: 100 million words, 1200s-1900s. [URL]. (12 March 2014.)
Diab, Mona & Ankit Kamboj. 2011. Feasibility of leveraging crowd sourcing for the creation of a large scale annotated resource for Hindi English code switched data: a pilot annotation. 9th Workshop on Asian Language Resources, 36–40. Chiang Mai, Thailand.
Donnelly, Kevin & Margaret Deuchar. 2011. Using constraint grammar in the Bangor Autoglosser to disambiguate multilingual spoken text. In Constraint Grammar Applications: Proceedings of the NODALIDA 2011 Workshop, Riga, Latvia, 17–25.
Elfardy, Heba, Mohamed Al-Badrashiny & Mona Diab. 2013. Code switch point detection in Arabic. In Elisabeth Métais, Farid Meziane, Mohamad Sararee, Vijayan Sugumaran & Sunil Vadera (eds.) Natural Language Processing and Information Systems: Proceedings of the 18th International Conference on Applications of Natural Language to Information Systems (NLDB2013), Salford, UK, 412–416. Heidelberg: Springer.
González-Vilbazo, Kay & Luis López. 2011. Some properties of light verbs in code-switching. Lingua 1211. 832–850.
Guzmán, Gualberto, Joseph Ricard, Jacqueline Serigos, Barbara E. Bullock & Almeida Jacqueline Toribio. 2017. Moving code-switching research towards more empirically grounded methods. CDH 2017 Corpora in the Digital Humanities, CEUR Workshop Proceedings, 1–9.
Guzmán, Gualberto, Jacqueline Serigos, Barbara E. Bullock & Almeida Jacqueline Toribio. 2016. Simple tools for exploring variation in code-switching for linguists. Proceedings of EMNLP (Empirical Methods in Natural Language Processing 2016), Second Workshop on Computational Approaches to Code-switching, 12–20. Association for Computational Linguistics.
Jarvis, Scott & Scott Crossley. 2012. Approaching language transfer through text classification: Explorations in the detection-based approach. Bristol, UK: Multilingual matters.
Jarvis, Scott & Aneta Pavlenko. 2008. Crosslinguistic influence in language and cognition. New York & London: Routledge.
Jenkins, Devin. 2003. Bilingual verb constructions in southwestern Spanish. Bilingual Review 271. 195–204.
King, Ben & Steven Abney. 2013. Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1110–1119. Association for Computational Linguistics.
Koehn, Philipp. 2005. Europarl: A parallel corpus for statistical machine translation. Machine Translation Summit 2005, 79–86.
Li, Ying, Yue Yu & Pascale Fung. 2012. A Mandarin-English code-switching corpus. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, 2515–2519. European Language Resources Association.
LIPPS Group. 2000. The LIDES coding manual: A document for preparing and analyzing language interaction data. International Journal of Bilingualism 41. 131–270.
Lipski, John M. 1985. Linguistic aspects of Spanish-English language switching. Tempe: Arizona State University Center for Latin American Studies.
Mackey, William F. 1970. Interference, integration and the synchronic fallacy. In James E. Alatis (ed.) Bilingualism and Language Contact: Anthropological, Linguistic, Psychological, and Sociological Aspects. Monograph Series on Languages and Linguistics (Georgetown University Round Table on Languages and Linguistics), vol. 231, 195–227. Washington: Georgetown University School of Languages and Linguistics.
MacWhinney, Brian. 2007. The TalkBank Project. In Joan C. Beal, Karen P. Corrigan & Hermann L. Moisl (eds.), Creating and Digitizing Language Corpora: Synchronic Databases, vol. 11, 163–180. Houndmills, UK: Palgrave-MacMillan.
Mougeon, Raymond, Terry Nadasdi & Katherine Rehner. 2005. Contact-induced linguistic innovations on the continuum of language use: The case of French in Ontario. Bilingualism: Language and Cognition 81. 99–115.
Muysken, Pieter. 2000. Bilingual speech: A typology of code-mixing. Cambridge, UK: Cambridge University Press.
Otheguy, Ricardo. 1995. When contact speakers talk, linguistic theory listens. In Ellen Contini-Morava & Barbara S. Goldberg (eds.), Meaning as explanation: Advances in linguistic sign theory (Trends in Linguistics, Studies and Monographs), vol. 841, 213–242. Berlin: Mouton de Gruyter.
Otheguy, Ricardo & Nancy Stern. 2011. On so-called Spanglish. International Journal of Bilingualism 151. 85–100.
Otheguy, Ricardo & Ana Celia Zentella. 2012. Spanish in New York: Language contact, dialectal leveling, and structural continuity. New York & Oxford: Oxford University Press.
Polinsky, Maria & Olga Kagan. 2007. Heritage languages: In the ‘wild’ and in the classroom. Language and Linguistics Compass 11. 368–395.
Poplack, Shana. 1980. Sometimes I’ll start a sentence in Spanish y termino en español: Toward a typology of code-switching. Linguistics 181. 581–618.
. 2012. What does the Nonce Borrowing Hypothesis hypothesize? Bilingualism: Language and Cognition 151. 644–648.
Putnam, Michael T. & Liliana Sánchez. 2013. What’s so incomplete about incomplete acquisition? A prolegomenon to modeling heritage language grammars. Linguistic Approaches to Bilingualism 31. 478–508.
R Development Core Team. 2009. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL [URL]
Roggia, Aaron B. 2011. Unaccusativity and word order in Mexican Spanish: An examination of syntactic interfaces and the split intransitivity hierarchy. Ph.D. dissertation. State College, Pennsylvania: The Pennsylvania State University.
Schmid, Helmut. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of international conference on new methods in language processing, Manchester, UK, 44–49.
Sebba, Mark. 1998. A congruence approach to the syntax of codeswitching. International Journal of Bilingualism 2(1). 1–19.
Serigos, Jacqueline Larsen. 2013. The social stratification of loanwords: A computational and corpus-based approach to Anglicisms in Argentina. Austin, TX: University of Texas at Austin master’s report.
Solorio, Thamar & Yang Liu. 2008a. Learning to predict code-switching points. The Conference Empirical Methods on Natural Language Processing, EMNLP 2008, 973–981. Honolulu, HI: Association for Computational Linguistics.
. 2008b. Part-of-speech tagging for English-Spanish code-switched text. The Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, 1051–1060. Honolulu, HI: Association for Computational Linguistics.
Solorio, Thamar, Elizabeth Blair, Suraj Maharjan, Steven Bethard, Mona Diab, Mahmoud Gohneim, Abdelati Hawwari, Fahad AlGhamdi, Julia Hirschberg, Alison Chang & Pascale Fung. 2014. Overview for the first shared task on language identification in code-switched data. First Workshop on Computational Approaches to Code Switching. Proceedings of the Workshop. EMNLP 2014, 62–72. Doha, Qatar: Association for Computational Linguistics.
Stammers, Jonathan & Margaret Deuchar. 2012. Testing the Nonce Borrowing Hypothesis: Counter-evidence from English-origin verbs in Welsh. Bilingualism: Language and Cognition 151. 630–643.
Thomason, Sarah & Terrence Kaufman. 1988. Language contact, creolization, and genetic linguistics. Berkeley, CA: University of California Press.
Torres Cacoullos, Rena & Catherine E. Travis. 2010. Testing convergence via code-switching: Priming and the structure of variable subject expression. International Journal of Bilingualism 141. 1–27.
Toribio, Almeida Jacqueline & Barbara E. Bullock. 2016. A new look at heritage Spanish and its speakers. In Diego Pascual y Cabo (ed.), Advances in Spanish as a Heritage Language, 27–50. John Benjamins.
Tortora, Christina, Beatrice Santorini, Frances Blanchette & C. E. A. Diertani. 2017. The Audio-Aligned and Parsed Corpus of Appalachian English (AAPCAppE). [URL].
Villa, Daniel J. 2005. Back to patrás: A process of grammaticalization in a contact variety of Spanish. In James Cohen, Kara T. McAlister, Kellie Rolstad & Jeff MacSwan (eds.) Proceedings of the 4th International Symposium on Bilingualism, 2310–2316. Somerville, MA: Cascadilla Press.
Cited by (2)
Cited by two other publications
Alvero, AJ & Rebecca Pattichis
This list is based on CrossRef data as of 27 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
