In:Language Acquisition Beyond Parameters: Studies in honour of Juana M. Liceras
Edited by Anahí Alba de la Fuente, Elena Valenzuela and Cristina Martínez Sanz
[Studies in Bilingualism 51] 2016
► pp. 281–301
Applying computing innovations to bilingual corpus analysis
Diana Carter | University of British Columbia | Centre for Research on Bilingualism, Bangor University
Published online: 16 December 2016
https://doi.org/10.1075/sibil.51.11car
https://doi.org/10.1075/sibil.51.11car
Abstract
With current innovations in corpus analysis, it is now possible to extract and analyze large amounts of monolingual and bilingual data in minutes, as opposed to the numerous hours previously needed to manually analyze a much smaller quantum of data. In this chapter, we review innovative techniques in bilingual corpus building and analysis, which include the use of automated glossing to allow the extraction of data that can then be statistically analyzed using mixed-effects models. We discuss the application of these techniques, among others, and provide examples from three bilingual corpora. We end by suggesting how researchers may benefit from the increasingly powerful computing capability that is now available.
Article outline
- 1.Introduction
- 2.Triggered codeswitching
- 3.The Miami, Patagonia, and Siarad corpora
- 3.1Participants
- 3.2Transcription
- 4.Automatic glossing
- 5.Data preparation
- 6.Data analysis and results
- 6.1Data analysis
- 6.2Results
- 7.Tips and tricks for processing corpus data
- 8.Conclusions
Acknowledgements Notes References
References (41)
Baayen, R. (2008). Analyzing Linguistic Data. A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.
Baayen, R., Davidson, D., & Bates, D. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390–412.
Becker, R., & Chambers, J. (1984). S: An Interactive Environment for Data Analysis and Graphics. Ithaca, NY: CRC Press.
Broersma, M. (2009). Triggered codeswitching between cognate languages. Bilingualism: Language and Cognition, 12, 447–462.
Broersma, M., & de Bot, K. (2006). Triggered codeswitching: A corpus-based evaluation of the original triggering hypothesis and a new alternative. Bilingualism: Language and Cognition, 9, 1–13.
Broersma, M., Isurin, L., Bultena, S., & de Bot, K. (2009). Triggered codeswitching: Evidence from Dutch-English and Russian-English bilinguals. In L. Isurin, D. Winford, & K. de Bot (Eds.), Multidisciplinary Approaches to Codeswitching (pp. 103–128). Amsterdam: John Benjamins.
Carter, D., Broersma, M., Donnelly, K., & Konopka, A. (2015). How cognates affect codeswitching: A large-scale study of Welsh-English bilinguals. Ms. in Preparation.
Carter, D., Deuchar, M., Davies, P., & Parafita Couto, M. C. (2011). A systematic comparison of factors affecting the choice of matrix language in three bilingual communities. Journal of Language Contact, 4, 153–183.
Clark, H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning & Verbal Behavior, 12, 335–359.
Clyne, M. (1967). Transference and Triggering: Observations on the Language Assimilation of Postwar German-speaking Migrants in Australia. The Hague: Martinus Nijhoff.
(2003). Dynamics of Language Contact: English and Immigrant Languages. Cambridge: Cambridge University Press.
Davies, P., & Deuchar, M. (2010). Using the Matrix Language Frame model to measure the extent of word order convergence in Welsh-English bilingual speech. In A. Breitbarth, C. Lucas, S. Watts & D. Willis (Eds.), Continuity and Change in Grammar (pp. 77–96). Amsterdam: John Benjamins.
Deuchar, M., Davies, P., & Donnelly, K. (2016). Building and Using the Siarad Corpus of Spoken Welsh: Bilingual Conversations in Welsh and English. Manuscript in preparation.
Deuchar, M., Davies, P., Herring, J., Parafita Couto, M.C., & Carter, D. (2014). Building bilingual corpora. In E. Thomas & I. Mennen (Eds.), Advances in the Study of Bilingualism (pp.93–110). Bristol: Multilingual Matters.
Donnelly, K., & Deuchar, M. (2011a). The Bangor Autoglosser: A Multilingual Tagger for Conversational Text. Paper presented at Internet Technologies and Applications, 11. Wrexham, Wales.
(2011b). Using constraint grammar in the Bangor Autoglosser to disambiguate multilingual spoken text. Constraint Grammar Applications: Proceedings of the NODALIDA 2011 Workshop. Riga, Latvia: NEALT Proceedings Series, Tartu.
Douglas, K., & Douglas, S. (2003). PostgreSQL: A Comprehensive Guide to Building, Programming, and Administering PostgreSQL Databases. Indianapolis, IN: Sams Publishing.
Fernández Fuertes, R., Liceras, J. M., Pérez-Tattam, R., Martínez, C., Alba de la Fuente, A., & Carter, D. (2006). The Nature of the Pronominal System and Verbal Morphology in Bilingual Spanish/English Child Data: Linguistic Theory and Learnability Issues. Paper presented at the Hispanic Linguistic Symposium. London: University of Western Ontario.
Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/hierarchical Models. Cambridge: Cambridge University Press.
Gries, S. (2013). Statistics for Linguistics with R: A Practical Introduction (2nd ed.). Berlin: Mouton de Gruyter.
Herring, J., Deuchar, M., Parafita Couto, M. C., & Moro Quintanilla, M. (2010). ‘I saw the madre’: Evaluating predictions about codeswitched determiner-noun sequences using Spanish-English and Welsh-English data. International Journal of Bilingual Education and Bilingualism, 13, 553–573.
Jaeger, T. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59, 434–446.
Karlsson, F. (1990). Constraint grammar as a framework for parsing unrestricted text. In H. Karlgren, (Ed.), Proceedings of the 13th International Conference of Computational Linguistics, 3, (pp. 168–173). Stroudsurg, PA: Association for Computational Linguistics.
Karlsson, F., Voutilainen, A., Juha Heikkilä, J., & Anttila A. (1995).
Constraint grammar: A language-independent system for parsing running text
. Natural Language Processing, 4. Berlin: Mouton de Gruyter.
MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk (3rd Ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
(2009). Enriching CHILDES for morphosyntactic analysis. Department of Psychology. Paper 175 Enriching CHILDES for morphosyntactic analysis <[URL]>
Matthew, N., & Stones, R. (2005). Beginning Databases with PostgreSQL: From Novice to Professional. New York, NY: Apress.
Myers-Scotton, C. (2002). Contact Linguistics: Bilingual Encounters and Grammatical Outcomes. Oxford; NY: Oxford University Press.
Paradis, M. (2004). A Neurolinguistic Theory of Bilingualism. Amsterdam: John Benjamins.
Quené, H., & van den Bergh, H. (2008). Examples of mixed-effects modeling with crossed random effects and with binomial data. Journal of Memory and Language, 59, 413–425.
Streiter, O., Scannell, K., & Stuflesser. M. (2006). Implementing NLP projects for non-central languages: Instructions for funding bodies, strategies for developers. Machine Translation, 20, 267–289.
Tagliamonte, S., & Baayen, R. (2012). Models, forests, and trees of York English: Was/were variation as a case study for statistical practice. Language Variation and Change, 24, 135–178.
Cited by (2)
Cited by two other publications
Broersma, Mirjam, Diana Carter, Kevin Donnelly & Agnieszka Konopka
[no author supplied]
This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
