Applying computing innovations to bilingual corpus analysis

Carter, Diana; Broersma, Mirjam; Donnelly, Kevin

doi:10.1075/sibil.51.11car

In:Language Acquisition Beyond Parameters: Studies in honour of Juana M. Liceras
Edited by Anahí Alba de la Fuente, Elena Valenzuela and Cristina Martínez Sanz
[Studies in Bilingualism 51] 2016
► pp. 281–301

Get fulltext from our e-platform

Download Book PDF

Applying computing innovations to bilingual corpus analysis

Diana Carter | University of British Columbia | Centre for Research on Bilingualism, Bangor University

Mirjam Broersma | Centre for Language Studies, Radboud University | Max Planck Institute for Psycholinguistics

Kevin Donnelly | Centre for Research on Bilingualism, Bangor University

Published online: 16 December 2016

https://doi.org/10.1075/sibil.51.11car

Abstract

With current innovations in corpus analysis, it is now possible to extract and analyze large amounts of monolingual and bilingual data in minutes, as opposed to the numerous hours previously needed to manually analyze a much smaller quantum of data. In this chapter, we review innovative techniques in bilingual corpus building and analysis, which include the use of automated glossing to allow the extraction of data that can then be statistically analyzed using mixed-effects models. We discuss the application of these techniques, among others, and provide examples from three bilingual corpora. We end by suggesting how researchers may benefit from the increasingly powerful computing capability that is now available.

Keywords: codeswitching, bilingual corpus, autoglossing, automated clause-splitting

Article outline

1.Introduction
2.Triggered codeswitching
3.The Miami, Patagonia, and Siarad corpora
- 3.1Participants
- 3.2Transcription
4.Automatic glossing
5.Data preparation
6.Data analysis and results
- 6.1Data analysis
- 6.2Results
7.Tips and tricks for processing corpus data
8.Conclusions
Acknowledgements
Notes
References

References (41)

References

Baayen, R. (2008). Analyzing Linguistic Data. A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.

Baayen, R., Davidson, D., & Bates, D. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390–412.

Becker, R., & Chambers, J. (1984). S: An Interactive Environment for Data Analysis and Graphics. Ithaca, NY: CRC Press.

Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O’Reilly Media.

Broersma, M. (2009). Triggered codeswitching between cognate languages. Bilingualism: Language and Cognition, 12, 447–462.

Broersma, M., & de Bot, K. (2006). Triggered codeswitching: A corpus-based evaluation of the original triggering hypothesis and a new alternative. Bilingualism: Language and Cognition, 9, 1–13.

Broersma, M., Isurin, L., Bultena, S., & de Bot, K. (2009). Triggered codeswitching: Evidence from Dutch-English and Russian-English bilinguals. In L. Isurin, D. Winford, & K. de Bot (Eds.), Multidisciplinary Approaches to Codeswitching (pp. 103–128). Amsterdam: John Benjamins.

Carter, D., Broersma, M., Donnelly, K., & Konopka, A. (2015). How cognates affect codeswitching: A large-scale study of Welsh-English bilinguals. Ms. in Preparation.

Carter, D., Deuchar, M., Davies, P., & Parafita Couto, M. C. (2011). A systematic comparison of factors affecting the choice of matrix language in three bilingual communities. Journal of Language Contact, 4, 153–183.

Clark, H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning & Verbal Behavior, 12, 335–359.

Clyne, M. (1967). Transference and Triggering: Observations on the Language Assimilation of Postwar German-speaking Migrants in Australia. The Hague: Martinus Nijhoff.

(2003). Dynamics of Language Contact: English and Immigrant Languages. Cambridge: Cambridge University Press.

Crawley, M. (2005). Statistics: An Introduction Using R. Chichester: Wiley & Sons.

Davies, P., & Deuchar, M. (2010). Using the Matrix Language Frame model to measure the extent of word order convergence in Welsh-English bilingual speech. In A. Breitbarth, C. Lucas, S. Watts & D. Willis (Eds.), Continuity and Change in Grammar (pp. 77–96). Amsterdam: John Benjamins.

Deuchar, M., Davies, P., & Donnelly, K. (2016). Building and Using the Siarad Corpus of Spoken Welsh: Bilingual Conversations in Welsh and English. Manuscript in preparation.

Deuchar, M., Davies, P., Herring, J., Parafita Couto, M.C., & Carter, D. (2014). Building bilingual corpora. In E. Thomas & I. Mennen (Eds.), Advances in the Study of Bilingualism (pp.93–110). Bristol: Multilingual Matters.

Donnelly, K., & Deuchar, M. (2011a). The Bangor Autoglosser: A Multilingual Tagger for Conversational Text. Paper presented at Internet Technologies and Applications, 11. Wrexham, Wales.

(2011b). Using constraint grammar in the Bangor Autoglosser to disambiguate multilingual spoken text. Constraint Grammar Applications: Proceedings of the NODALIDA 2011 Workshop. Riga, Latvia: NEALT Proceedings Series, Tartu.

Douglas, K., & Douglas, S. (2003). PostgreSQL: A Comprehensive Guide to Building, Programming, and Administering PostgreSQL Databases. Indianapolis, IN: Sams Publishing.

Duran Eppler, E. (2010). Emigranto: The Syntax of a German/English Mixed Code. Vienna: Braumüller.

Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. London: Sage.

Fernández Fuertes, R., Liceras, J. M., Pérez-Tattam, R., Martínez, C., Alba de la Fuente, A., & Carter, D. (2006). The Nature of the Pronominal System and Verbal Morphology in Bilingual Spanish/English Child Data: Linguistic Theory and Learnability Issues. Paper presented at the Hispanic Linguistic Symposium. London: University of Western Ontario.

Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/hierarchical Models. Cambridge: Cambridge University Press.

Gries, S. (2013). Statistics for Linguistics with R: A Practical Introduction (2nd ed.). Berlin: Mouton de Gruyter.

(2009). Quantitative Corpus Linguistics with R: A Practical Introduction. London: Routledge.

Herring, J., Deuchar, M., Parafita Couto, M. C., & Moro Quintanilla, M. (2010). ‘I saw the madre’: Evaluating predictions about codeswitched determiner-noun sequences using Spanish-English and Welsh-English data. International Journal of Bilingual Education and Bilingualism, 13, 553–573.

Jaeger, T. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59, 434–446.

Karlsson, F. (1990). Constraint grammar as a framework for parsing unrestricted text. In H. Karlgren, (Ed.), Proceedings of the 13th International Conference of Computational Linguistics, 3, (pp. 168–173). Stroudsurg, PA: Association for Computational Linguistics.

Karlsson, F., Voutilainen, A., Juha Heikkilä, J., & Anttila A. (1995). Constraint grammar: A language-independent system for parsing running text . Natural Language Processing, 4. Berlin: Mouton de Gruyter.

Labov, W. (1972). Some principles of linguistic methodology. Language and Society, 1, 97–120.

MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk (3rd Ed.). Mahwah, NJ: Lawrence Erlbaum Associates.

(2009). Enriching CHILDES for morphosyntactic analysis. Department of Psychology. Paper 175 Enriching CHILDES for morphosyntactic analysis <[URL]>

Matthew, N., & Stones, R. (2005). Beginning Databases with PostgreSQL: From Novice to Professional. New York, NY: Apress.

Milroy, L. (1987). Language and Social Networks. Oxford: Blackwell.

Myers-Scotton, C. (2002). Contact Linguistics: Bilingual Encounters and Grammatical Outcomes. Oxford; NY: Oxford University Press.

Paradis, M. (2004). A Neurolinguistic Theory of Bilingualism. Amsterdam: John Benjamins.

Quené, H., & van den Bergh, H. (2008). Examples of mixed-effects modeling with crossed random effects and with binomial data. Journal of Memory and Language, 59, 413–425.

Streiter, O., Scannell, K., & Stuflesser. M. (2006). Implementing NLP projects for non-central languages: Instructions for funding bodies, strategies for developers. Machine Translation, 20, 267–289.

Tagliamonte, S., & Baayen, R. (2012). Models, forests, and trees of York English: Was/were variation as a case study for statistical practice. Language Variation and Change, 24, 135–178.

Wilson, G., Aruliah, D., Brown, C., Hong, N., Davis, M., Guy, R., … Wilson, P. (2012). Best Practices for Scientific Computing. arXiv preprint arXiv:1210.0530.

Zuur, A., Saveliev, A., & Ieno, E. (2012). Zero Inflated Models and Generalized Mixed Models with R. Scotland: Highland Statistics.

Cited by (2)

Cited by two other publications

Broersma, Mirjam, Diana Carter, Kevin Donnelly & Agnieszka Konopka

2020. Triggered codeswitching: Lexical processing and conversational dynamics. Bilingualism: Language and Cognition 23:2 ► pp. 295 ff.

[no author supplied]

2018. Building and Using the Siarad Corpus [Studies in Corpus Linguistics, 81],

This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.