Modelling crosslinguistic n‑gram correspondence in typologically different languages

Milička, Jiří; Cvrček, Václav; Lukešová, Lucie

doi:10.1075/lic.19018.mil

Article published In: Languages in Contrast
Vol. 21:2 (2021) ► pp.217–249

Get fulltext from our e-platform

Download PDF

Modelling crosslinguistic n‑gram correspondence in typologically different languages

Jiří Milička | Charles University, Czech Republic

Václav Cvrček | Charles University, Czech Republic

Lucie Lukešová | Charles University, Czech Republic

Published online: 12 January 2021

https://doi.org/10.1075/lic.19018.mil

Abstract

N‑gram analysis (popularized e.g. by Biber et al., ) has become a popular method for the identification of recurrent language patterns. Although the extraction of n‑grams from a corpus may seem straightforward, it proves to be very challenging when applied cross-linguistically (cf. e.g. ; ; ). The major issue is that the quantities of n‑grams of a certain length in typologically different languages do not correspond. Consequently, n‑grams of a given length may function differently across languages, rendering a direct comparison inadequate. Our paper introduces a function capable of modelling the relation between the quantities of n‑grams in typologically distant languages, using the example of Czech and English (and some other language pairs). Based on our model, we can suggest what n‑gram lengths should be contrasted to better reflect the size of n‑gram inventories in each language. The correspondence may not be intuitive (e.g. a Czech 2-gram may best correspond to an English 2.5-gram), but it still provides researchers with a general guide as to what might be useful to include in their analysis (e.g. in this case 2-grams in Czech and 2- and 3-grams in English).

Keywords: n‑grams, parallel corpus, correspondence, Czech/English/Spanish

Article outline

1.Introduction
- 1.1N‑grams in corpus linguistics
- 1.2Major issues in cross-linguistic n‑gram correspondence
  - 1.2.1N‑gram length
  - 1.2.2The number of n‑gram types
  - 1.2.3The frequency threshold
- 1.3Our research questions
2.Data
- 2.1Corpus material
- 2.2N‑gram extraction
  - 2.2.1Word order and syntactic boundaries
  - 2.2.2N‑gram settings for this study
3.Searching for a model
- 3.1From a basic formula to an adequate model
- 3.2Fitting the model
4.Results
- 4.1Czech-English texts
- 4.2Czech-Spanish texts
- 4.3Parameters for other language pairs
5.Conclusion
Acknowledgements
Notes
References

References (23)

References

Baker, M. 2004. A Corpus-Based View of Similarity and Difference in Translation. International Journal of Corpus Linguistics 9(2): 167–193.

Biber, D., Johansson, S., Leech, G., Conrad, S. and Finegan, E. 1999. Longman Grammar of Spoken and Written English. Harlow: Longman.

Biber, D., Kim, Y. and Tracy-Ventura, N. 2010. A Corpus-Driven Approach to Comparative Phraseology: Lexical Bundles in English, Spanish, and Korean. In Japanese/Korean Linguistics, Volume 171, S. Iwasaki, H. Hoji, P. M. Clancy and S.-O. Sohn (eds), 75–94. Stanford: Center for the Study of Language and Information (CSLI).

Cheng, W., Greaves, C. and Warren, M. 2006. From N‑gram to Skipgram to Concgram. International Journal of Corpus Linguistics 11(4): 411–433.

Cortes, V. 2008. A Comparative Analysis of Lexical Bundles in Academic History Writing in English and Spanish. Corpora 3(1): 43–57.

Cvrček, V. 2019. Calc: Corpus Calculator. Prague: Czech National Corpus. Available at [URL]

Čermák, F. and Rosen, A. 2012. The Case of InterCorp, a Multilingual Parallel Corpus. International Journal of Corpus Linguistics 13(3): 411–427.

Čermáková, A. and Chlumská, L. 2017. Expressing Place in Children’s Literature: Testing the Limits of the N‑gram Method in Contrastive Linguistics. In Cross-Linguistic Correspondences: From Lexis to Genre, T. Egan and H. Dirdal (eds), 75–95. Amsterdam: John Benjamins.

Ebeling, J. and Ebeling, S. Oksefjell. 2013. Patterns in Contrast. Studies in Corpus Linguistics 58. Amsterdam: John Benjamins.

. 2017. A Cross-Linguistic Comparison of Recurrent Word Combinations in a Comparable Corpus of English and Norwegian Fiction. In Contrasting English and other Languages through Corpora, M. Janebová, E. Lapshinova-Koltunski and M. Martínková (eds), 2–31. Newcastle upon Tyne: Cambridge Scholars Publishing.

Forchini, P. and Murphy, A. C. 2008. N‑grams in Comparable Specialized Corpora: Perspectives on Phraseology, Translation, and Pedagogy. International Journal of Corpus Linguistics 13(3): 351–367.

Granger, S. 2014. A Lexical Bundle Approach to Comparing Languages: Stems in English and French. Languages in Contrast 14(1): 58–72.

Granger, S. and Lefer, M.-A. 2013. Enriching the Phraseological Coverage of High-Frequency Adverbs in English–French Bilingual Dictionaries. In Advances in Corpus-Based Contrastive Linguistics: Studies in Honour of Stig Johansson, K. Aijmer and B. Altenberg (eds), 157–176. Amsterdam: John Benjamins.

Hasselgård, H. 2017. Temporal Expression in English and Norwagian. In Contrasting English and other Languages through Corpora, M. Janebová, E. Lapshinova-Koltunski and M. Martínková (eds), 75–101. Newcastle upon Tyne: Cambridge Scholars Publishing.

Kim, Y. 2009. Korean Lexical Bundles in Conversations and Academic Texts. Corpora 4(2): 135–165.

Mahlberg, M. 2012. Corpus Stylistics and Dickens’s Fiction. London: Routledge.

Milička, J. 2013. Rank-Frequency Relation & Type-Token Relation: Two Sides of the Same Coin. In Methods and Applications of Quantitative Linguistics, M. Obradovič, E. Kelih, R. Köhler (eds), 163–172. Belgrade: University of Belgrade and Academic Mind.

Nebeský, L. and Novák, P. 1996. Větné faktory a jejich podíl na analýze věty. Slovo a Slovesnost 57(4): 282–295.

Rapoport, A. 1982. Zipf’s Law Re-Visited. Quantitative Linguistics 16(1): 1–28.

Rosen, A., Vavřín, M. and Zasina, A. J. 2018. The InterCorp Corpus, Version 11 of 11 October 2018. Praha: Institute of the Czech National Corpus. FF UK. Available at [URL]

Sinclair, J. 2004. The Search for Units of Meaning. In Trust the Text: Language, Corpus and Discourse, R. Carter (ed.), 24–48. London: Routledge.

Tracy-Ventura, N., Cortes, V. and Biber, D. 2007. Lexical Bundles in Spanish Speech and Writing. In Working with Spanish Corpora, G. Parodi (ed.), 217–231. London: Continuum.

Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Cambridge: Addison-Wesley Press.

Cited by (1)

Cited by one other publication

Wang, Guanfang, Xianshan Chen, Geng Tian, Jiasheng Yang & Huiling Chen

2022. A Novel N -Gram-Based Image Classification Model and Its Applications in Diagnosing Thyroid Nodule and Retinal OCT Images. Computational and Mathematical Methods in Medicine 2022 ► pp. 1 ff.

This list is based on CrossRef data as of 26 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.