In:Parallel Corpora for Contrastive and Translation Studies: New resources and applications
Edited by Irene Doval and M. Teresa Sánchez Nieto
[Studies in Corpus Linguistics 90] 2019
► pp. 267–279
Discovering bilingual collocations in parallel corpora
A first attempt at using distributional semantics
Published online: 20 March 2019
https://doi.org/10.1075/scl.90.16gon
https://doi.org/10.1075/scl.90.16gon
This chapter presents a method that exploits parallel corpora to automatically extract bilingual collocation equivalents. First, we use dependency parsing and statistical measures to identify collocation candidates in corpora. Then, we leverage the parallel corpora to extract bilingual word-embeddings. Finally, we use these distributional models as probabilistic dictionaries in order to identify bilingual collocation equivalents. To evaluate our strategy we carry out a set of experiments in Portuguese and Spanish focusing on verb-object collocations, for example, “reach the maturity” (“atingir a maturidade” in Portuguese, “alcanzar la madurez” in Spanish). The results of our experiments show that this method is useful to automatically identify thousands of bilingual collocation equivalents, achieving a precision of 86%.
Keywords: collocations, distributional semantics, phraseology, parallel corpora
Article outline
- 1.Introduction
- 2.Previous research on bilingual collocation extraction
- 3.The proposed strategy
- 3.1Extracting monolingual collocation candidates
- 3.2Bilingual distributional semantics model
- 3.3Bilingual alignment of monolingual collocations
- 4.Evaluation
- 4.1Data
- 4.2Monolingual extraction and bilingual alignment
- 4.3Results
- 4.4Error analysis
- 5.Conclusions
Acknowledgements Note References
References (27)
Alonso-Ramos, Margarita, Wanner, Leo, Vincze, Orsolya, Casamayor del Bosque, Gerard, Vázquez Veiga, Nancy, Mosqueira Suárez, Estela & Prieto González, Sabela. 2010. Towards a motivated annotation schema of collocation errors in learner corpora. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), 3209–3214. Paris: European Language Resources Association (ELRA).
Altenberg, Bengt & Granger, Sylviane. 2001. The grammatical and lexical patterning of MAKE in native and non-native student writing. Applied Linguistics 22: 173–195.
Dyer, Chris, Victor Chahuneau & Noah A. Smith. 2013. A Simple, Fast, and Effective Reparameterization of IBM Model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), 644-648. Atlanta, Georgia: Association for Computational Linguistics.
Evert, Stefan. 2008. Corpora and collocations. In Corpus Linguistics. An International Handbook, Vol. 2, Anke Lüdeling & Merja Kytö (eds), 1212–1248. Berlin: Mouton de Gruyter.
Fung, Pascale. 1998. A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In Proceedings of the Third Conference of the Association for Machine Translation in the Americas. Machine Translation and the Information Soup (AMTA 1998), 1–17, Langhorne, PA: Association for Machine Translation in the Americas.
Gamallo, Pablo. 2019. Strategies to build high quality bilingual lexicons from comparable corpora. In Parallel Corpora for Contrastive and Translation Studies: New Resources and Applications [Studies in Corpus Linguistics 90], Irene Doval & M. Teresa Sánchez (eds). Amsterdam: John Benjamins. (this volume)
Garcia, Marcos & Gamallo, Pablo. 2015. Yet another suite of multilingual NLP Tools. In Languages, Applications and Technologies [Communications in Computer and Information Science 563], José-Luis Sierra-Rodríguez, José Paulo Leal & Alberto Simões (eds), 65–75. Cham: Springer. Revised Selected Papers of the Symposium on Languages, Applications and Technologies (SLATE 2015), Madrid.
Garcia, Marcos, Marcos García-Salido & Margarita Alonso-Ramos. 2017. Using bilingual word-embeddings for multilingual collocation extraction. In Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) at the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017), 21-30. Valencia: Association for Computational Linguistics.
Krenn, Brigitte & Evert, Stefan. 2001. Can we do better than frequency? A case Study on extracting PP-verb collocations. In Proceedings of the ACL Workshop on Collocations, 39–46. Toulouse: Association for Computational Linguistics.
Kupiec, Julian. 1993. An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of the 31st Annual Meeting on Association for Computational Linguistics (ACL 1993), 17–22, Columbus OH: Association for Computational Linguistics.
Lison, Pierre & Tiedemann, Jörg. 2016. OpenSubtitles2016: Extracting large parallel corpora from movie and tv subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), 923–929. Paris: European Language Resources Association (ELRA).
Lü, Yajuan & Zhou, Ming. 2004. Collocation translation acquisition using monolingual corpora. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (ACL 2004), 167–174. Barcelona: Association for Computational Linguistics.
van der Maaten, Laurens & Hinton, Geoffrey. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9: 2579–2605.
Mel’čuk, Igor. 1998. Collocations and lexical functions. In Phraseology. Theory, Analysis and Applications, Anthony Paul Cowie (ed.), 23–53. Oxford: Clarendon Press.
Mikolov, Tomas, Chen, Kai, Corrado, Greg & Dean, Jeffrey. 2013. Efficient estimation of word representations in vector space. In Workshop Proceedings of the International Conference on Learning Representations (ICLR 2013). Scottsdale AZ. ArXiv preprint arXiv:1301.3781.
Nesselhauf, Nadja. 2004. Collocations in a Learner Corpus [Studies in Corpus Linguistics 14]. Amsterdam: John Benjamins.
Nivre, Joakim, de Marneffe, Marie-Catherine, Ginter, Filip, Goldberg, Yoav, Hajič, D. Manning, Christopher, McDonald, Ryan, Petrov, Slav, Pyysalo, Sampo, Silveira, Natalia, Tsarfaty, Reut & Zeman, Daniel. 2016. Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), 1659–1666. Paris: European Language Resources Association (ELRA).
Nivre, Joakim, Hall, Johan, Nilsson, Jens, Chanev, Atanas, Eryigit, Gülsen, Kübler, Sandra, Marinov, Svetoslav & Marsi. Erwin. 2007. MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering 13(2): 95–135.
Orliac, Brigitte & Dillinger, Mike. 2003. Collocation extraction for machine translation. In Proceedings of Ninth Machine Translation Summit (MT Summit IX), 292–298, New Orleans LA.
Pecina, Pavel. 2010. Lexical association measures and collocation extraction. Language Resources and Evaluation 44(1-2): 137–158.
Rivera, Oscar Mendoza, Mitkov, Ruslan & Corpas Pastor, Gloria. 2013. A flexible framework for collocation retrieval and translation from parallel and comparable corpora. In Multiword Units in Machine Translation and Translation Technology [Current Issues in Linguistic Theory 341], Ruslan Mitkov, Johanna Monti, Gloria Corpas Pastor & Violeta Seretan (eds), 18–25, Amsterdam: John Benjamins.
Seretan, Violeta & Wehrli. Eric. 2007. Collocation translation based on sentence alignment and parsing. In Actes de la 14e conference sur le Traitement Automatique des Langues Naturelles (TALN 2007), 401–410, Toulouse.
Smadja, Frank, McKeown, Kathleen R. & Hatzivassiloglou, Vasileios. 1996. Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics 22(1): 1–38.
Smadja, Frank. 1992. How to compile a bilingual collocational lexicon automatically. In Proceedings of the AAAI Workshop on Statistically-Based NLP Techniques, 57–63, San Jose CA.
Şulea, Octavia-Maria, Nisioi, Sergiu & Dinu, Liviu P. 2016. Using word embeddings to translate named entities, In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), 3362–3366. Paris: European Language Resources Association (ELRA).
Wu, Chien-Cheng & Chang, Jason S. 2003. Bilingual collocation extraction based on syntactic and statistical analyses. In Proceedings of the 15th Conference on Computational Linguistics and Speech Processing (ROCLING 2003), 1–20. Taiwan: Association for Computational Linguistics and Chinese Language Processing.
