In:Corpus Use in Cross-linguistic Research: Paving the way for teaching, translation and professional communication
Edited by Marlén Izquierdo and Zuriñe Sanz-Villar
[Studies in Corpus Linguistics 113] 2023
► pp. 195–215
Chapter 11Word alignment in the Russian-Chinese parallel corpus
Published online: 2 November 2023
https://doi.org/10.1075/scl.113.11pol
https://doi.org/10.1075/scl.113.11pol
The Russian-Chinese parallel corpus (RuZhCorp) was created in 2016 by sinologists and computational linguists. So
far, it has accumulated 1 074 texts and over 4.6 million words that are aligned on a sentence level. To produce word alignment for the
entire corpus, we used deep neural networks trained both on the whole RuZhCorp and on a manually aligned at a word level gold dataset.
Using the principles presented in previous publications, we compiled the first word-to-word alignment guideline for the Russian-Chinese
language pair, which makes the manual alignment process less ambiguous and more consistent. The joint fine-tuning of the LaBSE deep
learning model on RuZhCorp and the gold dataset achieved the best AER of 18.9%.
Article outline
- 1.Introduction
- 2.Corpus
- 2.1Building the gold dataset
- 2.1.1Types of alignment
- 2.1.2Alignment tool
- 2.1.3The alignment process
- 2.2Alignment manifesto
- 2.3Alignment rules
- 2.3.1Punctuation
- 2.3.2Pronouns and classifiers
- 2.3.3Chinese particles and verb complements
- 2.3.4Prepositions
- 2.3.5Chinese verbs “to be” and “to have”
- 2.3.6Alignment of speech figures
- 2.1Building the gold dataset
- 3.Evaluation
- 4.Conclusion
Acknowledgements Abbreviations Notes References
References (32)
Alkhouli, Tamer & Ney, Hermann. 2017. Biasing attention-based recurrent neural networks using external alignment information. In Proceedings of the Second Conference on Machine Translation, Ondřej Bojar, Christian Buck, Rajen Chatterjee, Christian Federmann et al. (eds), 108–117. Copenhagen: Association for Computational Linguistics.
Bahdanau, Dzmitry, Cho, Kyung Hyun & Bengio, Yoshua. 2015. Neural machine translation by jointly learning to align and translate. Third International Conference on Learning Representations (ICLR 2015). San Diego: oral presentation. <[URL]> (30 July 2022).
Brown, Peter F., Cocke, John, Pietra, Stephen Della, Pietra, Vincent J. Della, Jelinek, Frederick, Lafferty, John D., Mercer, Robert L. & Roossin, Paul S. 1990. A statistical approach to machine translation. Computational Linguistics 16(2): 79–85.
Brown, Peter F., Pietra, Stephen A., Pietra, Vincent J. Della & Mercer, Robert L. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2): 263–311.
Chen, Jiang & Nie, Jian-Yun. 2000. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. In Proceedings of the Sixth Conference on Applied Natural Language Processing, 21–28. Seattle WA: Association for Computational Linguistics.
Chen, Wenhu, Matusov, Evgeny, Khadivi, Shahram & Peter, Jan-Thorsten. 2016. Guided alignment training for topic-aware neural machine translation. <[URL]> (30 July 2022).
Cysouw, Michael & Wälchli, Bernhard. 2007. Parallel texts: Using translational equivalents in linguistic typology. Language Typology and Universals 60(2): 95–99.
Davis, Mark & Dunning Ted E. 1995. Query translation using evolutionary programming for multi-lingual information retrieval. In Query Translation Using Evolutionary Programming for Multi-lingual Information Retrieval, John R. McDonnell, Robert G. Reynolds & David B. Fogel (eds), 175–185. Cambridge MA: The MIT Press.
Dempster, Arthur P., Laird, Nan M. & Rubin, Donald B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society 39(1): 1–38.
Dou, Zi-Yi & Neubig, Graham. 2021. Word alignment by fine-tuning embeddings on parallel corpora. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, Paola Merlo, Jorg Tiedemann, Reut Tsarfaty (eds), 2112–2128. Stroudsburg PA: Association for Computational Linguistics.
Graça, João V., Pardal, Joana Paulo, Coheur, Luisa & Caseiro, Diamantino Antonio. 2008. Building a golden collection of parallel multi-language word alignment. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), 986–993. Marrakech: LREC.
Han, Nway Nway & Thida, Aye. 2019. Annotated guidelines and building reference corpus for Myanmar-English word alignment. International Journal on Natural Language Computing 8(4): 25–38.
Kruijff-Korbayová, Ivana, Chvátalová, Klára & Postolache, Oana. 2006. Annotation guidelines for Czech-English word alignment. In Proceedings of the Fifth International Conference on Language Resources and Evaluation, Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard et al. (eds), 1256–1261. Genoa: ELRA.
Kuhn, Jonas. 2004. Experiments in parallel-text based grammar induction. Proceedings of the 42th Annual Meeting of the Association for Computational Linguistics, 470–477. Barcelona: Association for Computational Linguistics.
Lambert, Patrik, Gispert, Adria, Banchs, Rafael & Mariño, Jose B. 2005. Guidelines for word alignment evaluation and manual alignment. Language Resources and Evaluation 39(4): 267–285.
Li, Jinji, Kim, Dong-Il & Lee, Jong-Hyeok. 2008. Annotation guidelines for Chinese-Korean word alignment. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis & Daniel Tapias (Eds), 518–524. Marrakech: ELRA.
Li, Xintong, Li, Guanlin, Liu, Lemao, Meng, Max & Shi, Shuming. 2019. On the word alignment from neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Preslav Nakov & Alexis Palmer (eds), 1293–1303. Florence: Association for Computational Linguistics.
Macken, Lieve. 2010. An annotation scheme and Gold Standard for Dutch-English word alignment. Proceedings of the 7th conference on International Language Resources and Evaluation (LREC 10), Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani et al. (eds), 3369–3374. Valletta: ELRA.
Mayer, Thomas & Cysouw, Michael. 2012. Language comparison through sparse multilingual word alignment. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, Miriam Butt, Sheelagh Carpendale, Gerald Penn, Jelena Prokić, Michael Cysouw (eds), 54–62). Avignon: Association for Computational Linguistics.
Nagata, Masaaki, Chousa, Katsuki & Nishino, Masaaki. 2020. A supervised word alignment method based on cross-language span prediction using multilingual BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing [online], Bonnie Webber, Trevor Cohn, Yulan He & Yang Liu (eds), 555–565. Stroudsburg PA: Association for Computational Linguistics.
Nie, Jian-Yun, Simard, Michel, Isabelle, Pierre & Durand, Richard. 1999. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In SIGIR ’99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 74–81. New York NY: Association for Computing Machinery.
Och, Franz Josef & Ney, Hermann. 2000. Improved statistical alignment models. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, 440–447. Hong Kong: Association for Computational Linguistics.
. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29(1): 19–51.
Östling, Robert. 2016. Studying colexification through massively parallel corpora. In The Lexical Typology of Semantic Shifts, Päivi Juvonen & Maria Koptjevskaja-Tamm (eds), 157–176. Berlin: De Gruyter Mouton.
Sahlgren, Magnus & Karlgren, Jussi. 2005. Automatic bilingual lexicon acquisition using random indexing of parallel corpora. Natural Language Engineering 11(3): 327–341.
Segalovich, Ilya. 2003. A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In Proceedings of the International Conference on Machine Learning; Models, Technologies and Applications, Hamid R. Arabnia & Elena B. Kozerenko (eds), 273–280. Las Vegas NV: CSREA Press.
Stahlberg, Felix, Saunders, Danielle & Byrne, Bill. 2018. An operation sequence model for explainable neural machine translation. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Tal Linzen, Grzegorz Chrupała & Afra Alishahi (eds), 175–186. Brussels: Association for Computational Linguistics.
Stengel-Eskin, Elias, Su, Tzu-Ray, Post, Matt & Van Durme, Benjamin. 2019. A discriminative neural model for cross-lingual word alignment. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Sebastian Padó & Ruihong Huang (eds), 910–920. Hong Kong: Association for Computational Linguistics.
