Chapter 2. Mining parallel corpora from Wikipedia

Kraif, Olivier

doi:10.1075/scl.121.02kra

In:Investigating Wikipedia: Linguistic corpus building, exploration and analysis
Edited by Céline Poudat, Harald Lüngen and Laura Herzberg
[Studies in Corpus Linguistics 121] 2024
► pp. 45–74

Get fulltext from our e-platform

Download Book PDF

Download Book EPUB

Chapter 2
Mining parallel corpora from Wikipedia

Olivier Kraif | Université Grenoble Alpes, LIDILEM

Published online: 31 October 2024

https://doi.org/10.1075/scl.121.02kra

Abstract

In this article, we address the issue of Wikipedia as a multilingual resource to extract parallel corpora that are useful in multilingual terminology extraction or machine translation. While most previous work in this field assumes that Wikipedia is suitable for mining comparable corpora, we concentrate on the actual place of translation in the editorial process of Wikipedia to examine the possibility of extracting parallel corpora, that is, texts where source segments can be linked to their translations. After identifying the different projects, tools and recommendations that allow contributors to enrich Wikipedia by exercising their skills as translators, we conduct an experiment in which we download pairs of articles containing translations. We show the importance of performing a temporal alignment of the versions to be downloaded before launching the actual sentence-level alignment. This strategy allows us to obtain a large volume of parallel texts with good-quality sentence-to-sentence alignment.

Keywords: translation, parallel corpora, Wikipedia as a corpus, sentence alignment

Article outline

1.Introduction
2.Wikipedia as a comparable corpus for NLP and contrastive studies
- 2.1Aligning documents according to domain or content
- 2.2Bilingual lexicon extraction
- 2.3Aligning sentences and chunks using machine translation
- 2.4Sentence alignment using a monotonic algorithm
- 2.5Parallel sentence extraction
3.The translation process in Wikipedia
- 3.1Translation projects
- 3.2Translation guidelines
- 3.3Review process
- 3.4Translation tools
- 3.5Content translation tool statistics
- 3.6Translation into languages other than English
4.Experiments
- 4.1Preliminary observations
- 4.2Downloading potentially alignable items
- 4.3First experiment: Sentence alignment of articles
- 4.4Second experiment: Filtering using dotplot
- 4.5Third experiment: Using Content Translation application markup
5.Conclusion and future perspectives
Notes
References
Appendix

References (28)

References

Adafre, Sisay F. & de Rijke, Maarten. 2006. Finding similar sentences across multiple languages in Wikipedia. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, Diana McCarthy & Shuly Wintner (eds), 62–69. Stroudsburg PA: ACL.

Artetxe, Mikel & Schwenk, Holger. 2018. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. arXiv.1812.10464.

Bouamor, Dhouha. 2014. Constitution of Multilingual Linguistic Resources from Parallel and Comparable Text Corpora. PhD dissertation, Université Paris-Sud.

Brunette, Louise & Gagnon, Chantal. 2013. Enseigner la révision à l’ère des wikis: Là où l’on trouve la technologie alors qu’on ne l’attendait plus. JoSTrans. The Journal of Specialized Translation 19: 96–121.

Church, Kenneth W. 1993. Char-align: A program for aligning parallel texts at the character level. In Proceedings of the 31st Annual Meeting of the Associatoin of Computational Linguistics, Columbus OH, 22–26 June, 1–8. Stroudsburg PA: ACL.

Etchegoyhen, Thierry & Azpeitia, Andoni. 2016. A portable method for parallel and comparable document alignment. Baltic Journal of Modern Computing 4(2): 243–255.

Gabrilovich, Evgeniy & Markovitch, Shaul. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07). Morgan Kaufmann Publishers, 1606–1611.

Gupta, Rajdeep, Pal, Santanu & Bandyopadhyay, Sivaji. 2013. Improving MT system using extracted parallel fragments of text from comparable corpora. In Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, Serge Sharoff, Pierre Zweigenbaum & Reinhard Rapp (eds), 69–76. Stroudsburg PA: ACL.

Johnson, Jeff, Douze, Matthijs & Hervé, Jégou. 2017. Billion-scale similarity search with GPUs. arXiv.1702.08734v1.

Lamraoui, Fethi & Langlais, Philippe. 2013. Yet another fast, robust and open source sentence aligner. Time to reconsider sentence alignment? In Proceedings of the Machine Translation Summit 2013. 〈[URL]〉 (1 June 2024).

McEnery, Anthony & Xiao, Zhonghua. 2007. Parallel and comparable corpora: What is happening? In Incorporating Corpora: The Linguist and the Translator, Gunilla Anderman & Margaret Rogers (eds). Clevedon: Multilingual Matters.

Mohammadi, Mehdi & Ghasem Aghaee, Naser. 2010. Building bilingual parallel corpora based on Wikipedia. In Proceedings of the Second International Conference on Computer Engineering and Applications (ICCEA 2010), Bali, Indonesia, 19–21 March. IEEE.

Moore, Robert C. 2002. Fast and accurate sentence alignment of bilingual corpora. In Proceeding of the 5th Conference of the Association for Machine Translation in the Americas, 135–144. New York NY: Springer.

Morin, Emmanuel, Daille, Béatrice, Takeuchi, Koichi & Kageura, Kyo. 2007. Bilingual Terminology mining — Using brain, not brawn comparable corpora. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL’07), 664–671. Stroudsburg PA: ACL.

Patry, Alexandre & Langlais, Philippe. 2011. Identifying parallel documents from a large bilingual collection of texts: Application to parallel article extraction in wikipedia. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, Pierre Zweigenbaum, Reinhard Rapp & Serge Sharoff (eds), 87–95. Stroudsburg PA: ACL.

Plamadă, Magdalena & Volk, Martin. 2013. Mining for domain-specific parallel text from Wikipedia. In Proceedings of the 6th Workshop on Building and Using Comparable Corpora, Sofia, Bulgaria, Serge Sharoff, Pierre Zweigenbaum & Reinhard Rapp (eds), 112–120. Stroudsburg PA: ACL.

Prochasson, Emmanuel & Fung, Pascale. 2011. Rare word translation extraction from aligned comparable documents. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 1327–1335. Stroudsburg PA: ACL. 〈[URL]〉 (1 June 2024).

Rapp, Reinhard, Sharoff, Serge, & Bebych, Bogdan. 2012. Identifying word translations from comparable documents without a seed lexicon. In Proceedings of LREC 2012, Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis (eds). 〈[URL]〉

Schwenk, Holger, Chaudhary, Vishrav, Sun, Shuo, Gong, Hongyu & Guzmán, Francisco. 2019. WikiMatrix: Mining 135M parallel sentences in 1620 language pairs from Wikipedia. arXiv.1907.05791.

Semmar, Nasredine. 2021. Multilingualism and Automatic Processing of Well and Poorly Endowed Languages. HDR dissertation, Paris Saclay University.

Sharoff, Serge, Zweigenbaum, Pierre & Rapp, Reinhard. 2015. BUCC shared task: Cross-language document similarity. Proceedings of the 8th Workshop on Building and Using Comparable Corpora, 74–78. Beijing, China, June.

Ştefănescu, Dan & Ion, Radu. 2013. Parallel-Wiki: A collection of parallel sentences extracted from Wikipedia. Research in Computing Science, Vol. 70: Advances in Computing Science. Greece.

Ştefănescu, Dan, Ion, Radu & Hunsicker, S. 2012. Hybrid parallel sentence mining from comparable corpora. In Proceedings of the 16th Conference of the European Association for Machine Translation, Trento, Italy, 28–30 May, Mauro Cettolo, Marcello Federico, Lucia Specia & Andy Way (eds), 137–144. Fondazione Bruno Kessler.

Trieu, Hai-Long & Ittoo, Ashwin. 2019. Generation of parallel corpus for low resource language translation. ORBi Open Repository and Bibliography, Liège. 〈[URL]〉 (1 June 2024).

Tufiş, Dan, Ion, Radu, Dumitrescu, Ştefan, Ştefănescu, Dan. 2014. Large SMT data-sets extracted from Wikipedia. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). 656–663, Reykjavik, Iceland.

Varga, Daniel, Németh, László, Halácsy, Peter, Kornai, András, Trón, Viktor & Nagy, Viktor. 2005. Parallel corpora for medium density languages. In Proceedings of the RANLP 2005, 590–596.

Wołk, Krzysztof & Marasek, Krzysztof. 2014. Building subject-aligned comparable corpora and mining it for truly parallel sentence pairs. Procedia Technology, 18, 126–132.

Yasuda, Keiji & Sumita, Eiichiro. 2008. Method for building sentence-aligned corpus from Wikipedia. In 2008 AAAI Workshop on Wikipedia and Artificial Intelligence (WikiAI08), 263–268.

Chapter 2Mining parallel corpora from Wikipedia

Chapter 2
Mining parallel corpora from Wikipedia