In:Investigating Wikipedia: Linguistic corpus building, exploration and analysis
Edited by Céline Poudat, Harald Lüngen and Laura Herzberg
[Studies in Corpus Linguistics 121] 2024
► pp. 45–74
Chapter 2Mining parallel corpora from Wikipedia
Published online: 31 October 2024
https://doi.org/10.1075/scl.121.02kra
https://doi.org/10.1075/scl.121.02kra
Abstract
In this article, we address the issue of Wikipedia as a multilingual resource to extract parallel corpora that are useful in multilingual terminology
extraction or machine translation. While most previous work in this field assumes that Wikipedia is suitable for
mining comparable corpora, we concentrate on the actual
place of translation in the editorial process of Wikipedia to examine the possibility of extracting parallel corpora,
that is, texts where source segments can be linked to their translations. After identifying the different projects,
tools and recommendations that allow contributors to enrich Wikipedia by exercising their skills as translators, we
conduct an experiment in which we download pairs of articles containing translations. We show the importance of
performing a temporal alignment of the versions to be downloaded before launching the actual sentence-level alignment. This strategy allows us to obtain a large volume
of parallel texts with good-quality sentence-to-sentence alignment.
Article outline
- 1.Introduction
- 2.Wikipedia as a comparable corpus for NLP and
contrastive studies
- 2.1Aligning documents according to domain or content
- 2.2Bilingual lexicon extraction
- 2.3Aligning sentences and chunks using machine translation
- 2.4Sentence alignment using a monotonic algorithm
- 2.5Parallel sentence extraction
- 3.The translation process in Wikipedia
- 3.1Translation projects
- 3.2Translation guidelines
- 3.3Review process
- 3.4Translation tools
- 3.5Content translation tool statistics
- 3.6Translation into languages other than English
- 4.Experiments
- 4.1Preliminary observations
- 4.2Downloading potentially alignable items
- 4.3First experiment: Sentence alignment of articles
- 4.4Second experiment: Filtering using dotplot
- 4.5Third experiment: Using Content Translation application markup
- 5.Conclusion and future perspectives
Notes References Appendix
References (28)
Adafre, Sisay F. & de Rijke, Maarten. 2006. Finding
similar sentences across multiple languages in
Wikipedia. In Proceedings of the 11th Conference of
the European Chapter of the Association for Computational Linguistics, Diana McCarthy & Shuly Wintner (eds), 62–69. Stroudsburg PA: ACL.
Artetxe, Mikel & Schwenk, Holger. 2018. Massively
multilingual sentence embeddings for zero-shot cross-lingual transfer and
beyond. arXiv.1812.10464.
Bouamor, Dhouha. 2014. Constitution
of Multilingual Linguistic Resources from Parallel and Comparable Text
Corpora. PhD dissertation, Université Paris-Sud.
Brunette, Louise & Gagnon, Chantal. 2013. Enseigner
la révision à l’ère des wikis: Là où l’on trouve la technologie alors qu’on ne l’attendait
plus. JoSTrans. The Journal of Specialized
Translation 19: 96–121.
Church, Kenneth W. 1993. Char-align: A
program for aligning parallel texts at the character
level. In Proceedings of the 31st Annual Meeting of
the Associatoin of Computational Linguistics, Columbus OH, 22–26
June, 1–8. Stroudsburg PA: ACL.
Etchegoyhen, Thierry & Azpeitia, Andoni. 2016. A
portable method for parallel and comparable document alignment. Baltic Journal
of Modern
Computing 4(2): 243–255.
Gabrilovich, Evgeniy & Markovitch, Shaul. 2007. Computing
semantic relatedness using wikipedia-based explicit semantic
analysis. In Proceedings of the 20th International
Joint Conference on Artificial Intelligence (IJCAI’07). Morgan Kaufmann Publishers, 1606–1611.
Gupta, Rajdeep, Pal, Santanu & Bandyopadhyay, Sivaji. 2013. Improving
MT system using extracted parallel fragments of text from comparable
corpora. In Proceedings of the Sixth Workshop on
Building and Using Comparable Corpora, Serge Sharoff, Pierre Zweigenbaum & Reinhard Rapp (eds), 69–76. Stroudsburg PA: ACL.
Johnson, Jeff, Douze, Matthijs & Hervé, Jégou. 2017. Billion-scale
similarity search with GPUs. arXiv.1702.08734v1.
Lamraoui, Fethi & Langlais, Philippe. 2013. Yet
another fast, robust and open source sentence aligner. Time to reconsider sentence
alignment? In Proceedings of the Machine Translation
Summit 2013. 〈[URL]〉 (1 June 2024).
McEnery, Anthony & Xiao, Zhonghua. 2007. Parallel
and comparable corpora: What is
happening? In Incorporating Corpora: The Linguist and
the Translator, Gunilla Anderman & Margaret Rogers (eds). Clevedon: Multilingual Matters.
Mohammadi, Mehdi & Ghasem Aghaee, Naser. 2010. Building
bilingual parallel corpora based on
Wikipedia. In Proceedings of the Second International
Conference on Computer Engineering and Applications (ICCEA 2010), Bali, Indonesia, 19–21
March. IEEE.
Moore, Robert C. 2002. Fast and
accurate sentence alignment of bilingual
corpora. In Proceeding of the 5th Conference of the
Association for Machine Translation in the
Americas, 135–144. New York NY: Springer.
Morin, Emmanuel, Daille, Béatrice, Takeuchi, Koichi & Kageura, Kyo. 2007. Bilingual
Terminology mining — Using brain, not brawn comparable
corpora. In Proceedings of the 45th Annual Meeting of
the Association for Computational Linguistics
(ACL’07), 664–671. Stroudsburg PA: ACL.
Patry, Alexandre & Langlais, Philippe. 2011. Identifying
parallel documents from a large bilingual collection of texts: Application to parallel article extraction in
wikipedia. In Proceedings of the 4th Workshop on
Building and Using Comparable Corpora: Comparable Corpora and the Web, Pierre Zweigenbaum, Reinhard Rapp & Serge Sharoff (eds), 87–95. Stroudsburg PA: ACL.
Plamadă, Magdalena & Volk, Martin. 2013. Mining
for domain-specific parallel text from
Wikipedia. In Proceedings of the 6th Workshop on
Building and Using Comparable Corpora, Sofia, Bulgaria, Serge Sharoff, Pierre Zweigenbaum & Reinhard Rapp (eds), 112–120. Stroudsburg PA: ACL.
Prochasson, Emmanuel & Fung, Pascale. 2011. Rare
word translation extraction from aligned comparable
documents. In Proceedings of the 49th Annual Meeting
of the Association for Computational
Linguistics, 1327–1335. Stroudsburg PA: ACL. 〈[URL]〉 (1
June 2024).
Rapp, Reinhard, Sharoff, Serge, & Bebych, Bogdan. 2012. Identifying
word translations from comparable documents without a seed
lexicon. In Proceedings of LREC
2012, Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis (eds). 〈[URL]〉
Schwenk, Holger, Chaudhary, Vishrav, Sun, Shuo, Gong, Hongyu & Guzmán, Francisco. 2019. WikiMatrix:
Mining 135M parallel sentences in 1620 language pairs from
Wikipedia. arXiv.1907.05791.
Semmar, Nasredine. 2021. Multilingualism
and Automatic Processing of Well and Poorly Endowed Languages. HDR
dissertation, Paris Saclay University.
Sharoff, Serge, Zweigenbaum, Pierre & Rapp, Reinhard. 2015. BUCC
shared task: Cross-language document similarity. Proceedings of the 8th
Workshop on Building and Using Comparable
Corpora, 74–78. Beijing, China, June.
Ştefănescu, Dan & Ion, Radu. 2013. Parallel-Wiki:
A collection of parallel sentences extracted from Wikipedia. Research in
Computing Science, Vol. 70: Advances
in Computing Science. Greece.
Ştefănescu, Dan, Ion, Radu & Hunsicker, S. 2012. Hybrid
parallel sentence mining from comparable
corpora. In Proceedings of the 16th Conference of the
European Association for Machine Translation, Trento, Italy, 28–30 May, Mauro Cettolo, Marcello Federico, Lucia Specia & Andy Way (eds), 137–144. Fondazione Bruno Kessler.
Trieu, Hai-Long & Ittoo, Ashwin. 2019. Generation
of parallel corpus for low resource language translation. ORBi Open Repository
and Bibliography, Liège. 〈[URL]〉 (1 June 2024).
Tufiş, Dan, Ion, Radu, Dumitrescu, Ştefan, Ştefănescu, Dan. 2014. Large
SMT data-sets extracted from
Wikipedia. In Proceedings of the Ninth International
Conference on Language Resources and Evaluation
(LREC’14). 656–663, Reykjavik, Iceland.
Varga, Daniel, Németh, László, Halácsy, Peter, Kornai, András, Trón, Viktor & Nagy, Viktor. 2005. Parallel
corpora for medium density languages. In Proceedings
of the RANLP 2005, 590–596.
