Schemes, techniques and combinations: Arabic preprocessing for Statistical Machine Translation

Habash, Nizar; Sadat, Fatiha

doi:10.1075/nlp.9.05hab

In:Challenges for Arabic Machine Translation
Edited by Abdelhadi Soudi, Ali Farghaly, Günter Neumann and Rabih Zbib
[Natural Language Processing 9] 2012
► pp. 73–94

Get fulltext from our e-platform

Download Book PDF

Arabic preprocessing for Statistical Machine Translation

Schemes, techniques and combinations

Nizar Habash | Center for Computational Learning Systems, Columbia University

Fatiha Sadat | Department of Computer Science, Université du Québec á Montréal

Published online: 1 August 2012

https://doi.org/10.1075/nlp.9.05hab

Arabic is a morphologically rich language. This poses some problems for statistical machine translation (SMT) approaches. In this chapter, we study the effect of different Arabic word-level preprocessing schemes and techniques on the quality of phrase-based SMT. We also present and evaluate different methods for combining preprocessing schemes. Our results show that given large training data sets, splitting off proclitics only performs best. However, for small training data sets, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing scheme produces a significant increase in BLEU score if there is a change in genre between training and test data. We also found that combining different preprocessing schemes leads to improved translation quality.

Cited by (2)

Cited by two other publications

Mallek, Fatma, Ngoc Tan Le & Fatiha Sadat

2018. Automatic Machine Translation for Arabic Tweets. In Intelligent Natural Language Processing: Trends and Applications [Studies in Computational Intelligence, 740], ► pp. 101 ff.

Mallek, Fatma, Billal Belainine & Fatiha Sadat

2017. Arabic Social Media Analysis and Translation. Procedia Computer Science 117 ► pp. 298 ff.

This list is based on CrossRef data as of 28 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.