Sentence splitting in Arabic to Spanish translation

Roldán, Juan; Feria García, Manuel

doi:10.1075/resla.21008.rol

Article published In: Revista Española de Lingüística Aplicada/Spanish Journal of Applied Linguistics
Vol. 36:2 (2023) ► pp.585–614

Get fulltext from our e-platform

Download PDF

Download EPUB

Sentence splitting in Arabic to Spanish translation

Juan Roldán | University of Granada

Manuel Feria García | University of Granada

Published online: 4 July 2023

https://doi.org/10.1075/resla.21008.rol

Abstract

Modern Standard Arabic makes extensive use of coordination particles whereas punctuation marks are scarce and erratic, leading to long clauses. This is generally assumed to hinder Sentence Boundary Detection and to promote sentence splitting when translating from Arabic into English. Previous literature on translation from Arabic to Spanish is practically inexistent. We have tested this hypothesis regarding translation from Arabic to Spanish on a sample of 282,714 graphic words extracted from a bilingual corpus of 8,681,110 graphic words and found that each Arabic sentence yielded an average of 1.5 Spanish sentences. Furthermore, our data shows the potential impact of directionality in that sentence splitting when translating from Arabic into Spanish is 50% more frequent than from English into Arabic. We also determined statistically that five elements (wa [و], ḥaythu [حيث], kamā [كما], wa-qad [وقد], and wa-dhalika [وذلك]) are the most salient potential markers for sentence splitting in the resulting Spanish translations. Our findings should be particularly interesting for Computational Linguistics and translator training.

Keywords: Arabic to Spanish translation, sentence splitting, sentence boundary detection

Resumen

La división de oraciones en la traducción del árabe al español

El árabe moderno estándar tiende a la parataxis y emplea los signos de puntuación escasa e incoherentemente. Como consecuencia, las oraciones suelen ser largas. Se asume que esto dificulta la detección del límite entre oraciones y fomenta la división de oraciones al traducir al inglés. Prácticamente no existe investigación previa al respecto sobre la traducción del árabe al español. Con una muestra de 282 714 palabras gráficas tomadas de un corpus bilingüe de 8 681 110 palabras gráficas testamos la tendencia a dividir oraciones al traducir del árabe al español. Concluimos que cada oración árabe generó un promedio de 1,5 oraciones en español, que esa tendencia es un 50% mayor que al traducir del inglés al árabe y que cinco elementos (wa [و], ḥaythu [حيث], kamā [كما], wa-qad [وقد], and wa-dhalika [وذلك]) destacan como potenciales marcas de división de oraciones. Las conclusiones son de interés, en particular, para la Lingüística Computacional y la formación de traductores.

Palabras clave: Traducción del árabe al español, división de oraciones, detección del límite entre oraciones

Article outline

1.Introduction
2.State of the art
- 2.1Automatic segmentation and alignment in Arabic
- 2.2Translation studies
- 2.3Conclusions
3.Methodology
- 3.1Corpus
- 3.2Sample
- 3.3Elements of distortion
- 3.4Segmentation
4.Data
5.Discussion
- 5.1Representativity, sentences, words, and average number of words per sentence
- 5.2Splitting types
6.Conclusions and future work
References

References (58)

References

Abdul-Raof, H. (1998). Subject, theme and agent in modern standard Arabic. Curzon Press.

Ahrenberg, L. (2017). Comparing machine translation and human translation: A case study. In I. Temnikova, C. Orasan, G. Corpas, & S. Vogel (Eds.), Proceedings of the First Workshop on Human-Informed Translation and Interpreting Technology (HiT-IT) (pp. 21–28). Association for Computational Linguistics. Retrieved from [URL].

Alazzawie, A. (2014). The discourse marker wa in standard Arabic – A syntactic and semantic analysis. Theory and Practice in Language Studies, 4(10), 2008–2015.

Alfuraih, R. (2020). The undergraduate learner translator corpus: a new resource for translation studies and computational linguistics. Language Resources & Evaluation, 541, 801–830.

Alghamdi, M., & Teahan, W. (2017). Experimental evaluation of Arabic OCR systems. PSU Research Review, 1(3), 229–241.

Al-Harthi, M., & Alsaif, A. (2019). The design of the SauLTC application for the English-Arabic learner translation corpus. In M. El-Haj, P. Rayson, E. Atwell, & L. Alsudias (eds.), Proceedings of the 3rd Workshop on Arabic Corpus Linguistics (pp. 80–88). Association for Computational Linguistics. Retrieved from [URL]

Al-Khuli, M. (1998). Al-tārakīb al-shāʾiʿa fi l-lugha al-ʿarabiyya. Dirāsa iḥṣāʾiyya [Most common structures in Arabic language. A statistical study]. Dār Al-Falāḥ.

Alotaiby, F., Foda, S., & Alkharashi, I. (2010). Clitics in Arabic language: A statistical study. Proceedings of Pacific Asia Conference on Language, Information and Computation (PACLIC), 241, 595–602.

Al-Raisi, F., Lin, W., & Bourai, A. (2018). A monolingual parallel corpus of Arabic. Procedia Computer Science, 1421, 334–338.

Altammami, S., Atwell, E., & Alsalka, A. (2019). Text segmentation using N-grams to annotate Hadith corpus. In M. El-Haj, P. Rayson, E. Atwell, & L. Alsudias (eds.), Proceedings of the 3rd Workshop on Arabic Corpus Linguistics (pp. 31–39). Association for Computational Linguistics. Retrieved from [URL]

Awad, D. (2015). The evolution of Arabic writing due to European influence: The case of punctuation. Journal of Arabic and Islamic Studies, 151, 117–136.

Baker, M. (1993). Corpus linguistics and translation studies: Implications and applications. In M. Baker, G. Francis, & E. Tognini-Bonelli (eds.), Text and technology: In honour of John Sinclair (pp. 233–250). John Benjamins.

Bisiada, M. (2013). From hypotaxis to parataxis: An investigation of English–German syntactic convergence in translation [Doctoral dissertation]. Retrieved from [URL]

(2016). Lösen Sie Schachtelsätze möglichst auf: The impact of editorial guidelines on sentence splitting in German business article translations. Applied Linguistics, 37(3), 354–376.

Bloch, I. (2005). Sentence splitting as an expression of translationese: Seminar paper. In Black Box Seminar, Bar Ilan University. Retrieved from [URL]

Buckwalter, T., & Parkinson, D. (2011). A frequency dictionary of Arabic: core vocabulary for learners. Routledge.

Chen, Y., & Eisele, A. (2012). MultiUN v2: UN documents with multilingual alignments. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidiset (eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (pp. 2500–2504). European Language Resources Association (ELRA). Retrieved from [URL]

Choueka, Y., Conley, E., & Dagan, I. (2000). A comprehensive bilingual word alignment system. Application to disparate languages: Hebrew and English. In J. Véronis (ed.), Parallel text processing. alignment and use of translation corpora (pp. 69–96). Kluwer Academic Publishers.

Darwish, K., & Gao, W. (2014). Simple effective microblog named entity recognition: Arabic as an example. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidiset (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) (pp. 2513–2517). European Languages Resources Association (ELRA). Retrieved from [URL]

Dickins, J., Sándor, H., & Higgins, I. (2017). Thinking Arabic translation. a course in translation method: Arabic to English. Routledge.

Eisele, A., & Chen, Y. (2010). MultiUnited nations: A multilingual corpus from United Nation documents. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (eds.), Proceedings of the Seventh conference on International Language Resources and Evaluation (pp. 2868–2872). European Language Resources Association (ELRA). Retrieved from [URL]

Fabricius-Hansen, C. (1999). Information packaging and translation: Aspects of translational sentence splitting (German-English/Norwegian). In M. Doherty (ed.), Sprachspezifissche Aspekte der Informationsverteilung (pp. 175–214). Akademie Verlag.

Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing: Challenges and solutions. ACM TraSActions on Asian Language Information Processing (TALIP), 8(4), 1–22.

Feria, M. (2014). Planning the acquisition and enhancement of language skills for translation and interpreting trainees: the case of Arabic. In V. Aguilar, W. Saleh, M. A. Manzano, L. M. Pérez Cañada, & P. Santillán Grimm (eds.), Arabele 2012: enseñanza y aprendizaje de la lengua árabe (pp. 197–221). Universidad de Murcia.

Frankenberg-Garcia, A. (2019). A corpus study of splitting and joining sentences in translation. Corpora, 14(1), 1–30.

Gale, W., & Kenneth, C. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.

García Barrero, D., Feria García, M., & Turell, M. (2012). Using function words and punctuation marks in Arabic forensic authorship attribution. In R. Sousa-Silva, R. Faria, N. Gavaldà, & B. Maia (eds.), Proceedings of the 3rd European Conference of the International Association of Forensic Linguists (pp. 42–56). Universidade de Porto.

Ghaly, H. (2014). Canvas: A fast and accurate geometric sentence alignment system using lexical cues within complex misalignment settings. CUNY Academic Works.

Habash, N. (2010). Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies, 3(1), 1–187.

Halliday, M. & Hasan, R. (1976). Cohesion in English. London: Longman.

Hareide, L., & Hofland, K. (2012). Compiling a Norwegian-Spanish parallel corpus. Methods and challenges. In M. Oakes, & J. Meng (eds.), Quantitative methods in corpus-based translation studies (pp. 75–114). John Benjamins.

Heine, B., & Kuteva, T. (2002). World lexicon of grammaticalization. Cambridge University Press.

Keskes, I. (2015). Discourse analysis of Arabic documents and application to automatic summarization (Doctoral dissertation). Retrieved from [URL]

Kunilovskaya, M., & Morgoun, N. (2013). Gains and pitfalls of sentence-splitting in translation. Perm National Research Polytechnic University Herald. Issues in Linguistics and Pedagogy, 8(50), 152–166.

Merkel, M. (2001). Comparing source and target texts in a translation corpus. In A. S. Hein (ed.), Proceedings of the 13th Nordic Conference of Computational Linguistics, NODALIDA (pp. 81–85). Association for Computational Linguistics. Retrieved from [URL]

Neme, A., & Paumier, S. (2020). Restoring Arabic vowels through omission-tolerant dictionary lookup. Language Resources and Evaluation, 541, 487–551.

Parkinson, D. (1981). VSO to SVO in modern standard Arabic: A study in diglossia syntax. Al-Arabiyya, 141, 24–37.

Pasha, A., Al-Badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., & Roth, R. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (eds.), LREC 2014, Ninth International Conference on Language Resources and Evaluation (pp. 1094–1101). European Language Resources Association. Retrieved from [URL]

Ramm, W. (2004). Sentence-boundary adjustment in Norwegian-German and German-Norwegian translations: First results of a corpus-based study. In K. Aijmer, & H. Hasselgard (eds.), Translation and Corpora (pp. 129–147). Acta Universitatis Gothoburgensis.

Rafalovitch, A., & Dale, R. (2009). United Nations General Assembly resolutions: A six-language parallel corpus’. In Proceedings of the MT Summit XII (pp. 292–299). International Association of Machine Translation. Retrieved from [URL]

Read, J., Dridan, R., Oepen, S., & Solberg, L. (2012). Sentence boundary detection: A long solved problem? In M. Kay, & C. Boitet (eds.), Proceedings of COLING 2012: Posters (pp. 985–994). COLING 2012 Organization Committee. Retrieved from [URL]

Ryding, K. (2005). A reference grammar of modern standard Arabic. Cambridge University Press.

Sainz-Quinn, C. & Feria García, M. (2020). Translating Arabic named entities into English and Spanish: Translation consistency at the United Nations. In S. Hanna, H. El-Farahaty, & A. W. Khalifa (eds.), Routledge Handbook of Arabic Translation (pp. 381–396). Routledge.

Salameh, M., Zantout, R., & Mansour, N. (2011). Improving the accuracy of English-Arabic statistical sentence alignment. The International Arab Journal of Information Technology, 8(2), 171–177.

Samy, D., Moreno-Sandoval, A., & Guirao, J. M. (2004). An alignment experiment of a Spanish-Arabic parallel corpus. In Proceedings of the International Conference on Arabic Language Resources and Tools (pp. 85–89). NEMLAR. Retrieved from [URL]

Samy, D. (2005). Named entities: Structure and translation. A study based on a parallel corpus (Arabic-English-Spanish). In Proceedings from the Corpus Linguistics Conference Series. Birmingham. Retrieved from [URL]

Samy, D., Moreno-Sandoval, A., Guirao, J. M., & Alfonseca, E. (2006). Building a parallel multilingual corpus (Arabic-Spanish-English). In N. Calzolari, K. Choukri, A. Gangemi, B. Maegaard, J. Mariani, J. Odijk, & D. Tapias (eds.), Proceedings of the 5th International Conference on Language Resources and Evaluations (LREC’06). GeNAO. Retrieved from [URL]

Samy, D., & González Ledesma, A. (2008). Pragmatic annotation of discourse markers in a multilingual parallel corpus (Arabic-Spanish-English). In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, & D. Tapias (eds.), Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008. Retrieved from [URL]

Sánchez-Ratia, J. (2018). El árabe en la traducción al español de las Naciones Unidas. Retrieved from [URL]

Scott, M. (2008). WordSmith Tools 5.0. Lexical Analysis Software.

Semmar, N., & Fluhr, C. (2007). Arabic to French sentence alignment: Exploration of a cross-language information retrieval approach. In V. Cavalli-Sforza, & I. Zitouni (eds.), Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources (pp. 73–80). Retrieved from [URL].

Serbina, T. (2014). Sentence splitting in the translation pair English-German. In 4th Using Corpora in Contrastive and Translation Studies Conference. Abstract Book (pp. 61–62). Lancaster University. Retrieved from [URL]

Shaalan, K. (2014). A survey of Arabic named entity recognition and classification. Computational Linguistics, 40(2), 469–510.

Solfjeld, K. (2008). Sentence splitting and discourse structure in translations. Languages in Contrast, 8(1), 21–46.

Taji, D., El Gizuli, J., & Habash, N. (2018). An Arabic dependency treebank in the travel domain. In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, & T. Tokunaga (eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA). Retrieved from [URL]

Touir, A., Mathkour, H., & Al-Sanea, W. (2008). Semantic-based segmentation of Arabic texts. Information Technology Journal, 71, 1009–1015.

Xu, J., Fraser, A., & Weischedel, R. (2001). TREC 2001 Cross-lingual retrieval at BBN. In NIST TREC 2001 Proceedings (pp. 68–77). Retrieved from [URL]

Zantout, R., & Guessoum, A. (2015). Obstacles facing Arabic machine translation: Building a neural network-based transfer module. In S. Izwaini (ed.), Papers in Translation Studies (pp. 229–251). Cambridge Scholars Publishing.

Cited by (1)

Cited by one other publication

Feria, Manuel & Juan Roldán

2025. Sentence splitting in Arabic to English and Spanish translation: a statistical based, training-oriented study. The Interpreter and Translator Trainer 19:1 ► pp. 70 ff.

This list is based on CrossRef data as of 30 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.