Article published In: Revista Española de Lingüística Aplicada/Spanish Journal of Applied Linguistics
Vol. 36:2 (2023) ► pp.585–614
Sentence splitting in Arabic to Spanish translation
Published online: 4 July 2023
https://doi.org/10.1075/resla.21008.rol
https://doi.org/10.1075/resla.21008.rol
Abstract
Modern Standard Arabic makes extensive use of coordination particles whereas punctuation marks are scarce and
erratic, leading to long clauses. This is generally assumed to hinder Sentence Boundary Detection and to promote sentence
splitting when translating from Arabic into English. Previous literature on translation from Arabic to Spanish is practically
inexistent. We have tested this hypothesis regarding translation from Arabic to Spanish on a sample of 282,714 graphic words
extracted from a bilingual corpus of 8,681,110 graphic words and found that each Arabic sentence yielded an average of 1.5 Spanish
sentences. Furthermore, our data shows the potential impact of directionality in that sentence splitting when translating from
Arabic into Spanish is 50% more frequent than from English into Arabic. We also determined statistically that five elements
(wa [و], ḥaythu [حيث], kamā [كما],
wa-qad [وقد], and wa-dhalika [وذلك]) are the most salient potential markers for sentence splitting in the resulting Spanish
translations. Our findings should be particularly interesting for Computational Linguistics and translator training.
Resumen
La división de oraciones en la traducción del árabe al español
El árabe moderno estándar tiende a la parataxis y emplea los signos de puntuación escasa e incoherentemente.
Como consecuencia, las oraciones suelen ser largas. Se asume que esto dificulta la detección del límite entre oraciones y fomenta
la división de oraciones al traducir al inglés. Prácticamente no existe investigación previa al respecto sobre la traducción del
árabe al español. Con una muestra de 282 714 palabras gráficas tomadas de un corpus bilingüe de 8 681 110 palabras gráficas
testamos la tendencia a dividir oraciones al traducir del árabe al español. Concluimos que cada oración árabe generó un promedio
de 1,5 oraciones en español, que esa tendencia es un 50% mayor que al traducir del inglés al árabe y que cinco elementos
(wa [و], ḥaythu [حيث], kamā [كما],
wa-qad [وقد], and wa-dhalika [وذلك]) destacan como potenciales marcas de división de oraciones. Las conclusiones son de
interés, en particular, para la Lingüística Computacional y la formación de traductores.
Article outline
- 1.Introduction
- 2.State of the art
- 2.1Automatic segmentation and alignment in Arabic
- 2.2Translation studies
- 2.3Conclusions
- 3.Methodology
- 3.1Corpus
- 3.2Sample
- 3.3Elements of distortion
- 3.4Segmentation
- 4.Data
- 5.Discussion
- 5.1Representativity, sentences, words, and average number of words per sentence
- 5.2Splitting types
- 6.Conclusions and future work
References
References (58)
Ahrenberg, L. (2017). Comparing
machine translation and human translation: A case study. In I. Temnikova, C. Orasan, G. Corpas, & S. Vogel (Eds.), Proceedings
of the First Workshop on Human-Informed Translation and Interpreting Technology
(HiT-IT) (pp. 21–28). Association for Computational Linguistics. Retrieved from [URL].
Alazzawie, A. (2014). The
discourse marker wa in standard Arabic – A syntactic and semantic
analysis. Theory and Practice in Language
Studies, 4(10), 2008–2015.
Alfuraih, R. (2020). The
undergraduate learner translator corpus: a new resource for translation studies and computational
linguistics. Language Resources &
Evaluation, 541, 801–830.
Alghamdi, M., & Teahan, W. (2017). Experimental
evaluation of Arabic OCR systems. PSU Research
Review, 1(3), 229–241.
Al-Harthi, M., & Alsaif, A. (2019). The
design of the SauLTC application for the English-Arabic learner translation
corpus. In M. El-Haj, P. Rayson, E. Atwell, & L. Alsudias (eds.), Proceedings
of the 3rd Workshop on Arabic Corpus
Linguistics (pp. 80–88). Association for Computational Linguistics. Retrieved from [URL]
Al-Khuli, M. (1998). Al-tārakīb al-shāʾiʿa fi l-lugha al-ʿarabiyya. Dirāsa iḥṣāʾiyya [Most common structures in Arabic language. A statistical study]. Dār Al-Falāḥ.
Alotaiby, F., Foda, S., & Alkharashi, I. (2010). Clitics
in Arabic language: A statistical study. Proceedings of Pacific Asia Conference on Language,
Information and Computation
(PACLIC), 241, 595–602.
Al-Raisi, F., Lin, W., & Bourai, A. (2018). A
monolingual parallel corpus of Arabic. Procedia Computer
Science, 1421, 334–338.
Altammami, S., Atwell, E., & Alsalka, A. (2019). Text
segmentation using N-grams to annotate Hadith corpus. In M. El-Haj, P. Rayson, E. Atwell, & L. Alsudias (eds.), Proceedings
of the 3rd Workshop on Arabic Corpus
Linguistics (pp. 31–39). Association for Computational Linguistics. Retrieved from [URL]
Awad, D. (2015). The
evolution of Arabic writing due to European influence: The case of punctuation. Journal of
Arabic and Islamic
Studies, 151, 117–136.
Baker, M. (1993). Corpus
linguistics and translation studies: Implications and
applications. In M. Baker, G. Francis, & E. Tognini-Bonelli (eds.), Text
and technology: In honour of John
Sinclair (pp. 233–250). John Benjamins.
Bisiada, M. (2013). From
hypotaxis to parataxis: An investigation of English–German syntactic convergence in
translation [Doctoral dissertation]. Retrieved
from [URL]
(2016). Lösen
Sie Schachtelsätze möglichst auf: The impact of editorial guidelines on sentence splitting in German business
article translations. Applied
Linguistics, 37(3), 354–376.
Bloch, I. (2005). Sentence
splitting as an expression of translationese: Seminar paper. In Black
Box Seminar, Bar Ilan University. Retrieved
from [URL]
Buckwalter, T., & Parkinson, D. (2011). A
frequency dictionary of Arabic: core vocabulary for
learners. Routledge.
Chen, Y., & Eisele, A. (2012). MultiUN
v2: UN documents with multilingual alignments. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidiset (eds.), Proceedings
of the Eight International Conference on Language Resources and Evaluation
(LREC’12) (pp. 2500–2504). European Language Resources Association (ELRA). Retrieved from [URL]
Choueka, Y., Conley, E., & Dagan, I. (2000). A
comprehensive bilingual word alignment system. Application to disparate languages: Hebrew and
English. In J. Véronis (ed.), Parallel
text processing. alignment and use of translation
corpora (pp. 69–96). Kluwer Academic Publishers.
Darwish, K., & Gao, W. (2014). Simple
effective microblog named entity recognition: Arabic as an
example. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidiset (eds.), Proceedings
of the Ninth International Conference on Language Resources and Evaluation
(LREC’14) (pp. 2513–2517). European Languages Resources Association (ELRA). Retrieved from [URL]
Dickins, J., Sándor, H., & Higgins, I. (2017). Thinking
Arabic translation. a course in translation method: Arabic to
English. Routledge.
Eisele, A., & Chen, Y. (2010). MultiUnited
nations: A multilingual corpus from United Nation documents. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (eds.), Proceedings
of the Seventh conference on International Language Resources and
Evaluation (pp. 2868–2872). European Language Resources Association (ELRA). Retrieved from [URL]
Fabricius-Hansen, C. (1999). Information
packaging and translation: Aspects of translational sentence splitting
(German-English/Norwegian). In M. Doherty (ed.), Sprachspezifissche
Aspekte der
Informationsverteilung (pp. 175–214). Akademie Verlag.
Farghaly, A., & Shaalan, K. (2009). Arabic
natural language processing: Challenges and solutions. ACM TraSActions on Asian Language
Information Processing
(TALIP), 8(4), 1–22.
Feria, M. (2014). Planning
the acquisition and enhancement of language skills for translation and interpreting trainees: the case of
Arabic. In V. Aguilar, W. Saleh, M. A. Manzano, L. M. Pérez Cañada, & P. Santillán Grimm (eds.), Arabele
2012: enseñanza y aprendizaje de la lengua
árabe (pp. 197–221). Universidad de Murcia.
Frankenberg-Garcia, A. (2019). A
corpus study of splitting and joining sentences in
translation. Corpora, 14(1), 1–30.
Gale, W., & Kenneth, C. (1993). A
program for aligning sentences in bilingual corpora. Computational
Linguistics, 19(1), 75–102.
García Barrero, D., Feria García, M., & Turell, M. (2012). Using
function words and punctuation marks in Arabic forensic authorship
attribution. In R. Sousa-Silva, R. Faria, N. Gavaldà, & B. Maia (eds.), Proceedings
of the 3rd European Conference of the International Association of Forensic
Linguists (pp. 42–56). Universidade de Porto.
Ghaly, H. (2014). Canvas:
A fast and accurate geometric sentence alignment system using lexical cues within complex misalignment
settings. CUNY Academic Works.
Habash, N. (2010). Introduction
to Arabic natural language processing. Synthesis Lectures on Human Language
Technologies, 3(1), 1–187.
Hareide, L., & Hofland, K. (2012). Compiling
a Norwegian-Spanish parallel corpus. Methods and challenges. In M. Oakes, & J. Meng (eds.), Quantitative
methods in corpus-based translation
studies (pp. 75–114). John Benjamins.
Keskes, I. (2015). Discourse
analysis of Arabic documents and application to automatic summarization (Doctoral
dissertation). Retrieved from [URL]
Kunilovskaya, M., & Morgoun, N. (2013). Gains
and pitfalls of sentence-splitting in translation. Perm National Research Polytechnic
University Herald. Issues in Linguistics and
Pedagogy, 8(50), 152–166.
Merkel, M. (2001). Comparing
source and target texts in a translation corpus. In A. S. Hein (ed.), Proceedings
of the 13th Nordic Conference of Computational Linguistics,
NODALIDA (pp. 81–85). Association for Computational Linguistics. Retrieved from [URL]
Neme, A., & Paumier, S. (2020). Restoring
Arabic vowels through omission-tolerant dictionary lookup. Language Resources and
Evaluation, 541, 487–551.
Parkinson, D. (1981). VSO
to SVO in modern standard Arabic: A study in diglossia
syntax. Al-Arabiyya, 141, 24–37.
Pasha, A., Al-Badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., & Roth, R. (2014). MADAMIRA:
A fast, comprehensive tool for morphological analysis and disambiguation of
Arabic. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (eds.), LREC
2014, Ninth International Conference on Language Resources and
Evaluation (pp. 1094–1101). European Language Resources Association. Retrieved from [URL]
Ramm, W. (2004). Sentence-boundary
adjustment in Norwegian-German and German-Norwegian translations: First results of a corpus-based
study. In K. Aijmer, & H. Hasselgard (eds.), Translation
and
Corpora (pp. 129–147). Acta Universitatis Gothoburgensis.
Rafalovitch, A., & Dale, R. (2009). United
Nations General Assembly resolutions: A six-language parallel
corpus’. In Proceedings of the MT Summit
XII (pp. 292–299). International Association of Machine Translation. Retrieved from [URL]
Read, J., Dridan, R., Oepen, S., & Solberg, L. (2012). Sentence
boundary detection: A long solved problem? In M. Kay, & C. Boitet (eds.), Proceedings
of COLING 2012:
Posters (pp. 985–994). COLING 2012 Organization Committee. Retrieved from [URL]
Sainz-Quinn, C. & Feria García, M. (2020). Translating
Arabic named entities into English and Spanish: Translation consistency at the United
Nations. In S. Hanna, H. El-Farahaty, & A. W. Khalifa (eds.), Routledge
Handbook of Arabic
Translation (pp. 381–396). Routledge.
Salameh, M., Zantout, R., & Mansour, N. (2011). Improving
the accuracy of English-Arabic statistical sentence alignment. The International Arab Journal
of Information
Technology, 8(2), 171–177.
Samy, D., Moreno-Sandoval, A., & Guirao, J. M. (2004). An
alignment experiment of a Spanish-Arabic parallel
corpus. In Proceedings of the International Conference on Arabic
Language Resources and
Tools (pp. 85–89). NEMLAR. Retrieved
from [URL]
Samy, D. (2005). Named
entities: Structure and translation. A study based on a parallel corpus
(Arabic-English-Spanish). In Proceedings from the Corpus Linguistics
Conference Series. Birmingham. Retrieved
from [URL]
Samy, D., Moreno-Sandoval, A., Guirao, J. M., & Alfonseca, E. (2006). Building
a parallel multilingual corpus (Arabic-Spanish-English). In N. Calzolari, K. Choukri, A. Gangemi, B. Maegaard, J. Mariani, J. Odijk, & D. Tapias (eds.), Proceedings
of the 5th International Conference on Language Resources and Evaluations
(LREC’06). GeNAO. Retrieved from [URL]
Samy, D., & González Ledesma, A. (2008). Pragmatic
annotation of discourse markers in a multilingual parallel corpus
(Arabic-Spanish-English). In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, & D. Tapias (eds.), Proceedings
of the 6th International Conference on Language Resources and Evaluation, LREC 2008. Retrieved
from [URL]
Sánchez-Ratia, J. (2018). El
árabe en la traducción al español de las Naciones Unidas. Retrieved from [URL]
Semmar, N., & Fluhr, C. (2007). Arabic
to French sentence alignment: Exploration of a cross-language information retrieval
approach. In V. Cavalli-Sforza, & I. Zitouni (eds.), Proceedings
of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and
Resources (pp. 73–80). Retrieved
from [URL].
Serbina, T. (2014). Sentence
splitting in the translation pair English-German. In 4th Using
Corpora in Contrastive and Translation Studies Conference. Abstract
Book (pp. 61–62). Lancaster University. Retrieved from [URL]
Shaalan, K. (2014). A
survey of Arabic named entity recognition and classification. Computational
Linguistics, 40(2), 469–510.
Solfjeld, K. (2008). Sentence
splitting and discourse structure in translations. Languages in
Contrast, 8(1), 21–46.
Taji, D., El Gizuli, J., & Habash, N. (2018). An
Arabic dependency treebank in the travel domain. In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, & T. Tokunaga (eds.), Proceedings
of the Eleventh International Conference on Language Resources and Evaluation (LREC
2018). European Language Resources Association (ELRA). Retrieved from [URL]
Touir, A., Mathkour, H., & Al-Sanea, W. (2008). Semantic-based
segmentation of Arabic texts. Information Technology
Journal, 71, 1009–1015.
Xu, J., Fraser, A., & Weischedel, R. (2001). TREC
2001 Cross-lingual retrieval at BBN. In NIST TREC 2001
Proceedings (pp. 68–77). Retrieved
from [URL]
Cited by (1)
Cited by one other publication
This list is based on CrossRef data as of 30 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
