Corpora are an important resource for both teaching and research. Arabic lacks sufficient resources in this field, so a research project has been designed to compile a corpus, which represents the state of the Arabic language at the present time and the needs of end-users. This report presents the result of a survey of the needs of teachers of Arabic as a foreign language (TAFL) and language engineers. The survey shows that a wide range of text types should be included in the corpus. Overall, our survey confirms our view that existing corpora are too narrowly limited in source-type and genre, and that there is a need for a freely-accessible corpus of contemporary Arabic covering a broad range of text-types. We have collected and published an initial version of the Corpus of Contemporary Arabic (CCA) to meet these design issues. The CCA is freely downloadable via WWW from http://www.comp.leeds.ac.uk/arabic.
Abdelaal, Ahmad, Abdallah Elsaadany, Abdelrhman Ahmed Medhat, Aysha Al Shamsi & Noha Gamal ElDin Saad Ali
2025. Plagiarism detection across languages: a comprehensive study of Arabic and English-to-Arabic long documents. PeerJ Computer Science 11 ► pp. e3128 ff.
Abjalova, Manzura, Fayiza Tursunboyeva, Aminova Nafisa Istamovna & Yulduz Shodmonaliyeva
2025. 2025 10th International Conference on Computer Science and Engineering (UBMK), ► pp. 577 ff.
Al-Onazi, Badriyya B., Wadee A. Nashir & Asma A. Al-Shargabi
2025. “Diwan”: Constructing the Largest Annotated Corpus for Arabic Poetry. IEEE Access 13 ► pp. 58927 ff.
Alaqel, Haifa & Khalil El Hindi
2025. Lightweight End-to-End Diacritical Arabic Speech Recognition Using CTC-Transformer with Relative Positional Encoding. Mathematics 13:20 ► pp. 3352 ff.
Alaqel, Haifa & Khalil El Hindi
2025. Improving Diacritical Arabic Speech Recognition: Transformer-Based Models with Transfer Learning and Hybrid Data Augmentation. Information 16:3 ► pp. 161 ff.
Alayba, Abdulaziz M. & Mohammed Altamimi
2025. Optimization of Arabic text classification using SVM integrated with word embedding models on a novel dataset. International Journal of ADVANCED AND APPLIED SCIENCES 12:9 ► pp. 140 ff.
de Carvalho, Victor Diogho Heuer & Ana Paula Cabral Seixas Costa
2024. Towards corpora creation from social web in Brazilian Portuguese to support public security analyses and decisions. Library Hi Tech 42:4 ► pp. 1080 ff.
Bouressace, Hassina
2023. Computational Analysis of Printed Arabic Text Database for Natural Language Processing. Cognitive Studies | Études cognitives :23
2023. Building the Leeds Monolingual and Parallel Legal Corpora of Arabic and English Countries’ Constitutions: Methods, Challenges and Solutions. Corpus Pragmatics 7:2 ► pp. 103 ff.
Alfraidi, Tareq, Mohammad A. R. Abdeen, Ahmed Yatimi, Reyadh Alluhaibi & Abdulmohsen Al-Thubaity
2022. The Saudi Novel Corpus: Design and Compilation. Applied Sciences 12:13 ► pp. 6648 ff.
Almuntashiri, Abdullah H., Mohammed Al-Sarem, Omar F. Aloufi, Abdel-Hamid Emara & Mhd Ammar Alsalka
2022. Arabic Auto-CON: Automated Arabic Concordancer Construction from Arabic Corpus. In Advances on Intelligent Informatics and Computing [Lecture Notes on Data Engineering and Communications Technologies, 127], ► pp. 283 ff.
Hallberg, Andreas
2022. Principles of variation in the use of diacritics (taškīl) in Arabic books. Language Sciences 93 ► pp. 101482 ff.
Kaddoura, Sanaa, Rowanda D. Ahmed & Jude Hemanth D.
2022. A comprehensive review on Arabic word sense disambiguation for natural language processing applications. WIREs Data Mining and Knowledge Discovery 12:4
Kah, Anoual El & Imad Zeroual
2022. 2022 11th International Symposium on Signal, Image, Video and Communications (ISIVC), ► pp. 1 ff.
Kah, Anoual El & Imad Zeroual
2022. 2022 International Conference on Intelligent Systems and Computer Vision (ISCV), ► pp. 1 ff.
Alsuhaim, Amjad F., Aqil M. Azmi & Muhammad Hussain
2021. Improving the Retrieval of Arabic Web Search Results Using Enhanced k-Means Clustering Algorithm. Entropy 23:4 ► pp. 449 ff.
Hardie, Andrew & Wesam Ibrahim
2021. Exploring and categorising the Arabic copula and auxiliarykānathrough enhanced part-of-speech tagging. Corpora 16:3 ► pp. 305 ff.
Karin Ryding & David Wilmsen
2021. The Cambridge Handbook of Arabic Linguistics,
Zaki, Mai, David Wilmsen & Dana Abdulrahim
2021. The Utility of Arabic Corpus Linguistics. In The Cambridge Handbook of Arabic Linguistics, ► pp. 473 ff.
Altheneyan, Alaa Saleh & Mohamed El Bachir Menai
2020. Automatic plagiarism detection in obfuscated text. Pattern Analysis and Applications 23:4 ► pp. 1627 ff.
Sharaf Addin, Mohammed & Sabah Al-Shehabi
2020. Developing Social-Media Based Text Corpus for San’ani Dialect (SMTCSD). In Advances in Decision Sciences, Image Processing, Security and Computer Vision [Learning and Analytics in Intelligent Systems, 3], ► pp. 491 ff.
El Ouahabi, Safâa, Mohamed Atounti & Mohamed Bellouki
2019. Toward an automatic speech recognition system for amazigh-tarifit language. International Journal of Speech Technology 22:2 ► pp. 421 ff.
Elayeb, Bilel
2019. Arabic word sense disambiguation: a review. Artificial Intelligence Review 52:4 ► pp. 2475 ff.
2018. Adaptation of a Term Extractor to Arabic Specialised Texts: First Experiments and Limits. In Computational Linguistics and Intelligent Text Processing [Lecture Notes in Computer Science, 9623], ► pp. 242 ff.
Nisar, Shibli & Muhammad Tariq
2018. Dialect recognition for low resource language using an adaptive filter bank. International Journal of Wavelets, Multiresolution and Information Processing 16:04 ► pp. 1850031 ff.
Onyenwe, Ikechukwu E, Mark Hepple, Uchechukwu Chinedu & Ignatius Ezeani
2018.
A Basic Language Resource Kit Implementation for the Igbo
NLP
Project
. ACM Transactions on Asian and Low-Resource Language Information Processing 17:2 ► pp. 1 ff.
Salah, Ramzi Esmail & Lailatul Qadri Binti Zakaria
2018. 2018 Fourth International Conference on Information Retrieval and Knowledge Management (CAMP), ► pp. 1 ff.
St Kuraedah, Nur Azaliah Mar & Fahmi Gunawan
2018. Improving Students’ Sense to Learn Language in Islamic Institution of Coastal Area Indonesia. IOP Conference Series: Earth and Environmental Science 156 ► pp. 012047 ff.
Abushariah, Mohammad A. M.
2017. TAMEEM V1.0: speakers and text independent Arabic automatic continuous speech recognizer. International Journal of Speech Technology 20:2 ► pp. 261 ff.
Alotaibi, Hind M.
2017. Arabic-English Parallel Corpus: A New Resource for Translation Training and Language Teaching. SSRN Electronic Journal
El Kah, Anoual, Imad Zeroual & Abdelhak Lakhouaja
2017. Proceedings of the 2nd international Conference on Big Data, Cloud and Applications, ► pp. 1 ff.
Halabi, Dana, Arafat Awajan & Ebaa Fayyoumi
2017. 2017 International Conference on New Trends in Computing Sciences (ICTCS), ► pp. 207 ff.
Hamed, Osama & Torsten Zesch
2017. The Role of Diacritics in Designing Lexical Recognition Tests for Arabic. Procedia Computer Science 117 ► pp. 119 ff.
Itani, Maher, Chris Roast & Samir Al-Khayatt
2017. 2017 8th International Conference on Information and Communication Systems (ICICS), ► pp. 64 ff.
Itani, Maher, Chris Roast & Samir Al-Khayatt
2017. Developing Resources For Sentiment Analysis Of Informal Arabic Text In Social Media. Procedia Computer Science 117 ► pp. 129 ff.
2017. Proceedings of the 2nd international Conference on Big Data, Cloud and Applications, ► pp. 1 ff.
Zeroual, Imad & Abdelhak Lakhouaja
2017. 2017 Intelligent Systems and Computer Vision (ISCV), ► pp. 1 ff.
Zeroual, Imad & Abdelhak Lakhouaja
2018. Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go. In Intelligent Natural Language Processing: Trends and Applications [Studies in Computational Intelligence, 740], ► pp. 613 ff.
Al-Saleh, Asma Bader & Mohamed El Bachir Menai
2016. Automatic Arabic text summarization: a survey. Artificial Intelligence Review 45:2 ► pp. 203 ff.
Alfaifi, Abdullah & Eric Atwell
2016. Comparative evaluation of tools for Arabic corpora search and analysis. International Journal of Speech Technology 19:2 ► pp. 347 ff.
Hammo, Bassam, Sane Yagi, Omaima Ismail & Mohammad AbuShariah
2016. Exploring and exploiting a historical corpus for Arabic. Language Resources and Evaluation 50:4 ► pp. 839 ff.
2016. 2016 International Conference on Asian Language Processing (IALP), ► pp. 26 ff.
Lakhfif, Abdelaziz & Mohamed Tayeb Laskri
2016. A frame-based approach for capturing semantics from Arabic text for text-to-sign language MT. International Journal of Speech Technology 19:2 ► pp. 203 ff.
Al-Thubaity, Abdulmohsen O.
2015. A 700M+ Arabic corpus: KACST Arabic corpus design and construction. Language Resources and Evaluation 49:3 ► pp. 721 ff.
El-Haj, Mahmoud, Udo Kruschwitz & Chris Fox
2015. Creating language resources for under-resourced languages: methodologies, and experiments with Arabic. Language Resources and Evaluation 49:3 ► pp. 549 ff.
Attia, M., P. Pecina, A. Toral & J. van Genabith
2014. A corpus-based finite-state morphological toolkit for contemporary arabic. Journal of Logic and Computation 24:2 ► pp. 455 ff.
Al-Thubaity, Abdulmohsen, Marwa Khan, Manal Al-Mazrua & Maram Al-Mousa
2013. 2013 International Conference on Asian Language Processing, ► pp. 67 ff.
Almeman, K. & M. Lee
2013. 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA), ► pp. 1 ff.
Almisreb, Ali Abd, Ahmad Farid Abidin & Nooritawati Md Tahir
2013. 2013 IEEE 3rd International Conference on System Engineering and Technology, ► pp. 232 ff.
Almisreb, Ali Abd, Ahmad Farid Abidin & Nooritawati Md Tahir
2014. 2014 IEEE Conference on Systems, Process and Control (ICSPC 2014), ► pp. 101 ff.
Aly, Walid Mohamed, Mohamed Wagdy Youssef, Wafaa Hanna Sharaby & Hany Atef Kelleny
2013. 2013 23rd International Conference on Computer Theory and Applications (ICCTA), ► pp. 68 ff.
Alzahrani, Salha M.
2013. Building, Profiling, Analysing and Publishing an Arabic News Corpus Based on Google News RSS Feeds. In Information Retrieval Technology [Lecture Notes in Computer Science, 8281], ► pp. 488 ff.
Merhbene, Laroussi, Anis Zouaghi & Mounir Zrigui
2013. Fourth International Conference on Information and Communication Technology and Accessibility (ICTA), ► pp. 1 ff.
Sawalha, Majdi & Eric Atwell
2013. A standard tag set expounding traditional morphological features for Arabic language part-of-speech tagging. Word Structure 6:1 ► pp. 43 ff.
Erradi, Abdelkarim, Sajeda Nahia, Hind Almerekhi & Lubna Al-kailani
2012. 2012 Colloquium in Information Science and Technology, ► pp. 149 ff.
Erradi, Abdelkarim, Sajeda Nahia, Hind Almerekhi & Lubna Al-kailani
2012. 2012 International Conference on Multimedia Computing and Systems, ► pp. 833 ff.
Zouaghi, Anis, Mounir Zrigui, Georges Antoniadis & Laroussi Merhbene
2012. Contribution to Semantic Analysis of Arabic Language. Advances in Artificial Intelligence 2012 ► pp. 1 ff.
Attia, Mohammed, Pavel Pecina, Antonio Toral, Lamia Tounsi & Josef van Genabith
2011. A Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer. In Systems and Frameworks for Computational Morphology [Communications in Computer and Information Science, 100], ► pp. 98 ff.
Bijankhan, Mahmood, Javad Sheykhzadegan, Mohammad Bahrani & Masood Ghayoomi
2011. Lessons from building a Persian written corpus: Peykare. Language Resources and Evaluation 45:2 ► pp. 143 ff.
Abushariah, Mohammad A. M., Raja N. Ainon, Roziati Zainuddin, Moustafa Elshafei & Othman O. Khalifa
2010. 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010), ► pp. 65 ff.
Abushariah, Mohammad A. M., Raja N. Ainon, Roziati Zainuddin, Moustafa Elshafei & Othman O. Khalifa
2012. Phonetically rich and balanced text and speech corpora for Arabic language. Language Resources and Evaluation 46:4 ► pp. 601 ff.
Abushariah, Mohammad A. M., Raja N. Ainon, Roziati Zainuddin, Othman O. Khalifa & Moustafa Elshafei
2010. International Conference on Computer and Communication Engineering (ICCCE'10), ► pp. 1 ff.
Froud, H., R. Benslimane, A. Lachkar & S. Alaoui Ouatik
2010. 2010 5th International Symposium On I/V Communications and Mobile Network, ► pp. 1 ff.
2009. 2009 IEEE/ACS International Conference on Computer Systems and Applications, ► pp. 396 ff.
Roberts, Andrew, Latifa Al-Sulaiti & Eric Atwell
2006. aConCorde: Towards an open-source, extendable concordancer for Arabic. Corpora 1:1 ► pp. 39 ff.
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.