Readability assessment for text simplification: From analysing documents to identifying sentential simplifications

Vajjala, Sowmya; Meurers, Detmar

doi:10.1075/itl.165.2.04vaj

Article published In: Recent Advances in Automatic Readability Assessment and Text Simplification
Edited by Thomas François and Delphine Bernhard
[ITL - International Journal of Applied Linguistics 165:2] 2014
► pp. 194–222

Get fulltext from our e-platform

Download PDF

Readability assessment for text simplification

From analysing documents to identifying sentential simplifications

Sowmya Vajjala | LEAD Graduate School and Seminar für Sprachwissenschaft

Detmar Meurers | Eberhard-Karls Universität Tübingen, Germany

Published online: 23 January 2015

https://doi.org/10.1075/itl.165.2.04vaj

Readability assessment can play a role in the evaluation of a simplification algorithm as well as in the identification of what to simplify. While some previous research used traditional readability formulas to evaluate text simplification, there is little research into the utility of readability assessment for identifying and analyzing sentence level targets for text simplification. We explore this aspect in our paper by first constructing a readability model that is generalizable across corpora and across genres and later adapting this model to make sentence-level readability judgments.

First, we report on experiments establishing that the readability model integrating a broad range of linguistic features works well at a document level, performing on par with the best systems on a standard test corpus. Next, the model is confirmed to be transferable to different text genres. Moving from documents to sentences, we investigate the model’s ability to correctly identify the difference in reading level between a sentence and its human simplified version. We conclude that readability models can be useful for identifying simplification targets for human writers and for evaluating machine generated simplifications.

Keywords: generalizability of readability models, readability assessment, text simplification, sentence readability, simplification evaluation

References (69)

Allen, D. (2009). Using a corpus of simplified news texts to investigate features of the intuitive approach to simplification. Proceedings of the Corpus Linguistics Conference (pp. 585–599).

Aluisio, S., Specia, L., Gasperin, C., & Scarton, C. (2010). Readability assessment for text simplification. Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 1–9). Association for Computational Linguistics.

Aranzabe, M.J., de Ilarraza, A.D., & Gonzalez-Dios, I. (2012). First approach to automatic text simplification in Basque. Proceedings of the First workshop on Natural Language Processing for Improving Textual Accessibility (NLP4ITA) (pp. 1–8).

Baayen, R.H., Piepenbrock, R., & Gulikers, L. (1995). The CELEX lexical database (CD-ROM). [URL].

Bach, N., Gao, Q., Vogel, S., & Waibel, A. (2011). TriS: A statistical sentence simplifier with log-linear models and margin-based discriminative training. Proceedings of 5th International Joint Conference on Natural Language Processing (IJCNLP) (pp. 474–482).

Barlacchi, G., & Tonelli, S. (2013). ERNESTA: A sentence simplification tool for children’s stories in Italian. Proceedings of the 14th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing) , (pp. 476–487).

Biran, O., Brody, S., & Elhadad, N. (2011). Putting it simply: A context-aware approach to lexical simplification. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT) (pp. 496–501).

Bormuth, J.R. (1966). Readability: A new approach. Reading Research Quarterly, 1(3), 79–132.

Boston, M.F., Hale, J.T., Patil, U., Kliegl, R., & Vasishth, S. (2008). Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus. Journal of Eye Movement Research, 21, 1–12.

Bott, S., & Saggion, H. (2011). Spanish text simplification: An exploratory study. Proceedings of the 27th Conference of the Spanish Society for Natural Language Processing (pp. 87–95).

Carroll, J., Minnen, G., Canning, Y., Devlin, S., & Tait, J. (1998). Practical simplification of English newspaper text to assist aphasic readers. Proceedings of the AAAI-98 Workshop on Integrating Artificial Intelligence and Assistive Technology (pp. 7–10).

Canning, Y., Tait, J., Archibald, J., & Crawley, R. (1999). Cohesive generation of syntactically simplified newspaper text. Proceedings of the Third International Workshop on Text, Speech and Dialogue (pp. 145–150).

Chall, J.S., & Dale, E. (1995). Readability revisited: The new Dale-Chall readability formula. Brookeline Books.

Chandrasekar, R., Doran, C., & Srinivas, B. (1996). Motivations and methods for text simplification. Proceedings of the 16th Conference on Computational Linguistics (COLING) (pp. 1041–1044).

Chandrasekar, R., & Srinivas, B. (1997). Automatic induction of rules for text simplification. Knowledge Based Systems, 101, 183–190.

Collins-Thompson, K., & Callan, J. (2005). Predicting reading difficulty with statistical language models. Journal of the American Society for Information Science and Technology, 561, 1448–1462.

Coster, W., & Kauchak, D. (2011). Learning to simplify sentences using wikipedia. Proceedings of the Workshop on Monolingual Text-To-Text Generation , (pp. 1–9).

Crossley, S.A., Dufty, D.F., McCarthy, P.M., & McNamara, D.S. (2007). Toward a new readability: A mixed model approach. Proceedings of the 29th annual conference of the Cognitive Science Society (pp. 197–202).

Dell’Orletta, F., Montemagni, S., & Venturi, G. (2011). READ-IT: Assessing readability of Italian texts with a view to text simplification. Proceedings of the 2nd Workshop on Speech and Language Processing for Assistive Technologies (pp. 73–83).

Dubay, W.H. (Ed.). (2006). The classic readability studies. Costa Mesa: Impact Information.

Feng, L., Elhadad, N., & Huenerfauth, M. (2009). Cognitively motivated features for readability assessment. Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009) (pp. 229–237).

Flor, M., Klebanov, B.B. & Sheehan, K.M. (2013). Lexical Tightness and Text Complexity. Proceedings of the Second Workshop on Natural Language Processing for Improving Textual Accessibility (pp. 29–38).

François, T., & Watrin, P. (2011). On the contribution of MWE-based features to a readability formula for French as a foreign language. Proceedings of Recent Advances in Natural Language Processing (RANLP) (pp. 441–447).

Futagi, Y., Kostin, I.W., & Sheehan, K.M. (2007). Reading level assessment for literacy and expository texts. Proceedings of the 29th Annual Meeting of the Cognitive Science Society (pp. 18–53).

Gasperin, C., Specia, L., Pereira, T.F., & Aluisio, S.M. (2009). Learning when to simplify sentences for natural text simplification. Proceedings of the Encontro Nacional de Inteligência Artificial (ENIA-2009) (pp. 809–818).

Graesser, A.C., McNamara, D.S., & Kulikowich, J.M. (2012). Coh-metrix: Providing multilevel analyses of text characteristics. Educational Researcher, 40(5), 223–234.

Hancke, J., Vajjala, S., & Meurers, D. (2012). Readability classification for German using lexical, syntactic and morphological features. Proceedings of the 24th International Conference on Computational Linguistics (COLING) (pp. 1063–1080).

Heilman, M., Collins-Thompson, K., Callan, J., & Eskenazi, M. (2007). Combining lexical and grammatical features to improve readability measures for first and second language texts. Proceedings of the Human Language Technologies Conference (HLT) (pp. 460–467). Association for Computational Linguistics.

Heilman, M., Collins-Thompson K., & Eskenazi, M. (2008). An analysis of statistical models and features for reading difficulty prediction. Proceedings of the 3rd Workshop on Innovative Use of NLP for Building Educational Applications (pp. 71–79). Association for Computational Linguistics.

Heilman, M., Zhao, L., Pino, J., & Eskenazi, M. (2008a). Retrieval of reading materials for vocabulary and reading practice. Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications (BEA3) (pp. 80–88).

Hunt, K.W. (1970). Do sentences in the second language grow like those in the first? TESOL Quarterly, 41, 195–202.

Jonnalagadda, S., Tari, L., Hakenberg, J., Baral, C., & Gonzalez, G. (2009). Towards effective sentence simplification for automatic processing of biomedical text. Proceedings of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies ( NAACL HLT) (pp. 177–180).

Kim, J.Y., Collins-Thompson, K., Bennett, P.N., & Dumais, S.T. (2012). Characterizing web content, user interests, and search behavior by reading level and topic. Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (WSDM) (pp. 213–222).

Kincaid, J.P., Fishburne Jr., R.P., Rogers, R.L., & Chissom, B.S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for Navy enlisted personnel. Research Branch Report. Naval Technical Training Command. (pp. 8–75).

Klebanov, B.B., Knight, K., & Marcu, D. (2004). Text simplification for information-seeking applications. On the Move to Meaningful Internet Systems, Lecture Notes in Computer Science (pp. 735–747).

Klerke, S., & Søgaard A. (2012). Dsim, a Danish parallel corpus for text simplification. Proceedings of Language Resources and Evaluation Conference (LREC) (pp. 4015–4018).

Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, Marc. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 441, 978–990.

Landauer, T., & Way, D. (2012). Improving text complexity measurement through reading maturity metric. Annual Meeting of the National Council on Measurement in Education . [URL]

Levy, R., & Andrew, G. (2006). Tregex and Tsurgeon: Tools for querying and manipulating tree data structures. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC) (pp. 2231–2234).

Liu, X., Croft, W.B., Oh, P., & Hart, D. (2004). Automatic recognition of reading levels from user queries. Proceedings of the 27th Annual International ACM SIGIR Conference on RESEARCH and Development in Information Retrieval (pp. 548–549).

Ma., Y., Fosler-Lussier, E., & Lofthus, R. (2012). Ranking-based readability assessment for early primary children’s literature. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (pp. 548–552).

Martin, L., & Gottron., T. (2012). Readability and the web. Future Internet, 41, 238–252.

Medero, J., & Ostendorf, M. (2011). Identifying targets for syntactic simplification. Proceedings of the International Workshop on Speech and Language Technology in Education (SLaTE 2011) . [URL]

Napoles, C., & Dredze, M. (2010). Learning simple wikipedia: A cogitation in ascertaining abecedarian language. Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids (pp. 42–50).

Nelson, J., Perfetti, C., Liben, D., & Liben, M. (2012). Measures of Text Difficulty: Testing their Predictive Value for Grade Levels and Student Performance. The Council of Chief State School Officers Technical Report.

Pera, M.S., & Ng, Y-K. (2012). BReK12: A book recommender for K-12 users. Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (pp. 1037–1038).

Petersen, S.E., & Ostendorf, M. (2007). Text simplification for language learners: A corpus analysis. Proceedings of Speech and Language Technology for Education (SLaTE) . [URL]

Petrov, S., & Klein, D. (2007). Improved inference for unlexicalized parsing. Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics (pp. 404–411).

Sheehan, K.M., Kostin, I., & Futagi, Y. (2009). When do standard approaches for measuring vocabulary difficulty, syntactic complexity and referential cohesion yield biased estimates of text difficulty? Proceedings of the 30th Annual Meeting of the Cognitive Science Society (pp. 1978–1983).

Sheehan, K.M., Kostin, I., Futagi, Y., & Flor, M. (2010). Generating automated text complexity classifications that are aligned with targeted text complexity standards. ETS Research Report, RR-10-28.

Siddharthan, A. (2002). An architecture for a text simplification system. Proceedings of the Language Engineering Conference (LEC) (pp. 64–71).

. (2003). Preserving discourse structure when simplifying text. Proceedings of the European Natural Language Generation Workshop (ENLG) (pp. 103–110).

. (2004). Syntactic simplification and text cohesion. PhD Thesis, University of Cambridge.

Specia, L. (2010). Translating from complex to simplified sentences. Proceedings of the 9th international Conference on Computational Processing of the Portuguese Language (PROPOR’10) (pp. 30–39).

Specia, L., Jauhar, S.K., & Mihalcea, R. (2012). SemEval-2012 task 1: English lexical simplification. Proceedings of the 6th International Conference on Semantic Evaluation (SemEval) (pp. 347–355).

Štajner, S., Drndarevic, B., & Saggion, H. (2013). Corpus-based sentence deletion and split decisions for Spanish text simplification. Computación y Sistemas (CICLing 2013) 17(2). 251–262.

Toutanova, K., & Klein, D. (2003). Feature-Rich Part-of-speech tagging with a cyclic dependency network. Proceedings of HLT-NAACL 2003 (pp. 252–259).

Vajjala, S., & Meurers, D. (2012). On improving the accuracy of readability classification. Proceedings of the Seventh Workshop on Innovative use of NLP for Building Educational Applications (BEA7) (pp. 163–173). Association for Computational Linguistics.

. (2013). On The applicability of readability models to web texts. Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR) (pp. 59–68). Association for Computational Linguistics.

. (2014). Exploring measures of Readability for spoken language: Analyzing linguistic features of subtitles to identify age-specific TV programs. Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations . Association for Computational Linguistics. Gothenburg, Sweden.

Vor der Brück, T., Hartrumpf, S., & Helbig, H. (2008). A readability checker with supervised learning using deep syntactic and semantic indicators. Informatica, 321, 429–435.

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83.

Wilson, M.D. (1988). The MRC psycholinguistic database: Machine readable dictionary, Version 2. Behavioral Research Methods, Instruments and Computers, 201, 6–11.

Witten, I.H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufman, Amsterdam; Boston, MA.

Woodsend, K., & Lapata, M. (2011). Learning to simplify sentences with quasi-synchronous grammar and integer programming. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 409–420).

Wubben, S., van den Bosch, A., & Krahmer, E. (2012). Sentence simplification by monolingual machine translation. Proceedings of ACL 2012 (pp. 1015–1024).

Yatskar, M., Pang, B., Danescu-Niculescu-Mizil, C., & Lee, L. (2010). For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia. Proceedings of NAACL-HLT (pp. 365–368).

Zhao, J., & Kan, M-Y. (2010). Domain-specific iterative readability computation. Proceedings of the 10th annual joint conference on Digital libraries (pp. 205–214).

Zhu, Z., Bernhard, D., & Gurevych, I. (2010). A monolingual tree-based translation model for sentence simplification. Proceedings of The 23rd International Conference on Computational Linguistics (COLING) (pp. 1353–1361).

Cited by (18)

Cited by 18 other publications

Order by:

Elysia, Aurellia Gita & Yulyani Arifin

2025. 2025 5th International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA), ► pp. 509 ff.

Nanyonga, Aziida, Hassan Wasswa, Keith Joiner, Ugur Turhan & Graham Wild

2025. A Multi-Head Attention-Based Transformer Model for Predicting Causes in Aviation Incidents. Modelling 6:2 ► pp. 27 ff.

Karaca, Mehmet Fatih & Münir Şahin

2024. Readability Of Higher Education Institutions Exam Basic Proficiency Test Turkish Questions. Dil Eğitimi ve Araştırmaları Dergisi 10:2 ► pp. 662 ff.

Kostadimas, Dimitris, Katia Lida Kermanidis & Theodore Andronikos

2024. Exploring the Effectiveness of Shallow and L2 Learner-Suitable Textual Features for Supervised and Unsupervised Sentence-Based Readability Assessment. Applied Sciences 14:17 ► pp. 7997 ff.

Kong, Nancy, Uwe Dulleck, Adam B. Jaffe, Shupeng Sun & Sowmya Vajjala

2023. Linguistic metrics for patent disclosure: Evidence from university versus corporate patents. Research Policy 52:2 ► pp. 104670 ff.

Li, Zhenzhen, Han Ding & Shaohong Zhang

2023. Cross-Corpus Readability Compatibility Assessment for English Texts. IEEE Access 11 ► pp. 101985 ff.

Jena, Om Prakash, Alok Ranjan Tripathy, Sudhansu Sekhar Patra, Manas Ranjan Chowdhury & Rajesh Kumar Sahoo

2022. Automatic Text Simplification Using LSTM Encoder Decoder Model. In Advances in Distributed Computing and Machine Learning [Lecture Notes in Networks and Systems, 302], ► pp. 235 ff.

Sharoff, Serge Aleksandrovich

2022. What neural networks know about linguistic complexity. Russian Journal of Linguistics 26:2 ► pp. 371 ff.

Xu, Rui, Wenjing Pan, Canhua Chen, Xiaoyin Chen, Shilin Lin & Xia Li

2022. 2022 International Conference on Asian Language Processing (IALP), ► pp. 401 ff.

Andreessen, Lena M., Peter Gerjets, Detmar Meurers & Thorsten O. Zander

2021. Toward neuroadaptive support technologies for improving digital reading: a passive BCI-based assessment of mental workload imposed by text difficulty and presentation speed during reading. User Modeling and User-Adapted Interaction 31:1 ► pp. 75 ff.

Alva-Manchego, Fernando, Carolina Scarton & Lucia Specia

2020. Data-Driven Sentence Simplification: Survey and Benchmark. Computational Linguistics 46:1 ► pp. 135 ff.

Brysbaert, Marc

2019. How many words do we read per minute? A review and meta-analysis of reading rate. Journal of Memory and Language 109 ► pp. 104047 ff.

Berger, Cynthia, Eric Friginal & Jennifer Roberts

2017. Representations of immigrants and refugees in US K-12 school-to-home correspondence: an exploratory corpus-assisted discourse study. Corpora 12:2 ► pp. 153 ff.

Hartmann, Nathan, Livia Cucatto, Danielle Brants & Sandra Aluísio

2016. Automatic Classification of the Complexity of Nonfiction Texts in Portuguese for Early School Years. In Computational Processing of the Portuguese Language [Lecture Notes in Computer Science, 9727], ► pp. 12 ff.

Vágvölgyi, Réka, Andra Coldea, Thomas Dresler, Josef Schrader & Hans-Christoph Nuerk

2016. A Review about Functional Illiteracy: Definition, Cognitive, Linguistic, and Numerical Aspects. Frontiers in Psychology 7

De Ruvo, Giuseppe & Antonella Santone

2015. 2015 IEEE 24th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises, ► pp. 204 ff.

Collins-Thompson, Kevyn

2014. Computational assessment of text readability. ITL - International Journal of Applied Linguistics 165:2 ► pp. 97 ff.

[no author supplied]

2017. Automatic Text Simplification [Synthesis Lectures on Human Language Technologies, ],

This list is based on CrossRef data as of 30 march 2026. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.