Article published In: Register Studies
Vol. 2:2 (2020) ► pp.306–349
Pinning down text complexity
An Exploratory Study on the Registers of the Stockholm-Umeå Corpus (SUC)
Published online: 13 August 2020
https://doi.org/10.1075/rs.19005.san
https://doi.org/10.1075/rs.19005.san
Abstract
In this article, we present the results of a corpus-based study where we explore whether it is possible to automatically
single out different facets of text complexity in a general-purpose corpus. To this end, we use factor analysis as applied in Biber’s
multi-dimensional analysis framework. We evaluate the results of the factor solution by correlating factor scores and readability scores to
ascertain whether the selected factor solution matches the independent measurement of readability, which is a notion tightly linked to text
complexity. The corpus used in the study is the Swedish national corpus, called Stockholm-Umeå Corpus or SUC. The SUC
contains subject-based text varieties (e.g., hobby), press genres (e.g., editorials), and mixed categories (e.g., miscellaneous). We refer
to them collectively as ‘registers’. Results show that it is indeed possible to elicit and interpret facets of text complexity using factor
analysis despite some caveats. We propose a tentative text complexity profiling of the SUC registers.
Article outline
- 1.Introduction
- 2.Text complexity and readability
- 3.Previous work
- 3.1Multi-dimensional analysis
- 3.2Readability-text complexity: Automatic approaches
- 4.Method
- 4.1The SUC corpus and dataset
- 4.2Multi-dimensional analysis: Technicalities
- 4.2.1Variable screening
- 4.2.2Running multi-dimensional analysis
- 4.2.3Three-Factors solution
- 5.Meaningful factors? Evaluation and interpretation
- 5.1Evaluation: Correlating LIX scores & factor scores
- 5.1.1Factor 1 scores & LIX scores
- 5.1.2Factor 2 scores & LIX scores
- 5.1.3Factor 3 scores & LIX scores
- 5.1.4Summary
- 5.2Interpretation: Signed dimensions & text complexity facets
- 5.2.1Factor1: Dim1+ & Dim1−
- 5.2.2Dim1+: Pronominal-Adverbial (spoken-emotional) facet – Average readability
- 5.2.3Dim1−: Nominal (informational) facet – Difficult readability
- 5.2.4Factor 2: Dim2+
- 5.2.5Dim2+: Adjectival (information elaboration) facet – Difficult readability
- 5.2.6Factor 3: Dim3+ & Dim3−
- 5.2.7Dim3+: Verbal (engaged) facet – Difficult readability
- 5.2.8Dim3−: Appositional (information expansion) facet – Difficult readability
- 5.2.9Summary
- 5.1Evaluation: Correlating LIX scores & factor scores
- 6.Profiling SUC registers
- 7.Discussion
- 8.Conclusion and future work
- Notes
Companion website References
References (55)
The study described in this paper is fully reproducible. Datasets, radar charts and R code are available here: <http://santini.se/registerstudies2020>.
Adesam, Y., Bouma, G. and Johansson, R. (2018). The Koala part-of-speechand morphological tagset for Swedish. SLTC.
Asención-Delaney, Y., & Collentine, J. (2011). A multidimensional analysis of a written L2 Spanish corpus. Applied linguistics, 32(3), 299–322.
(1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge University Press.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. Longman.
Biber, D., & Kurjian, J. (2007). Towards a taxonomy of web registers and text types: A multi- dimensional analysis. In Corpus Linguistics and the Web (pp. 109–131).
Biber, D., & Egbert, J. (2016). Register variation on the searchable web: A multi-dimensional analysis. Journal of English Linguistics, 44(2), 95–137.
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate behavioral research, 1(2), 245–276.
Collins-Thompson, K. (2014). Computational assessment of text readability: A survey of current and future research. ITL-International Journal of Applied Linguistics, 165(2), 97–135.
Common Core State Standards Initiative. (2010). Common Core State Standards for English Language Arts & Literacy InHistory/Social Studies, Science, and Technical Subjects. Appendix A: Research Supporting Key Elements of the Standards, Glossary of Key Terms.
Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., Zasina, A. J., & Benko, V. (2020). Comparing web-crawled and traditional corpora. Language Resources and Evaluation, 1–33.
Dahl, Ö. (2004). The growth and maintenance of linguistic complexity (Vol. 711). John Benjamins Publishing.
Dell’Orletta, F., Montemagni, S., & Venturi, G. (2013), September). Linguistic profiling of texts across textual genres and readability levels. An exploratory study on Italian fictional prose. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013 (pp. 189–197).
(2014). Assessing document and sentence readability in less resourced languages and across textual genres. ITL-International Journal of Applied Linguistics, 165(2), 163–193.
DiStefano, C., Zhu, M., & Mindrila, D. (2009). Understanding and using factor scores: Considerations for the applied researcher. Practical Assessment, Research & Evaluation, 14(20), 1–11.
Fahlborg, D., & Rennes, E. (2016). Introducing SAPIS–an API service for text analysis and simplification. In the second national Swe-Clarin workshop: Research collaborations for the digital age, Umeå, Sweden.
Falkenjack, J. (2018). Towards a model of general text complexity for Swedish (Doctoral dissertation, Linköping University Electronic Press).
Falkenjack, J., Mühlenbock, K. H., & Jönsson, A. (2013), May). Features indicating readability in Swedish text. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013) (pp. 27–40).
Falkenjack, J., Santini, M., & Jönsson, A. (2016). An exploratory study on genre classification using readability features. In Proceedings of the Sixth Swedish Language Technology Conference (SLTC 2016), Umeå, Sweden.
Hayton, J. C., Allen, D. G., & Scarpello, V. (2004). Factor retention decisions in exploratory factor analysis: A tutorial on parallel analysis. Organizational research methods, 7(2), 191–205.
Hiebert, E. H. (2012). Readability and the common core’s staircase of text complexity. Text Matters, 11.
Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika 301, 179–185.
Housen, A., De Clercq, B., Kuiken, F., & Vedder, I. (2019). Multiple approaches to complexity in second language research. Second Language Research, 35(1), 3–21.
Jönsson, S., Rennes, E., Falkenjack, J., & Jönsson, A. (2018). A component based approach to measuring text complexity. In Proceedings of The Seventh Swedish Language Technology Conference 2018 (SLTC-18).
Kate, R. J., Luo, X., Patwardhan, S., Franz, M., Florian, R., Mooney, R. J., & Welty, C. (2010), August). Learning to predict readability using diverse linguistic features. In Proceedings of the 23rd international conference on computational linguistics (pp. 546–554). Association for Computational Linguistics.
Källgren, G., Gustafson-Capková, S., & Hartmann, B. (2006). Manual of the Stockholm Umeå Corpus version 2.0. Department of Linguistics, Stockholm University, December. Sofia Gustafson-Capková and Britt Hartmann (eds.).
Ledesma, R. D., Valero-Mora, P., & Macbeth, G. (2015). The scree test and the number of factors: a dynamic graphics approach. The Spanish journal of psychology, 181.
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496.
Mühlenbock, K. H. (2013). I see what you mean: Assessing readability for specific target groups. (Doctoral dissertation, University of Gothenburg, Gothenburg, Sweden).
Napolitano, D., Sheehan, K. M., & Mundkowsky, R. (2015), June). Online readability and text complexity analysis with Text Evaluator. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations (pp. 96–100).
Nenkova, A., Chae, J., Louis, A., & Pitler, E. (2010). Structural features for predicting the linguistic quality of text. In Empirical methods in natural language generation (pp. 222–241). Springer, Berlin, Heidelberg.
Pallotti, G. (2015). A simple view of linguistic complexity. Second Language Research, 31(1), 117–134.
Petersen, S. (2007). Natural language processing tools for reading level assessment and text simplification for bilingual education. (Doctoral dissertation, University of Washington, Seattle, WA, USA).
Petersen, S. E., & Ostendorf, M. (2009). A machine learning approach to reading level assessment. Computer Speech & Language, 23(1), 89–106.
Pilán, I., Vajjala, S., & Volodina, E. (2016). A readable read: Automatic assessment of language learning materials based on linguistic complexity. arXiv preprint arXiv:1603.08868.
Pitler, E., & Nenkova, A. (2008), October). Revisiting readability: A unified framework for predicting text quality. In Proceedings of the 2008 conference on empirical methods in natural language processing (pp. 186–195).
Rello, L., Baeza-Yates, R., Bott, S., & Saggion, H. (2013a). Simplify or help? Text simplification strategies for people with dyslexia. In Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility (pp. 1–10).
Rello, L., Baeza-Yates, R., Dempere-Marco, L., and Saggion, H. (2013b). Frequent words improve readability and short words improve understandability for people with dyslexia. In IFIP Conference on Human-Computer Interaction (pp. 203–219. Springer.
Saggion, H. (2017). Automatic text simplification. Synthesis Lectures on Human Language Technologies, 10(1), 1–137.
Santini, M., Danielsson, B., & Jönsson, A. (2019), August). Introducing the Notion of ‘Contrast’Features for Language Technology. In International Conference on Database and Expert Systems Applications (pp. 189–198). Springer, Cham.
Sardinha, T. B., Kauffmann, C., & Acunzo, C. M. (2014). A multi-dimensional analysis of register variation in Brazilian Portuguese. Corpora, 9(2), 239–271.
Sardinha, T. B., & Pinto, M. V. (Eds.). (2014). Multi-dimensional analysis, 25 years on: A tribute to Douglas Biber (Vol. 601). John Benjamins Publishing Company.
Štajner, S., & Saggion, H. (2018), August). Data-Driven Text Simplification. In Proceedings of the 27th International Conference on Computational Linguistics: Tutorial Abstracts (pp. 19–23).
Cited by (3)
Cited by three other publications
de Brún, Jacqueline, Pádraig Ó Duibhir & Eithne Kennedy
Tao, Xuelian & Vahid Aryadoust
This list is based on CrossRef data as of 30 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
