Pinning down text complexity: An Exploratory Study on the Registers of the Stockholm-Umeå Corpus (SUC)

Santini, Marina; Jönsson, Arne

doi:10.1075/rs.19005.san

Article published In: Register Studies
Vol. 2:2 (2020) ► pp.306–349

Get fulltext from our e-platform

Download PDF

Pinning down text complexity

An Exploratory Study on the Registers of the Stockholm-Umeå Corpus (SUC)

Marina Santini | RISE Research Institutes of Sweden

Arne Jönsson | Linköping University

Published online: 13 August 2020

https://doi.org/10.1075/rs.19005.san

Abstract

In this article, we present the results of a corpus-based study where we explore whether it is possible to automatically single out different facets of text complexity in a general-purpose corpus. To this end, we use factor analysis as applied in Biber’s multi-dimensional analysis framework. We evaluate the results of the factor solution by correlating factor scores and readability scores to ascertain whether the selected factor solution matches the independent measurement of readability, which is a notion tightly linked to text complexity. The corpus used in the study is the Swedish national corpus, called Stockholm-Umeå Corpus or SUC. The SUC contains subject-based text varieties (e.g., hobby), press genres (e.g., editorials), and mixed categories (e.g., miscellaneous). We refer to them collectively as ‘registers’. Results show that it is indeed possible to elicit and interpret facets of text complexity using factor analysis despite some caveats. We propose a tentative text complexity profiling of the SUC registers.

Keywords: text complexity, readability, corpus-based analysis, multivariate statistics, factor analysis, Multi-Dimensional Analysis (MDA), correlation tests

Article outline

1.Introduction
2.Text complexity and readability
3.Previous work
- 3.1Multi-dimensional analysis
- 3.2Readability-text complexity: Automatic approaches
4.Method
- 4.1The SUC corpus and dataset
- 4.2Multi-dimensional analysis: Technicalities
  - 4.2.1Variable screening
  - 4.2.2Running multi-dimensional analysis
  - 4.2.3Three-Factors solution
5.Meaningful factors? Evaluation and interpretation
- 5.1Evaluation: Correlating LIX scores & factor scores
  - 5.1.1Factor 1 scores & LIX scores
  - 5.1.2Factor 2 scores & LIX scores
  - 5.1.3Factor 3 scores & LIX scores
  - 5.1.4Summary
- 5.2Interpretation: Signed dimensions & text complexity facets
  - 5.2.1Factor1: Dim1+ & Dim1−
  - 5.2.2Dim1+: Pronominal-Adverbial (spoken-emotional) facet – Average readability
  - 5.2.3Dim1−: Nominal (informational) facet – Difficult readability
  - 5.2.4Factor 2: Dim2+
  - 5.2.5Dim2+: Adjectival (information elaboration) facet – Difficult readability
  - 5.2.6Factor 3: Dim3+ & Dim3−
  - 5.2.7Dim3+: Verbal (engaged) facet – Difficult readability
  - 5.2.8Dim3−: Appositional (information expansion) facet – Difficult readability
  - 5.2.9Summary
6.Profiling SUC registers
7.Discussion
8.Conclusion and future work
Notes
Companion website
References

References (55)

Companion website

The study described in this paper is fully reproducible. Datasets, radar charts and R code are available here: <http://santini.se/registerstudies2020>.

References

Adesam, Y., Bouma, G. and Johansson, R. (2018). The Koala part-of-speechand morphological tagset for Swedish. SLTC.

Asención-Delaney, Y., & Collentine, J. (2011). A multidimensional analysis of a written L2 Spanish corpus. Applied linguistics, 32(3), 299–322.

Biber, D. (1988). Variation across speech and writing. Cambridge University Press.

(1989). A typology of English texts. Linguistics, 27(1), 3–44.

(1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge University Press.

Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. Longman.

Biber, D., & Kurjian, J. (2007). Towards a taxonomy of web registers and text types: A multi- dimensional analysis. In Corpus Linguistics and the Web (pp. 109–131).

Biber, D., & Conrad, S. (2009). Register, genre, and style. Cambridge University Press.

Biber, D., & Egbert, J. (2016). Register variation on the searchable web: A multi-dimensional analysis. Journal of English Linguistics, 44(2), 95–137.

Björnsson, C. H. (1968). Läsbarhet. Liber.

Cattell, R. B. (1966). The scree test for the number of factors. Multivariate behavioral research, 1(2), 245–276.

Collins-Thompson, K. (2014). Computational assessment of text readability: A survey of current and future research. ITL-International Journal of Applied Linguistics, 165(2), 97–135.

Common Core State Standards Initiative. (2010). Common Core State Standards for English Language Arts & Literacy InHistory/Social Studies, Science, and Technical Subjects. Appendix A: Research Supporting Key Elements of the Standards, Glossary of Key Terms.

Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., Zasina, A. J., & Benko, V. (2020). Comparing web-crawled and traditional corpora. Language Resources and Evaluation, 1–33.

Dahl, Ö. (2004). The growth and maintenance of linguistic complexity (Vol. 711). John Benjamins Publishing.

Dale, E., & Chall, J. S. (1949). The concept of readability. Elementary English, 26(1), 19–26.

Dell’Orletta, F., Montemagni, S., & Venturi, G. (2013), September). Linguistic profiling of texts across textual genres and readability levels. An exploratory study on Italian fictional prose. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013 (pp. 189–197).

(2014). Assessing document and sentence readability in less resourced languages and across textual genres. ITL-International Journal of Applied Linguistics, 165(2), 163–193.

DiStefano, C., Zhu, M., & Mindrila, D. (2009). Understanding and using factor scores: Considerations for the applied researcher. Practical Assessment, Research & Evaluation, 14(20), 1–11.

Fahlborg, D., & Rennes, E. (2016). Introducing SAPIS–an API service for text analysis and simplification. In the second national Swe-Clarin workshop: Research collaborations for the digital age, Umeå, Sweden.

Falkenjack, J. (2018). Towards a model of general text complexity for Swedish (Doctoral dissertation, Linköping University Electronic Press).

Falkenjack, J., Mühlenbock, K. H., & Jönsson, A. (2013), May). Features indicating readability in Swedish text. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013) (pp. 27–40).

Falkenjack, J., Santini, M., & Jönsson, A. (2016). An exploratory study on genre classification using readability features. In Proceedings of the Sixth Swedish Language Technology Conference (SLTC 2016), Umeå, Sweden.

Feng, L. (2010). Automatic readability assessment (Doctoral dissertation, CUNY Academic Works).

Field, A. (2000). Discovering statistics using SPSS for Windows. Londra: Sage Publication.

Flesch, R. (1948). A new readibility yardstick. Journal of Applied Psychology, 32(3):221–23.

Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Sage publications.

Hayton, J. C., Allen, D. G., & Scarpello, V. (2004). Factor retention decisions in exploratory factor analysis: A tutorial on parallel analysis. Organizational research methods, 7(2), 191–205.

Hiebert, E. H. (2012). Readability and the common core’s staircase of text complexity. Text Matters, 11.

Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika 301, 179–185.

Housen, A., De Clercq, B., Kuiken, F., & Vedder, I. (2019). Multiple approaches to complexity in second language research. Second Language Research, 35(1), 3–21.

Jelen, B. (2013). Excel 2013 charts and graphs. Que Publishing Company.

Jönsson, S., Rennes, E., Falkenjack, J., & Jönsson, A. (2018). A component based approach to measuring text complexity. In Proceedings of The Seventh Swedish Language Technology Conference 2018 (SLTC-18).

Kate, R. J., Luo, X., Patwardhan, S., Franz, M., Florian, R., Mooney, R. J., & Welty, C. (2010), August). Learning to predict readability using diverse linguistic features. In Proceedings of the 23rd international conference on computational linguistics (pp. 546–554). Association for Computational Linguistics.

Källgren, G., Gustafson-Capková, S., & Hartmann, B. (2006). Manual of the Stockholm Umeå Corpus version 2.0. Department of Linguistics, Stockholm University, December. Sofia Gustafson-Capková and Britt Hartmann (eds.).

Ledesma, R. D., Valero-Mora, P., & Macbeth, G. (2015). The scree test and the number of factors: a dynamic graphics approach. The Spanish journal of psychology, 181.

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496.

Mühlenbock, K. H. (2013). I see what you mean: Assessing readability for specific target groups. (Doctoral dissertation, University of Gothenburg, Gothenburg, Sweden).

Napolitano, D., Sheehan, K. M., & Mundkowsky, R. (2015), June). Online readability and text complexity analysis with Text Evaluator. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations (pp. 96–100).

Nenkova, A., Chae, J., Louis, A., & Pitler, E. (2010). Structural features for predicting the linguistic quality of text. In Empirical methods in natural language generation (pp. 222–241). Springer, Berlin, Heidelberg.

Nivre, J. (2006). Inductive dependency parsing (pp. 87–120). Springer Netherlands.

Pallotti, G. (2015). A simple view of linguistic complexity. Second Language Research, 31(1), 117–134.

Petersen, S. (2007). Natural language processing tools for reading level assessment and text simplification for bilingual education. (Doctoral dissertation, University of Washington, Seattle, WA, USA).

Petersen, S. E., & Ostendorf, M. (2009). A machine learning approach to reading level assessment. Computer Speech & Language, 23(1), 89–106.

Pilán, I., Vajjala, S., & Volodina, E. (2016). A readable read: Automatic assessment of language learning materials based on linguistic complexity. arXiv preprint arXiv:1603.08868.

Pitler, E., & Nenkova, A. (2008), October). Revisiting readability: A unified framework for predicting text quality. In Proceedings of the 2008 conference on empirical methods in natural language processing (pp. 186–195).

Rello, L., Baeza-Yates, R., Bott, S., & Saggion, H. (2013a). Simplify or help? Text simplification strategies for people with dyslexia. In Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility (pp. 1–10).

Rello, L., Baeza-Yates, R., Dempere-Marco, L., and Saggion, H. (2013b). Frequent words improve readability and short words improve understandability for people with dyslexia. In IFIP Conference on Human-Computer Interaction (pp. 203–219. Springer.

Saggion, H. (2017). Automatic text simplification. Synthesis Lectures on Human Language Technologies, 10(1), 1–137.

Santini, M., Danielsson, B., & Jönsson, A. (2019), August). Introducing the Notion of ‘Contrast’Features for Language Technology. In International Conference on Database and Expert Systems Applications (pp. 189–198). Springer, Cham.

Sardinha, T. B., Kauffmann, C., & Acunzo, C. M. (2014). A multi-dimensional analysis of register variation in Brazilian Portuguese. Corpora, 9(2), 239–271.

Sardinha, T. B., & Pinto, M. V. (Eds.). (2014). Multi-dimensional analysis, 25 years on: A tribute to Douglas Biber (Vol. 601). John Benjamins Publishing Company.

Štajner, S., & Saggion, H. (2018), August). Data-Driven Text Simplification. In Proceedings of the 27th International Conference on Computational Linguistics: Tutorial Abstracts (pp. 19–23).

Vega, B., Feng, S., Lehman, B., Graesser, A., & D’Mello, S. (2013), July). Reading into the text: Investigating the influence of text complexity on cognitive engagement. In Educational Data Mining 2013.

Wray, D., & Janan, D. (2013). Readability revisited? The implications of text complexity Published in The Curriculum Journal, 2013.

Cited by (3)

Cited by three other publications

de Brún, Jacqueline, Pádraig Ó Duibhir & Eithne Kennedy

2025. The teaching and learning of reading in an immersion setting: A focus on word recognition. International Journal of Bilingual Education and Bilingualism 28:10 ► pp. 1329 ff.

Tao, Xuelian & Vahid Aryadoust

2024. A Multidimensional Analysis of a High-Stakes English Listening Test: A Corpus-Based Approach. Education Sciences 14:2 ► pp. 137 ff.

Vahrusheva, Alexandra, Valery Solovyev, Marina Solnyshkina, Elzara Gafiaytova & Svetlana Akhtyamova

2023. Revisiting Assessment of Text Complexity: Lexical and Syntactic Parameters Fluctuations. In Speech and Computer [Lecture Notes in Computer Science, 14338], ► pp. 430 ff.

This list is based on CrossRef data as of 30 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.