Article published In: International Journal of Learner Corpus Research
Vol. 8:1 (2022) ► pp.121–138
Corpus reports
The University of Pittsburgh English Language Institute Corpus (PELIC)
Published online: 8 March 2022
https://doi.org/10.1075/ijlcr.21002.nai
https://doi.org/10.1075/ijlcr.21002.nai
Abstract
This report introduces the University of Pittsburgh English Language Institute Corpus (PELIC;
Juffs, A., Han, N-R., & Naismith, B. (2020). The
University of Pittsburgh English Language Corpus (PELIC) [Data
set]. ), a publicly available 4.2-million-word learner corpus of
written texts. Collected over seven years in the University of Pittsburgh’s Intensive English Program, these texts were produced
by more than 1,100 students with diverse linguistic backgrounds and proficiency levels. Unlike most learner corpora which are
cross-sectional, PELIC is longitudinal, offering greater opportunities for tracking development in a natural classroom setting.
This potential is illustrated in an overview of the research conducted to date with these data. The report also provides a
description of PELIC’s creation and contents, including how the texts have been managed to facilitate natural language processing.
Overall, the corpus contributes to the field of learner corpus research by adding to the pool of freely and publicly available
learner corpora, supplemented by a useful set of Python tools and tutorials for accessing these data.
Keywords: ESL, IEP, longitudinal development, multi-L1 corpus, PELIC
Article outline
- 1.Introduction
- 2.Corpus description
- 2.1PELIC background, context, and design
- 2.2Corpus size
- 2.3Participants
- 2.4Corpus summary
- 3.Data collection and processing
- 3.1Data collection
- 3.2Ethical and legal concerns
- 3.3Data cleaning
- 3.4Data processing
- 3.4.1Tokenization
- 3.4.2Part-of-speech tagging and lemmatization
- 4.Additional resources
- 4.1Tutorials
- 4.1.1Corpus compilation
- 4.1.2Exploratory data analysis (EDA)
- 4.1.3Concordancing tutorial
- 4.2Pitt ELI toolkit (PELITK)
- 4.2.1Concordancing package
- 4.2.2Lexical proficiency
- 4.3PELIC spelling
- 4.1Tutorials
- 5.Current PELIC research
- 6.Future developments
- 7.Conclusion
- Acknowledgements
- Notes
References
References (32)
Alexopoulou, T., Geertzen, J., Korhonen, A., & Meurers, D. (2015). Exploring
big educational learner corpora for SLA research: Perspectives on relative
clauses. International Journal of Learner Corpus
Research,
1
(1), 96–129.
Atkinson, K. (2019). Spell
Checking Oriented Word Lists (SCOWL) (Version 2019). [URL]
Biber, D., Reppen, R., Staples, S., & Egbert, J. (2020). Exploring
the longitudinal development of grammatical complexity in the disciplinary writing of L2-English university
students. International Journal of Learner Corpus
Research,
6
(1), 38–71.
Blanchard, D., Tetreault, J., Higgins, D., Cahill, A., & Chodorow, M. (2014). ETS
Corpus of Non-Native Written English LDC2014T06. Linguistic Data Consortium.
Callies, M. (2015). Learner
corpus methodology. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The
Cambridge handbook of learner corpus
research (pp. 35–56). Cambridge University Press.
Centre for English Corpus
Linguistics. (2021a). Longitudinal Database of Learner English
(LONGDALE). Université catholique de Louvain. [URL]
. (2021b). Learner corpora around the
world. Université catholique de Louvain. [URL]
Davies, M. (2008–). The
Corpus of Contemporary American English (COCA): 560 million words, 1990-present. [URL]
Dunlap, S. (2012). Orthographic
quality in English as a second language (Unpublished doctoral
dissertation). University of Pittsburgh.
Etaiwi, W., & Naymat, G. (2017). The
impact of applying different preprocessing steps on review spam detection. Procedia Computer
Science,
113
1, 273–279.
Gablasova, D., Brezina, V., & McEnery, T. (2017). Exploring
learner language through corpora: Comparing and interpreting corpus frequency
information. Language
Learning
67
(1), 130–154.
Garbe, W. (2020). SymSpell (Version
6.7). [URL]
Gilquin, G. (2015). From
design to collection of learner corpora. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The
Cambridge handbook of learner corpus
research (pp. 9–34). Cambridge University Press.
Granger, S., Dupont, M., Meunier, F., Naets, H. & Paquot, M. (2020). The
International Corpus of Learner English. Version 3. Presses universitaires de Louvain. [URL]
Honnibal, M. (2013). A
good part-of-speech tagger in about 200 lines of Python. Explosion. [URL]
Juffs, A., & Han, N-R. (2019, March 12). Combining
formal and usage-based theories with data science techniques in measuring the development of syntactic complexity in written
production. Paper presented at the International Conference of the
American Association of Applied Linguistics, Atlanta, GA.
Juffs, A., Han, N-R., & Naismith, B. (2020). The
University of Pittsburgh English Language Corpus (PELIC) [Data
set].
Marcus, M. P., Santorini, B., Marcinkiewicz, M. A., & Taylor, A. (1999). Treebank-3
LDC99T42 [Web Download]. Linguistic Data Consortium. [URL]
Meunier, F. (2016). Introduction
to the LONGDALE Project. In E. Castello, K. Ackerley, & F. Coccetta (Eds.), Studies
in learner corpus linguistics. Research and applications for foreign language teaching and
assessment (pp. 123–126). Peter Lang.
Naismith, B., Han, N.-R., Juffs, A., Hill, B. L., & Zheng, D. (2018). Accurate
measurement of lexical sophistication with reference to ESL learner
data. In K. E. Boyer & M. Yudelson (Eds), Proceedings
of the 11th International Conference on Educational Data
Mining (pp. 259–265).
Naismith, B., & Juffs, A. (2021). Finding
the sweet spot: Learners’ productive knowledge of mid-frequency lexical items. Language
Teaching Research.
Nation, I. S. P. (2013). Learning
vocabulary in another language (2nd ed.). Cambridge University Press.
Picoral, A., Staples, S., & Reppen, R. (2021). Automated
annotation of learner English. International Journal of Learner Corpus
Research,
7
(1), 17–52.
Rankin, T., & Schiftner, B. (2011). Marginal
prepositions in learner English: Applying local corpus data. International Journal of Corpus
Linguistics,
16
(3), 412–34.
Someya, Y. (1998). Someya
Lemma List. [URL]
Tidball, F., & Treffers-Daller, J. (2008). Analysing
lexical richness in French learner language: what frequency lists and teacher judgements can tell us about basic and advanced
words. Journal of French Language
Studies,
18
(3), 299–313.
van Rooy, B., & Schäfer, L. (2009). The
effect of learner errors on POS tag errors during automatic POS tagging. Southern African
Linguistics and Applied Language
Studies,
20
(4), 325–335.
Cited by (10)
Cited by ten other publications
Granger, Sylviane & Magali Paquot
Naismith, Ben & Alan Juffs
Pauls, Tobias
Cong, Yan
Cong, Yan
Cong, Yan
Kyle, Kristopher & Masaki Eguchi
Martin, Katherine I.
Xu, Wei
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
