Article published In: International Journal of Corpus Linguistics
Vol. 22:2 (2017) ► pp.212–241
Using word n-grams to identify authors and idiolects
A corpus approach to a forensic linguistic problem
Published online: 16 October 2017
https://doi.org/10.1075/ijcl.22.2.03wri
https://doi.org/10.1075/ijcl.22.2.03wri
Abstract
Forensic authorship attribution is concerned with identifying the writers of anonymous criminal documents. Over the last twenty years, computer scientists have developed a wide range of statistical procedures using a number of different linguistic features to measure similarity between texts. However, much of this work is not of practical use to forensic linguists who need to explain in reports or in court why a particular method of identifying potential authors works. This paper sets out to address this problem using a corpus linguistic approach and the 176-author 2.5 million-word Enron Email Corpus. Drawing on literature positing the idiolectal nature of collocations, phrases and word sequences, this paper tests the accuracy of word n-grams in identifying the authors of anonymised email samples. Moving beyond the statistical analysis, the usage-based concept of entrenchment is offered as a means by which to account for the recurring and distinctive production of idiolectal word n-grams.
Keywords: forensic linguistics, idiolect, authorship attribution, entrenchment, Enron
Article outline
- 1.The linguistic individual, corpora and forensic linguistics
- 2.Word strings as features in authorship analysis
- 2.1Word strings, routine and the individual
- 2.2Empirical evidence for idiolectal word strings
- 2.3‘Word n-grams’ in this study
- 3.Methodology
- 3.1The Enron Email Corpus
- 3.2The authorship attribution experiment
- 4.Attribution results
- 4.1Effect of sample size
- 4.2Performance of different n-gram lengths
- 4.3Performance across authors
- 5.Identifying idiolectal word n-grams
- 6.Conclusions and implications
- Acknowledgements
References
References (51)
Argamon, S., & Koppel, M. (2013). A systemic functional approach to automated authorship analysis. Journal of Law and Policy, 21(2), 299–316.
Barlow, M. (2013). Individual differences and usage-based grammar. International Journal of Corpus Linguistics, 18(4), 443–478.
Becker, J. D. (1975). The phrasal lexicon. In B. L. Nash-Webber & R. Shank (Eds.), Theoretical Issues in Natural Language Processing (pp. 60–63). Cambridge, MA: Bolt Beranek and Newman.
Biber, D., Conrad, S., & Cortes, V. (2004).
If you look at …: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405.
Cohen, W. W. (2009). Enron Email Dataset [online]. Retrieved from [URL] (last accessed November 2010).
Coniam, D. (2004). Concordancing oneself: Constructing individual textual profiles. International Journal of Corpus Linguistics, 9(2), 271–298.
Cotterill, J. (2010). How to use corpus linguistics in forensic linguistics. In A. O’Keefe & M. McCarthy (Eds.), The Routledge Handbook of Corpus Linguistics (pp. 578–590). London: Routledge.
Coulthard, M. (1994). On the use of corpora in the analysis of forensic texts. Forensic Linguistics. International Journal of Speech, Language and the Law, 1(1), 27–43.
(2004). Author identification, idiolect, and linguistic uniqueness. Applied Linguistics, 24(4), 431–447.
Coulthard, M., Grant, T., & Kredens, K. (2011). Forensic Linguistics. In R. Wodak, B. Johnstone & P. Kerswill (Eds.), The SAGE Handbook of Sociolinguistics (pp. 531–544). London: Sage.
Coyotl-Morales, R., Villaseñor-Pineda, M. L., Montes-y-Gómez, M., & Rosso, P. (2006). Authorship attribution using word sequences. In J. F. Martínez-Trinidad, J. A. Carrasco Ochoa & J. Kittler (Eds.), Proceedings of the 11th Iberoamerican Congress on Pattern Recognition (pp. 844–853). Berlin: Springer.
Durrant, P., & Doherty, A. (2010). Are high-frequency collocations psychologically real? Investigating the thesis of collocational priming. Corpus Linguistics and Linguistic Theory, 6(2), 125–155.
Eckert, P., & McConnell-Ginet, S. (1998). Communities of practice: Where language, gender and power all live? In J. Coates (Ed.), Language and Gender: A Reader (pp. 484–494). Oxford: Blackwell.
Eder, M. (2015). Does size matter? Authorship attribution, small samples, big problem. Digital Scholarship in the Humanities, 30(2), 167–182.
Firth, J. R. (1957). A synopsis of linguistic theory 1930–1955. In F. R. Palmer (Ed.), Selected papers of J.R. Firth 1952–1959 (pp. 168–205). London: Longman.
Grant, T. (2007). Quantifying evidence in forensic authorship analysis. International Journal of Speech, Language and the Law, 14(1), 1–25.
(2008). Approaching questions in forensic authorship analysis. In J. Gibbons & M. T. Turell (Eds.), Dimensions of Forensic Linguistics (pp. 215–229). Amsterdam/Philadelphia: John Benjamins.
(2010). Txt 4n6: Idiolect free authorship analysis? In M. Coulthard & A. Johnson (Eds.), The Routledge Handbook of Forensic Linguistics (pp. 508–522) London: Routledge.
(2013). Txt 4N6: Method, consistency and distinctiveness in the analysis of SMS text messages. Journal of Law and Policy, 21(2), 467–494.
Grieve, J. (2007). Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing, 22(3), 251–270.
Hoover, D. L. (2002). Frequent word sequences and statistical stylistics. Literary and Linguistic Computing, 17(2), 157–180.
Johnson, A. & Wright, D. (2014). Identifying idiolect in forensic authorship attribution: An n-gram textbite approach. Language and Law (Linguagem e Direito) 1(1), 37–69.
Koppel, M., Schler, J., & Argamon, S. (2011). Authorship attribution in the wild. Language Resources and Evaluation, 45(1), 83–94.
Kredens, K. (2002). Towards a corpus-based methodology of forensic authorship attribution: A comparative study of two idiolects. In B. Lewandowska-Tomaszczyk (Ed.), PALC’01: Practical Applications in Language Corpora (pp. 405–437). Peter Lang: Frankfurt am Mein.
Kuiper, K. (2004). Formulaic performance in conventionalised varieties of speech. In N. Schmitt (Ed.), Formulaic Sequences: Acquisition, Processing and Use (pp. 37–54). Amsterdam/Philadelphia: John Benjamins.
Langacker, R. (1988). A usage-based model. In B. Rudzka-Ostyn (Ed.), Topics in Cognitive Linguistics (pp. 127–161). Amsterdam/Philadelphia: John Benjamins.
(2000). A dynamic usage-based model. In M. Barlow & S. Kemmer (Eds.), Usage-Based Models of Language (pp. 1–63). Stanford: CSLI Publications.
Larner, S. (2014). A preliminary investigation into the use of fixed formulaic sequences as a marker of authorship. International Journal of Speech, Language and the Law, 21(1), 1–22.
Luyckx, K., & Daelemans, W. (2011). The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing, 26(1), 35–55.
Mikros, G. (2012). Authorship attribution and gender identification in Greek blogs. In I. Obradovic, E. Kelih & Reinhard Köhler (Eds.), Methods and Applications of Quantitative Linguistics (pp. 21–32). University of Belgrade: Academic Mind.
Mollin, S. (2009). ‘I entirely understand’ is a Blairism: The methodology of identifying idiolectal collocations. International Journal of Corpus Linguistics, 14(3), 367–392.
Nattinger, J. R., & DeCarrico, J. (1992). Lexical Phrases and Language Teaching. Oxford: Oxford University Press.
Nini, A., & Grant, T. (2013). Bridging the gap between stylistic and cognitive approaches to authorship analysis using Systemic Functional Linguistics and multidimensional analysis. International Journal of Speech, Language and the Law, 20(2), 173–202.
Sanderson, C., & Guenter, S. (2006). Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In Proceedings of the International Conference on Empirical Methods in Natural Language Engineering (pp. 482–491). Morristown, NJ: Association for Computational Linguistics.
Schmid, H-J. (2016). A framework for understanding linguistic entrenchment and its psychological foundations. In H-J. Schmid (Ed.), Entrenchment and the Psychology of Language Learning: How We Reorganize and Adapt Linguistic Knowledge (pp. 9–36). Berlin: De Gruyter Mouton.
Schmitt, N., Grandage, S., & Adolphs, S. (2004). Are corpus-derived recurrent clusters psycholinguistically valid? In N. Schmitt (Ed.) Formulaic Sequences: Acquisition, Processing and Use (pp. 12–151). Amsterdam/Philadelphia: John Benjamins.
Scott, M. (2008). WordSmith Tool (Version 5) [Computer software]. Liverpool: Lexical Analysis Software.
Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3), 538–556.
(2013). On the robustness of authorship attribution based on character n-gram features. Journal of Law and Policy, 21(2), 421–440.
Svartvik, J. (1968). The Evans Statements: A case for Forensic Linguistics. Gotëborg: University of Gothenburg Press.
Turell, M. T., & Gavaldà, N. (2013). Towards an index of idiolectal similitude (or distance) in forensic authorship analysis. Journal of Law and Policy, 21(2), 495–514.
Woolls, D. (2013). CFL Jaccard n-gram Lexical Evaluator (Jangle) (Version 2) [Computer software]. CFL Software Limited. Retrieved from [URL] (last accessed January 2017).
Cited by (38)
Cited by 38 other publications
Carreras-Riudavets, Francisco J. & Zenón J. Hernández-Figueroa
Hidalgo Tenorio, Encarnación, Miguel Ángel Benítez Castro, Irene González, Reyes Rodríguez, Pol Castells, Roberto Muelas Lobato, David Sánchez & Manuel Moyano
Hubert, József, Zsófia Kenesei & András Bauer
Yan, Jianwei, Qidi Li & Haitao Liu
Heini, Annina & Krzysztof Kredens
Grieve, Jack
Andrea Mojedano Batel, Neus Alberich Buera & Krzysztof Kredens
Busso, Lucia, Marton Petyko, Sarah Atkins & Tim Grant
Fadlil, Abdul, Sunardi Sunardi & Rezki Ramdhani
Grant, Tim & Jack Grieve
Klyushin, Dmitriy & Yulia Nykyporets
Liu, Xueqin & Mingzhe Jin
Marko, Karoline, Margit Reitbauer & Georg Pickl
Romanova, Tatiana & Anna Khomenko
Tomas, Frédéric, Olivier Dodier & Samuel Demarchi
Изотова, Т., Е. Крюк, В. Кузнецов, А. Плотникова, Т. Бердникова, А. Заварыкина, Е. Крюк & Н. Михалева
Douglas, Fiona M.
Evans, Mel & Alan Hogarth
MacLeod, Nicci & Tim Grant
Mazurek, Marcin & Mateusz Romaniuk
Nini, Andrea
Raj, Sariga, B. Kannan & V. P. Jagathy Raj
Deviterne-Lapeyre, Capitaine Marie
Fonteyn, Lauren & Andrea Nini
Miranker, Molly & Alberto Giordano
Sharon Belvisi, Nicole Mariah, Naveed Muhammad & Fernando Alonso-Fernandez
Vetchinnikova, Svetlana & Turo Hiltunen
Yang, Yang, Wu Youyou & Brian Uzzi
Zhao, Yunqi, Igor Borovikov, Fernando de Mesentier Silva, Ahmad Beirami, Jason Rupert, Caedmon Somers, Jesse Harder, John Kolen, Jervis Pinto, Reza Pourabolghasem, James Pestrak, Harold Chaput, Mohsen Sardari, Long Lin, Sundeep Narravula, Navid Aghdaie & Kazi Zaman
Grieve, Jack, Isobelle Clarke, Emily Chiang, Hannah Gideon, Annina Heini, Andrea Nini & Emily Waibel
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
