In:Corpora and Rhetorically Informed Text Analysis: The diverse applications of DocuScope
Edited by David West Brown and Danielle Zawodny Wetzel
[Studies in Corpus Linguistics 109] 2023
► pp. 148–166
Public policy research applications
of DocuScope’s linguistic
taxonomy
Mining style and stance for sociocultural insight
Published online: 29 June 2023
https://doi.org/10.1075/scl.109.07mar
https://doi.org/10.1075/scl.109.07mar
Abstract
Computer scientists in natural language
processing (NLP) have focused on the lexical level of language: word
counts, ratios, distance, and context, and this attention to the
lexical level of language is well suited to semantic tasks as well
as syntactic analyses. Corpus linguists on the other hand have had a
broader focus, also accounting for the lexicogrammatical level of
language, and thus their approach is well-suited to pragmatic tasks.
DocuScope, with its linguistic taxonomy at the lexicogrammatical
level, is thus a unique and complementary tool for the data-driven
analysis of large collections of text, addressing the stance and
style choices pervasive in linguistic behavior. This chapter looks
at how DocuScope’s taxonomy has informed a range of problems in
public policy at the RAND Corporation. One section of the chapter
examines how the DocuScope taxonomy has been used as a statistical
tool to find patterns in text corpora, scaling up human qualitative
analysis into a mixed methods text analysis approach, for example
analyzing open text responses in a large survey of U.S. special
forces operators. The second section shows how the DocuScope
taxonomy has improved machine learning efforts, both in terms of
accuracy and interpretability, for example in detecting and
understanding conspiracy theory discourse over social media. This
chapter ultimately calls for humanistic knowledge as a valuable and
necessary complement to technical advances in data-centric
disciplines like NLP.
Article outline
- 1.Introduction
- 2.Overview of DocuScope’s usage at RAND
- 2.1The RAND-Lex instantiation of the DocuScope
dictionaries: Quantifying stance
- 2.1.1Machine + human reading: Scaling up qualitative analysis
- 2.1.2Quantitative representations of stance for machine learning
- 2.1The RAND-Lex instantiation of the DocuScope
dictionaries: Quantifying stance
- 3.Examples applications of the DocuScope dictionaries in public
policy research
- 3.1Scaling up human reading: Analyzing attitudes in survey responses and measuring
changes in news presentation
- 3.1.1Analyzing attitudes in survey responses from special operations members
- 3.1.2Measuring style at scale: Has U.S. news reporting become more subjective over time?
- 3.2Improving machine reading through linguistic stance
- 3.2.1Election interference: Understanding Russian trolls and U.S. partisanship
- 3.2.2Stance across language: Understanding the Arabic Bin Laden archive
- 3.2.3Hybrid modeling: Improving machine learning performance, and insight with the DocuScope dictionaries
- 3.2.4Stance’s value is document-length dependent
- 3.2.5Modeling with stance: Improved interpretability
- 3.1Scaling up human reading: Analyzing attitudes in survey responses and measuring
changes in news presentation
- 4.Filling in NLP gaps through humanistic theory
Notes References
References (26)
Allison, S. D., Heuser, R., Jockers, M. L., Moretti, F., & Witmore, M. (2011). Quantitative
formalism: An
experiment. Stanford Literary Lab.
Baker, P. (2004). Querying
keywords: Questions of difference, frequency, and sense in
keywords analysis. Journal of
English
Linguistics, 32(4), 346–359.
Bellasio, J., Grand-Clement, S., Iqbal, S., Marcellino, W., Lynch, A., Abdelfatah, Y., Richardson-Golinski, T., Cox, K., & Persi Paoli, G. (2021). Insights
from the Bin Laden Archive: Inventory of research and
knowledge and initial assessment and characterisation of the
Bin Laden Archive. RAND Corporation. Retrieved
on 24 January
2023 from [URL]
Brown, R., Marcellino, M., Van Hegewald, E., John, E., Salas, A., & Matthews, M. (2021). Rapid
analysis of foreign malign information on COVID-19 in the
Indo-Pacific: A proof-of-concept
study. RAND Corporation. Retrieved
on 24 January
2023 from [URL]
Claes, J., & Ortiz López, L. A. (2011). Restricciones pragmáticas y sociales en la
expresión de futuridad en el español de Puerto
Rico [Pragmatic and social restrictions in the expression of the
future in Puerto Rican
Spanish]. Spanish in
Context, 8, 50–72.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert:
Pre-training of deep bidirectional transformers for language
understanding. arXiv:1810.04805.
Hope, J., & Witmore, M. The
very large textual object: A prosthetic reading of
Shakespeare. Early Modern
Literary
Studies, 9(3), 1–36.
Johnson, C., & Marcellino, W. (2022). Bag-of-words
algorithms can supplement transformer sequence
classification & improve model
interpretability. RAND Corporation. Retrieved
on 24 January
2023 from [URL]
Kaufer, D., & Parry-Giles, S. (2017). Hillary
Clinton’s presidential campaign memoirs: A study in
contrasting
identities. Quarterly Journal
of
Speech, 103(1/2): 7–32.
Kavanagh, J., Marcellino, M., Blake, J. S., Smith, S., Davenport, S., & Gizaw, M. (2019). News
in a digital age: Comparing the presentation of news
information over time and across media
platforms. RAND Corporation. Retrieved
on 24 January
2023 from [URL].
Li, Y., Thomas, M., & Liu, D. (2021). From
semantics to pragmatics: Where IS can lead in Natural
Language Processing (NLP)
research. European Journal of
Information
Systems, 30(5), 569–590.
Marcellino, W. (2014). Talk
like a Marine: USMC linguistic acculturation and
civil–military
argument. Discourse
Studies, 16(3), 385–405.
Marcellino, M., Cragin, K., Mendelsohn, J., Cady, A., Magnuson, M., & Reedy, K. (2017). Measuring
the popular resonance of Daesh’s
propoganda. Journal of
Strategic
Security, 10(1), 4.
Marcellino, W., Johnson, C., Posard, M. N., & Helmus, T. C. (2020a). Foreign
interference in the 2020 election: Tools for detecting
online election
interference. RAND Corporation. Retrieved
on 24 January
2023 from [URL].
Marcellino, W., Cox, K., Galai, K., Slapakova, L., Jaycocks, A., & Harris, R. (2020b). Human-machine
detection of online-based malign
information. RAND Corporation. Retrieved
on 24 January
2023 from [URL].
Marcellino, W., Helmus, T., Kerrigan, J., Reininger, H., Karimov, R., & Lawrence, R. (2021). Detecting
conspiracy theories on social media: Improving machine
learning to detect and understand online conspiracy
theories. RAND Corporation. Retrieved
on 24 January
2023 from [URL]
Rich, M. (2018). Truth
decay: An initial exploration of the diminishing role of
facts and analysis in American public
life. Rand Corporation.
Ronowicz, E., & Rittidech, K. (2006). The
Sapir Whorf hypothesis and translation or the power and
weakness of language. The
Journal of the Faculty
Arts, 2(2), 21–32.
Rudin, C. (2019). Stop
explaining black box machine learning models for high stakes
decisions and use interpretable models
instead. Nature Machine
Intelligence, 1(5), 206–215.
Szayna, T., Larson, E., O’Mahony, A., Robson, S., Gereben, A., Schaefer Matthews, M., Polich, J., Ayer, L., Eaton, D., Marcellino, W., Kraus, L., Posard, M., Syme, J., Winkelman, Z., Wright, C., Cotugno, C., & Welser, W. (2016). Considerations
for integrating women into closed occupations in U.S.
special operations
forces. RAND Corporation. Retrieved
on 24 January
2023 from [URL]
Cited by (2)
Cited by two other publications
Ishizaki, Suguru & Belén López-Arroyo
This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
