In:Spanish Sociolinguistics in the 21st Century: Current trends and methodologies
Edited by Cecilia Montes-Alcalá and Miguel García
[Issues in Hispanic and Lusophone Linguistics 42] 2025
► pp. 203–229
Chapter 9A sociolinguistic analysis of a deep learning based
classification model of South American voseo in X
posts
Published online: 15 May 2025
https://doi.org/10.1075/ihll.42.09res
https://doi.org/10.1075/ihll.42.09res
Abstract
Here, I present the implementation of a dialectal
classification system that uses voseo in X
(formerly Twitter) posts to identify speakers of Colombian
(Paisa and Caleño) and
Argentine (Buenos Aires and La Plata) Spanish. Two datasets of over
18,000 posts were collected from recent X posts according to the
geolocalization of the tweet. The data was used to train and
evaluate a transformer-based machine learning classifier of South
American voseo. Results show that the system is
able to identify the voseo region with a high
degree of accuracy (0.84 F1 and 0.88 AUC ROC — Area Under the
Receiving Operating Characteristic Curve). A sociolinguistics
analysis of each dataset gave further insights on the accuracy of
the classifier, the status of voseo, and the
discourse function of voseo and other second-person
singular forms of address (2PS), particularly in the context of
Colombian voseo. An examination of the lexical,
syntactical, and grammatical properties of Colombian and Argentine
voseo also offered more detailed information on
the properties not considered by the model. The natural language
processing (NLP) methods presented here aim to pave the way for
innovative approaches with high potential in Spanish
sociolinguistics research.
Article outline
- 1.Introduction
- 1.1The current situation of Colombian and Argentinian voseo
- 1.2Voseo and social media
- 1.3New computational approaches to Sociolinguistics
- 2.Methodology
- 3.Results
- 3.1Machine learning model
- 3.2Sociolinguistic analysis: Status of 2PS
- 3.3Sociolinguistic analysis: Lexical features
- 3.4Sociolinguistic analysis: Emotions and (lm)polite speech acts
- 3.5Syntactic and grammatical features
- 4.Discussion and conclusion
Notes References
References (49)
Aaron, J., & Hernández, J. E. (2007). Quantitative
evidence for contact-induced accommodation: Shifts n /s/
reduction patterns in Salvadoran Spanish in
Houston. In K. Potowski and R. Cameron (Eds.), Spanish
in contact: Policy, social and linguistic
inquiries (pp. 329–343). John Benjamins.
Bani, S. (2023). Ideología
(s) lingüística (s): el caso del
voseo. Artifara: Revista de
lenguas y literaturas ibéricas y
latinoamericanas, 23(1), 35–51.
Bland, J., & Morgan, T. A. (2020). Geographic
variation of voseo on Spanish
twitter. In D. Pascual y Cabo & I. Elola (Eds.), Current
Theoretical and Applied Perspectives on Hispanic and
Lusophone
Linguistics (pp. 7–38). John Benjamins.
Cañete, J., Chaperon, G., Fuentes, R., Ho, J. H., Kang, H., & Pérez, J. (2020). Spanish
pre-trained bert model and evaluation
data. Pml4dc at
iclr, 2020.
Cautín-Epifani, V., & Valenzuela, M. R. (2018). Variación
sociolingüística del voseo verbal chileno en interacciones
escritas en la Biografía
Facebook. Onomázein, 4, 49–69.
Cuetos, F., Glez-Nosti, M., Barbón, A., & Brysbaert, M. (2012). SUBTLEX-ESP:
Spanish word frequencies based on film
subtitles. Psicológica, 33(2), 133–143.
Dant, P. F., Foulds, J. R., & Pan, S. (2022). Polling
Latent Opinions: A Method for Computational Sociolinguistics
Using Transformer Language
Models. arXiv preprint
arXiv:2204.07483.
Denbaum-Restrepo, N. (2021). Polymorphism
of Second Person Singular Forms of Address in Medellin,
Colombia: Usage and Language
Attitudes [Doctoral
dissertation]. Indiana University.
(2023). Polymorphism
of second person singular forms of address in the Spanish of
Medellin, Colombia. Journal
of
Pragmatics, 203, 82–95.
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT:
Pre-training of Deep Bidirectional Transformers for Language
Understanding. ArXiv. /abs/1810.04805
Díaz Collazos, A. M. (2015). Desarrollo
sociolingüístico del voseo en la región andina de Colombia
(1555–1976) (Vol.
392). Walter de Gruyter GmbH & Co KG.
Española, R. A. (2005). Diccionario
panhispánico de dudas
[Online]. Retrieved
on 5/28/2023. Available on
the Web: [URL]
Flores-Ferrán, N. (2007). A
bend in the road: Subject personal pronoun expression in
Spanish after 30 years of sociolinguistic
research. Language and
Linguistics
Compass, 1(6), 624–652.
Fontanella de Weinberg, M. B. (1979). La
oposición «cantes/cantés» en el español de Buenos
Aires. Thesavrvs, 34(1,
2, &
3), 72–83.
(1970). La
evolución de los pronombres de tratamiento en el español
bonaerense. Thesaurus:
Boletín del Instituto Caro y
Cuervo, 25(1), 12–23.
García Negroni, M., & Ramírez Gelbes, S. (2020). Prescriptive
and descriptive norms in second person singular forms of
address in Argentinean
Spanish. Address in
Portuguese and
Spanish, 361.
Hernández, J. (2002). Accommodation
in a dialect contact
situation. Revista
de Filología y Lingüística de la
Universidad de Costa
Rica, 28(2), 93–110.
Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy:
Industrial-strength natural language processing in
python.
Hovy, D., & Søgaard, A. (2015, July). Tagging
performance correlates with author
age. In Proceedings
of the 53rd annual meeting of the Association for
Computational Linguistics and the 7th international joint
conference on natural language processing (volume 2: Short
papers) (pp. 483–488).
Hovy, D., & Johannsen, A. (2016, May). Exploring
language variation across Europe-a web-based tool for
computational
sociolinguistics. In Proceedings
of the tenth international conference on language resources
and evaluation
(LREC’16) (pp. 2986–2989).
Jang, J. S. (2013). Voseo
medellinense como expresión de identidad
paisa. Íkala, revista de
lenguaje y
cultura, 18(1), 61–81.
Johnson, M. (2016). Epistemicity
in voseo and tuteo negative commands in Argentinian
Spanish. Journal of
Pragmatics, 97, 37–54.
Leech, G. (1999). The
distribution and function of vocatives in American and
British English
conversation. Language and
Computers, 26, 107–120.
Martínez Barahona, S. Y. (2020). The
Usage of Voseo in Social Media: Hondurans and Salvadorans in
the United States. The
Macksey
Journal, 1(1), 22.
Michnowicz, J., & Quintana Sarria, V. (2020). A
new look at forms of address in the Spanish of Cali,
Colombia. Hispanic Studies
Review, 4(2), 121–139.
Millán, M. (2014). “Vos
sos paisa”: A study of address forms in Medellín,
Colombia. In R. Orozco (Ed.), New
Directions in Hispanic
Linguistics (pp. 92–111). Cambridge Scholars Publishing.
Moyna, M. I., & Rivera-Mills, S. (Eds.). (2016). Forms
of address in the Spanish of the
Americas (Vol. 10). John Benjamins Publishing Company.
Nguyen, D., Doğruöz, A. S., Rosé, C. P., & De Jong, F. (2016). Computational
sociolinguistics: A
survey. Computational
linguistics, 42(3), 537–593.
Nivre, J., De Marneffe, M. C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C. D., … & Zeman, D. (2016, May). Universal
dependencies v1: A multilingual treebank
collection. In Proceedings
of the Tenth International Conference on Language Resources
and Evaluation
(LREC’16) (pp. 1659–1666).
Pešková, A. (2011). La
omisión y la expresión del pronombre sujeto vos en el
español
porteño. In Á. Di Tulio & R. Kailuweit (Eds.), El
español
rioplatense (pp. 49–76). Iberoamericana Vervuert.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving
language understanding with unsupervised
learning. Open
AI.
Rasmy, L., Xiang, Y., Xie, Z., Tao, C., & Zhi, D. (2021). Med-BERT:
pretrained contextualized embeddings on large-scale
structured electronic health records for disease
prediction. NPJ digital
medicine, 4(1), 86.
Recasens, M., & Martí, M. A. (2010). AnCora-CO:
Coreferentially annotated corpora for Spanish and
Catalan. Language resources
and
evaluation, 44, 315–345.
Restrepo-Ramos, F., & Denbaum-Restrepo, N. (2022). The
Syntactic and Discourse Properties of Second Person Singular
Forms of Address in Paisa
Spanish. Studies in Hispanic
and Lusophone
Linguistics, 15(2), 453–482.
Restrepo-Ramos, F. (2021). A
changing landscape of voseo in
Medellín? In. P. Gubitosi & M. F. Ramos Pellicia (Eds.), Linguistic
landscape in the Spanish-speaking
world (pp. 45–72). John Benjamins.
(2022). Contrastive
language policies: A comparison of two multilingual
linguistic landscapes where Spanish coexists with regional
minority
languages. International
Journal of
Multilingualism, 21(2), 906–931.
Roesslein, J. (2020). Tweepy:
X for
Python! URL: [URL]
Schmid, H. J., Würschinger, Q., Fischer, S., & Küchenhoff, H. (2021). That’s
cool. Computational sociolinguistic methods for
investigating individual lexico-grammatical
variation. Frontiers in
Artificial
Intelligence, 3, 547531.
Sinner, C. (2010). «¿Cómo
te hablé, de vos o de tú?». Uso y acomodación de las formas
de tratamiento por emigrantes y turistas argentinos en
España y
Alemania. In M. Hummel et al. (Eds.), Formas
y fórmulas de tratamiento en el mundo
hispánico, (pp. 829–856). El Colegio de México.
Sorenson, T. (2016). ¿De
dónde sos? Differences between Argentine and Salvadoran
voseo to tuteo
accommodation in the United
States. In M. I. Moyna & S. Rivera-Mills (Eds.), Forms
of address in the Spanish of the
Americas, (pp. 171–196). John Benjamins.
Stoop, W., & van den Bosch, A. (2014, April). Using
idiolects and sociolects to improve word
prediction. In Proceedings
of the 14th Conference of the European Chapter of the
Association for Computational
Linguistics (pp. 318–327).
Straka, M., & Straková, J. (2017, August). Tokenizing,
pos tagging, lemmatizing and parsing ud 2.0 with
udpipe. In Proceedings
of the CoNLL 2017 shared task: Multilingual parsing from raw
text to universal
dependencies (pp. 88–99).
Sun, H., Wang, R., Chen, K., Utiyama, M., Sumita, E., & Zhao, T. (2019, July). Unsupervised
bilingual word embedding agreement for unsupervised neural
machine
translation. In Proceedings
of the 57th Annual Meeting of the Association for
Computational
Linguistics (pp. 1235–1245).
Volkova, S., & Bachrach, Y. (2016, August). Inferring
perceived demographics from user emotional tone and
user-environment emotional
contrast. In Proceedings
of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long
Papers) (pp. 1567–1578).
