In:Challenges in Corpus Linguistics: Rethinking corpus compilation and analysis
Edited by Mark Kaunisto and Marco Schilk
[Studies in Corpus Linguistics 118] 2024
► pp. 142–170
Modeling fine-grained sociolinguistic variation
The promises and pitfalls of Twitter corpora and neural word embeddings
Published online: 19 September 2024
https://doi.org/10.1075/scl.118.09mil
https://doi.org/10.1075/scl.118.09mil
Abstract
This chapter examines the use of recent data sources and
computational methods to study fine-grained sociolinguistic phenomena. We
deploy a custom-built corpus of tweets (Miletić et al. 2020) and neural word embeddings to investigate
the use of contact-induced semantic shifts in Quebec English. Drawing on an
analysis of 40 lexical items, we show that our approach is beneficial in
facilitating manual inspection of vast amounts of data and establishing
fine-grained patterns of language variation. While it is affected by a range
of noise-related issues, which we describe in detail, coarse-grained
annotation provides an efficient way of circumventing them. We use the
results filtered in this way to conduct a quantitative analysis of
sociolinguistic constraints on contact-induced semantic shifts, further
confirming the relevance of our approach.
Article outline
- 1.Introduction
- 2.Theoretical and methodological background
- 2.1Semantic shifts in Quebec English: The need for corpus studies
- 2.2Twitter-based corpora for language variation
- 2.3Vector space models for lexical semantic variation
- 3.Data and method
- 3.1A corpus of tweets
- 3.2A set of semantic shifts in Quebec English
- 3.3Neural word embeddings
- 3.4Clustering and annotating the uses of a lexical item
- 4.Results
- 4.1An overview of regionally specific clusters
- 4.2Types of variation captured by the analysis
- 4.2.1True positives
- A clear-cut distinction
- A subtler distinction
- 4.2.2False positives
- Cultural effects
- Proper names
- French homographs in codeswitched tweets
- Structural patterns affecting model performance
- 4.2.1True positives
- 4.3Deploying coarsely annotated data for linguistic description
- 5.Discussion and conclusion
Notes References
References (60)
Bamman, David, Eisenstein, Jacob & Schnoebelen, Tyler. 2014. Gender
identity and lexical variation in social
media. Journal of
Sociolinguistics 18(2): 135–160.
Bird, Steven, Loper, Edward & Klein, Ewan. 2009. Natural
Language Processing with
Python. Sebastopol CA: O’Reilly Media.
Boberg, Charles. 2005. The
North American Regional Vocabulary Survey: New variables and methods
in the study of North American
English. American
Speech 80(1): 22–60.
Boberg, Charles & Hotton, Jenna. 2015. English
in the Gaspé region of
Quebec. English
World-Wide 36(3): 277–314.
Boleda, Gemma. 2020. Distributional
semantics and linguistic
theory. Annual Review of
Linguistics 6: 213–234.
Cajolet-Laganière, Hélène, Martel, Pierre, Masson, Chantal-Édith & Mercier, Louis. 2014. Usito. <[URL]> (20 May
2024).
Chambers, J. K. & Heisler, Troy. 1999. Dialect
topography of Québec City
English. Canadian Journal of
Linguistics/Revue Canadienne de
Linguistique 44(1): 23–48.
De Pascale, Stefano. 2019. Token-based
Vector Space Models as Semantic Control in Lexical
Lectometry. PhD
dissertation, KU Leuven.
Del Tredici, Marco & Fernández, Raquel. 2017. Semantic
variation in online communities of
practice. In IWCS
2017 – 12th International Conference on Computational Semantics –
Long papers. <[URL]> (20 May
2024).
Dendien, Jacques & Pierrel, Jean-Marie. 2003. Le
trésor de la langue française informatisé. Un exemple
d’informatisation d’un dictionnaire de langue de
référence. Traitement Automatique des
Langues 44(2): 11–37.
Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton & Toutanova, Kristina. 2019. BERT:
Pre-training of deep bidirectional transformers for language
understanding. In Proceedings
of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short
Papers), 4171–4186. Minneapolis MN: Association for Computational Linguistics.
Dollinger, Stefan. 2015. The
Written Questionnaire in Social Dialectology: History, Theory,
Practice. Amsterdam: John Benjamins.
Dollinger, Stefan & Fee, Margery. 2017. DCHP-2:
The Dictionary of Canadianisms on Historical
Principles, 2nd
edn. <[URL]> (20 May
2024).
Donoso, Gonzalo & Sánchez, David. 2017. Dialectometric
analysis of language variation in
Twitter. In Proceedings
of the Fourth Workshop on NLP for Similar Languages, Varieties and
Dialects
(VarDial), 16–25. Valencia: Association for Computational Linguistics.
Durkin, Philip. 2012. Variation
in the lexicon: The ‘Cinderella’ of sociolinguistics? Why does
variation in word forms and word meanings present such challenges
for empirical research? English
Today 28(4): 3–9.
Fee, Margery. 1991. Frenglish
in Quebec English
newspapers. In Papers
of the Fifteenth Annual Meeting of the Atlantic Provinces Linguistic
Association, 12–23. New Brunswick: Atlantic Provinces Linguistic Association.
. 2008. French
borrowing in Quebec
English. Anglistik: International
Journal of English
Studies 19(2): 173–188.
Firth, John R. 1957. A
synopsis of linguistic theory,
1930–1955. In Studies
in Linguistic
Analysis, 1–32. Oxford: Blackwell.
Gimpel, Kevin, Schneider, Nathan, O’Connor, Brendan, Das, Dipanjan, Mills, Daniel, Eisenstein, Jacob, Heilman, Michael, Yogatama, Dani, Flanigan, Jeffrey & Smith, Noah A. 2011. Part-of-speech
tagging for Twitter: Annotation, features, and
experiments. In Proceedings
of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language
Technologies, 42–47. Portland OR: Association for Computational Linguistics.
Giulianelli, Mario, Del Tredici, Marco & Fernández, Raquel. 2020. Analysing
lexical semantic change with contextualised word
representations. In Proceedings
of the 58th Annual Meeting of the Association for Computational
Linguistics, 3960–3973. Stroudsburg PA: Association for Computational Linguistics.
Grant, Pamela. 2010a. English
usage in contemporary Quebec: Reflections of the
local. In Canadian
English: A Linguistic Reader [Strathy
Occasional Papers on Canadian English
6], Elaine Gold & Janice McAlpine (eds), 177–197. Kingston ON: Queen’s University.
. 2010b. Is
Quebec English distinct? English usage in contemporary
Quebec [lecture
slides]. <[URL]> (20 May
2024).
Grant-Russell, Pamela. 1999. The
influence of French on Quebec English: Motivation for lexical
borrowing and integration of
loanwords. In LACUS
Forum 26, Shin Ja J. Hwang & Arle R. Lommel (eds), 473–486. Fullerton CA: The Linguistic Association of Canada and the United States.
Grieve, Jack, Montgomery, Chris, Nini, Andrea, Murakami, Akira & Guo, Diansheng. 2019. Mapping
lexical dialect variation in British English using
Twitter. Frontiers in Artificial
Intelligence 2: 11.
Hengchen, Simon, Tahmasebi, Nina, Schlechtweg, Dominik & Dubossarsky, Haim. 2021. Challenges
for computational lexical semantic
change. In Computational
Approaches to Semantic Change, Nina Tahmasebi, Lars Borin, Adam Jatowt, Yang Xu & Simon Hengchen (eds), 341–372. Berlin: Language Science Press.
Jones, Taylor. 2015. Toward
a description of African American Vernacular English dialect regions
using “Black Twitter.” American
Speech 90(4): 403–440.
Josselin, Amélie. 2001. L’emprunt
lexical en France et au Canada: Le cas particulier des anglicismes
et des gallicismes et leur traitement
lexicographique. DEA
thesis, Université de Lyon II.
Laicher, Severin, Kurtyigit, Sinan, Schlechtweg, Dominik, Kuhn, Jonas & Schulte im Walde, Sabine. 2021. Explaining
and improving BERT performance on lexical semantic change
detection. In Proceedings
of the 16th Conference of the European Chapter of the Association
for Computational Linguistics: Student Research
Workshop, 192–202. Stroudsburg PA: Association for Computational Linguistics.
Martinc, Matej, Montariol, Syrielle, Zosa, Elaine & Pivovarova, Lidia. 2020. Capturing
evolution in word usage: Just add more
clusters? In Companion
Proceedings of the Web Conference 2020 (WWW
’20), 343–349. New York NY: Association for Computing Machinery.
McArthur, Tom. 1989. The
English Language as Used in Quebec: A
Survey [Strathy Occasional Papers on
Canadian English 3]. Kingston ON: Queen’s University.
Miletić, Filip. 2019. Contact-induced
lexical variation in Quebec English: An accountable
description. In RJC2019 –
22èmes rencontres des jeunes chercheurs en sciences du langage, Paris, France. <[URL]>
Miletić, Filip, Przewozny-Desriaux, Anne & Tanguy, Ludovic. 2020. Collecting
tweets to investigate regional variation in Canadian
English. In Proceedings
of the 12th Language Resources and Evaluation
Conference, 6255–6264. Marseille: European Language Resources Association.
. 2021. Detecting
contact-induced semantic shifts: What can embedding-based methods do
in
practice? In Proceedings
of the 2021 Conference on Empirical Methods in Natural Language
Processing, 10852–10865. Punta Cana, Dominican Republic: Association for Computational Linguistics.
. 2023. Understanding computational models of semantic change: New insights from the speech community. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 9209–9220. Singapore: Association for Computational Linguistics.
Montariol, Syrielle, Martinc, Matej & Pivovarova, Lidia. 2021. Scalable
and interpretable semantic change
detection. In Proceedings
of the 2021 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language
Technologies, 4642–4652. Stroudsburg PA: Association for Computational Linguistics.
Nguyen, Dong. 2021. Dialect
variation on social
media. In Similar
Languages, Varieties, and Dialects. A Computational
Perspective, Marcos Zampieri & Preslav Nakov (eds.), 204–218. Cambridge: CUP.
Nguyen, Dat Quoc, Vu, Thanh & Tuan Nguyen, Anh. 2020. BERTweet:
A pre-trained language model for English
Tweets. In Proceedings
of the 2020 Conference on Empirical Methods in Natural Language
Processing: System
Demonstrations, 9–14. Stroudsburg PA: Association for Computational Linguistics.
Owoputi, Olutobi, O’Connor, Brendan, Dyer, Chris, Gimpel, Kevin, Schneider, Nathan & Smith, Noah A. 2013. Improved
part-of-speech tagging for online conversational text with word
clusters. In Proceedings
of the 2013 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language
Technologies, 380–390. Atlanta GA: Association for Computational Linguistics.
Pavalanathan, Umashanthi & Eisenstein, Jacob. 2015. Confounds
and consequences in geotagged Twitter
data. In Proceedings
of the 2015 Conference on Empirical Methods in Natural Language
Processing, 2138–2148. Lisbon: Association for Computational Linguistics.
Pedregosa, Fabian, Varoquaux, Gaël, Gramfort, Alexandre, Michel, Vincent, Thirion, Bertrand, Grisel, Olivier & Blondel, Mathieu et al. 2011. Scikit-learn:
Machine learning in Python. Journal
of Machine Learning
Research 12: 2825–2830.
Poplack, Shana, Walker, James A. & Malcolmson, Rebecca. 2006. An
English ‘like no other’? Language contact and change in
Quebec. Canadian Journal of
Linguistics/Revue Canadienne de
Linguistique 51(2–3): 185–213.
Rodda, Martina A., Lenci, Alessandro & Senaldi, Marco S. G. 2017. Panta
rei: Tracking semantic change with distributional
semantics in Ancient Greek. Italian
Journal of Computational
Linguistics 3(1): 11–24.
Rouaud, Julie. 2019. Lexical
and Phonological Integration of French Loanwords into Varieties of
Canadian English Since the Seventeenth
Century. PhD
dissertation, Université Toulouse – Jean Jaurès.
Schlechtweg, Dominik, Hätty, Anna, Del Tredici, Marco & Schulte im Walde, Sabine. 2019. A
wind of change: Detecting and evaluating lexical semantic change
across times and
domains. In Proceedings
of the 57th Annual Meeting of the Association for Computational
Linguistics, 732–746. Florence: Association for Computational Linguistics.
Schlechtweg, Dominik, McGillivray, Barbara, Hengchen, Simon, Dubossarsky, Haim & Tahmasebi, Nina. 2020. SemEval-2020
task 1: Unsupervised lexical semantic change
detection. In Proceedings
of the Fourteenth Workshop on Semantic
Evaluation, A. Herbelot, X. Zhu, A. Palmer, N. Schneider, J. May & E. Shutova (eds), 1–23. Barcelona: International Committee for Computational Linguistics.
Shoemark, Philippa, Sur, Debnil, Shrimpton, Luke, Murray, Iain & Goldwater, Sharon. 2017. Aye
or naw, whit dae ye hink? Scottish independence and linguistic
identity on social
media. In Proceedings
of the 15th Conference of the European Chapter of the Association
for Computational Linguistics, Vol. 1: Long
Papers, 1239–1248. Valencia: Association for Computational Linguistics.
Statistics
Canada. 2022. Table
98-10-0218-01. Mother tongue by age: Canada, provinces and
territories. <[URL]> (20 May
2024).
Tagliamonte, Sali A. 2002. Comparative
sociolinguistics. In The
Handbook of Language Variation and
Change, Jack K. Chambers, Peter Trudgill & Natalie Schilling-Estes (eds), 729–763. Malden MA: Blackwell.
Tahmasebi, Nina, Borin, Lars & Jatowt, Adam. 2021. Survey
of computational approaches to lexical semantic
change. In Computational
Approaches to Semantic Change, Nina Tahmasebi, Lars Borin, Adam Jatowt, Yang Xu & Simon Hengchen (eds), 1–91. Berlin: Language Science Press.
Takamura, Hiroya, Nagata, Ryo & Kawasaki, Yoshifumi. 2017. Analyzing
semantic change in Japanese
loanwords. In Proceedings
of the 15th Conference of the European Chapter of the Association
for Computational Linguistics, Vol. 1: Long
Papers, 1195–1204. Valencia: Association for Computational Linguistics.
Turney, Peter D. & Pantel, Patrick. 2010. From
frequency to meaning: Vector space models of
semantics. Journal of Artificial
Intelligence
Research 37: 141–188.
Uban, Ana, Ciobanu, Alina Maria & Dinu, Liviu P. 2019. Studying
laws of semantic divergence across languages using cognate
sets. In Proceedings
of the 1st International Workshop on Computational Approaches to
Historical Language
Change, 161–166. Florence: Association for Computational Linguistics.
Wolf, Thomas, Debut, Lysandre, Sanh, Victor, Chaumond, Julien, Delangue, Clement, Moi, Anthony & Cistac, Pierric et al. 2020. Transformers:
State-of-the-art natural language
processing. In Proceedings
of the 2020 Conference on Empirical Methods in Natural Language
Processing: System
Demonstrations, 38–45. Stroudsburg PA: Association for Computational Linguistics.
