In:Romance Languages and Linguistic Theory 11: Selected papers from the 44th Linguistic Symposium on Romance Languages (LSRL), London, Ontario
Edited by Silvia Perpiñán, David Heap, Itziri Moreno-Villamar and Adriana Soto-Corominas
[Romance Languages and Linguistic Theory 11] 2017
► pp. 169–188
Chapter 8Automatic detection of syntactic patterns from texts with application to Spanish clitic doubling
Published online: 19 October 2017
https://doi.org/10.1075/rllt.11.08est
https://doi.org/10.1075/rllt.11.08est
Abstract
We developed an automated algorithm to retrieve direct object clitic doubling (DOCLD) examples in Spanish data from texts and the web. We focused on the Rioplatense dialect, where this kind of doubling is rather common. Given an electronic text, our procedure has two steps: first, tagging the text with an available part-of speech (PoS) tagger (TreeTagger), then inputing the tagged text into java-based code that extracts all sentences containing direct object clitics and attempts to match each clitic to a candidate doubled NP in its sentence. Identification of DOCLD cases in a short story (edited text) was 100%, whereas on unedited, raw text it was only 50%. Missing DOCLD cases are mainly caused by misspellings and lack of punctuation in the raw texts. We discuss how to improve accuracy mainly by reducing the number of false negatives.
Article outline
- 1.Corpus linguistics and the World Wide Web
- 2.Our case study: Identifying DOCLD from web texts
- 3.Precision and recall
- 4.Limitations of off-the-shelf tools (corpora and parsers)
- 5.Pattern identification vs. parsing
- 6.Curated vs. raw text
- 7.Our strategy
- 7.1Description of pattern matching algorithm (CLDFinder)
- 8.Results
- 8.1Edited, curated text
- 8.2Raw text from the web
- 8.2.1False positives
- 8.2.2False negatives
- 9.Discussion
Notes References
References (41)
Alonso, Jaime, Juan José del Coz, Jorge Díez, Oscar Luaces, and Antonio Bahamonde. 2008. “Learning to Predict One or More Ranks in Ordinal Regression Tasks.” In Machine Learning and Knowledge Discovery in Databases: European Conference, Antwerp, Belgium, September 15–19, 2008, Proceedings, edited by Walter Daelemans, Bart Goethals, and Katharina Morik, 39–54. Berlin Heidelberg: Springer Science & Business Media.
Añez, Juancarlo. 2011. “Reply to ‘Efficient Context-Free Grammar Parser, Preferably Python-Friendly.’” Stackoverflow. [URL].
Baroni, Marco, and Adam Kilgarriff. 2006. “Large Linguistically-Processed Web Corpora for Multiple Languages.” EACL 2006–11th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 87.
Barrenechea, Ana María. 1987. El habla culta de la ciudad de Buenos Aires: materiales para su estudio. edited by Instituto de Filología y Literaturas Hispánicas “Dr. Amado Alonso.” 2 vols. Buenos Aires: Universidad nacional de Buenos Aires, Facultad de filosofía y letras.
Belloro, Valeria A. 2007. “Spanish Clitic Doubling: A Study of the Syntax-Pragmatics Interface.” PhD dissertation, Buffalo, NY: State University of New York at Buffalo. [URL].
2011. “Dislocaciones Y Doblados: Entre La Concordancia Anafórica Y La Gramatical.” Hechos Y Proyecciones Del Lenguaje 20: 127–49.
2012. “Encoding Information Structure via Object Agreement in Spanish Interactions.” In Proceedings of BLS 34, 391–402. Berkeley, CA.
Davies, Mark. 2002. “Corpus Del Español.” Corpus of Spanish. [URL].
Dufter, Andreas. 2009. “Clefting and Discourse Organization: Comparing Germanic and Romance.” In Focus and Background in Romance Languages, edited by Andreas Dufter and Daniel Jacob, 83–121. Amsterdam: John Benjamins Publishing. [URL].
Estigarribia, Bruno. 2005. “Direct Object Clitic Doubling in OT-LFG: A New Look at Rioplatense Spanish.” In The Proceedings of the LFG ’05 Conference, edited by Miriam Butt and Tracy Holloway King. University of Bergen, Norway. [URL].
. 2006. “Why Clitic Doubling? A Functional Analysis for Rioplatense Spanish.” In Selected Proceedings of the 8th Hispanic Linguistics Symposium, edited by Timothy L. Face and Carol A. Klee, 123–36. Somerville, MA: Cascadilla Proceedings Project.
. 2013. “Rioplatense Spanish Clitic Doubling and ‘Tripling’ in Lexical-Functional Grammar.” In Selected Proceedings of the 15th Hispanic Linguistics Symposium, edited by Chad Howe, Sarah E. Blackwell, and Margaret Lubbers Quesada, 297–309. University of Georgia, Athens: Cascadilla Proceedings Project.
. 2014. “La estructura informacional en la triplicación con clíticos del español rioplatense.” Signo y Seña | Revista del Instituto de Lingüística, no. 25: 105–32.
. Forthcoming. The semantics of Spanish Clitic Left-Dislocations with epithets. To appear in Probus.
Fletcher, William H. 2012. “Corpus Analysis of the World Wide Web.” In The Encyclopedia of Applied Linguistics, n.p. Blackwell Publishing Ltd. [URL].
Fontanarrosa, Roberto. 1995a. “Beto.” In La Mesa De Los Galanes y otros cuentos, 42–50. Buenos Aires: Ediciones De La Flor.
. 1995b. “Periodismo investigativo.” In La Mesa De Los Galanes y otros cuentos, 7–26. Buenos Aires: Ediciones De La Flor.
Gimpel, K., N. Schneider, B. O’Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and N. A. Smith. 2011. “Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments.” In Proc. of ACL.
Gries, Stefan Thomas. 2009. Quantitative Corpus Linguistics with R: A Practical Introduction. 1st ed. Routledge.
Gries, Stefan Thomas, Stefanie Wulff, and Mark Davies, eds. 2009. Corpus-Linguistic Applications: Current Studies, New Directions. Rodopi.
Gutiérrez-Rexach, Javier. 1999. “The Formal Semantics of Clitic Doubling.” Journal of Semantics 16 (4): 315–80. . Gutiérrez-Rexach 1999
Hopper, Paul J., and Sandra A. Thompson. 1980. “Transitivity in Grammar and Discourse.” Language 56 (2): 251–99. .
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning (with Applications in R). Springer Texts in Statistics 417. New York-Heidelberg-Dordrecht-London: Springer. [URL].
Keller, Frank, and Mirella Lapata. 2003. “Using the Web to Obtain Frequencies for Unseen Bigrams.” Computational Linguistics 29 (3): 459–84. .
Kilgarriff, Adam, and Gregory Grefenstette. 2003. “Introduction to the Special Issue on the Web As Corpus.” Computational Linguistics 29 (3): 333–47. .
Ligatto, Dolorès. 1996. Matériau pour l’étude de l’espagnol parlé: la variante argentine. Presses Univ. Limoges.
López, Luis. 2009. A Derivational Syntax for Information Structure. Oxford-New York: Oxford University Press.
Mazzuchino, María Gabriela. 2013. “El doblado de acusativo en el español de Argentina: definitud, especificidad, presuposicionalidad y otras nociones conexas.” Lengua y Habla 17 (0): 118–52.
Real Academia Española. 2014. “Corpus de Referencia Del Español Actual. Banco de Datos (CREA) [en Línea].” Corpus of Spanish. [URL].
Russell, Matthew A. 2013. Mining the Social Web. Second edition. Sebastopol, CA: O’Reilly Media. [URL].
. 1995. “Improvements in Part-of-Speech Tagging with an Application to German.” In Proceedings of the ACL SIGDAT-Workshop. Dublin, Ireland.
Sportiche, Dominique. 1996. “Clitic Constructions.” In Phrase Structure and the Lexicon, edited by Johan Rooryck and Laurie Ann Zaring, 213–76. Dordrecht, Netherlands: Kluwer Academic Publishers.
Subirats, Carlos, and Marc Ortega. 2014. “Corpus Del Español Actual (CEA).” Corpus of Spanish. [URL].
Suñer, Margarita. 1988. “The Role of Agreement in Clitic-Doubled Constructions.” Natural Language & Linguistic Theory 6 (3): 391–434. .
Torrego, Esther. 1992. “Case and Argument Structure.” Unpublished manuscript. Boston, University of Massachussets.
. 1995. On the Nature of Clitic Doubling. [URL].
