In:Spoken Corpora and Linguistic Studies
Edited by Tommaso Raso and Heliana Mello
[Studies in Corpus Linguistics 61] 2014
► pp. 27–68
Methodological issues for spontaneous speech corpora compilation
The case of C-ORAL-BRASIL
Published online: 14 November 2014
https://doi.org/10.1075/scl.61.01mel
https://doi.org/10.1075/scl.61.01mel
Spontaneous Speech Corpus Compilation has been going through a growing period in the past 20 years. This is due majorly to technological advances that have been achieved allowing for highly accurate recording in vivo, new insights coming from empirically-based linguistic theory, concerns for the documentation of threatened languages and the high degree of relevance of findings to speech recognition applications. This paper discusses methodologies associated to spontaneous speech corpus compilation which shed light on specific aspects of relevance to the understanding of linguistic phenomena that pertain to spoken language. The compilation process of C-ORAL-BRASIL I, an informal spontaneous speech Brazilian Portuguese corpus, among other examples, is used as the basis for the discussion carried.
References (75)
Allwood, Jens. 2002. Bodily communications. Dimensions of expression and content. In Multimodality in Language and Speech Systems, Björn Granström, David House & Inger Karlsson (eds), 7–26. Dordrecht: Kluwer.
Berruto, Gaetano. 1987. Sociolinguistica dell’italiano contemporaneo. Roma: La Nuova Italia Scientifica.
. 1993a. Le varietà del repertorio. In Introduzione all’italiano contemporaneo, Alberto A. Sobrero (ed.). Roma-Bari: Laterza 2: 3–36.
. 1993b. Varietà diamesiche, diastratiche, diafasiche. In Introduzione all’italiano contemporaneo, Alberto A. Sobrero (ed.). Roma-Bari: Laterza 2: 37–92.
. 2011. Registri, stili: Alcune considerazioni su categorie mal definite. In La variazione di registro nella comunicazione elettronica, Massimo Cerruti, Elisa Corino & Christina Onesti (eds), 15–35. Roma: Carocci.
Biber, Douglas & Conrad, Susan. 2009. Register variation: A corpus approach. In The Handbook of Discourse Analysis, Deborah Schiffrin, Deborah Tannen & Heidi E. Hamilton (eds), 175–196. Oxford: Blackwell.
Biber, Douglas, Conrad, Susan & Reppen, Randi. 1998. Corpus linguistics: Investigating language structure and use. Cambridge: CUP.
Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan & Finegan, Edward. 1999. The Longman Grammar of Spoken and Written English. London: Longman.
Chomsky, Noam. 1970. Remarks on nominalization. In Readings in English Transformational Grammar, Roderick A. Jacobs & Peter S. Rosenbaum (eds), 184–221. Waltham MA: Blaisdell.
. 2001. Per una nuova definizione di frase. In Studi di storia della lingua italiana offerti a Ghino Ghinassi, Paolo Bongrani, Andrea Dardi, Massimo Fanfani & Riccardo Tesi (Eds.), 511–550. Firenze: Le Lettere.
Cresti, E. 2005a. Notes on lexical strategy, structural strategy and surface clause indexes in the C-ORAL-ROM spoken corpora. In Cresti & Moneglia (eds), 209–256.
Cresti, Emanuela. 2005b. Enunciato e frase: Teoria e verifiche empiriche. In Italia linguistica: Discorsi di scritto e di parlato. Nuovi studi di linguistica italiana per Giovanni Nencioni, Marco Biffi, Omar Calabrese & Luciana Salibra (eds), 249–260. Siena: Protagon.
Cresti, Emanuela & Gramigni, Paola. 2004. Per una linguistica corpus based dell’italiano parlato: Le unità di riferimento. In Atti del Convegno ‘L’italiano parlato’, Federico Leoni Albano, Francesco Cutugno, Massimo Pettorino & Renata Savy (eds). Napoli: D’Auria.
Cresti, Emanuela & Moneglia, Massimo (eds). 2005. C-ORAL-ROM. Integrated Reference Corpora for Spoken Romance Languages [Studies in Corpus Linguistics 15]. Amsterdam: John Benjamins.
Cresti, Emanuela & Raso, Tommaso. 2012. Text annotation of information units through IPIC. LABLITA [URL]
Dittmar, Norbert. 2004. Register. In Handbuch der Soziolinguistik / Handbook of Sociolinguistics, Vol.1, Ulrich Ammon, Norbert Dittmar, Klaus J. Mattheier & Peter Trudgill (eds), 2016–226. Berlin: De Gruyter.
Du Bois, John W., Chafe, Wallace L., Meyer, Charles, Thompson, Sandra A., Englebretson, Robert & Martey, Nii. 2000–2005. Santa Barbara Corpus of Spoken American English, Parts 1–4. Philadelphia PA: Linguistic Data Consortium.
EAGLES Standards. 1996. [URL]
Edwards, Jane A. 1993. Principles and contrasting systems of discourse transcription. In Talking data: Transcription and coding in discourse research. Jane A. Edwards & Martin D. Lampert (eds), 3–31. Hillsdale NJ: Lawrence Erlbaum Associates.
Firenzuoli, Valentina. 2003. Le forme intonative di valore illocutivo dell’italiano parlato: Analisi sperimentale di un crpus di parlato spontaneo (LABLITA). PhD dissertation, University of Florence.
Fleiss, Joseph L. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5): 378–382.
Fogassi, Leonardo & Ferrari Pier Francesco. 2005. Mirror neurons, gestures and language evolution. Interaction Studies 5(3): 345–363. Special issue Vocalize to Localize, Christina Abry, Anne Vilain & Jean-Luc Schwartz (eds).
Gregori, Lorenzo & Panunzi, Allesandro. 2012.
DB-IPIC: An XML database for informational patterning analysis
. In
Proceedings of the 7th GSCP International Conference. Speech and Corpora
, Heliana Mello, Massimo Pettorino & Tommaso Raso (eds), 121–127. Florence: Firenze University Press.
van den Heuvel, Henk, Boves, Louis, Choukri, Khalid, Goddijn, Simo & Sanders, Eric 2000. SLR validation: Present state of affairs and prospects. In
Proceedings of the 2nd International Conference on Language Resource and Evaluation (LREC 2000)
, 435–440. Paris: ELRA.
Izre’el, Shlomo, Hary, Benjamin & Rahav, Giora. 2001. Designing C0SIH: The Corpus of Spoken Israeli Hebrew. International Journal of Corpus Linguistics 6: 171–197.
Johansson, Stig. 1995a. The approach of the Text Encoding Initiative to the encoding of spoken discourse. In Leech, Meyers & Thomas (eds), 82–98.
. 1995b. The encoding of spoken texts. Computers and the Humanities 29(1): 149–158. Also in Ide, Nancy & Véronis, Jean. 1995. The Text Encoding Initiative. Background and Context, 149–158. Dordrecht: Kluwer.
Karcevsky, Serge. 1931. Sur la phonologie de la phrase. Travaux du Cercle Linguistique de Prague IV: 188–228.
Labov, William. 1966. The Social Stratification of English in New York City. Washington DC: Center for Applied Linguistics.
Labov, William & Waletzky, Joshua. 1967. Narrative analysis. In Essays on the Verbal and Visual Arts, June Helm (ed.), 12–44. Seattle, WA: University of Washington Press.
Leech, Geoffrey, Myers, Greg & Thomas, Jenny (eds). 1995. Spoken English on Computer. Transcription, Markup and Applications. Harlow: Longman.
Llisterri, Joaquim. 1996. Preliminary recommendations on spoken texts. EAGLES Documents EAG-TCWG-STP/P. [URL]
MacWhinney, Brian J. 2000. The CHILDES Project: Tools for Analyzing Talk. Mahwah NJ: Lawrence Erlbaum Associates.
Martin, Philippe. 2005. WinPitch Corpus: A text-to-speech analysis and alignment tool. In C-ORAL-ROM. Integrated Reference Corpora for Spoken Romance Languages [Studies in Corpus Linguistics 15], Emanuela Cresti & Massimo Moneglia (eds), Section 1.4 of Ch. 1. Amsterdam: John Benjamins.
Mello, Heliana & Raso, Tommaso. 2009. Para a transcrição da fala espontânea: O caso do C-ORAL-BRASIL. Revista Portuguesa de Humanidades – Estudos Linguísticos 13(1): 153–178.
Mello, Heliana, Raso, Tommaso, Mittmann, Maryualê M., Vale, Heloisa P. & Côrtes, Priscila O. 2012. Transcrição e segmentação prosodic do corpus C-ORAL-BRASIL: Critérios de implementação e validação. In C-ORAL – Brasil I: Corpus de referência do português brasileiro falado informal, Tommaso Raso & Heliana Mello (eds), 125–176. Belo Horizonte: Editora UFMG.
Mello, Heliana, Raso, Tommaso, Mittmann, Maryualê M. & Furtado, D. DBCom: C-ORAL-BRASIL search engine platform. Forthcoming.
Mettouchi, Amina, Lacheret-Dujour, Anne, Silber-Varod, Vered, Izre’el, Shlomo. 2007. Only prosody? Perception of speech segmentation in Kabyle and Hebrew. Nouveaux Cahiers de Linguistique Française 28: 207–218.
Mettouchi, Amina, Caubet, Dominique, Vanhove, Martine, Tosco, Mauro, Comrie, Bernard & Izre’el, Shlomo. 2010. CORPAFROAS. A corpus for spoken Afroasiatic languages: Morphosyntactic and prosodic analysis. In CAMSEMUD 2007, Frederick Mario Fales & Giulia Francesca Grassi (eds), 177–180. Padova: SARGON.
Moneglia, Massimo. 2005. The C-ORAL-ROM Resource. In Cresti & Moneglia, 1–70.
Moneglia, Massimo & Cresti, Emanuela. 1997. L’intonazione e I criteri di trascrizione del parlato adulto e infantile. In Il progettto CHILDES Italia, Umberta Bortolini & Elen Pizzuto (eds), 57–90. Pisa: Del Cerro.
Moneglia, Massimo, Scaarano, Antonietta & Spinu, Marius. 2005. The multilingual corpus of spontaneous speech C-ORAL-ROM: Validation of the prosodic annotation by expert transcribers. In
Atti della Conferenza CLiP 2003
, Carlotta Nicolas Martinez & Massimo Moneglia (eds), 127–142. Firenze: Firenze University Press.
Moneglia, Massimo & Scarano, Antonietta. 2008. Il Corpus Stammerjohann. Il primo corpus di italiano parlato, in rete nella base dati di LABLITA. In Atti del convegno internazionale ‘La comunicazione parlata’, Tomo III, Massimo Pettorino (ed.), 1650–1685. Napoli: Liguori.
Moneglia, Massimo & Cresti, Emanuela. Forthcoming. The cross-linguistic comparison of information patterning in spontaneous speech corpora: Data from C-ORAL-ROM ITALIAN and C-ORAL-BRASIL. In Linguistique interactionnelle contrastive. Grammaire et interaction dans les langues romanes, Sabine Diao-Klaeger & Britta Thörle (eds). Tübingen: Stauffenburg.
Nencioni, Giovanni. 1976. Parlato-parlato, parlato-scritto, parlato-recitato. Strumenti Critici 10: 1–56. Also in Nencioni, Giovanni. 1983. Di scritto e parlato. Discorsi linguistici, 126–179. Bologna: Zanichelli.
Oostdijk, Nelleke, Goedertier, Wim, Van Eynde, Frank, Boves, Louis, Martens, Jean-Pierre, Moortgat, Michael, Baayen, R. Harald. 2002. Experiences from the Spoken Dutch Corpus Project. In
Proceedings from the Third International Conference on Language Resources and Evaluations
, Manuel Gonzalez-Rodriguez & Carmen Paz Suárez Araujo (eds), 330–347. Las Palmas de Gran Canaria.
Panunzi, Allesandro & Gregori, Lorenzo. 2012. DB-IPIC. An XML database for the representation of information structure in spoken language. In Pragmatics and Prosody. Illocution, Modality, Attitude, Information Structure and Speech Annotation, Heliana Mello, Allesandro Panunzi & Tommaso Raso (eds), 19–37. Florence: Firenze University Press.
Poggi, Isabella. 2007. Mind, Hands, Face and Body. A Goal and Belief View of Multimodal Communication. Berlin: Werdler.
. In press. Fala e escrita: Meio, canal, consequências pragmáticas e linguísticas. Domínios da Linguagem.
Raso, Tommaso & Mello, Heliana (eds). 2012. C-ORAL – Brasil I: Corpus de referência do português brasileiro falado informal. Belo Horizonte: Editora UFMG.
Raso, Tommaso & Mittmann, Maryualê M. 2009. Validação estatística dos critérios de segmentação da fala espontânea no corpus C-ORAL-BRASIL. Revista de Estudos da Linguagem 17(2): 73–91.
Rocha, Bruno. 2013. Metodologia emírica para o estudo de ilocuções no PB. Domínios de Linguagem 14: 109–148.
Rossini, Nicla. 2012. Language ‘in action’: Reinterpreting Gesture as Language. Amsterdam: IOS Press.
Scarano, Antonietta. 2004. Enunciati nominali in un corpus di italiano parlato. Appunti per una grammatica corpus based. In Atti del Convegno ‘L’italiano parlato’, Federico Leoni Albano, Francesco Cutugno, Massimo Pettorino & Renata Savy (eds). Napoli: D’Auria.
Schiel, Florian, Baumann, Angela, Draxler, Christoph, Ellbogen, Tania, Hoole, Phil & Steffen, Alexander. 2004. The Validation of Speech Corpora. Munich: University of Munich.
Signorini, Sabrina & Tucci, Ida. 2004. Il restauro e l’ archiviazione elettronica del primo corpus di italiano parlato: Il corpus Stammerjohann. In Costituzione, Gestione e restauro di corpora vocali, Atti delle XIV Giornate del GFS, Collana degli atti dell’associazione italiana di acustica. Viterbo, 4–6 dicembre 2003, Amedeo De Dominicis, Laura Mori & Marianna Stefani (eds), 119–126. Roma: Esagrafica.
Sinclair, John. 1996. Preliminary recommendations on corpus typology. EAGLES Document EAG-TCWG-CTYP/P. [URL]
Stam, Gale & Ishino, Mika (eds). 2011. Integrating Gestures: The Interdisciplinary Nature of Gesture[Gesture Studies 4]. Amsterdam: John Benjamins.
Thompson, Paul. 2005. Spoken language corpora. In Developing Linguistic Corpora: A Guide to Good Practice, Martin Wynne (ed.), 59–70. Oxford: Oxbow Books.
Winski, Richard, Moore, Roger & Gibbon, Dafydd. 1995. EAGLES Spoken Language Working Group: Overview and results. In
Eurospeech’95. Proceedings of the 4th European Conference on Speech Communication and Speech Technology
, 18–21 September, Vol 1, 841–844. Madrid, Spain.
audio
Cited by (8)
Cited by eight other publications
Barros, Camila Antônio, Heliana Mello & Tommaso Raso
Carlucci, Alessandro
2025. Theoretical and methodological issues in language contact and change. In The Progressive Revisited [Studies in Language Companion Series, 236], ► pp. 255 ff.
Rocha, João Victor Pessoa & Átila Augusto Soares Vital
Izre'el, Shlomo
Ferrari, Lúcia de Almeida & Evandro Landulfo Teixeira Paradela Cunha
Bossaglia, Giulia, Heliana Mello & Tommaso Raso
2020. Illocution as a unit of reference for spontaneous speech. In In search of basic units of spoken language [Studies in Corpus Linguistics, 94], ► pp. 221 ff.
Cresti, Emanuela
2020. The pragmatic analysis of speech and its illocutionary classification according to the
Language into Act Theory. In In search of basic units of spoken language [Studies in Corpus Linguistics, 94], ► pp. 181 ff.
This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
