Article published In: Compilation, transcription, markup and annotation of spoken corpora
Edited by John M. Kirk and Gisle Andersen
[International Journal of Corpus Linguistics 21:3] 2016
► pp. 396–418
Good practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German
Published online: 29 September 2016
https://doi.org/10.1075/ijcl.21.3.05sch
https://doi.org/10.1075/ijcl.21.3.05sch
This paper presents practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German, a large collection of spontaneous verbal interaction from diverse discourse domains. After introducing the aims and organisational circumstances of the construction of FOLK, the general idea discussed is that good practices cannot be developed without considering methodological, technological and organisational aspects on equal footing. Starting from this idea, this paper inspects more closely some actual practices in FOLK, namely the handling of legal (especially privacy protection) issues, the decisions taken for the transcription and annotation workflow, and the question of how to best disseminate a corpus like FOLK. The final section sketches some possible future improvements for practices in FOLK.
Keywords: oral corpora, corpus interface, transcription, spoken language
References (41)
Baude, O., Blanche-Benveniste, C., Calas, M.-F., Cappeau, P., Corderereix, P., Goury, L., Jacobson, M., de Lambertierie, I., Marchello-Nizia, C., & Mondada, L. (2006). Corpus Oraux: Guide des Bonnes Pratiques. Orléans: Presses Universitaires d’Orléans. Retrieved from [URL] (last accessed October 2014).
Berens, F.-J., Jäger, K.-H., Schank, G., & Schwitalla, J. (1976). Projekt Dialogstrukturen. Ein Arbeitsbericht. Heutiges Deutsch, I(12), 1–147.
Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation. Speech Communication, 33(1,2), 23–60.
Bird, S., & Simons, G. (2002). Seven dimensions of portability for language documentation and description. Language, 79(3), 557–582.
Brinckmann, C., Kleiner, S., Knöbl, R., & Berend, N. (2008). German today: An areally extensive corpus of spoken standard German.
Proceedings 6th International Conference on Language Resources and Evaluation (LREC 2008)
, Marrakesch, Marokko (pp. 3185–3191). Retrieved from [URL] (last accessed November 2015).
Carletta, J., Kilgour, J., O’Donnell, T., Evert, S., & Voorman, H. (2003). The NITE object model library for handling structured linguistic annotation on multimodaldata sets.
Proceedings of the EACL Workshop on Language Technology and the Semantic Web. Budapest
(pp. 17–24). Retrieved from [URL] (last accessed November 2015).
CLARIN (2010). Interoperability and standards. CLARIN deliverable D5.C-3. Retrieved from [URL] (last accessed November 2015).
Deppermann, A., & Hartung, M. (2011). Was gehört in ein nationales Gesprächskorpus? Kriterien, Probleme und Prioritäten der Stratifikation des ‘Forschungs- und Lehrkorpus Gesprochenes Deutsch’ (FOLK) am Institut für Deutsche Sprache (Mannheim). In E. Felder, M. Müller, & F. Vogel, F.. (Eds.), Korpuspragmatik. Thematische Korpora als Basis diskurslinguistischer Analysen (pp. 414–450). Berlin: de Gruyter.
Deppermann, A., & Proske, N. (2015). Grundeinheiten der Sprache und des Sprechens. In C. Dürscheid & J.-G. Schneider (Eds.), Satz, Äußerung, Schema (pp. 17–47). Berlin: de Gruyter,
Fandrych, C., Meißner, C., & Slavcheva, A. (2012). The GeWiss Corpus: Comparing spoken academic German, English and Polish. In T. Schmidt & K. Wörner (Eds.), Multilingual Corpora and Multilingual Corpus Analysis (pp. 319–337). Amsterdam: John Benjamins.
Goldman, J., Renals, S., Bird, S., de Jong, F., Federico, M., Fleischhauer, C., Kornbluh, M., Lamel, L., Oard, D.W., Stewart, C., & Wright, R. (2005). Accessing the spoken word. International Journal on Digital Libraries, 5(4), 287–298.
Habscheid, S. (2014). Haben sich Sprach- und Literaturwissenschaft heute noch etwas zu sagen? Eine Antwort aus sprachwissenschaftlicher Perspektive – am Beispiel eines gesprächslinguistischen Forschungsprojekts über Pausengespräche im Theater. In H.-R. Fluck & J. Zhu (Eds.), Vielfalt und Interkulturalität der internationalen Germanistik. Festgabe für Siegfried Grosse zum 90. Geburtstag (pp. 73–85). Tübingen: Stauffenburg,.
Hedeland, H., Lehmberg, T., Schmidt, T., & Wörner, K. (2014). Multilingual corpora at the Hamburg Centre for Language Corpora. In S. Ruhi, M. Haugh, T. Schmidt & K. Wörner (Eds.), Best Practices for Spoken Corpora in Linguistic Research (pp. 208–224). Newcastle-upon-Tyne: Cambridge Scholars Press.
Hee, K. (2012). Polizeivernehmungen von Migranten: Eine gesprächsanalytische Studie interkultureller Interaktionen in Institutionen. Heidelberg: Universitätsverlag Winter.
Kellner, B., Lehmberg, T., Schröder, I., & Wörner, K. (2008). Data structures for the analysis of regional language variation. In A. Storrer, A. Geyken, A. Siebert & K.-M. Würzner (Eds.), Text Resources and Lexical Knowledge (pp. 53–63). Berlin: Walter de Gruyter.
Kupietz, M., & Schmidt, T. (2015). Schriftliche und mündliche Korpora am IDS als Grundlage für die empirische Forschung. In L.M. Eichinger, (Ed.), Sprachwissenschaft im Fokus: Positionsbestimmungen und Perspektiven (pp. 297–322). Berlin: De Gruyter Mouton.
Kucharczik, K. (no date). Korpus der gesprochenen Sprache im Ruhrgebiet (KgSR). Retrieved from [URL] (last accessed January 2014).
Leech, G., Myers, G., & Thomas, J. (Eds.) (1995). Spoken English on Computer: Transcription, Markup and Application. Harlow: Longman.
Ochs, E. (1979). Transcription as theory. In E. Ochs & B.B. Schieffelin (Eds.) Developmental Pragmatics (pp. 43–72). New York, NY: Academic Press.
O’Connell, D., & Kowal, S. (1994). Some current transcription systems for spoken discourse: A critical analysis. Pragmatics, 4(1), 81–107.
. (2000). Are transcripts reproducible? Pragmatics, 10(2), 247–269.
Oostdijk, N., & Broeder, D. (2003). The Spoken Dutch Corpus and its exploitation environment. In A. Abeille, S. Hansen-Schirra & H. Uszkoreit (Eds.) Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03). 14 April, 2003. Budapest, Hungary (pp. 93–101).
Parisse, C., & Morgenstern, A. (2010). A multi-software integration platform and support for multimedia transcripts of language. In M. Kipp, J.C. Martin, P. Paggio & D. Heylen (Eds.), Proceedings of the LREC 2010 Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, (pp. 106–110). Retrieved from [URL] (last accessed November 2015).
Rehbein, J., Grießhaber, W., Löning, P., Hartung, M., & Bührig, K. (1993). Manual für das computergestützte Transkribieren mit dem Programm syncWRITER nach dem Verfahren der Halbinterpretativen Arbeitstranskriptionen (HIAT). Hamburg: Universität Hamburg.
Rehbein, J., Schmidt, T., Meyer, B., Watzke, F., & Herkenrath, A. (2004) Handbuch für das computergestützte Transkribieren nach HIAT. Retrieved from [URL] (last accessed November 2015).
Rohlfing, K., Loehr, D., Duncan, S., Brown, A., Franklin, A., Kimbara, I., Milde, J.-T., Parrill, F., Rose, T., Schmidt, T., Sloetjes, H., & Thies, A. (2006). Comparison of multimodal annotation tools: Workshop report. In Gesprächsforschung: Online-Zeitschrift zur verbalen Interaktion 71, 99–123.
Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German.
Proceedings of the ACL SIGDAT-Workshop
. Dublin, Ireland. Retrieved from [URL] (last accessed November 2015).
Schmidt, T., & Schütte, W. (2010). FOLKER: An annotation tool for efficient transcription of natural, multi-party interaction. In
Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC10)
, Valletta, Malta (pp. 2091–2096). Retrieved from [URL] (last accessed November 2015).
Schmidt, T. (2011). A TEI-based approach to standardising spoken language transcription. Journal of the Text Encoding Initiative 11. Retrieved from [URL] (last accessed November 2015).
. (2012). EXMARaLDA and the FOLK tools. In
Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC’10)
, Istanbul, Turkey: European Language Resources Association (ELRA), (pp. 236–240). Retrieved from [URL] (last accessed November 2015).
. (2014). The Database for Spoken German – DGD2. In
Proceedings of the Ninth conference on International Language Resources and Evaluation (LREC’14)
, Reykjavik, Iceland: European Language Resources Association (ELRA) (pp. 1451–1457). Retrieved from [URL] (last accessed November 2015).
Schmidt, T., Dickgießer S., & Gasch, J. (2013). Die Datenbank für Gesprochenes Deutsch (DGD2). Mannheim: Institut für Deutsche Sprache. Retrieved from [URL] (last accessed November 2015).
Schmidt, T., & Wörner, K. (2014). EXMARaLDA. In J. Durand, U. Gut & G. Kristoffersen (Eds.), The Oxford Handbook of Corpus Phonology (pp. 402–419.). Oxford: Oxford University Press.
Selting, M., Auer, P., Barden, B. Bergmann, J., Couper-Kuhlen, E., Günthner, S., Meier, C., Quasthoff, U., Schlobinski, P., & Uhmann, S. (1998). Gesprächsanalytisches Transkriptionssystem (GAT). Linguistische Berichte, 1731, 91–122.
Selting, M., Auer, P., Barth-Weingarten, D., Bergmann, J., Bergmann P., Birkner, K., Couper-Kuhlen, E., Deppermann, A., Gilles, P., Günthner, S., & Hartung, M. (2009). Gesprächsanalytisches Transkriptionssystem 2 (GAT 2). In Gesprächsforschung: Online-Zeitschrift zur verbalen Interaktion,101, 353–402.
Stift, U.-M., & Schmidt, T. (2014). Mündliche Korpora am IDS: Vom Deutschen Spracharchiv zur Datenbank für Gesprochenes Deutsch. In Ansichten und Einsichten. 50 Jahre Institut für Deutsche Sprache (pp. 360–375). Mannheim: Institut für Deutsche Sprache (IDS).
Thompson, P. (2005). Spoken language corpora. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 59–70). Oxford: Oxbow Books. Retrieved from [URL] (last accessed November 2015).
Cited by (29)
Cited by 29 other publications
Betz, Emma & Alexandra Gubina
Bridwell, Keiko & Katherine Ireland
Frick, Elena & Thomas Schmidt
Gubina, Alexandra
Gubina, Alexandra
2025. Structurally ‘incomplete’ social action formats in the grammar of talk-in-interaction?. In Grammar in Action [Studies in Language and Social Interaction, 37], ► pp. 116 ff.
Schubert, Mojenn
Deppermann, Arnulf, Alexandra Gubina, Katharina König & Martin Pfeiffer
Gubina, Alexandra & Arnulf Deppermann
Hashimoto, Brett & Kyra Nelson
Yu, Guodong, Yaxin Wu, Paul Drew & Chase Wesley Raymond
2024. The DIG Mandarin Conversations (DMC) Corpus. Chinese Language and Discourse. An International and Interdisciplinary Journal 15:1 ► pp. 105 ff.
Hanks, Elizabeth
2023. Review of Love (2020): Overcoming challenges in corpus construction: The Spoken British National Corpus 2014. Register Studies 5:1 ► pp. 136 ff.
Helmer, Henrike
Hirschmann, Hagen & Thomas Schmidt
Love, Robbie, Claire Dembry, Andrew Hardie, Vaclav Brezina & Tony McEnery
Stratton, James M.
Deppermann, Arnulf & Alexandra Gubina
Deppermann, Arnulf & Alexandra Gubina
Gubina, Alexandra & Emma Betz
Gubina, Alexandra & Emma Betz
Knight, Dawn, Steve Morris, Laura Arman, Jennifer Needs & Mair Rees
Põldvere, Nele, Johan Frid, Victoria Johansson & Carita Paradis
PÕLDVERE, NELE, VICTORIA JOHANSSON & CARITA PARADIS
Saccone, Valentina & Chiara Trombetta
Chen, Yu-Hua & Radovan Bruncak
Ghyselen, Anne-Sophie, Anne Breitbarth, Melissa Farasyn, Jacques Van Keymeulen & Arjan van Hessen
Deppermann, Arnulf & Elwys De Stefani
Batinić, Dolores & Thomas Schmidt
Meliss, Meike, Christine Möhrs & Maria Ribeiro Silveira
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
